Documentation clarifications.

This commit is contained in:
Philip.Hazel 2015-01-26 14:21:45 +00:00
parent 5438fc8a6a
commit 221cf10975
7 changed files with 84 additions and 53 deletions

16
README
View File

@ -179,20 +179,24 @@ library. They are also documented in the pcre2build man page.
. If you do not want to make use of the support for UTF-8 Unicode character
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
library, and UTF-32 Unicode character strings in the 32-bit library, you can
library, or UTF-32 Unicode character strings in the 32-bit library, you can
add --disable-unicode to the "configure" command. This reduces the size of
the libraries. It is not possible to configure one library with Unicode
support, and another without, in the same configuration.
When Unicode support is available, the use of a UTF encoding still has to be
enabled by an option at run time. When PCRE2 is compiled with Unicode
support, its input can only either be ASCII or UTF-8/16/32, even when running
on EBCDIC platforms. It is not possible to use both --enable-unicode and
--enable-ebcdic at the same time.
enabled by setting the PCRE2_UTF option at run time or starting a pattern
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
not possible to use both --enable-unicode and --enable-ebcdic at the same
time.
As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
However, only the basic two-letter properties such as Lu are supported.
Escape sequences such as \d and \w in patterns do not by default make use of
Unicode properties, but can be made to do so by setting the PCRE2_UCP option
or starting a pattern with (*UCP).
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
of the preceding, or any of the Unicode newline sequences, as indicating the
@ -825,4 +829,4 @@ The distribution should contain the files listed below.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
Last updated: 20 January 2015
Last updated: 26 January 2015

View File

@ -179,20 +179,24 @@ library. They are also documented in the pcre2build man page.
. If you do not want to make use of the support for UTF-8 Unicode character
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
library, and UTF-32 Unicode character strings in the 32-bit library, you can
library, or UTF-32 Unicode character strings in the 32-bit library, you can
add --disable-unicode to the "configure" command. This reduces the size of
the libraries. It is not possible to configure one library with Unicode
support, and another without, in the same configuration.
When Unicode support is available, the use of a UTF encoding still has to be
enabled by an option at run time. When PCRE2 is compiled with Unicode
support, its input can only either be ASCII or UTF-8/16/32, even when running
on EBCDIC platforms. It is not possible to use both --enable-unicode and
--enable-ebcdic at the same time.
enabled by setting the PCRE2_UTF option at run time or starting a pattern
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
not possible to use both --enable-unicode and --enable-ebcdic at the same
time.
As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
However, only the basic two-letter properties such as Lu are supported.
Escape sequences such as \d and \w in patterns do not by default make use of
Unicode properties, but can be made to do so by setting the PCRE2_UCP option
or starting a pattern with (*UCP).
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
of the preceding, or any of the Unicode newline sequences, as indicating the
@ -825,4 +829,4 @@ The distribution should contain the files listed below.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
Last updated: 20 January 2015
Last updated: 26 January 2015

View File

@ -127,8 +127,10 @@ in the same configuration.
</P>
<P>
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
or UTF-32. To do that, applications that use the library have to set the
PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
option when they call <b>pcre2_compile()</b> to compile a pattern.
Alternatively, patterns may be started with (*UTF) unless the application has
locked this out by setting PCRE2_NEVER_UTF.
</P>
<P>
UTF support allows the libraries to process character code points up to
@ -139,6 +141,12 @@ as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
</P>
<P>
Pattern escapes such as \d and \w do not by default make use of Unicode
properties. The application can request that they do by setting the PCRE2_UCP
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
request this by starting with (*UCP).
</P>
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
Just-in-time compiler support is included in the build by specifying
@ -471,9 +479,9 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 23 November 2014
Last updated: 26 January 2015
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2015 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -110,7 +110,7 @@ Unicode property support
Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \d and \w to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 128 via a lookup
instead of recognizing only characters with codes less than 256 via a lookup
table.
</P>
<P>
@ -572,8 +572,8 @@ Unicode is discouraged.
</P>
<P>
By default, characters whose code points are greater than 127 never match \d,
\s, or \w, and always match \D, \S, and \W, although this may vary for
characters in the range 128-255 when locale-specific matching is happening.
\s, or \w, and always match \D, \S, and \W, although this may be different
for characters in the range 128-255 when locale-specific matching is happening.
These escape sequences retain their original meanings from before Unicode
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
is set, the behaviour is changed so that Unicode properties are used to
@ -1369,11 +1369,12 @@ syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
</P>
<P>
By default, characters with values greater than 128 do not match any of the
POSIX character classes. However, if the PCRE2_UCP option is passed to
<b>pcre2_compile()</b>, some of the classes are changed so that Unicode
character properties are used. This is achieved by replacing certain POSIX
classes by other sequences, as follows:
By default, characters with values greater than 127 do not match any of the
POSIX character classes, although this may be different for characters in the
range 128-255 when locale-specific matching is happening. However, if the
PCRE2_UCP option is passed to <b>pcre2_compile()</b>, some of the classes are
changed so that Unicode character properties are used. This is achieved by
replacing certain POSIX classes with other sequences, as follows:
<pre>
[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
@ -1408,12 +1409,12 @@ not controls, that is, characters with the Zs property.
<P>
[:punct:]
This matches all characters that have the Unicode P (punctuation) property,
plus those characters with code points less than 128 that have the S (Symbol)
plus those characters with code points less than 256 that have the S (Symbol)
property.
</P>
<P>
The other POSIX classes are unchanged, and match only characters with code
points less than 128.
points less than 256.
</P>
<br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
<P>
@ -3248,7 +3249,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 02 January 2015
Last updated: 26 January 2015
<br>
Copyright &copy; 1997-2015 University of Cambridge.
<br>

View File

@ -2874,9 +2874,10 @@ UNICODE AND UTF SUPPORT
another without, in the same configuration.
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
UTF-16 or UTF-32. To do that, applications that use the library have to
set the PCRE2_UTF option when they call pcre2_compile() to compile a
pattern.
UTF-16 or UTF-32. To do that, applications that use the library can set
the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
tern. Alternatively, patterns may be started with (*UTF) unless the
application has locked this out by setting PCRE2_NEVER_UTF.
UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. It also provides support for
@ -2885,6 +2886,11 @@ UNICODE AND UTF SUPPORT
such as Lu and Nd are supported. Details are given in the pcre2pattern
documentation.
Pattern escapes such as \d and \w do not by default make use of Unicode
properties. The application can request that they do by setting the
PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
pattern may also request this by starting with (*UCP).
JUST-IN-TIME COMPILER SUPPORT
@ -3226,8 +3232,8 @@ AUTHOR
REVISION
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 26 January 2015
Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
.TH PCRE2BUILD 3 "26 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.
@ -113,8 +113,10 @@ is not possible to build one library with Unicode support, and another without,
in the same configuration.
.P
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
or UTF-32. To do that, applications that use the library have to set the
PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
option when they call \fBpcre2_compile()\fP to compile a pattern.
Alternatively, patterns may be started with (*UTF) unless the application has
locked this out by setting PCRE2_NEVER_UTF.
.P
UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. It also provides support for
@ -125,6 +127,11 @@ as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
\fBpcre2pattern\fP
.\"
documentation.
.P
Pattern escapes such as \ed and \ew do not by default make use of Unicode
properties. The application can request that they do by setting the PCRE2_UCP
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
request this by starting with (*UCP).
.
.
.SH "JUST-IN-TIME COMPILER SUPPORT"
@ -487,6 +494,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 26 January 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00"
.TH PCRE2PATTERN 3 "26 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -73,7 +73,7 @@ appearance in a pattern causes an error.
Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \ed and \ew to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 128 via a lookup
instead of recognizing only characters with codes less than 256 via a lookup
table.
.P
Some applications that allow their users to supply patterns may wish to
@ -575,8 +575,8 @@ accented letters, and these are then matched by \ew. The use of locales with
Unicode is discouraged.
.P
By default, characters whose code points are greater than 127 never match \ed,
\es, or \ew, and always match \eD, \eS, and \eW, although this may vary for
characters in the range 128-255 when locale-specific matching is happening.
\es, or \ew, and always match \eD, \eS, and \eW, although this may be different
for characters in the range 128-255 when locale-specific matching is happening.
These escape sequences retain their original meanings from before Unicode
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
is set, the behaviour is changed so that Unicode properties are used to
@ -1369,11 +1369,12 @@ matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
.P
By default, characters with values greater than 128 do not match any of the
POSIX character classes. However, if the PCRE2_UCP option is passed to
\fBpcre2_compile()\fP, some of the classes are changed so that Unicode
character properties are used. This is achieved by replacing certain POSIX
classes by other sequences, as follows:
By default, characters with values greater than 127 do not match any of the
POSIX character classes, although this may be different for characters in the
range 128-255 when locale-specific matching is happening. However, if the
PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
changed so that Unicode character properties are used. This is achieved by
replacing certain POSIX classes with other sequences, as follows:
.sp
[:alnum:] becomes \ep{Xan}
[:alpha:] becomes \ep{L}
@ -1404,11 +1405,11 @@ not controls, that is, characters with the Zs property.
.TP 10
[:punct:]
This matches all characters that have the Unicode P (punctuation) property,
plus those characters with code points less than 128 that have the S (Symbol)
plus those characters with code points less than 256 that have the S (Symbol)
property.
.P
The other POSIX classes are unchanged, and match only characters with code
points less than 128.
points less than 256.
.
.
.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
@ -3292,6 +3293,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 02 January 2015
Last updated: 26 January 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi