Documentation clarifications.
This commit is contained in:
parent
5438fc8a6a
commit
221cf10975
16
README
16
README
|
@ -179,20 +179,24 @@ library. They are also documented in the pcre2build man page.
|
|||
|
||||
. If you do not want to make use of the support for UTF-8 Unicode character
|
||||
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
||||
library, and UTF-32 Unicode character strings in the 32-bit library, you can
|
||||
library, or UTF-32 Unicode character strings in the 32-bit library, you can
|
||||
add --disable-unicode to the "configure" command. This reduces the size of
|
||||
the libraries. It is not possible to configure one library with Unicode
|
||||
support, and another without, in the same configuration.
|
||||
|
||||
When Unicode support is available, the use of a UTF encoding still has to be
|
||||
enabled by an option at run time. When PCRE2 is compiled with Unicode
|
||||
support, its input can only either be ASCII or UTF-8/16/32, even when running
|
||||
on EBCDIC platforms. It is not possible to use both --enable-unicode and
|
||||
--enable-ebcdic at the same time.
|
||||
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
||||
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
||||
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
|
||||
not possible to use both --enable-unicode and --enable-ebcdic at the same
|
||||
time.
|
||||
|
||||
As well as supporting UTF strings, Unicode support includes support for the
|
||||
\P, \p, and \X sequences that recognize Unicode character properties.
|
||||
However, only the basic two-letter properties such as Lu are supported.
|
||||
Escape sequences such as \d and \w in patterns do not by default make use of
|
||||
Unicode properties, but can be made to do so by setting the PCRE2_UCP option
|
||||
or starting a pattern with (*UCP).
|
||||
|
||||
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
|
||||
of the preceding, or any of the Unicode newline sequences, as indicating the
|
||||
|
@ -825,4 +829,4 @@ The distribution should contain the files listed below.
|
|||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 20 January 2015
|
||||
Last updated: 26 January 2015
|
||||
|
|
|
@ -179,20 +179,24 @@ library. They are also documented in the pcre2build man page.
|
|||
|
||||
. If you do not want to make use of the support for UTF-8 Unicode character
|
||||
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
||||
library, and UTF-32 Unicode character strings in the 32-bit library, you can
|
||||
library, or UTF-32 Unicode character strings in the 32-bit library, you can
|
||||
add --disable-unicode to the "configure" command. This reduces the size of
|
||||
the libraries. It is not possible to configure one library with Unicode
|
||||
support, and another without, in the same configuration.
|
||||
|
||||
When Unicode support is available, the use of a UTF encoding still has to be
|
||||
enabled by an option at run time. When PCRE2 is compiled with Unicode
|
||||
support, its input can only either be ASCII or UTF-8/16/32, even when running
|
||||
on EBCDIC platforms. It is not possible to use both --enable-unicode and
|
||||
--enable-ebcdic at the same time.
|
||||
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
||||
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
||||
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
|
||||
not possible to use both --enable-unicode and --enable-ebcdic at the same
|
||||
time.
|
||||
|
||||
As well as supporting UTF strings, Unicode support includes support for the
|
||||
\P, \p, and \X sequences that recognize Unicode character properties.
|
||||
However, only the basic two-letter properties such as Lu are supported.
|
||||
Escape sequences such as \d and \w in patterns do not by default make use of
|
||||
Unicode properties, but can be made to do so by setting the PCRE2_UCP option
|
||||
or starting a pattern with (*UCP).
|
||||
|
||||
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
|
||||
of the preceding, or any of the Unicode newline sequences, as indicating the
|
||||
|
@ -825,4 +829,4 @@ The distribution should contain the files listed below.
|
|||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 20 January 2015
|
||||
Last updated: 26 January 2015
|
||||
|
|
|
@ -127,8 +127,10 @@ in the same configuration.
|
|||
</P>
|
||||
<P>
|
||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
||||
or UTF-32. To do that, applications that use the library have to set the
|
||||
PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
|
||||
or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
|
||||
option when they call <b>pcre2_compile()</b> to compile a pattern.
|
||||
Alternatively, patterns may be started with (*UTF) unless the application has
|
||||
locked this out by setting PCRE2_NEVER_UTF.
|
||||
</P>
|
||||
<P>
|
||||
UTF support allows the libraries to process character code points up to
|
||||
|
@ -139,6 +141,12 @@ as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
|
|||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
Pattern escapes such as \d and \w do not by default make use of Unicode
|
||||
properties. The application can request that they do by setting the PCRE2_UCP
|
||||
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
|
||||
request this by starting with (*UCP).
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||
<P>
|
||||
Just-in-time compiler support is included in the build by specifying
|
||||
|
@ -471,9 +479,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 23 November 2014
|
||||
Last updated: 26 January 2015
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -110,7 +110,7 @@ Unicode property support
|
|||
Another special sequence that may appear at the start of a pattern is (*UCP).
|
||||
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
||||
such as \d and \w to use Unicode properties to determine character types,
|
||||
instead of recognizing only characters with codes less than 128 via a lookup
|
||||
instead of recognizing only characters with codes less than 256 via a lookup
|
||||
table.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -572,8 +572,8 @@ Unicode is discouraged.
|
|||
</P>
|
||||
<P>
|
||||
By default, characters whose code points are greater than 127 never match \d,
|
||||
\s, or \w, and always match \D, \S, and \W, although this may vary for
|
||||
characters in the range 128-255 when locale-specific matching is happening.
|
||||
\s, or \w, and always match \D, \S, and \W, although this may be different
|
||||
for characters in the range 128-255 when locale-specific matching is happening.
|
||||
These escape sequences retain their original meanings from before Unicode
|
||||
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
|
||||
is set, the behaviour is changed so that Unicode properties are used to
|
||||
|
@ -1369,11 +1369,12 @@ syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
|
|||
supported, and an error is given if they are encountered.
|
||||
</P>
|
||||
<P>
|
||||
By default, characters with values greater than 128 do not match any of the
|
||||
POSIX character classes. However, if the PCRE2_UCP option is passed to
|
||||
<b>pcre2_compile()</b>, some of the classes are changed so that Unicode
|
||||
character properties are used. This is achieved by replacing certain POSIX
|
||||
classes by other sequences, as follows:
|
||||
By default, characters with values greater than 127 do not match any of the
|
||||
POSIX character classes, although this may be different for characters in the
|
||||
range 128-255 when locale-specific matching is happening. However, if the
|
||||
PCRE2_UCP option is passed to <b>pcre2_compile()</b>, some of the classes are
|
||||
changed so that Unicode character properties are used. This is achieved by
|
||||
replacing certain POSIX classes with other sequences, as follows:
|
||||
<pre>
|
||||
[:alnum:] becomes \p{Xan}
|
||||
[:alpha:] becomes \p{L}
|
||||
|
@ -1408,12 +1409,12 @@ not controls, that is, characters with the Zs property.
|
|||
<P>
|
||||
[:punct:]
|
||||
This matches all characters that have the Unicode P (punctuation) property,
|
||||
plus those characters with code points less than 128 that have the S (Symbol)
|
||||
plus those characters with code points less than 256 that have the S (Symbol)
|
||||
property.
|
||||
</P>
|
||||
<P>
|
||||
The other POSIX classes are unchanged, and match only characters with code
|
||||
points less than 128.
|
||||
points less than 256.
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
|
||||
<P>
|
||||
|
@ -3248,7 +3249,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 02 January 2015
|
||||
Last updated: 26 January 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -2874,9 +2874,10 @@ UNICODE AND UTF SUPPORT
|
|||
another without, in the same configuration.
|
||||
|
||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
|
||||
UTF-16 or UTF-32. To do that, applications that use the library have to
|
||||
set the PCRE2_UTF option when they call pcre2_compile() to compile a
|
||||
pattern.
|
||||
UTF-16 or UTF-32. To do that, applications that use the library can set
|
||||
the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
|
||||
tern. Alternatively, patterns may be started with (*UTF) unless the
|
||||
application has locked this out by setting PCRE2_NEVER_UTF.
|
||||
|
||||
UTF support allows the libraries to process character code points up to
|
||||
0x10ffff in the strings that they handle. It also provides support for
|
||||
|
@ -2885,6 +2886,11 @@ UNICODE AND UTF SUPPORT
|
|||
such as Lu and Nd are supported. Details are given in the pcre2pattern
|
||||
documentation.
|
||||
|
||||
Pattern escapes such as \d and \w do not by default make use of Unicode
|
||||
properties. The application can request that they do by setting the
|
||||
PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
|
||||
pattern may also request this by starting with (*UCP).
|
||||
|
||||
|
||||
JUST-IN-TIME COMPILER SUPPORT
|
||||
|
||||
|
@ -3226,8 +3232,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 26 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2BUILD 3 "26 January 2015" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.
|
||||
|
@ -113,8 +113,10 @@ is not possible to build one library with Unicode support, and another without,
|
|||
in the same configuration.
|
||||
.P
|
||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
||||
or UTF-32. To do that, applications that use the library have to set the
|
||||
PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
|
||||
or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
|
||||
option when they call \fBpcre2_compile()\fP to compile a pattern.
|
||||
Alternatively, patterns may be started with (*UTF) unless the application has
|
||||
locked this out by setting PCRE2_NEVER_UTF.
|
||||
.P
|
||||
UTF support allows the libraries to process character code points up to
|
||||
0x10ffff in the strings that they handle. It also provides support for
|
||||
|
@ -125,6 +127,11 @@ as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
|
|||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
documentation.
|
||||
.P
|
||||
Pattern escapes such as \ed and \ew do not by default make use of Unicode
|
||||
properties. The application can request that they do by setting the PCRE2_UCP
|
||||
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
|
||||
request this by starting with (*UCP).
|
||||
.
|
||||
.
|
||||
.SH "JUST-IN-TIME COMPILER SUPPORT"
|
||||
|
@ -487,6 +494,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 26 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00"
|
||||
.TH PCRE2PATTERN 3 "26 January 2015" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -73,7 +73,7 @@ appearance in a pattern causes an error.
|
|||
Another special sequence that may appear at the start of a pattern is (*UCP).
|
||||
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
||||
such as \ed and \ew to use Unicode properties to determine character types,
|
||||
instead of recognizing only characters with codes less than 128 via a lookup
|
||||
instead of recognizing only characters with codes less than 256 via a lookup
|
||||
table.
|
||||
.P
|
||||
Some applications that allow their users to supply patterns may wish to
|
||||
|
@ -575,8 +575,8 @@ accented letters, and these are then matched by \ew. The use of locales with
|
|||
Unicode is discouraged.
|
||||
.P
|
||||
By default, characters whose code points are greater than 127 never match \ed,
|
||||
\es, or \ew, and always match \eD, \eS, and \eW, although this may vary for
|
||||
characters in the range 128-255 when locale-specific matching is happening.
|
||||
\es, or \ew, and always match \eD, \eS, and \eW, although this may be different
|
||||
for characters in the range 128-255 when locale-specific matching is happening.
|
||||
These escape sequences retain their original meanings from before Unicode
|
||||
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
|
||||
is set, the behaviour is changed so that Unicode properties are used to
|
||||
|
@ -1369,11 +1369,12 @@ matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
|
|||
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
|
||||
supported, and an error is given if they are encountered.
|
||||
.P
|
||||
By default, characters with values greater than 128 do not match any of the
|
||||
POSIX character classes. However, if the PCRE2_UCP option is passed to
|
||||
\fBpcre2_compile()\fP, some of the classes are changed so that Unicode
|
||||
character properties are used. This is achieved by replacing certain POSIX
|
||||
classes by other sequences, as follows:
|
||||
By default, characters with values greater than 127 do not match any of the
|
||||
POSIX character classes, although this may be different for characters in the
|
||||
range 128-255 when locale-specific matching is happening. However, if the
|
||||
PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
|
||||
changed so that Unicode character properties are used. This is achieved by
|
||||
replacing certain POSIX classes with other sequences, as follows:
|
||||
.sp
|
||||
[:alnum:] becomes \ep{Xan}
|
||||
[:alpha:] becomes \ep{L}
|
||||
|
@ -1404,11 +1405,11 @@ not controls, that is, characters with the Zs property.
|
|||
.TP 10
|
||||
[:punct:]
|
||||
This matches all characters that have the Unicode P (punctuation) property,
|
||||
plus those characters with code points less than 128 that have the S (Symbol)
|
||||
plus those characters with code points less than 256 that have the S (Symbol)
|
||||
property.
|
||||
.P
|
||||
The other POSIX classes are unchanged, and match only characters with code
|
||||
points less than 128.
|
||||
points less than 256.
|
||||
.
|
||||
.
|
||||
.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
|
||||
|
@ -3292,6 +3293,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 02 January 2015
|
||||
Last updated: 26 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
Loading…
Reference in New Issue