Documentation clarifications.
This commit is contained in:
parent
5438fc8a6a
commit
221cf10975
16
README
16
README
|
@ -179,20 +179,24 @@ library. They are also documented in the pcre2build man page.
|
||||||
|
|
||||||
. If you do not want to make use of the support for UTF-8 Unicode character
|
. If you do not want to make use of the support for UTF-8 Unicode character
|
||||||
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
||||||
library, and UTF-32 Unicode character strings in the 32-bit library, you can
|
library, or UTF-32 Unicode character strings in the 32-bit library, you can
|
||||||
add --disable-unicode to the "configure" command. This reduces the size of
|
add --disable-unicode to the "configure" command. This reduces the size of
|
||||||
the libraries. It is not possible to configure one library with Unicode
|
the libraries. It is not possible to configure one library with Unicode
|
||||||
support, and another without, in the same configuration.
|
support, and another without, in the same configuration.
|
||||||
|
|
||||||
When Unicode support is available, the use of a UTF encoding still has to be
|
When Unicode support is available, the use of a UTF encoding still has to be
|
||||||
enabled by an option at run time. When PCRE2 is compiled with Unicode
|
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
||||||
support, its input can only either be ASCII or UTF-8/16/32, even when running
|
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
||||||
on EBCDIC platforms. It is not possible to use both --enable-unicode and
|
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
|
||||||
--enable-ebcdic at the same time.
|
not possible to use both --enable-unicode and --enable-ebcdic at the same
|
||||||
|
time.
|
||||||
|
|
||||||
As well as supporting UTF strings, Unicode support includes support for the
|
As well as supporting UTF strings, Unicode support includes support for the
|
||||||
\P, \p, and \X sequences that recognize Unicode character properties.
|
\P, \p, and \X sequences that recognize Unicode character properties.
|
||||||
However, only the basic two-letter properties such as Lu are supported.
|
However, only the basic two-letter properties such as Lu are supported.
|
||||||
|
Escape sequences such as \d and \w in patterns do not by default make use of
|
||||||
|
Unicode properties, but can be made to do so by setting the PCRE2_UCP option
|
||||||
|
or starting a pattern with (*UCP).
|
||||||
|
|
||||||
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
|
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
|
||||||
of the preceding, or any of the Unicode newline sequences, as indicating the
|
of the preceding, or any of the Unicode newline sequences, as indicating the
|
||||||
|
@ -825,4 +829,4 @@ The distribution should contain the files listed below.
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
Email local part: ph10
|
Email local part: ph10
|
||||||
Email domain: cam.ac.uk
|
Email domain: cam.ac.uk
|
||||||
Last updated: 20 January 2015
|
Last updated: 26 January 2015
|
||||||
|
|
|
@ -179,20 +179,24 @@ library. They are also documented in the pcre2build man page.
|
||||||
|
|
||||||
. If you do not want to make use of the support for UTF-8 Unicode character
|
. If you do not want to make use of the support for UTF-8 Unicode character
|
||||||
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
||||||
library, and UTF-32 Unicode character strings in the 32-bit library, you can
|
library, or UTF-32 Unicode character strings in the 32-bit library, you can
|
||||||
add --disable-unicode to the "configure" command. This reduces the size of
|
add --disable-unicode to the "configure" command. This reduces the size of
|
||||||
the libraries. It is not possible to configure one library with Unicode
|
the libraries. It is not possible to configure one library with Unicode
|
||||||
support, and another without, in the same configuration.
|
support, and another without, in the same configuration.
|
||||||
|
|
||||||
When Unicode support is available, the use of a UTF encoding still has to be
|
When Unicode support is available, the use of a UTF encoding still has to be
|
||||||
enabled by an option at run time. When PCRE2 is compiled with Unicode
|
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
||||||
support, its input can only either be ASCII or UTF-8/16/32, even when running
|
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
||||||
on EBCDIC platforms. It is not possible to use both --enable-unicode and
|
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
|
||||||
--enable-ebcdic at the same time.
|
not possible to use both --enable-unicode and --enable-ebcdic at the same
|
||||||
|
time.
|
||||||
|
|
||||||
As well as supporting UTF strings, Unicode support includes support for the
|
As well as supporting UTF strings, Unicode support includes support for the
|
||||||
\P, \p, and \X sequences that recognize Unicode character properties.
|
\P, \p, and \X sequences that recognize Unicode character properties.
|
||||||
However, only the basic two-letter properties such as Lu are supported.
|
However, only the basic two-letter properties such as Lu are supported.
|
||||||
|
Escape sequences such as \d and \w in patterns do not by default make use of
|
||||||
|
Unicode properties, but can be made to do so by setting the PCRE2_UCP option
|
||||||
|
or starting a pattern with (*UCP).
|
||||||
|
|
||||||
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
|
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
|
||||||
of the preceding, or any of the Unicode newline sequences, as indicating the
|
of the preceding, or any of the Unicode newline sequences, as indicating the
|
||||||
|
@ -825,4 +829,4 @@ The distribution should contain the files listed below.
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
Email local part: ph10
|
Email local part: ph10
|
||||||
Email domain: cam.ac.uk
|
Email domain: cam.ac.uk
|
||||||
Last updated: 20 January 2015
|
Last updated: 26 January 2015
|
||||||
|
|
|
@ -127,8 +127,10 @@ in the same configuration.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
||||||
or UTF-32. To do that, applications that use the library have to set the
|
or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
|
||||||
PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
|
option when they call <b>pcre2_compile()</b> to compile a pattern.
|
||||||
|
Alternatively, patterns may be started with (*UTF) unless the application has
|
||||||
|
locked this out by setting PCRE2_NEVER_UTF.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
UTF support allows the libraries to process character code points up to
|
UTF support allows the libraries to process character code points up to
|
||||||
|
@ -139,6 +141,12 @@ as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
documentation.
|
documentation.
|
||||||
</P>
|
</P>
|
||||||
|
<P>
|
||||||
|
Pattern escapes such as \d and \w do not by default make use of Unicode
|
||||||
|
properties. The application can request that they do by setting the PCRE2_UCP
|
||||||
|
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
|
||||||
|
request this by starting with (*UCP).
|
||||||
|
</P>
|
||||||
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
|
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||||
<P>
|
<P>
|
||||||
Just-in-time compiler support is included in the build by specifying
|
Just-in-time compiler support is included in the build by specifying
|
||||||
|
@ -471,9 +479,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 23 November 2014
|
Last updated: 26 January 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -110,7 +110,7 @@ Unicode property support
|
||||||
Another special sequence that may appear at the start of a pattern is (*UCP).
|
Another special sequence that may appear at the start of a pattern is (*UCP).
|
||||||
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
||||||
such as \d and \w to use Unicode properties to determine character types,
|
such as \d and \w to use Unicode properties to determine character types,
|
||||||
instead of recognizing only characters with codes less than 128 via a lookup
|
instead of recognizing only characters with codes less than 256 via a lookup
|
||||||
table.
|
table.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -572,8 +572,8 @@ Unicode is discouraged.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
By default, characters whose code points are greater than 127 never match \d,
|
By default, characters whose code points are greater than 127 never match \d,
|
||||||
\s, or \w, and always match \D, \S, and \W, although this may vary for
|
\s, or \w, and always match \D, \S, and \W, although this may be different
|
||||||
characters in the range 128-255 when locale-specific matching is happening.
|
for characters in the range 128-255 when locale-specific matching is happening.
|
||||||
These escape sequences retain their original meanings from before Unicode
|
These escape sequences retain their original meanings from before Unicode
|
||||||
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
|
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
|
||||||
is set, the behaviour is changed so that Unicode properties are used to
|
is set, the behaviour is changed so that Unicode properties are used to
|
||||||
|
@ -1369,11 +1369,12 @@ syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
|
||||||
supported, and an error is given if they are encountered.
|
supported, and an error is given if they are encountered.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
By default, characters with values greater than 128 do not match any of the
|
By default, characters with values greater than 127 do not match any of the
|
||||||
POSIX character classes. However, if the PCRE2_UCP option is passed to
|
POSIX character classes, although this may be different for characters in the
|
||||||
<b>pcre2_compile()</b>, some of the classes are changed so that Unicode
|
range 128-255 when locale-specific matching is happening. However, if the
|
||||||
character properties are used. This is achieved by replacing certain POSIX
|
PCRE2_UCP option is passed to <b>pcre2_compile()</b>, some of the classes are
|
||||||
classes by other sequences, as follows:
|
changed so that Unicode character properties are used. This is achieved by
|
||||||
|
replacing certain POSIX classes with other sequences, as follows:
|
||||||
<pre>
|
<pre>
|
||||||
[:alnum:] becomes \p{Xan}
|
[:alnum:] becomes \p{Xan}
|
||||||
[:alpha:] becomes \p{L}
|
[:alpha:] becomes \p{L}
|
||||||
|
@ -1408,12 +1409,12 @@ not controls, that is, characters with the Zs property.
|
||||||
<P>
|
<P>
|
||||||
[:punct:]
|
[:punct:]
|
||||||
This matches all characters that have the Unicode P (punctuation) property,
|
This matches all characters that have the Unicode P (punctuation) property,
|
||||||
plus those characters with code points less than 128 that have the S (Symbol)
|
plus those characters with code points less than 256 that have the S (Symbol)
|
||||||
property.
|
property.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The other POSIX classes are unchanged, and match only characters with code
|
The other POSIX classes are unchanged, and match only characters with code
|
||||||
points less than 128.
|
points less than 256.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
|
<br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -3248,7 +3249,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 02 January 2015
|
Last updated: 26 January 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2015 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -2874,9 +2874,10 @@ UNICODE AND UTF SUPPORT
|
||||||
another without, in the same configuration.
|
another without, in the same configuration.
|
||||||
|
|
||||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
|
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
|
||||||
UTF-16 or UTF-32. To do that, applications that use the library have to
|
UTF-16 or UTF-32. To do that, applications that use the library can set
|
||||||
set the PCRE2_UTF option when they call pcre2_compile() to compile a
|
the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
|
||||||
pattern.
|
tern. Alternatively, patterns may be started with (*UTF) unless the
|
||||||
|
application has locked this out by setting PCRE2_NEVER_UTF.
|
||||||
|
|
||||||
UTF support allows the libraries to process character code points up to
|
UTF support allows the libraries to process character code points up to
|
||||||
0x10ffff in the strings that they handle. It also provides support for
|
0x10ffff in the strings that they handle. It also provides support for
|
||||||
|
@ -2885,6 +2886,11 @@ UNICODE AND UTF SUPPORT
|
||||||
such as Lu and Nd are supported. Details are given in the pcre2pattern
|
such as Lu and Nd are supported. Details are given in the pcre2pattern
|
||||||
documentation.
|
documentation.
|
||||||
|
|
||||||
|
Pattern escapes such as \d and \w do not by default make use of Unicode
|
||||||
|
properties. The application can request that they do by setting the
|
||||||
|
PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
|
||||||
|
pattern may also request this by starting with (*UCP).
|
||||||
|
|
||||||
|
|
||||||
JUST-IN-TIME COMPILER SUPPORT
|
JUST-IN-TIME COMPILER SUPPORT
|
||||||
|
|
||||||
|
@ -3226,8 +3232,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 23 November 2014
|
Last updated: 26 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
|
.TH PCRE2BUILD 3 "26 January 2015" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.
|
.
|
||||||
|
@ -113,8 +113,10 @@ is not possible to build one library with Unicode support, and another without,
|
||||||
in the same configuration.
|
in the same configuration.
|
||||||
.P
|
.P
|
||||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
||||||
or UTF-32. To do that, applications that use the library have to set the
|
or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
|
||||||
PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
|
option when they call \fBpcre2_compile()\fP to compile a pattern.
|
||||||
|
Alternatively, patterns may be started with (*UTF) unless the application has
|
||||||
|
locked this out by setting PCRE2_NEVER_UTF.
|
||||||
.P
|
.P
|
||||||
UTF support allows the libraries to process character code points up to
|
UTF support allows the libraries to process character code points up to
|
||||||
0x10ffff in the strings that they handle. It also provides support for
|
0x10ffff in the strings that they handle. It also provides support for
|
||||||
|
@ -125,6 +127,11 @@ as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
documentation.
|
documentation.
|
||||||
|
.P
|
||||||
|
Pattern escapes such as \ed and \ew do not by default make use of Unicode
|
||||||
|
properties. The application can request that they do by setting the PCRE2_UCP
|
||||||
|
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
|
||||||
|
request this by starting with (*UCP).
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "JUST-IN-TIME COMPILER SUPPORT"
|
.SH "JUST-IN-TIME COMPILER SUPPORT"
|
||||||
|
@ -487,6 +494,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 23 November 2014
|
Last updated: 26 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00"
|
.TH PCRE2PATTERN 3 "26 January 2015" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -73,7 +73,7 @@ appearance in a pattern causes an error.
|
||||||
Another special sequence that may appear at the start of a pattern is (*UCP).
|
Another special sequence that may appear at the start of a pattern is (*UCP).
|
||||||
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
||||||
such as \ed and \ew to use Unicode properties to determine character types,
|
such as \ed and \ew to use Unicode properties to determine character types,
|
||||||
instead of recognizing only characters with codes less than 128 via a lookup
|
instead of recognizing only characters with codes less than 256 via a lookup
|
||||||
table.
|
table.
|
||||||
.P
|
.P
|
||||||
Some applications that allow their users to supply patterns may wish to
|
Some applications that allow their users to supply patterns may wish to
|
||||||
|
@ -575,8 +575,8 @@ accented letters, and these are then matched by \ew. The use of locales with
|
||||||
Unicode is discouraged.
|
Unicode is discouraged.
|
||||||
.P
|
.P
|
||||||
By default, characters whose code points are greater than 127 never match \ed,
|
By default, characters whose code points are greater than 127 never match \ed,
|
||||||
\es, or \ew, and always match \eD, \eS, and \eW, although this may vary for
|
\es, or \ew, and always match \eD, \eS, and \eW, although this may be different
|
||||||
characters in the range 128-255 when locale-specific matching is happening.
|
for characters in the range 128-255 when locale-specific matching is happening.
|
||||||
These escape sequences retain their original meanings from before Unicode
|
These escape sequences retain their original meanings from before Unicode
|
||||||
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
|
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
|
||||||
is set, the behaviour is changed so that Unicode properties are used to
|
is set, the behaviour is changed so that Unicode properties are used to
|
||||||
|
@ -1369,11 +1369,12 @@ matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
|
||||||
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
|
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
|
||||||
supported, and an error is given if they are encountered.
|
supported, and an error is given if they are encountered.
|
||||||
.P
|
.P
|
||||||
By default, characters with values greater than 128 do not match any of the
|
By default, characters with values greater than 127 do not match any of the
|
||||||
POSIX character classes. However, if the PCRE2_UCP option is passed to
|
POSIX character classes, although this may be different for characters in the
|
||||||
\fBpcre2_compile()\fP, some of the classes are changed so that Unicode
|
range 128-255 when locale-specific matching is happening. However, if the
|
||||||
character properties are used. This is achieved by replacing certain POSIX
|
PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
|
||||||
classes by other sequences, as follows:
|
changed so that Unicode character properties are used. This is achieved by
|
||||||
|
replacing certain POSIX classes with other sequences, as follows:
|
||||||
.sp
|
.sp
|
||||||
[:alnum:] becomes \ep{Xan}
|
[:alnum:] becomes \ep{Xan}
|
||||||
[:alpha:] becomes \ep{L}
|
[:alpha:] becomes \ep{L}
|
||||||
|
@ -1404,11 +1405,11 @@ not controls, that is, characters with the Zs property.
|
||||||
.TP 10
|
.TP 10
|
||||||
[:punct:]
|
[:punct:]
|
||||||
This matches all characters that have the Unicode P (punctuation) property,
|
This matches all characters that have the Unicode P (punctuation) property,
|
||||||
plus those characters with code points less than 128 that have the S (Symbol)
|
plus those characters with code points less than 256 that have the S (Symbol)
|
||||||
property.
|
property.
|
||||||
.P
|
.P
|
||||||
The other POSIX classes are unchanged, and match only characters with code
|
The other POSIX classes are unchanged, and match only characters with code
|
||||||
points less than 128.
|
points less than 256.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
|
.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
|
||||||
|
@ -3292,6 +3293,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 02 January 2015
|
Last updated: 26 January 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
Loading…
Reference in New Issue