Documentation for PCRE2_UCP handling of upper/lower casing.
This commit is contained in:
parent
f50ee03f5d
commit
4e8f13cbd6
|
@ -1481,13 +1481,13 @@ documentation.
|
||||||
</pre>
|
</pre>
|
||||||
If this bit is set, letters in the pattern match both upper and lower case
|
If this bit is set, letters in the pattern match both upper and lower case
|
||||||
letters in the subject. It is equivalent to Perl's /i option, and it can be
|
letters in the subject. It is equivalent to Perl's /i option, and it can be
|
||||||
changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
|
changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
|
||||||
properties are used for all characters with more than one other case, and for
|
PCRE2_UCP is set, Unicode properties are used for all characters with more than
|
||||||
all characters whose code points are greater than U+007F. For lower valued
|
one other case, and for all characters whose code points are greater than
|
||||||
characters with only one other case, a lookup table is used for speed. When
|
U+007F. For lower valued characters with only one other case, a lookup table is
|
||||||
PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
|
used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is
|
||||||
and higher code points (available only in 16-bit or 32-bit mode) are treated as
|
used for all code points less than 256, and higher code points (available only
|
||||||
not having another case.
|
in 16-bit or 32-bit mode) are treated as not having another case.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_DOLLAR_ENDONLY
|
PCRE2_DOLLAR_ENDONLY
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1820,16 +1820,23 @@ are not representable in UTF-16.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_UCP
|
PCRE2_UCP
|
||||||
</pre>
|
</pre>
|
||||||
This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
|
This option has two effects. Firstly, it change the way PCRE2 processes \B,
|
||||||
\w, and some of the POSIX character classes. By default, only ASCII characters
|
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By
|
||||||
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
|
default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
|
||||||
classify characters. More details are given in the section on
|
properties are used instead to classify characters. More details are given in
|
||||||
|
the section on
|
||||||
<a href="pcre2pattern.html#genericchartypes">generic character types</a>
|
<a href="pcre2pattern.html#genericchartypes">generic character types</a>
|
||||||
in the
|
in the
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
||||||
longer. The option is available only if PCRE2 has been compiled with Unicode
|
longer.
|
||||||
support (which is the default).
|
</P>
|
||||||
|
<P>
|
||||||
|
The second effect of PCRE2_UCP is to force the use of Unicode properties for
|
||||||
|
upper/lower casing operations on characters with code points greater than 127,
|
||||||
|
even when PCRE2_UTF is not set. This makes it possible, for example, to process
|
||||||
|
strings in the 16-bit UCS-2 code. This option is available only if PCRE2 has
|
||||||
|
been compiled with Unicode support (which is the default).
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_UNGREEDY
|
PCRE2_UNGREEDY
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1997,14 +2004,20 @@ PCRE2 handles caseless matching, and determines whether characters are letters,
|
||||||
digits, or whatever, by reference to a set of tables, indexed by character code
|
digits, or whatever, by reference to a set of tables, indexed by character code
|
||||||
point. However, this applies only to characters whose code points are less than
|
point. However, this applies only to characters whose code points are less than
|
||||||
256. By default, higher-valued code points never match escapes such as \w or
|
256. By default, higher-valued code points never match escapes such as \w or
|
||||||
\d. When PCRE2 is built with Unicode support (the default), all characters can
|
\d.
|
||||||
be tested with \p and \P, or, alternatively, the PCRE2_UCP option can be set
|
</P>
|
||||||
when a pattern is compiled; this causes \w and friends to use Unicode property
|
<P>
|
||||||
support instead of the built-in tables.
|
When PCRE2 is built with Unicode support (the default), the Unicode properties
|
||||||
|
of all characters can be tested with \p and \P, or, alternatively, the
|
||||||
|
PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
|
||||||
|
friends to use Unicode property support instead of the built-in tables.
|
||||||
|
PCRE2_UCP also causes upper/lower casing operations on characters with code
|
||||||
|
points greater than 127 to use Unicode properties. These effects apply even
|
||||||
|
when PCRE2_UTF is not set.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The use of locales with Unicode is discouraged. If you are handling characters
|
The use of locales with Unicode is discouraged. If you are handling characters
|
||||||
with code points greater than 128, you should either use Unicode support, or
|
with code points greater than 127, you should either use Unicode support, or
|
||||||
use locales, but not try to mix the two.
|
use locales, but not try to mix the two.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -3494,7 +3507,10 @@ terminating a \Q quoted sequence) reverts to no case forcing. The sequences
|
||||||
\u and \l force the next character (if it is a letter) to upper or lower
|
\u and \l force the next character (if it is a letter) to upper or lower
|
||||||
case, respectively, and then the state automatically reverts to no case
|
case, respectively, and then the state automatically reverts to no case
|
||||||
forcing. Case forcing applies to all inserted characters, including those from
|
forcing. Case forcing applies to all inserted characters, including those from
|
||||||
capture groups and letters within \Q...\E quoted sequences.
|
capture groups and letters within \Q...\E quoted sequences. If either
|
||||||
|
PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
|
||||||
|
properties are used for case forcing characters whose code points are greater
|
||||||
|
than 127.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Note that case forcing sequences such as \U...\E do not nest. For example,
|
Note that case forcing sequences such as \U...\E do not nest. For example,
|
||||||
|
@ -3915,7 +3931,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 16 February 2020
|
Last updated: 24 February 2020
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2020 University of Cambridge.
|
Copyright © 1997-2020 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -114,7 +114,8 @@ Another special sequence that may appear at the start of a pattern is (*UCP).
|
||||||
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
||||||
such as \d and \w to use Unicode properties to determine character types,
|
such as \d and \w to use Unicode properties to determine character types,
|
||||||
instead of recognizing only characters with codes less than 256 via a lookup
|
instead of recognizing only characters with codes less than 256 via a lookup
|
||||||
table.
|
table. If also causes upper/lower casing operations to use Unicode properties
|
||||||
|
for characters with code points greater than 127, even when UTF is not set.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Some applications that allow their users to supply patterns may wish to
|
Some applications that allow their users to supply patterns may wish to
|
||||||
|
@ -3833,7 +3834,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 27 January 2020
|
Last updated: 24 February 2020
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2020 University of Cambridge.
|
Copyright © 1997-2020 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -19,7 +19,7 @@ UNICODE AND UTF SUPPORT
|
||||||
PCRE2 is normally built with Unicode support, though if you do not need it, you
|
PCRE2 is normally built with Unicode support, though if you do not need it, you
|
||||||
can build it without, in which case the library will be smaller. With Unicode
|
can build it without, in which case the library will be smaller. With Unicode
|
||||||
support, PCRE2 has knowledge of Unicode character properties and can process
|
support, PCRE2 has knowledge of Unicode character properties and can process
|
||||||
text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
|
strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
|
||||||
width), but this is not the default. Unless specifically requested, PCRE2
|
width), but this is not the default. Unless specifically requested, PCRE2
|
||||||
treats each code unit in a string as one character.
|
treats each code unit in a string as one character.
|
||||||
</P>
|
</P>
|
||||||
|
@ -134,14 +134,16 @@ However, the special horizontal and vertical white space matching escapes (\h,
|
||||||
not PCRE2_UCP is set.
|
not PCRE2_UCP is set.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
CASE-EQUIVALENCE IN UTF MODE
|
UNICODE CASE-EQUIVALENCE
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Case-insensitive matching in UTF mode makes use of Unicode properties except
|
If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use
|
||||||
for characters whose code points are less than 128 and that have at most two
|
of Unicode properties except for characters whose code points are less than 128
|
||||||
case-equivalent values. For these, a direct table lookup is used for speed. A
|
and that have at most two case-equivalent values. For these, a direct table
|
||||||
few Unicode characters such as Greek sigma have more than two code points that
|
lookup is used for speed. A few Unicode characters such as Greek sigma have
|
||||||
are case-equivalent, and these are treated specially.
|
more than two code points that are case-equivalent, and these are treated
|
||||||
|
specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
|
||||||
|
processing for non-UTF character encodings such as UCS-2.
|
||||||
<a name="scriptruns"></a></P>
|
<a name="scriptruns"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
SCRIPT RUNS
|
SCRIPT RUNS
|
||||||
|
@ -484,9 +486,9 @@ Cambridge, England.
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 24 May 2019
|
Last updated: 23 February 2020
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2019 University of Cambridge.
|
Copyright © 1997-2020 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -1454,14 +1454,14 @@ COMPILING A PATTERN
|
||||||
|
|
||||||
If this bit is set, letters in the pattern match both upper and lower
|
If this bit is set, letters in the pattern match both upper and lower
|
||||||
case letters in the subject. It is equivalent to Perl's /i option, and
|
case letters in the subject. It is equivalent to Perl's /i option, and
|
||||||
it can be changed within a pattern by a (?i) option setting. If
|
it can be changed within a pattern by a (?i) option setting. If either
|
||||||
PCRE2_UTF is set, Unicode properties are used for all characters with
|
PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all
|
||||||
more than one other case, and for all characters whose code points are
|
characters with more than one other case, and for all characters whose
|
||||||
greater than U+007F. For lower valued characters with only one other
|
code points are greater than U+007F. For lower valued characters with
|
||||||
case, a lookup table is used for speed. When PCRE2_UTF is not set, a
|
only one other case, a lookup table is used for speed. When neither
|
||||||
lookup table is used for all code points less than 256, and higher code
|
PCRE2_UTF nor PCRE2_UCP is set, a lookup table is used for all code
|
||||||
points (available only in 16-bit or 32-bit mode) are treated as not
|
points less than 256, and higher code points (available only in 16-bit
|
||||||
having another case.
|
or 32-bit mode) are treated as not having another case.
|
||||||
|
|
||||||
PCRE2_DOLLAR_ENDONLY
|
PCRE2_DOLLAR_ENDONLY
|
||||||
|
|
||||||
|
@ -1786,14 +1786,20 @@ COMPILING A PATTERN
|
||||||
|
|
||||||
PCRE2_UCP
|
PCRE2_UCP
|
||||||
|
|
||||||
This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
|
This option has two effects. Firstly, it change the way PCRE2 processes
|
||||||
\w, and some of the POSIX character classes. By default, only ASCII
|
\B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character
|
||||||
characters are recognized, but if PCRE2_UCP is set, Unicode properties
|
classes. By default, only ASCII characters are recognized, but if
|
||||||
are used instead to classify characters. More details are given in the
|
PCRE2_UCP is set, Unicode properties are used instead to classify char-
|
||||||
section on generic character types in the pcre2pattern page. If you set
|
acters. More details are given in the section on generic character
|
||||||
PCRE2_UCP, matching one of the items it affects takes much longer. The
|
types in the pcre2pattern page. If you set PCRE2_UCP, matching one of
|
||||||
option is available only if PCRE2 has been compiled with Unicode sup-
|
the items it affects takes much longer.
|
||||||
port (which is the default).
|
|
||||||
|
The second effect of PCRE2_UCP is to force the use of Unicode proper-
|
||||||
|
ties for upper/lower casing operations on characters with code points
|
||||||
|
greater than 127, even when PCRE2_UTF is not set. This makes it possi-
|
||||||
|
ble, for example, to process strings in the 16-bit UCS-2 code. This op-
|
||||||
|
tion is available only if PCRE2 has been compiled with Unicode support
|
||||||
|
(which is the default).
|
||||||
|
|
||||||
PCRE2_UNGREEDY
|
PCRE2_UNGREEDY
|
||||||
|
|
||||||
|
@ -1953,14 +1959,18 @@ LOCALE SUPPORT
|
||||||
letters, digits, or whatever, by reference to a set of tables, indexed
|
letters, digits, or whatever, by reference to a set of tables, indexed
|
||||||
by character code point. However, this applies only to characters whose
|
by character code point. However, this applies only to characters whose
|
||||||
code points are less than 256. By default, higher-valued code points
|
code points are less than 256. By default, higher-valued code points
|
||||||
never match escapes such as \w or \d. When PCRE2 is built with Unicode
|
never match escapes such as \w or \d.
|
||||||
support (the default), all characters can be tested with \p and \P, or,
|
|
||||||
alternatively, the PCRE2_UCP option can be set when a pattern is com-
|
When PCRE2 is built with Unicode support (the default), the Unicode
|
||||||
piled; this causes \w and friends to use Unicode property support in-
|
properties of all characters can be tested with \p and \P, or, alterna-
|
||||||
stead of the built-in tables.
|
tively, the PCRE2_UCP option can be set when a pattern is compiled;
|
||||||
|
this causes \w and friends to use Unicode property support instead of
|
||||||
|
the built-in tables. PCRE2_UCP also causes upper/lower casing opera-
|
||||||
|
tions on characters with code points greater than 127 to use Unicode
|
||||||
|
properties. These effects apply even when PCRE2_UTF is not set.
|
||||||
|
|
||||||
The use of locales with Unicode is discouraged. If you are handling
|
The use of locales with Unicode is discouraged. If you are handling
|
||||||
characters with code points greater than 128, you should either use
|
characters with code points greater than 127, you should either use
|
||||||
Unicode support, or use locales, but not try to mix the two.
|
Unicode support, or use locales, but not try to mix the two.
|
||||||
|
|
||||||
PCRE2 contains a built-in set of character tables that are used by de-
|
PCRE2 contains a built-in set of character tables that are used by de-
|
||||||
|
@ -3375,7 +3385,9 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
||||||
it is a letter) to upper or lower case, respectively, and then the
|
it is a letter) to upper or lower case, respectively, and then the
|
||||||
state automatically reverts to no case forcing. Case forcing applies to
|
state automatically reverts to no case forcing. Case forcing applies to
|
||||||
all inserted characters, including those from capture groups and let-
|
all inserted characters, including those from capture groups and let-
|
||||||
ters within \Q...\E quoted sequences.
|
ters within \Q...\E quoted sequences. If either PCRE2_UTF or PCRE2_UCP
|
||||||
|
was set when the pattern was compiled, Unicode properties are used for
|
||||||
|
case forcing characters whose code points are greater than 127.
|
||||||
|
|
||||||
Note that case forcing sequences such as \U...\E do not nest. For exam-
|
Note that case forcing sequences such as \U...\E do not nest. For exam-
|
||||||
ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
|
ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
|
||||||
|
@ -3761,7 +3773,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 16 February 2020
|
Last updated: 24 February 2020
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2020 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -6145,7 +6157,9 @@ SPECIAL START-OF-PATTERN ITEMS
|
||||||
(*UCP). This has the same effect as setting the PCRE2_UCP option: it
|
(*UCP). This has the same effect as setting the PCRE2_UCP option: it
|
||||||
causes sequences such as \d and \w to use Unicode properties to deter-
|
causes sequences such as \d and \w to use Unicode properties to deter-
|
||||||
mine character types, instead of recognizing only characters with codes
|
mine character types, instead of recognizing only characters with codes
|
||||||
less than 256 via a lookup table.
|
less than 256 via a lookup table. If also causes upper/lower casing op-
|
||||||
|
erations to use Unicode properties for characters with code points
|
||||||
|
greater than 127, even when UTF is not set.
|
||||||
|
|
||||||
Some applications that allow their users to supply patterns may wish to
|
Some applications that allow their users to supply patterns may wish to
|
||||||
restrict them for security reasons. If the PCRE2_NEVER_UCP option is
|
restrict them for security reasons. If the PCRE2_NEVER_UCP option is
|
||||||
|
@ -9502,7 +9516,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 27 January 2020
|
Last updated: 24 February 2020
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2020 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -10878,7 +10892,7 @@ UNICODE AND UTF SUPPORT
|
||||||
PCRE2 is normally built with Unicode support, though if you do not need
|
PCRE2 is normally built with Unicode support, though if you do not need
|
||||||
it, you can build it without, in which case the library will be
|
it, you can build it without, in which case the library will be
|
||||||
smaller. With Unicode support, PCRE2 has knowledge of Unicode character
|
smaller. With Unicode support, PCRE2 has knowledge of Unicode character
|
||||||
properties and can process text strings in UTF-8, UTF-16, or UTF-32
|
properties and can process strings of text in UTF-8, UTF-16, and UTF-32
|
||||||
format (depending on the code unit width), but this is not the default.
|
format (depending on the code unit width), but this is not the default.
|
||||||
Unless specifically requested, PCRE2 treats each code unit in a string
|
Unless specifically requested, PCRE2 treats each code unit in a string
|
||||||
as one character.
|
as one character.
|
||||||
|
@ -10974,14 +10988,16 @@ WIDE CHARACTERS AND UTF MODES
|
||||||
ters, whether or not PCRE2_UCP is set.
|
ters, whether or not PCRE2_UCP is set.
|
||||||
|
|
||||||
|
|
||||||
CASE-EQUIVALENCE IN UTF MODE
|
UNICODE CASE-EQUIVALENCE
|
||||||
|
|
||||||
Case-insensitive matching in UTF mode makes use of Unicode properties
|
If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing
|
||||||
except for characters whose code points are less than 128 and that have
|
makes use of Unicode properties except for characters whose code points
|
||||||
at most two case-equivalent values. For these, a direct table lookup is
|
are less than 128 and that have at most two case-equivalent values. For
|
||||||
used for speed. A few Unicode characters such as Greek sigma have more
|
these, a direct table lookup is used for speed. A few Unicode charac-
|
||||||
than two code points that are case-equivalent, and these are treated
|
ters such as Greek sigma have more than two code points that are case-
|
||||||
specially.
|
equivalent, and these are treated specially. Setting PCRE2_UCP without
|
||||||
|
PCRE2_UTF allows Unicode-style case processing for non-UTF character
|
||||||
|
encodings such as UCS-2.
|
||||||
|
|
||||||
|
|
||||||
SCRIPT RUNS
|
SCRIPT RUNS
|
||||||
|
@ -11294,8 +11310,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 24 May 2019
|
Last updated: 23 February 2020
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2020 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "16 February 2020" "PCRE2 10.35"
|
.TH PCRE2API 3 "24 February 2020" "PCRE2 10.35"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -1420,13 +1420,13 @@ documentation.
|
||||||
.sp
|
.sp
|
||||||
If this bit is set, letters in the pattern match both upper and lower case
|
If this bit is set, letters in the pattern match both upper and lower case
|
||||||
letters in the subject. It is equivalent to Perl's /i option, and it can be
|
letters in the subject. It is equivalent to Perl's /i option, and it can be
|
||||||
changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
|
changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
|
||||||
properties are used for all characters with more than one other case, and for
|
PCRE2_UCP is set, Unicode properties are used for all characters with more than
|
||||||
all characters whose code points are greater than U+007F. For lower valued
|
one other case, and for all characters whose code points are greater than
|
||||||
characters with only one other case, a lookup table is used for speed. When
|
U+007F. For lower valued characters with only one other case, a lookup table is
|
||||||
PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
|
used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is
|
||||||
and higher code points (available only in 16-bit or 32-bit mode) are treated as
|
used for all code points less than 256, and higher code points (available only
|
||||||
not having another case.
|
in 16-bit or 32-bit mode) are treated as not having another case.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_DOLLAR_ENDONLY
|
PCRE2_DOLLAR_ENDONLY
|
||||||
.sp
|
.sp
|
||||||
|
@ -1769,10 +1769,11 @@ are not representable in UTF-16.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_UCP
|
PCRE2_UCP
|
||||||
.sp
|
.sp
|
||||||
This option changes the way PCRE2 processes \eB, \eb, \eD, \ed, \eS, \es, \eW,
|
This option has two effects. Firstly, it change the way PCRE2 processes \eB,
|
||||||
\ew, and some of the POSIX character classes. By default, only ASCII characters
|
\eb, \eD, \ed, \eS, \es, \eW, \ew, and some of the POSIX character classes. By
|
||||||
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
|
default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
|
||||||
classify characters. More details are given in the section on
|
properties are used instead to classify characters. More details are given in
|
||||||
|
the section on
|
||||||
.\" HTML <a href="pcre2pattern.html#genericchartypes">
|
.\" HTML <a href="pcre2pattern.html#genericchartypes">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
generic character types
|
generic character types
|
||||||
|
@ -1782,8 +1783,13 @@ in the
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
||||||
longer. The option is available only if PCRE2 has been compiled with Unicode
|
longer.
|
||||||
support (which is the default).
|
.P
|
||||||
|
The second effect of PCRE2_UCP is to force the use of Unicode properties for
|
||||||
|
upper/lower casing operations on characters with code points greater than 127,
|
||||||
|
even when PCRE2_UTF is not set. This makes it possible, for example, to process
|
||||||
|
strings in the 16-bit UCS-2 code. This option is available only if PCRE2 has
|
||||||
|
been compiled with Unicode support (which is the default).
|
||||||
.sp
|
.sp
|
||||||
PCRE2_UNGREEDY
|
PCRE2_UNGREEDY
|
||||||
.sp
|
.sp
|
||||||
|
@ -1957,13 +1963,18 @@ PCRE2 handles caseless matching, and determines whether characters are letters,
|
||||||
digits, or whatever, by reference to a set of tables, indexed by character code
|
digits, or whatever, by reference to a set of tables, indexed by character code
|
||||||
point. However, this applies only to characters whose code points are less than
|
point. However, this applies only to characters whose code points are less than
|
||||||
256. By default, higher-valued code points never match escapes such as \ew or
|
256. By default, higher-valued code points never match escapes such as \ew or
|
||||||
\ed. When PCRE2 is built with Unicode support (the default), all characters can
|
\ed.
|
||||||
be tested with \ep and \eP, or, alternatively, the PCRE2_UCP option can be set
|
.P
|
||||||
when a pattern is compiled; this causes \ew and friends to use Unicode property
|
When PCRE2 is built with Unicode support (the default), the Unicode properties
|
||||||
support instead of the built-in tables.
|
of all characters can be tested with \ep and \eP, or, alternatively, the
|
||||||
|
PCRE2_UCP option can be set when a pattern is compiled; this causes \ew and
|
||||||
|
friends to use Unicode property support instead of the built-in tables.
|
||||||
|
PCRE2_UCP also causes upper/lower casing operations on characters with code
|
||||||
|
points greater than 127 to use Unicode properties. These effects apply even
|
||||||
|
when PCRE2_UTF is not set.
|
||||||
.P
|
.P
|
||||||
The use of locales with Unicode is discouraged. If you are handling characters
|
The use of locales with Unicode is discouraged. If you are handling characters
|
||||||
with code points greater than 128, you should either use Unicode support, or
|
with code points greater than 127, you should either use Unicode support, or
|
||||||
use locales, but not try to mix the two.
|
use locales, but not try to mix the two.
|
||||||
.P
|
.P
|
||||||
PCRE2 contains a built-in set of character tables that are used by default.
|
PCRE2 contains a built-in set of character tables that are used by default.
|
||||||
|
@ -3495,7 +3506,10 @@ terminating a \eQ quoted sequence) reverts to no case forcing. The sequences
|
||||||
\eu and \el force the next character (if it is a letter) to upper or lower
|
\eu and \el force the next character (if it is a letter) to upper or lower
|
||||||
case, respectively, and then the state automatically reverts to no case
|
case, respectively, and then the state automatically reverts to no case
|
||||||
forcing. Case forcing applies to all inserted characters, including those from
|
forcing. Case forcing applies to all inserted characters, including those from
|
||||||
capture groups and letters within \eQ...\eE quoted sequences.
|
capture groups and letters within \eQ...\eE quoted sequences. If either
|
||||||
|
PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
|
||||||
|
properties are used for case forcing characters whose code points are greater
|
||||||
|
than 127.
|
||||||
.P
|
.P
|
||||||
Note that case forcing sequences such as \eU...\eE do not nest. For example,
|
Note that case forcing sequences such as \eU...\eE do not nest. For example,
|
||||||
the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no
|
the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no
|
||||||
|
@ -3923,6 +3937,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 16 February 2020
|
Last updated: 24 February 2020
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2020 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "27 January 2020" "PCRE2 10.35"
|
.TH PCRE2PATTERN 3 "24 February 2020" "PCRE2 10.35"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -75,7 +75,8 @@ Another special sequence that may appear at the start of a pattern is (*UCP).
|
||||||
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
||||||
such as \ed and \ew to use Unicode properties to determine character types,
|
such as \ed and \ew to use Unicode properties to determine character types,
|
||||||
instead of recognizing only characters with codes less than 256 via a lookup
|
instead of recognizing only characters with codes less than 256 via a lookup
|
||||||
table.
|
table. If also causes upper/lower casing operations to use Unicode properties
|
||||||
|
for characters with code points greater than 127, even when UTF is not set.
|
||||||
.P
|
.P
|
||||||
Some applications that allow their users to supply patterns may wish to
|
Some applications that allow their users to supply patterns may wish to
|
||||||
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
|
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
|
||||||
|
@ -3876,6 +3877,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 27 January 2020
|
Last updated: 24 February 2020
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2020 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2UNICODE 3 "24 May 2019" "PCRE2 10.34"
|
.TH PCRE2UNICODE 3 "23 February 2020" "PCRE2 10.35"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE - Perl-compatible regular expressions (revised API)
|
PCRE - Perl-compatible regular expressions (revised API)
|
||||||
.SH "UNICODE AND UTF SUPPORT"
|
.SH "UNICODE AND UTF SUPPORT"
|
||||||
|
@ -7,7 +7,7 @@ PCRE - Perl-compatible regular expressions (revised API)
|
||||||
PCRE2 is normally built with Unicode support, though if you do not need it, you
|
PCRE2 is normally built with Unicode support, though if you do not need it, you
|
||||||
can build it without, in which case the library will be smaller. With Unicode
|
can build it without, in which case the library will be smaller. With Unicode
|
||||||
support, PCRE2 has knowledge of Unicode character properties and can process
|
support, PCRE2 has knowledge of Unicode character properties and can process
|
||||||
text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
|
strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
|
||||||
width), but this is not the default. Unless specifically requested, PCRE2
|
width), but this is not the default. Unless specifically requested, PCRE2
|
||||||
treats each code unit in a string as one character.
|
treats each code unit in a string as one character.
|
||||||
.P
|
.P
|
||||||
|
@ -126,14 +126,16 @@ However, the special horizontal and vertical white space matching escapes (\eh,
|
||||||
not PCRE2_UCP is set.
|
not PCRE2_UCP is set.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "CASE-EQUIVALENCE IN UTF MODE"
|
.SH "UNICODE CASE-EQUIVALENCE"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
Case-insensitive matching in UTF mode makes use of Unicode properties except
|
If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use
|
||||||
for characters whose code points are less than 128 and that have at most two
|
of Unicode properties except for characters whose code points are less than 128
|
||||||
case-equivalent values. For these, a direct table lookup is used for speed. A
|
and that have at most two case-equivalent values. For these, a direct table
|
||||||
few Unicode characters such as Greek sigma have more than two code points that
|
lookup is used for speed. A few Unicode characters such as Greek sigma have
|
||||||
are case-equivalent, and these are treated specially.
|
more than two code points that are case-equivalent, and these are treated
|
||||||
|
specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
|
||||||
|
processing for non-UTF character encodings such as UCS-2.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="scriptruns"></a>
|
.\" HTML <a name="scriptruns"></a>
|
||||||
|
@ -455,6 +457,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 24 May 2019
|
Last updated: 23 February 2020
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2020 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
Loading…
Reference in New Issue