Documentation for PCRE2_UCP handling of upper/lower casing.

This commit is contained in:
Philip.Hazel 2020-02-24 16:35:15 +00:00
parent f50ee03f5d
commit 4e8f13cbd6
7 changed files with 153 additions and 101 deletions

View File

@ -1481,13 +1481,13 @@ documentation.
</pre> </pre>
If this bit is set, letters in the pattern match both upper and lower case If this bit is set, letters in the pattern match both upper and lower case
letters in the subject. It is equivalent to Perl's /i option, and it can be letters in the subject. It is equivalent to Perl's /i option, and it can be
changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
properties are used for all characters with more than one other case, and for PCRE2_UCP is set, Unicode properties are used for all characters with more than
all characters whose code points are greater than U+007F. For lower valued one other case, and for all characters whose code points are greater than
characters with only one other case, a lookup table is used for speed. When U+007F. For lower valued characters with only one other case, a lookup table is
PCRE2_UTF is not set, a lookup table is used for all code points less than 256, used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is
and higher code points (available only in 16-bit or 32-bit mode) are treated as used for all code points less than 256, and higher code points (available only
not having another case. in 16-bit or 32-bit mode) are treated as not having another case.
<pre> <pre>
PCRE2_DOLLAR_ENDONLY PCRE2_DOLLAR_ENDONLY
</pre> </pre>
@ -1820,16 +1820,23 @@ are not representable in UTF-16.
<pre> <pre>
PCRE2_UCP PCRE2_UCP
</pre> </pre>
This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W, This option has two effects. Firstly, it change the way PCRE2 processes \B,
\w, and some of the POSIX character classes. By default, only ASCII characters \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
classify characters. More details are given in the section on properties are used instead to classify characters. More details are given in
the section on
<a href="pcre2pattern.html#genericchartypes">generic character types</a> <a href="pcre2pattern.html#genericchartypes">generic character types</a>
in the in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page. If you set PCRE2_UCP, matching one of the items it affects takes much page. If you set PCRE2_UCP, matching one of the items it affects takes much
longer. The option is available only if PCRE2 has been compiled with Unicode longer.
support (which is the default). </P>
<P>
The second effect of PCRE2_UCP is to force the use of Unicode properties for
upper/lower casing operations on characters with code points greater than 127,
even when PCRE2_UTF is not set. This makes it possible, for example, to process
strings in the 16-bit UCS-2 code. This option is available only if PCRE2 has
been compiled with Unicode support (which is the default).
<pre> <pre>
PCRE2_UNGREEDY PCRE2_UNGREEDY
</pre> </pre>
@ -1997,14 +2004,20 @@ PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code digits, or whatever, by reference to a set of tables, indexed by character code
point. However, this applies only to characters whose code points are less than point. However, this applies only to characters whose code points are less than
256. By default, higher-valued code points never match escapes such as \w or 256. By default, higher-valued code points never match escapes such as \w or
\d. When PCRE2 is built with Unicode support (the default), all characters can \d.
be tested with \p and \P, or, alternatively, the PCRE2_UCP option can be set </P>
when a pattern is compiled; this causes \w and friends to use Unicode property <P>
support instead of the built-in tables. When PCRE2 is built with Unicode support (the default), the Unicode properties
of all characters can be tested with \p and \P, or, alternatively, the
PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
friends to use Unicode property support instead of the built-in tables.
PCRE2_UCP also causes upper/lower casing operations on characters with code
points greater than 127 to use Unicode properties. These effects apply even
when PCRE2_UTF is not set.
</P> </P>
<P> <P>
The use of locales with Unicode is discouraged. If you are handling characters The use of locales with Unicode is discouraged. If you are handling characters
with code points greater than 128, you should either use Unicode support, or with code points greater than 127, you should either use Unicode support, or
use locales, but not try to mix the two. use locales, but not try to mix the two.
</P> </P>
<P> <P>
@ -3494,7 +3507,10 @@ terminating a \Q quoted sequence) reverts to no case forcing. The sequences
\u and \l force the next character (if it is a letter) to upper or lower \u and \l force the next character (if it is a letter) to upper or lower
case, respectively, and then the state automatically reverts to no case case, respectively, and then the state automatically reverts to no case
forcing. Case forcing applies to all inserted characters, including those from forcing. Case forcing applies to all inserted characters, including those from
capture groups and letters within \Q...\E quoted sequences. capture groups and letters within \Q...\E quoted sequences. If either
PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
properties are used for case forcing characters whose code points are greater
than 127.
</P> </P>
<P> <P>
Note that case forcing sequences such as \U...\E do not nest. For example, Note that case forcing sequences such as \U...\E do not nest. For example,
@ -3915,7 +3931,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 16 February 2020 Last updated: 24 February 2020
<br> <br>
Copyright &copy; 1997-2020 University of Cambridge. Copyright &copy; 1997-2020 University of Cambridge.
<br> <br>

View File

@ -114,7 +114,8 @@ Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE2_UCP option: it causes sequences This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \d and \w to use Unicode properties to determine character types, such as \d and \w to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 256 via a lookup instead of recognizing only characters with codes less than 256 via a lookup
table. table. If also causes upper/lower casing operations to use Unicode properties
for characters with code points greater than 127, even when UTF is not set.
</P> </P>
<P> <P>
Some applications that allow their users to supply patterns may wish to Some applications that allow their users to supply patterns may wish to
@ -3833,7 +3834,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br> <br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 27 January 2020 Last updated: 24 February 2020
<br> <br>
Copyright &copy; 1997-2020 University of Cambridge. Copyright &copy; 1997-2020 University of Cambridge.
<br> <br>

View File

@ -19,7 +19,7 @@ UNICODE AND UTF SUPPORT
PCRE2 is normally built with Unicode support, though if you do not need it, you PCRE2 is normally built with Unicode support, though if you do not need it, you
can build it without, in which case the library will be smaller. With Unicode can build it without, in which case the library will be smaller. With Unicode
support, PCRE2 has knowledge of Unicode character properties and can process support, PCRE2 has knowledge of Unicode character properties and can process
text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
width), but this is not the default. Unless specifically requested, PCRE2 width), but this is not the default. Unless specifically requested, PCRE2
treats each code unit in a string as one character. treats each code unit in a string as one character.
</P> </P>
@ -134,14 +134,16 @@ However, the special horizontal and vertical white space matching escapes (\h,
not PCRE2_UCP is set. not PCRE2_UCP is set.
</P> </P>
<br><b> <br><b>
CASE-EQUIVALENCE IN UTF MODE UNICODE CASE-EQUIVALENCE
</b><br> </b><br>
<P> <P>
Case-insensitive matching in UTF mode makes use of Unicode properties except If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use
for characters whose code points are less than 128 and that have at most two of Unicode properties except for characters whose code points are less than 128
case-equivalent values. For these, a direct table lookup is used for speed. A and that have at most two case-equivalent values. For these, a direct table
few Unicode characters such as Greek sigma have more than two code points that lookup is used for speed. A few Unicode characters such as Greek sigma have
are case-equivalent, and these are treated specially. more than two code points that are case-equivalent, and these are treated
specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
processing for non-UTF character encodings such as UCS-2.
<a name="scriptruns"></a></P> <a name="scriptruns"></a></P>
<br><b> <br><b>
SCRIPT RUNS SCRIPT RUNS
@ -484,9 +486,9 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 24 May 2019 Last updated: 23 February 2020
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2020 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -1454,14 +1454,14 @@ COMPILING A PATTERN
If this bit is set, letters in the pattern match both upper and lower If this bit is set, letters in the pattern match both upper and lower
case letters in the subject. It is equivalent to Perl's /i option, and case letters in the subject. It is equivalent to Perl's /i option, and
it can be changed within a pattern by a (?i) option setting. If it can be changed within a pattern by a (?i) option setting. If either
PCRE2_UTF is set, Unicode properties are used for all characters with PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all
more than one other case, and for all characters whose code points are characters with more than one other case, and for all characters whose
greater than U+007F. For lower valued characters with only one other code points are greater than U+007F. For lower valued characters with
case, a lookup table is used for speed. When PCRE2_UTF is not set, a only one other case, a lookup table is used for speed. When neither
lookup table is used for all code points less than 256, and higher code PCRE2_UTF nor PCRE2_UCP is set, a lookup table is used for all code
points (available only in 16-bit or 32-bit mode) are treated as not points less than 256, and higher code points (available only in 16-bit
having another case. or 32-bit mode) are treated as not having another case.
PCRE2_DOLLAR_ENDONLY PCRE2_DOLLAR_ENDONLY
@ -1786,14 +1786,20 @@ COMPILING A PATTERN
PCRE2_UCP PCRE2_UCP
This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W, This option has two effects. Firstly, it change the way PCRE2 processes
\w, and some of the POSIX character classes. By default, only ASCII \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character
characters are recognized, but if PCRE2_UCP is set, Unicode properties classes. By default, only ASCII characters are recognized, but if
are used instead to classify characters. More details are given in the PCRE2_UCP is set, Unicode properties are used instead to classify char-
section on generic character types in the pcre2pattern page. If you set acters. More details are given in the section on generic character
PCRE2_UCP, matching one of the items it affects takes much longer. The types in the pcre2pattern page. If you set PCRE2_UCP, matching one of
option is available only if PCRE2 has been compiled with Unicode sup- the items it affects takes much longer.
port (which is the default).
The second effect of PCRE2_UCP is to force the use of Unicode proper-
ties for upper/lower casing operations on characters with code points
greater than 127, even when PCRE2_UTF is not set. This makes it possi-
ble, for example, to process strings in the 16-bit UCS-2 code. This op-
tion is available only if PCRE2 has been compiled with Unicode support
(which is the default).
PCRE2_UNGREEDY PCRE2_UNGREEDY
@ -1953,14 +1959,18 @@ LOCALE SUPPORT
letters, digits, or whatever, by reference to a set of tables, indexed letters, digits, or whatever, by reference to a set of tables, indexed
by character code point. However, this applies only to characters whose by character code point. However, this applies only to characters whose
code points are less than 256. By default, higher-valued code points code points are less than 256. By default, higher-valued code points
never match escapes such as \w or \d. When PCRE2 is built with Unicode never match escapes such as \w or \d.
support (the default), all characters can be tested with \p and \P, or,
alternatively, the PCRE2_UCP option can be set when a pattern is com- When PCRE2 is built with Unicode support (the default), the Unicode
piled; this causes \w and friends to use Unicode property support in- properties of all characters can be tested with \p and \P, or, alterna-
stead of the built-in tables. tively, the PCRE2_UCP option can be set when a pattern is compiled;
this causes \w and friends to use Unicode property support instead of
the built-in tables. PCRE2_UCP also causes upper/lower casing opera-
tions on characters with code points greater than 127 to use Unicode
properties. These effects apply even when PCRE2_UTF is not set.
The use of locales with Unicode is discouraged. If you are handling The use of locales with Unicode is discouraged. If you are handling
characters with code points greater than 128, you should either use characters with code points greater than 127, you should either use
Unicode support, or use locales, but not try to mix the two. Unicode support, or use locales, but not try to mix the two.
PCRE2 contains a built-in set of character tables that are used by de- PCRE2 contains a built-in set of character tables that are used by de-
@ -3375,7 +3385,9 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
it is a letter) to upper or lower case, respectively, and then the it is a letter) to upper or lower case, respectively, and then the
state automatically reverts to no case forcing. Case forcing applies to state automatically reverts to no case forcing. Case forcing applies to
all inserted characters, including those from capture groups and let- all inserted characters, including those from capture groups and let-
ters within \Q...\E quoted sequences. ters within \Q...\E quoted sequences. If either PCRE2_UTF or PCRE2_UCP
was set when the pattern was compiled, Unicode properties are used for
case forcing characters whose code points are greater than 127.
Note that case forcing sequences such as \U...\E do not nest. For exam- Note that case forcing sequences such as \U...\E do not nest. For exam-
ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
@ -3761,7 +3773,7 @@ AUTHOR
REVISION REVISION
Last updated: 16 February 2020 Last updated: 24 February 2020
Copyright (c) 1997-2020 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -6145,7 +6157,9 @@ SPECIAL START-OF-PATTERN ITEMS
(*UCP). This has the same effect as setting the PCRE2_UCP option: it (*UCP). This has the same effect as setting the PCRE2_UCP option: it
causes sequences such as \d and \w to use Unicode properties to deter- causes sequences such as \d and \w to use Unicode properties to deter-
mine character types, instead of recognizing only characters with codes mine character types, instead of recognizing only characters with codes
less than 256 via a lookup table. less than 256 via a lookup table. If also causes upper/lower casing op-
erations to use Unicode properties for characters with code points
greater than 127, even when UTF is not set.
Some applications that allow their users to supply patterns may wish to Some applications that allow their users to supply patterns may wish to
restrict them for security reasons. If the PCRE2_NEVER_UCP option is restrict them for security reasons. If the PCRE2_NEVER_UCP option is
@ -9502,7 +9516,7 @@ AUTHOR
REVISION REVISION
Last updated: 27 January 2020 Last updated: 24 February 2020
Copyright (c) 1997-2020 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -10878,7 +10892,7 @@ UNICODE AND UTF SUPPORT
PCRE2 is normally built with Unicode support, though if you do not need PCRE2 is normally built with Unicode support, though if you do not need
it, you can build it without, in which case the library will be it, you can build it without, in which case the library will be
smaller. With Unicode support, PCRE2 has knowledge of Unicode character smaller. With Unicode support, PCRE2 has knowledge of Unicode character
properties and can process text strings in UTF-8, UTF-16, or UTF-32 properties and can process strings of text in UTF-8, UTF-16, and UTF-32
format (depending on the code unit width), but this is not the default. format (depending on the code unit width), but this is not the default.
Unless specifically requested, PCRE2 treats each code unit in a string Unless specifically requested, PCRE2 treats each code unit in a string
as one character. as one character.
@ -10974,14 +10988,16 @@ WIDE CHARACTERS AND UTF MODES
ters, whether or not PCRE2_UCP is set. ters, whether or not PCRE2_UCP is set.
CASE-EQUIVALENCE IN UTF MODE UNICODE CASE-EQUIVALENCE
Case-insensitive matching in UTF mode makes use of Unicode properties If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing
except for characters whose code points are less than 128 and that have makes use of Unicode properties except for characters whose code points
at most two case-equivalent values. For these, a direct table lookup is are less than 128 and that have at most two case-equivalent values. For
used for speed. A few Unicode characters such as Greek sigma have more these, a direct table lookup is used for speed. A few Unicode charac-
than two code points that are case-equivalent, and these are treated ters such as Greek sigma have more than two code points that are case-
specially. equivalent, and these are treated specially. Setting PCRE2_UCP without
PCRE2_UTF allows Unicode-style case processing for non-UTF character
encodings such as UCS-2.
SCRIPT RUNS SCRIPT RUNS
@ -11294,8 +11310,8 @@ AUTHOR
REVISION REVISION
Last updated: 24 May 2019 Last updated: 23 February 2020
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "16 February 2020" "PCRE2 10.35" .TH PCRE2API 3 "24 February 2020" "PCRE2 10.35"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -1420,13 +1420,13 @@ documentation.
.sp .sp
If this bit is set, letters in the pattern match both upper and lower case If this bit is set, letters in the pattern match both upper and lower case
letters in the subject. It is equivalent to Perl's /i option, and it can be letters in the subject. It is equivalent to Perl's /i option, and it can be
changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
properties are used for all characters with more than one other case, and for PCRE2_UCP is set, Unicode properties are used for all characters with more than
all characters whose code points are greater than U+007F. For lower valued one other case, and for all characters whose code points are greater than
characters with only one other case, a lookup table is used for speed. When U+007F. For lower valued characters with only one other case, a lookup table is
PCRE2_UTF is not set, a lookup table is used for all code points less than 256, used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is
and higher code points (available only in 16-bit or 32-bit mode) are treated as used for all code points less than 256, and higher code points (available only
not having another case. in 16-bit or 32-bit mode) are treated as not having another case.
.sp .sp
PCRE2_DOLLAR_ENDONLY PCRE2_DOLLAR_ENDONLY
.sp .sp
@ -1769,10 +1769,11 @@ are not representable in UTF-16.
.sp .sp
PCRE2_UCP PCRE2_UCP
.sp .sp
This option changes the way PCRE2 processes \eB, \eb, \eD, \ed, \eS, \es, \eW, This option has two effects. Firstly, it change the way PCRE2 processes \eB,
\ew, and some of the POSIX character classes. By default, only ASCII characters \eb, \eD, \ed, \eS, \es, \eW, \ew, and some of the POSIX character classes. By
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
classify characters. More details are given in the section on properties are used instead to classify characters. More details are given in
the section on
.\" HTML <a href="pcre2pattern.html#genericchartypes"> .\" HTML <a href="pcre2pattern.html#genericchartypes">
.\" </a> .\" </a>
generic character types generic character types
@ -1782,8 +1783,13 @@ in the
\fBpcre2pattern\fP \fBpcre2pattern\fP
.\" .\"
page. If you set PCRE2_UCP, matching one of the items it affects takes much page. If you set PCRE2_UCP, matching one of the items it affects takes much
longer. The option is available only if PCRE2 has been compiled with Unicode longer.
support (which is the default). .P
The second effect of PCRE2_UCP is to force the use of Unicode properties for
upper/lower casing operations on characters with code points greater than 127,
even when PCRE2_UTF is not set. This makes it possible, for example, to process
strings in the 16-bit UCS-2 code. This option is available only if PCRE2 has
been compiled with Unicode support (which is the default).
.sp .sp
PCRE2_UNGREEDY PCRE2_UNGREEDY
.sp .sp
@ -1957,13 +1963,18 @@ PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code digits, or whatever, by reference to a set of tables, indexed by character code
point. However, this applies only to characters whose code points are less than point. However, this applies only to characters whose code points are less than
256. By default, higher-valued code points never match escapes such as \ew or 256. By default, higher-valued code points never match escapes such as \ew or
\ed. When PCRE2 is built with Unicode support (the default), all characters can \ed.
be tested with \ep and \eP, or, alternatively, the PCRE2_UCP option can be set .P
when a pattern is compiled; this causes \ew and friends to use Unicode property When PCRE2 is built with Unicode support (the default), the Unicode properties
support instead of the built-in tables. of all characters can be tested with \ep and \eP, or, alternatively, the
PCRE2_UCP option can be set when a pattern is compiled; this causes \ew and
friends to use Unicode property support instead of the built-in tables.
PCRE2_UCP also causes upper/lower casing operations on characters with code
points greater than 127 to use Unicode properties. These effects apply even
when PCRE2_UTF is not set.
.P .P
The use of locales with Unicode is discouraged. If you are handling characters The use of locales with Unicode is discouraged. If you are handling characters
with code points greater than 128, you should either use Unicode support, or with code points greater than 127, you should either use Unicode support, or
use locales, but not try to mix the two. use locales, but not try to mix the two.
.P .P
PCRE2 contains a built-in set of character tables that are used by default. PCRE2 contains a built-in set of character tables that are used by default.
@ -3495,7 +3506,10 @@ terminating a \eQ quoted sequence) reverts to no case forcing. The sequences
\eu and \el force the next character (if it is a letter) to upper or lower \eu and \el force the next character (if it is a letter) to upper or lower
case, respectively, and then the state automatically reverts to no case case, respectively, and then the state automatically reverts to no case
forcing. Case forcing applies to all inserted characters, including those from forcing. Case forcing applies to all inserted characters, including those from
capture groups and letters within \eQ...\eE quoted sequences. capture groups and letters within \eQ...\eE quoted sequences. If either
PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
properties are used for case forcing characters whose code points are greater
than 127.
.P .P
Note that case forcing sequences such as \eU...\eE do not nest. For example, Note that case forcing sequences such as \eU...\eE do not nest. For example,
the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no
@ -3923,6 +3937,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 16 February 2020 Last updated: 24 February 2020
Copyright (c) 1997-2020 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "27 January 2020" "PCRE2 10.35" .TH PCRE2PATTERN 3 "24 February 2020" "PCRE2 10.35"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -75,7 +75,8 @@ Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE2_UCP option: it causes sequences This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \ed and \ew to use Unicode properties to determine character types, such as \ed and \ew to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 256 via a lookup instead of recognizing only characters with codes less than 256 via a lookup
table. table. If also causes upper/lower casing operations to use Unicode properties
for characters with code points greater than 127, even when UTF is not set.
.P .P
Some applications that allow their users to supply patterns may wish to Some applications that allow their users to supply patterns may wish to
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
@ -3876,6 +3877,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 27 January 2020 Last updated: 24 February 2020
Copyright (c) 1997-2020 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2UNICODE 3 "24 May 2019" "PCRE2 10.34" .TH PCRE2UNICODE 3 "23 February 2020" "PCRE2 10.35"
.SH NAME .SH NAME
PCRE - Perl-compatible regular expressions (revised API) PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT" .SH "UNICODE AND UTF SUPPORT"
@ -7,7 +7,7 @@ PCRE - Perl-compatible regular expressions (revised API)
PCRE2 is normally built with Unicode support, though if you do not need it, you PCRE2 is normally built with Unicode support, though if you do not need it, you
can build it without, in which case the library will be smaller. With Unicode can build it without, in which case the library will be smaller. With Unicode
support, PCRE2 has knowledge of Unicode character properties and can process support, PCRE2 has knowledge of Unicode character properties and can process
text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
width), but this is not the default. Unless specifically requested, PCRE2 width), but this is not the default. Unless specifically requested, PCRE2
treats each code unit in a string as one character. treats each code unit in a string as one character.
.P .P
@ -126,14 +126,16 @@ However, the special horizontal and vertical white space matching escapes (\eh,
not PCRE2_UCP is set. not PCRE2_UCP is set.
. .
. .
.SH "CASE-EQUIVALENCE IN UTF MODE" .SH "UNICODE CASE-EQUIVALENCE"
.rs .rs
.sp .sp
Case-insensitive matching in UTF mode makes use of Unicode properties except If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use
for characters whose code points are less than 128 and that have at most two of Unicode properties except for characters whose code points are less than 128
case-equivalent values. For these, a direct table lookup is used for speed. A and that have at most two case-equivalent values. For these, a direct table
few Unicode characters such as Greek sigma have more than two code points that lookup is used for speed. A few Unicode characters such as Greek sigma have
are case-equivalent, and these are treated specially. more than two code points that are case-equivalent, and these are treated
specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
processing for non-UTF character encodings such as UCS-2.
. .
. .
.\" HTML <a name="scriptruns"></a> .\" HTML <a name="scriptruns"></a>
@ -455,6 +457,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 24 May 2019 Last updated: 23 February 2020
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
.fi .fi