Documentation for PCRE2_UCP handling of upper/lower casing.
This commit is contained in:
parent
f50ee03f5d
commit
4e8f13cbd6
|
@ -1481,13 +1481,13 @@ documentation.
|
|||
</pre>
|
||||
If this bit is set, letters in the pattern match both upper and lower case
|
||||
letters in the subject. It is equivalent to Perl's /i option, and it can be
|
||||
changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
|
||||
properties are used for all characters with more than one other case, and for
|
||||
all characters whose code points are greater than U+007F. For lower valued
|
||||
characters with only one other case, a lookup table is used for speed. When
|
||||
PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
|
||||
and higher code points (available only in 16-bit or 32-bit mode) are treated as
|
||||
not having another case.
|
||||
changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
|
||||
PCRE2_UCP is set, Unicode properties are used for all characters with more than
|
||||
one other case, and for all characters whose code points are greater than
|
||||
U+007F. For lower valued characters with only one other case, a lookup table is
|
||||
used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is
|
||||
used for all code points less than 256, and higher code points (available only
|
||||
in 16-bit or 32-bit mode) are treated as not having another case.
|
||||
<pre>
|
||||
PCRE2_DOLLAR_ENDONLY
|
||||
</pre>
|
||||
|
@ -1820,16 +1820,23 @@ are not representable in UTF-16.
|
|||
<pre>
|
||||
PCRE2_UCP
|
||||
</pre>
|
||||
This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
|
||||
\w, and some of the POSIX character classes. By default, only ASCII characters
|
||||
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
|
||||
classify characters. More details are given in the section on
|
||||
This option has two effects. Firstly, it change the way PCRE2 processes \B,
|
||||
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By
|
||||
default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
|
||||
properties are used instead to classify characters. More details are given in
|
||||
the section on
|
||||
<a href="pcre2pattern.html#genericchartypes">generic character types</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
||||
longer. The option is available only if PCRE2 has been compiled with Unicode
|
||||
support (which is the default).
|
||||
longer.
|
||||
</P>
|
||||
<P>
|
||||
The second effect of PCRE2_UCP is to force the use of Unicode properties for
|
||||
upper/lower casing operations on characters with code points greater than 127,
|
||||
even when PCRE2_UTF is not set. This makes it possible, for example, to process
|
||||
strings in the 16-bit UCS-2 code. This option is available only if PCRE2 has
|
||||
been compiled with Unicode support (which is the default).
|
||||
<pre>
|
||||
PCRE2_UNGREEDY
|
||||
</pre>
|
||||
|
@ -1997,14 +2004,20 @@ PCRE2 handles caseless matching, and determines whether characters are letters,
|
|||
digits, or whatever, by reference to a set of tables, indexed by character code
|
||||
point. However, this applies only to characters whose code points are less than
|
||||
256. By default, higher-valued code points never match escapes such as \w or
|
||||
\d. When PCRE2 is built with Unicode support (the default), all characters can
|
||||
be tested with \p and \P, or, alternatively, the PCRE2_UCP option can be set
|
||||
when a pattern is compiled; this causes \w and friends to use Unicode property
|
||||
support instead of the built-in tables.
|
||||
\d.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is built with Unicode support (the default), the Unicode properties
|
||||
of all characters can be tested with \p and \P, or, alternatively, the
|
||||
PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
|
||||
friends to use Unicode property support instead of the built-in tables.
|
||||
PCRE2_UCP also causes upper/lower casing operations on characters with code
|
||||
points greater than 127 to use Unicode properties. These effects apply even
|
||||
when PCRE2_UTF is not set.
|
||||
</P>
|
||||
<P>
|
||||
The use of locales with Unicode is discouraged. If you are handling characters
|
||||
with code points greater than 128, you should either use Unicode support, or
|
||||
with code points greater than 127, you should either use Unicode support, or
|
||||
use locales, but not try to mix the two.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -3494,7 +3507,10 @@ terminating a \Q quoted sequence) reverts to no case forcing. The sequences
|
|||
\u and \l force the next character (if it is a letter) to upper or lower
|
||||
case, respectively, and then the state automatically reverts to no case
|
||||
forcing. Case forcing applies to all inserted characters, including those from
|
||||
capture groups and letters within \Q...\E quoted sequences.
|
||||
capture groups and letters within \Q...\E quoted sequences. If either
|
||||
PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
|
||||
properties are used for case forcing characters whose code points are greater
|
||||
than 127.
|
||||
</P>
|
||||
<P>
|
||||
Note that case forcing sequences such as \U...\E do not nest. For example,
|
||||
|
@ -3915,7 +3931,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 16 February 2020
|
||||
Last updated: 24 February 2020
|
||||
<br>
|
||||
Copyright © 1997-2020 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -114,7 +114,8 @@ Another special sequence that may appear at the start of a pattern is (*UCP).
|
|||
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
||||
such as \d and \w to use Unicode properties to determine character types,
|
||||
instead of recognizing only characters with codes less than 256 via a lookup
|
||||
table.
|
||||
table. If also causes upper/lower casing operations to use Unicode properties
|
||||
for characters with code points greater than 127, even when UTF is not set.
|
||||
</P>
|
||||
<P>
|
||||
Some applications that allow their users to supply patterns may wish to
|
||||
|
@ -3833,7 +3834,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 27 January 2020
|
||||
Last updated: 24 February 2020
|
||||
<br>
|
||||
Copyright © 1997-2020 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -19,7 +19,7 @@ UNICODE AND UTF SUPPORT
|
|||
PCRE2 is normally built with Unicode support, though if you do not need it, you
|
||||
can build it without, in which case the library will be smaller. With Unicode
|
||||
support, PCRE2 has knowledge of Unicode character properties and can process
|
||||
text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
|
||||
strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
|
||||
width), but this is not the default. Unless specifically requested, PCRE2
|
||||
treats each code unit in a string as one character.
|
||||
</P>
|
||||
|
@ -134,14 +134,16 @@ However, the special horizontal and vertical white space matching escapes (\h,
|
|||
not PCRE2_UCP is set.
|
||||
</P>
|
||||
<br><b>
|
||||
CASE-EQUIVALENCE IN UTF MODE
|
||||
UNICODE CASE-EQUIVALENCE
|
||||
</b><br>
|
||||
<P>
|
||||
Case-insensitive matching in UTF mode makes use of Unicode properties except
|
||||
for characters whose code points are less than 128 and that have at most two
|
||||
case-equivalent values. For these, a direct table lookup is used for speed. A
|
||||
few Unicode characters such as Greek sigma have more than two code points that
|
||||
are case-equivalent, and these are treated specially.
|
||||
If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use
|
||||
of Unicode properties except for characters whose code points are less than 128
|
||||
and that have at most two case-equivalent values. For these, a direct table
|
||||
lookup is used for speed. A few Unicode characters such as Greek sigma have
|
||||
more than two code points that are case-equivalent, and these are treated
|
||||
specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
|
||||
processing for non-UTF character encodings such as UCS-2.
|
||||
<a name="scriptruns"></a></P>
|
||||
<br><b>
|
||||
SCRIPT RUNS
|
||||
|
@ -484,9 +486,9 @@ Cambridge, England.
|
|||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 24 May 2019
|
||||
Last updated: 23 February 2020
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
Copyright © 1997-2020 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -1454,14 +1454,14 @@ COMPILING A PATTERN
|
|||
|
||||
If this bit is set, letters in the pattern match both upper and lower
|
||||
case letters in the subject. It is equivalent to Perl's /i option, and
|
||||
it can be changed within a pattern by a (?i) option setting. If
|
||||
PCRE2_UTF is set, Unicode properties are used for all characters with
|
||||
more than one other case, and for all characters whose code points are
|
||||
greater than U+007F. For lower valued characters with only one other
|
||||
case, a lookup table is used for speed. When PCRE2_UTF is not set, a
|
||||
lookup table is used for all code points less than 256, and higher code
|
||||
points (available only in 16-bit or 32-bit mode) are treated as not
|
||||
having another case.
|
||||
it can be changed within a pattern by a (?i) option setting. If either
|
||||
PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all
|
||||
characters with more than one other case, and for all characters whose
|
||||
code points are greater than U+007F. For lower valued characters with
|
||||
only one other case, a lookup table is used for speed. When neither
|
||||
PCRE2_UTF nor PCRE2_UCP is set, a lookup table is used for all code
|
||||
points less than 256, and higher code points (available only in 16-bit
|
||||
or 32-bit mode) are treated as not having another case.
|
||||
|
||||
PCRE2_DOLLAR_ENDONLY
|
||||
|
||||
|
@ -1786,14 +1786,20 @@ COMPILING A PATTERN
|
|||
|
||||
PCRE2_UCP
|
||||
|
||||
This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
|
||||
\w, and some of the POSIX character classes. By default, only ASCII
|
||||
characters are recognized, but if PCRE2_UCP is set, Unicode properties
|
||||
are used instead to classify characters. More details are given in the
|
||||
section on generic character types in the pcre2pattern page. If you set
|
||||
PCRE2_UCP, matching one of the items it affects takes much longer. The
|
||||
option is available only if PCRE2 has been compiled with Unicode sup-
|
||||
port (which is the default).
|
||||
This option has two effects. Firstly, it change the way PCRE2 processes
|
||||
\B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character
|
||||
classes. By default, only ASCII characters are recognized, but if
|
||||
PCRE2_UCP is set, Unicode properties are used instead to classify char-
|
||||
acters. More details are given in the section on generic character
|
||||
types in the pcre2pattern page. If you set PCRE2_UCP, matching one of
|
||||
the items it affects takes much longer.
|
||||
|
||||
The second effect of PCRE2_UCP is to force the use of Unicode proper-
|
||||
ties for upper/lower casing operations on characters with code points
|
||||
greater than 127, even when PCRE2_UTF is not set. This makes it possi-
|
||||
ble, for example, to process strings in the 16-bit UCS-2 code. This op-
|
||||
tion is available only if PCRE2 has been compiled with Unicode support
|
||||
(which is the default).
|
||||
|
||||
PCRE2_UNGREEDY
|
||||
|
||||
|
@ -1953,14 +1959,18 @@ LOCALE SUPPORT
|
|||
letters, digits, or whatever, by reference to a set of tables, indexed
|
||||
by character code point. However, this applies only to characters whose
|
||||
code points are less than 256. By default, higher-valued code points
|
||||
never match escapes such as \w or \d. When PCRE2 is built with Unicode
|
||||
support (the default), all characters can be tested with \p and \P, or,
|
||||
alternatively, the PCRE2_UCP option can be set when a pattern is com-
|
||||
piled; this causes \w and friends to use Unicode property support in-
|
||||
stead of the built-in tables.
|
||||
never match escapes such as \w or \d.
|
||||
|
||||
When PCRE2 is built with Unicode support (the default), the Unicode
|
||||
properties of all characters can be tested with \p and \P, or, alterna-
|
||||
tively, the PCRE2_UCP option can be set when a pattern is compiled;
|
||||
this causes \w and friends to use Unicode property support instead of
|
||||
the built-in tables. PCRE2_UCP also causes upper/lower casing opera-
|
||||
tions on characters with code points greater than 127 to use Unicode
|
||||
properties. These effects apply even when PCRE2_UTF is not set.
|
||||
|
||||
The use of locales with Unicode is discouraged. If you are handling
|
||||
characters with code points greater than 128, you should either use
|
||||
characters with code points greater than 127, you should either use
|
||||
Unicode support, or use locales, but not try to mix the two.
|
||||
|
||||
PCRE2 contains a built-in set of character tables that are used by de-
|
||||
|
@ -3375,7 +3385,9 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
|||
it is a letter) to upper or lower case, respectively, and then the
|
||||
state automatically reverts to no case forcing. Case forcing applies to
|
||||
all inserted characters, including those from capture groups and let-
|
||||
ters within \Q...\E quoted sequences.
|
||||
ters within \Q...\E quoted sequences. If either PCRE2_UTF or PCRE2_UCP
|
||||
was set when the pattern was compiled, Unicode properties are used for
|
||||
case forcing characters whose code points are greater than 127.
|
||||
|
||||
Note that case forcing sequences such as \U...\E do not nest. For exam-
|
||||
ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
|
||||
|
@ -3761,7 +3773,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 16 February 2020
|
||||
Last updated: 24 February 2020
|
||||
Copyright (c) 1997-2020 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -6145,7 +6157,9 @@ SPECIAL START-OF-PATTERN ITEMS
|
|||
(*UCP). This has the same effect as setting the PCRE2_UCP option: it
|
||||
causes sequences such as \d and \w to use Unicode properties to deter-
|
||||
mine character types, instead of recognizing only characters with codes
|
||||
less than 256 via a lookup table.
|
||||
less than 256 via a lookup table. If also causes upper/lower casing op-
|
||||
erations to use Unicode properties for characters with code points
|
||||
greater than 127, even when UTF is not set.
|
||||
|
||||
Some applications that allow their users to supply patterns may wish to
|
||||
restrict them for security reasons. If the PCRE2_NEVER_UCP option is
|
||||
|
@ -9502,7 +9516,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 27 January 2020
|
||||
Last updated: 24 February 2020
|
||||
Copyright (c) 1997-2020 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -10878,7 +10892,7 @@ UNICODE AND UTF SUPPORT
|
|||
PCRE2 is normally built with Unicode support, though if you do not need
|
||||
it, you can build it without, in which case the library will be
|
||||
smaller. With Unicode support, PCRE2 has knowledge of Unicode character
|
||||
properties and can process text strings in UTF-8, UTF-16, or UTF-32
|
||||
properties and can process strings of text in UTF-8, UTF-16, and UTF-32
|
||||
format (depending on the code unit width), but this is not the default.
|
||||
Unless specifically requested, PCRE2 treats each code unit in a string
|
||||
as one character.
|
||||
|
@ -10974,14 +10988,16 @@ WIDE CHARACTERS AND UTF MODES
|
|||
ters, whether or not PCRE2_UCP is set.
|
||||
|
||||
|
||||
CASE-EQUIVALENCE IN UTF MODE
|
||||
UNICODE CASE-EQUIVALENCE
|
||||
|
||||
Case-insensitive matching in UTF mode makes use of Unicode properties
|
||||
except for characters whose code points are less than 128 and that have
|
||||
at most two case-equivalent values. For these, a direct table lookup is
|
||||
used for speed. A few Unicode characters such as Greek sigma have more
|
||||
than two code points that are case-equivalent, and these are treated
|
||||
specially.
|
||||
If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing
|
||||
makes use of Unicode properties except for characters whose code points
|
||||
are less than 128 and that have at most two case-equivalent values. For
|
||||
these, a direct table lookup is used for speed. A few Unicode charac-
|
||||
ters such as Greek sigma have more than two code points that are case-
|
||||
equivalent, and these are treated specially. Setting PCRE2_UCP without
|
||||
PCRE2_UTF allows Unicode-style case processing for non-UTF character
|
||||
encodings such as UCS-2.
|
||||
|
||||
|
||||
SCRIPT RUNS
|
||||
|
@ -11294,8 +11310,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 24 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
Last updated: 23 February 2020
|
||||
Copyright (c) 1997-2020 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "16 February 2020" "PCRE2 10.35"
|
||||
.TH PCRE2API 3 "24 February 2020" "PCRE2 10.35"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -1420,13 +1420,13 @@ documentation.
|
|||
.sp
|
||||
If this bit is set, letters in the pattern match both upper and lower case
|
||||
letters in the subject. It is equivalent to Perl's /i option, and it can be
|
||||
changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
|
||||
properties are used for all characters with more than one other case, and for
|
||||
all characters whose code points are greater than U+007F. For lower valued
|
||||
characters with only one other case, a lookup table is used for speed. When
|
||||
PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
|
||||
and higher code points (available only in 16-bit or 32-bit mode) are treated as
|
||||
not having another case.
|
||||
changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
|
||||
PCRE2_UCP is set, Unicode properties are used for all characters with more than
|
||||
one other case, and for all characters whose code points are greater than
|
||||
U+007F. For lower valued characters with only one other case, a lookup table is
|
||||
used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is
|
||||
used for all code points less than 256, and higher code points (available only
|
||||
in 16-bit or 32-bit mode) are treated as not having another case.
|
||||
.sp
|
||||
PCRE2_DOLLAR_ENDONLY
|
||||
.sp
|
||||
|
@ -1769,10 +1769,11 @@ are not representable in UTF-16.
|
|||
.sp
|
||||
PCRE2_UCP
|
||||
.sp
|
||||
This option changes the way PCRE2 processes \eB, \eb, \eD, \ed, \eS, \es, \eW,
|
||||
\ew, and some of the POSIX character classes. By default, only ASCII characters
|
||||
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
|
||||
classify characters. More details are given in the section on
|
||||
This option has two effects. Firstly, it change the way PCRE2 processes \eB,
|
||||
\eb, \eD, \ed, \eS, \es, \eW, \ew, and some of the POSIX character classes. By
|
||||
default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
|
||||
properties are used instead to classify characters. More details are given in
|
||||
the section on
|
||||
.\" HTML <a href="pcre2pattern.html#genericchartypes">
|
||||
.\" </a>
|
||||
generic character types
|
||||
|
@ -1782,8 +1783,13 @@ in the
|
|||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
||||
longer. The option is available only if PCRE2 has been compiled with Unicode
|
||||
support (which is the default).
|
||||
longer.
|
||||
.P
|
||||
The second effect of PCRE2_UCP is to force the use of Unicode properties for
|
||||
upper/lower casing operations on characters with code points greater than 127,
|
||||
even when PCRE2_UTF is not set. This makes it possible, for example, to process
|
||||
strings in the 16-bit UCS-2 code. This option is available only if PCRE2 has
|
||||
been compiled with Unicode support (which is the default).
|
||||
.sp
|
||||
PCRE2_UNGREEDY
|
||||
.sp
|
||||
|
@ -1957,13 +1963,18 @@ PCRE2 handles caseless matching, and determines whether characters are letters,
|
|||
digits, or whatever, by reference to a set of tables, indexed by character code
|
||||
point. However, this applies only to characters whose code points are less than
|
||||
256. By default, higher-valued code points never match escapes such as \ew or
|
||||
\ed. When PCRE2 is built with Unicode support (the default), all characters can
|
||||
be tested with \ep and \eP, or, alternatively, the PCRE2_UCP option can be set
|
||||
when a pattern is compiled; this causes \ew and friends to use Unicode property
|
||||
support instead of the built-in tables.
|
||||
\ed.
|
||||
.P
|
||||
When PCRE2 is built with Unicode support (the default), the Unicode properties
|
||||
of all characters can be tested with \ep and \eP, or, alternatively, the
|
||||
PCRE2_UCP option can be set when a pattern is compiled; this causes \ew and
|
||||
friends to use Unicode property support instead of the built-in tables.
|
||||
PCRE2_UCP also causes upper/lower casing operations on characters with code
|
||||
points greater than 127 to use Unicode properties. These effects apply even
|
||||
when PCRE2_UTF is not set.
|
||||
.P
|
||||
The use of locales with Unicode is discouraged. If you are handling characters
|
||||
with code points greater than 128, you should either use Unicode support, or
|
||||
with code points greater than 127, you should either use Unicode support, or
|
||||
use locales, but not try to mix the two.
|
||||
.P
|
||||
PCRE2 contains a built-in set of character tables that are used by default.
|
||||
|
@ -3495,7 +3506,10 @@ terminating a \eQ quoted sequence) reverts to no case forcing. The sequences
|
|||
\eu and \el force the next character (if it is a letter) to upper or lower
|
||||
case, respectively, and then the state automatically reverts to no case
|
||||
forcing. Case forcing applies to all inserted characters, including those from
|
||||
capture groups and letters within \eQ...\eE quoted sequences.
|
||||
capture groups and letters within \eQ...\eE quoted sequences. If either
|
||||
PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
|
||||
properties are used for case forcing characters whose code points are greater
|
||||
than 127.
|
||||
.P
|
||||
Note that case forcing sequences such as \eU...\eE do not nest. For example,
|
||||
the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no
|
||||
|
@ -3923,6 +3937,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 16 February 2020
|
||||
Last updated: 24 February 2020
|
||||
Copyright (c) 1997-2020 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "27 January 2020" "PCRE2 10.35"
|
||||
.TH PCRE2PATTERN 3 "24 February 2020" "PCRE2 10.35"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -75,7 +75,8 @@ Another special sequence that may appear at the start of a pattern is (*UCP).
|
|||
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
||||
such as \ed and \ew to use Unicode properties to determine character types,
|
||||
instead of recognizing only characters with codes less than 256 via a lookup
|
||||
table.
|
||||
table. If also causes upper/lower casing operations to use Unicode properties
|
||||
for characters with code points greater than 127, even when UTF is not set.
|
||||
.P
|
||||
Some applications that allow their users to supply patterns may wish to
|
||||
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
|
||||
|
@ -3876,6 +3877,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 27 January 2020
|
||||
Last updated: 24 February 2020
|
||||
Copyright (c) 1997-2020 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2UNICODE 3 "24 May 2019" "PCRE2 10.34"
|
||||
.TH PCRE2UNICODE 3 "23 February 2020" "PCRE2 10.35"
|
||||
.SH NAME
|
||||
PCRE - Perl-compatible regular expressions (revised API)
|
||||
.SH "UNICODE AND UTF SUPPORT"
|
||||
|
@ -7,7 +7,7 @@ PCRE - Perl-compatible regular expressions (revised API)
|
|||
PCRE2 is normally built with Unicode support, though if you do not need it, you
|
||||
can build it without, in which case the library will be smaller. With Unicode
|
||||
support, PCRE2 has knowledge of Unicode character properties and can process
|
||||
text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
|
||||
strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
|
||||
width), but this is not the default. Unless specifically requested, PCRE2
|
||||
treats each code unit in a string as one character.
|
||||
.P
|
||||
|
@ -126,14 +126,16 @@ However, the special horizontal and vertical white space matching escapes (\eh,
|
|||
not PCRE2_UCP is set.
|
||||
.
|
||||
.
|
||||
.SH "CASE-EQUIVALENCE IN UTF MODE"
|
||||
.SH "UNICODE CASE-EQUIVALENCE"
|
||||
.rs
|
||||
.sp
|
||||
Case-insensitive matching in UTF mode makes use of Unicode properties except
|
||||
for characters whose code points are less than 128 and that have at most two
|
||||
case-equivalent values. For these, a direct table lookup is used for speed. A
|
||||
few Unicode characters such as Greek sigma have more than two code points that
|
||||
are case-equivalent, and these are treated specially.
|
||||
If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use
|
||||
of Unicode properties except for characters whose code points are less than 128
|
||||
and that have at most two case-equivalent values. For these, a direct table
|
||||
lookup is used for speed. A few Unicode characters such as Greek sigma have
|
||||
more than two code points that are case-equivalent, and these are treated
|
||||
specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
|
||||
processing for non-UTF character encodings such as UCS-2.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="scriptruns"></a>
|
||||
|
@ -455,6 +457,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 24 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
Last updated: 23 February 2020
|
||||
Copyright (c) 1997-2020 University of Cambridge.
|
||||
.fi
|
||||
|
|
Loading…
Reference in New Issue