Documentation for Bidi_Control and Bidi_Class
This commit is contained in:
parent
0246c6bf64
commit
30abd0ac8d
|
@ -39,6 +39,8 @@ pcre2_substitute(), and the replacement argument of the latter, if the pointer
|
||||||
is NULL and the length is zero, treat as an empty string. Apparently a number
|
is NULL and the length is zero, treat as an empty string. Apparently a number
|
||||||
of applications treat NULL/0 in this way.
|
of applications treat NULL/0 in this way.
|
||||||
|
|
||||||
|
14. Added support for Bidi_Class and Bidi_Control Unicode properties.
|
||||||
|
|
||||||
|
|
||||||
Version 10.39 29-October-2021
|
Version 10.39 29-October-2021
|
||||||
-----------------------------
|
-----------------------------
|
||||||
|
|
|
@ -2055,8 +2055,8 @@ point. However, this applies only to characters whose code points are less than
|
||||||
\d.
|
\d.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When PCRE2 is built with Unicode support (the default), the Unicode properties
|
When PCRE2 is built with Unicode support (the default), certain Unicode
|
||||||
of all characters can be tested with \p and \P, or, alternatively, the
|
character properties can be tested with \p and \P, or, alternatively, the
|
||||||
PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
|
PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
|
||||||
friends to use Unicode property support instead of the built-in tables.
|
friends to use Unicode property support instead of the built-in tables.
|
||||||
PCRE2_UCP also causes upper/lower casing operations on characters with code
|
PCRE2_UCP also causes upper/lower casing operations on characters with code
|
||||||
|
@ -4018,7 +4018,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 30 November 2021
|
Last updated: 08 December 2021
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2021 University of Cambridge.
|
Copyright © 1997-2021 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -142,8 +142,9 @@ locked this out by setting PCRE2_NEVER_UTF.
|
||||||
UTF support allows the libraries to process character code points up to
|
UTF support allows the libraries to process character code points up to
|
||||||
0x10ffff in the strings that they handle. Unicode support also gives access to
|
0x10ffff in the strings that they handle. Unicode support also gives access to
|
||||||
the Unicode properties of characters, using pattern escapes such as \P, \p,
|
the Unicode properties of characters, using pattern escapes such as \P, \p,
|
||||||
and \X. Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
|
and \X. Only the general category properties such as <i>Lu</i> and <i>Nd</i>,
|
||||||
supported. Details are given in the
|
script names, and some bi-directional properties are supported. Details are
|
||||||
|
given in the
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
documentation.
|
documentation.
|
||||||
</P>
|
</P>
|
||||||
|
@ -615,9 +616,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC26" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC26" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 20 March 2020
|
Last updated: 08 December 2021
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2020 University of Cambridge.
|
Copyright © 1997-2021 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -66,9 +66,9 @@ interprets them.
|
||||||
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
|
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
|
||||||
built with Unicode support (the default). The properties that can be tested
|
built with Unicode support (the default). The properties that can be tested
|
||||||
with \p and \P are limited to the general category properties such as Lu and
|
with \p and \P are limited to the general category properties such as Lu and
|
||||||
Nd, script names such as Greek or Han, and the derived properties Any and L&.
|
Nd, script names such as Greek or Han, Bidi_Class, Bidi_Control, and the
|
||||||
Both PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its use
|
derived properties Any and LC (synonym L&). Both PCRE2 and Perl support the Cs
|
||||||
is limited. See the
|
(surrogate) property, but in PCRE2 its use is limited. See the
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
documentation for details. The long synonyms for property names that Perl
|
documentation for details. The long synonyms for property names that Perl
|
||||||
supports (such as \p{Letter}) are not supported by PCRE2, nor is it permitted
|
supports (such as \p{Letter}) are not supported by PCRE2, nor is it permitted
|
||||||
|
@ -257,7 +257,7 @@ Cambridge, England.
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 01 December 2021
|
Last updated: 08 December 2021
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2021 University of Cambridge.
|
Copyright © 1997-2021 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -783,11 +783,13 @@ escape sequences are:
|
||||||
\P{<i>xx</i>} a character without the <i>xx</i> property
|
\P{<i>xx</i>} a character without the <i>xx</i> property
|
||||||
\X a Unicode extended grapheme cluster
|
\X a Unicode extended grapheme cluster
|
||||||
</pre>
|
</pre>
|
||||||
The property names represented by <i>xx</i> above are case-sensitive. There is
|
The property names represented by <i>xx</i> above are not case-sensitive, and in
|
||||||
support for Unicode script names, Unicode general category properties, "Any",
|
accordance with Unicode's "loose matching" rules, spaces, hyphens, and
|
||||||
which matches any character (including newline), and some special PCRE2
|
underscores are ignored. There is support for Unicode script names, Unicode
|
||||||
properties (described in the
|
general category properties, "Any", which matches any character (including
|
||||||
<a href="#extraprops">next section).</a>
|
newline), Bidi_Control, Bidi_Class, and some special PCRE2 properties
|
||||||
|
(described
|
||||||
|
<a href="#extraprops">below).</a>
|
||||||
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
|
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
|
||||||
Note that \P{Any} does not match any characters, so always causes a match
|
Note that \P{Any} does not match any characters, so always causes a match
|
||||||
failure.
|
failure.
|
||||||
|
@ -1030,9 +1032,9 @@ The following general category property codes are supported:
|
||||||
Zp Paragraph separator
|
Zp Paragraph separator
|
||||||
Zs Space separator
|
Zs Space separator
|
||||||
</pre>
|
</pre>
|
||||||
The special property L& is also supported: it matches a character that has
|
The special property LC, which has the synonym L&, is also supported: it
|
||||||
the Lu, Ll, or Lt property, in other words, a letter that is not classified as
|
matches a character that has the Lu, Ll, or Lt property, in other words, a
|
||||||
a modifier or "other".
|
letter that is not classified as a modifier or "other".
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The Cs (Surrogate) property applies only to characters whose code points are in
|
The Cs (Surrogate) property applies only to characters whose code points are in
|
||||||
|
@ -1067,6 +1069,45 @@ properties in PCRE2 by default, though you can make them do so by setting the
|
||||||
PCRE2_UCP option or by starting the pattern with (*UCP).
|
PCRE2_UCP option or by starting the pattern with (*UCP).
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
Bi-directional properties for \p and \P
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Two properties relating to bi-directional text are supported:
|
||||||
|
<pre>
|
||||||
|
\p{Bidi_Control} matches a Bidi control character
|
||||||
|
\p{Bidi_Class:<class>} matches a character with the given class
|
||||||
|
</pre>
|
||||||
|
The recognized classes are:
|
||||||
|
<pre>
|
||||||
|
AL Arabic letter
|
||||||
|
AN Arabic number
|
||||||
|
B paragraph separator
|
||||||
|
BN boundary neutral
|
||||||
|
CS common separator
|
||||||
|
EN European number
|
||||||
|
ES European separator
|
||||||
|
ET European terminator
|
||||||
|
FSI first strong isolate
|
||||||
|
L left-to-right
|
||||||
|
LRE left-to-right embedding
|
||||||
|
LRI left-to-right isolate
|
||||||
|
LRO left-to-right override
|
||||||
|
NSM non-spacing mark
|
||||||
|
ON other neutral
|
||||||
|
PDF pop directional format
|
||||||
|
PDI pop directional isolate
|
||||||
|
R right-to-left
|
||||||
|
RLE right-to-left embedding
|
||||||
|
RLI right-to-left isolate
|
||||||
|
RLO right-to-left override
|
||||||
|
S segment separator
|
||||||
|
WS which space
|
||||||
|
</pre>
|
||||||
|
For Bidi_Class, an equals sign may be used instead of a colon. The class names
|
||||||
|
are case-insensitive. As for other properties, only the short names are
|
||||||
|
recognized.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
Extended grapheme clusters
|
Extended grapheme clusters
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -3861,7 +3902,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 01 December 2021
|
Last updated: 08 December 2021
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2021 University of Cambridge.
|
Copyright © 1997-2021 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -20,28 +20,29 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
|
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||||
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
|
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||||
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
|
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
|
||||||
<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
|
<li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a>
|
||||||
<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
|
<li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a>
|
||||||
<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
|
<li><a name="TOC10" href="#SEC10">QUANTIFIERS</a>
|
||||||
<li><a name="TOC11" href="#SEC11">REPORTED MATCH POINT SETTING</a>
|
<li><a name="TOC11" href="#SEC11">ANCHORS AND SIMPLE ASSERTIONS</a>
|
||||||
<li><a name="TOC12" href="#SEC12">ALTERNATION</a>
|
<li><a name="TOC12" href="#SEC12">REPORTED MATCH POINT SETTING</a>
|
||||||
<li><a name="TOC13" href="#SEC13">CAPTURING</a>
|
<li><a name="TOC13" href="#SEC13">ALTERNATION</a>
|
||||||
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
|
<li><a name="TOC14" href="#SEC14">CAPTURING</a>
|
||||||
<li><a name="TOC15" href="#SEC15">COMMENT</a>
|
<li><a name="TOC15" href="#SEC15">ATOMIC GROUPS</a>
|
||||||
<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
|
<li><a name="TOC16" href="#SEC16">COMMENT</a>
|
||||||
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
|
<li><a name="TOC17" href="#SEC17">OPTION SETTING</a>
|
||||||
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
|
<li><a name="TOC18" href="#SEC18">NEWLINE CONVENTION</a>
|
||||||
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
<li><a name="TOC19" href="#SEC19">WHAT \R MATCHES</a>
|
||||||
<li><a name="TOC20" href="#SEC20">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
|
<li><a name="TOC20" href="#SEC20">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
||||||
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
|
<li><a name="TOC21" href="#SEC21">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
|
||||||
<li><a name="TOC22" href="#SEC22">BACKREFERENCES</a>
|
<li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a>
|
||||||
<li><a name="TOC23" href="#SEC23">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
<li><a name="TOC23" href="#SEC23">BACKREFERENCES</a>
|
||||||
<li><a name="TOC24" href="#SEC24">CONDITIONAL PATTERNS</a>
|
<li><a name="TOC24" href="#SEC24">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||||
<li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a>
|
<li><a name="TOC25" href="#SEC25">CONDITIONAL PATTERNS</a>
|
||||||
<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
|
<li><a name="TOC26" href="#SEC26">BACKTRACKING CONTROL</a>
|
||||||
<li><a name="TOC27" href="#SEC27">SEE ALSO</a>
|
<li><a name="TOC27" href="#SEC27">CALLOUTS</a>
|
||||||
<li><a name="TOC28" href="#SEC28">AUTHOR</a>
|
<li><a name="TOC28" href="#SEC28">SEE ALSO</a>
|
||||||
<li><a name="TOC29" href="#SEC29">REVISION</a>
|
<li><a name="TOC29" href="#SEC29">AUTHOR</a>
|
||||||
|
<li><a name="TOC30" href="#SEC30">REVISION</a>
|
||||||
</ul>
|
</ul>
|
||||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -362,7 +363,40 @@ Yezidi,
|
||||||
Yi,
|
Yi,
|
||||||
Zanabazar_Square.
|
Zanabazar_Square.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
|
<br><a name="SEC8" href="#TOC1">BIDI_PROPERTIES FOR \p AND \P</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
\p{Bidi_Control} matches a Bidi control character
|
||||||
|
\p{Bidi_Class:<class>} matches a character with the given class
|
||||||
|
</pre>
|
||||||
|
The recognized classes are:
|
||||||
|
<pre>
|
||||||
|
AL Arabic letter
|
||||||
|
AN Arabic number
|
||||||
|
B paragraph separator
|
||||||
|
BN boundary neutral
|
||||||
|
CS common separator
|
||||||
|
EN European number
|
||||||
|
ES European separator
|
||||||
|
ET European terminator
|
||||||
|
FSI first strong isolate
|
||||||
|
L left-to-right
|
||||||
|
LRE left-to-right embedding
|
||||||
|
LRI left-to-right isolate
|
||||||
|
LRO left-to-right override
|
||||||
|
NSM non-spacing mark
|
||||||
|
ON other neutral
|
||||||
|
PDF pop directional format
|
||||||
|
PDI pop directional isolate
|
||||||
|
R right-to-left
|
||||||
|
RLE right-to-left embedding
|
||||||
|
RLI right-to-left isolate
|
||||||
|
RLO right-to-left override
|
||||||
|
S segment separator
|
||||||
|
WS which space
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC9" href="#TOC1">CHARACTER CLASSES</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
[...] positive character class
|
[...] positive character class
|
||||||
|
@ -390,7 +424,7 @@ In PCRE2, POSIX character set names recognize only ASCII characters by default,
|
||||||
but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||||
\Q...\E inside a character class.
|
\Q...\E inside a character class.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
|
<br><a name="SEC10" href="#TOC1">QUANTIFIERS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
? 0 or 1, greedy
|
? 0 or 1, greedy
|
||||||
|
@ -411,7 +445,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||||
{n,}? n or more, lazy
|
{n,}? n or more, lazy
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
|
<br><a name="SEC11" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
\b word boundary
|
\b word boundary
|
||||||
|
@ -429,7 +463,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||||
\G first matching position in subject
|
\G first matching position in subject
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC11" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
|
<br><a name="SEC12" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
\K set reported start of match
|
\K set reported start of match
|
||||||
|
@ -439,13 +473,13 @@ for compatibility with Perl. However, if the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
|
||||||
option is set, the previous behaviour is re-enabled. When this option is set,
|
option is set, the previous behaviour is re-enabled. When this option is set,
|
||||||
\K is honoured in positive assertions, but ignored in negative ones.
|
\K is honoured in positive assertions, but ignored in negative ones.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
|
<br><a name="SEC13" href="#TOC1">ALTERNATION</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
expr|expr|expr...
|
expr|expr|expr...
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
|
<br><a name="SEC14" href="#TOC1">CAPTURING</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(...) capture group
|
(...) capture group
|
||||||
|
@ -460,20 +494,20 @@ In non-UTF modes, names may contain underscores and ASCII letters and digits;
|
||||||
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
|
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
|
||||||
both cases, a name must not start with a digit.
|
both cases, a name must not start with a digit.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
|
<br><a name="SEC15" href="#TOC1">ATOMIC GROUPS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?>...) atomic non-capture group
|
(?>...) atomic non-capture group
|
||||||
(*atomic:...) atomic non-capture group
|
(*atomic:...) atomic non-capture group
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
|
<br><a name="SEC16" href="#TOC1">COMMENT</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?#....) comment (not nestable)
|
(?#....) comment (not nestable)
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
|
<br><a name="SEC17" href="#TOC1">OPTION SETTING</a><br>
|
||||||
<P>
|
<P>
|
||||||
Changes of these options within a group are automatically cancelled at the end
|
Changes of these options within a group are automatically cancelled at the end
|
||||||
of the group.
|
of the group.
|
||||||
|
@ -518,7 +552,7 @@ not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
|
||||||
application can lock out the use of (*UTF) and (*UCP) by setting the
|
application can lock out the use of (*UTF) and (*UCP) by setting the
|
||||||
PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
|
PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
|
<br><a name="SEC18" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||||
<P>
|
<P>
|
||||||
These are recognized only at the very start of the pattern or after option
|
These are recognized only at the very start of the pattern or after option
|
||||||
settings with a similar syntax.
|
settings with a similar syntax.
|
||||||
|
@ -531,7 +565,7 @@ settings with a similar syntax.
|
||||||
(*NUL) the NUL character (binary zero)
|
(*NUL) the NUL character (binary zero)
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
|
<br><a name="SEC19" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||||
<P>
|
<P>
|
||||||
These are recognized only at the very start of the pattern or after option
|
These are recognized only at the very start of the pattern or after option
|
||||||
setting with a similar syntax.
|
setting with a similar syntax.
|
||||||
|
@ -540,7 +574,7 @@ setting with a similar syntax.
|
||||||
(*BSR_UNICODE) any Unicode newline sequence
|
(*BSR_UNICODE) any Unicode newline sequence
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
|
<br><a name="SEC20" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?=...) )
|
(?=...) )
|
||||||
|
@ -561,7 +595,7 @@ setting with a similar syntax.
|
||||||
</pre>
|
</pre>
|
||||||
Each top-level branch of a lookbehind must be of a fixed length.
|
Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC20" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
|
<br><a name="SEC21" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
|
||||||
<P>
|
<P>
|
||||||
These assertions are specific to PCRE2 and are not Perl-compatible.
|
These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -574,7 +608,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||||
(*non_atomic_positive_lookbehind:...) )
|
(*non_atomic_positive_lookbehind:...) )
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
|
<br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(*script_run:...) ) script run, can be backtracked into
|
(*script_run:...) ) script run, can be backtracked into
|
||||||
|
@ -584,7 +618,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||||
(*asr:...) )
|
(*asr:...) )
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC22" href="#TOC1">BACKREFERENCES</a><br>
|
<br><a name="SEC23" href="#TOC1">BACKREFERENCES</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
\n reference by number (can be ambiguous)
|
\n reference by number (can be ambiguous)
|
||||||
|
@ -601,7 +635,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||||
(?P=name) reference by name (Python)
|
(?P=name) reference by name (Python)
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC23" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
<br><a name="SEC24" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?R) recurse whole pattern
|
(?R) recurse whole pattern
|
||||||
|
@ -620,7 +654,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||||
\g'-n' call subroutine by relative number (PCRE2 extension)
|
\g'-n' call subroutine by relative number (PCRE2 extension)
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC24" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
<br><a name="SEC25" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?(condition)yes-pattern)
|
(?(condition)yes-pattern)
|
||||||
|
@ -643,7 +677,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||||
conditions or recursion tests. Such a condition is interpreted as a reference
|
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||||
condition if the relevant named group exists.
|
condition if the relevant named group exists.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
<br><a name="SEC26" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||||
<P>
|
<P>
|
||||||
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
||||||
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
||||||
|
@ -670,7 +704,7 @@ pattern is not anchored.
|
||||||
The effect of one of these verbs in a group called as a subroutine is confined
|
The effect of one of these verbs in a group called as a subroutine is confined
|
||||||
to the subroutine call.
|
to the subroutine call.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
|
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?C) callout (assumed number 0)
|
(?C) callout (assumed number 0)
|
||||||
|
@ -681,12 +715,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
||||||
start and the end), and the starting delimiter { matched with the ending
|
start and the end), and the starting delimiter { matched with the ending
|
||||||
delimiter }. To encode the ending delimiter within the string, double it.
|
delimiter }. To encode the ending delimiter within the string, double it.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC27" href="#TOC1">SEE ALSO</a><br>
|
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||||
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC28" href="#TOC1">AUTHOR</a><br>
|
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
|
@ -695,9 +729,9 @@ Retired from University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC29" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 30 August 2021
|
Last updated: 08 December 2021
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2021 University of Cambridge.
|
Copyright © 1997-2021 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -52,13 +52,13 @@ When PCRE2 is built with Unicode support, the escape sequences \p{..},
|
||||||
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
|
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
|
||||||
The Unicode properties that can be tested are limited to the general category
|
The Unicode properties that can be tested are limited to the general category
|
||||||
properties such as Lu for an upper case letter or Nd for a decimal number, the
|
properties such as Lu for an upper case letter or Nd for a decimal number, the
|
||||||
Unicode script names such as Arabic or Han, and the derived properties Any and
|
Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
|
||||||
L&. Full lists are given in the
|
derived properties Any and LC (synonym L&). Full lists are given in the
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
and
|
and
|
||||||
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
|
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
|
||||||
documentation. Only the short names for properties are supported. For example,
|
documentation. Only the short names for properties are supported. For example,
|
||||||
\p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported.
|
\p{L} matches a letter. Its longer synonym, \p{Letter}, is not supported.
|
||||||
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
|
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
|
||||||
compatibility with Perl 5.6. PCRE2 does not support this.
|
compatibility with Perl 5.6. PCRE2 does not support this.
|
||||||
</P>
|
</P>
|
||||||
|
@ -486,9 +486,9 @@ Cambridge, England.
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 23 February 2020
|
Last updated: 08 December 2021
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2020 University of Cambridge.
|
Copyright © 1997-2021 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
150
doc/pcre2.txt
150
doc/pcre2.txt
|
@ -2012,13 +2012,13 @@ LOCALE SUPPORT
|
||||||
code points are less than 256. By default, higher-valued code points
|
code points are less than 256. By default, higher-valued code points
|
||||||
never match escapes such as \w or \d.
|
never match escapes such as \w or \d.
|
||||||
|
|
||||||
When PCRE2 is built with Unicode support (the default), the Unicode
|
When PCRE2 is built with Unicode support (the default), certain Unicode
|
||||||
properties of all characters can be tested with \p and \P, or, alterna-
|
character properties can be tested with \p and \P, or, alternatively,
|
||||||
tively, the PCRE2_UCP option can be set when a pattern is compiled;
|
the PCRE2_UCP option can be set when a pattern is compiled; this causes
|
||||||
this causes \w and friends to use Unicode property support instead of
|
\w and friends to use Unicode property support instead of the built-in
|
||||||
the built-in tables. PCRE2_UCP also causes upper/lower casing opera-
|
tables. PCRE2_UCP also causes upper/lower casing operations on charac-
|
||||||
tions on characters with code points greater than 127 to use Unicode
|
ters with code points greater than 127 to use Unicode properties. These
|
||||||
properties. These effects apply even when PCRE2_UTF is not set.
|
effects apply even when PCRE2_UTF is not set.
|
||||||
|
|
||||||
The use of locales with Unicode is discouraged. If you are handling
|
The use of locales with Unicode is discouraged. If you are handling
|
||||||
characters with code points greater than 127, you should either use
|
characters with code points greater than 127, you should either use
|
||||||
|
@ -3857,7 +3857,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 30 November 2021
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2021 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -3970,8 +3970,8 @@ UNICODE AND UTF SUPPORT
|
||||||
0x10ffff in the strings that they handle. Unicode support also gives
|
0x10ffff in the strings that they handle. Unicode support also gives
|
||||||
access to the Unicode properties of characters, using pattern escapes
|
access to the Unicode properties of characters, using pattern escapes
|
||||||
such as \P, \p, and \X. Only the general category properties such as Lu
|
such as \P, \p, and \X. Only the general category properties such as Lu
|
||||||
and Nd are supported. Details are given in the pcre2pattern documenta-
|
and Nd, script names, and some bi-directional properties are supported.
|
||||||
tion.
|
Details are given in the pcre2pattern documentation.
|
||||||
|
|
||||||
Pattern escapes such as \d and \w do not by default make use of Unicode
|
Pattern escapes such as \d and \w do not by default make use of Unicode
|
||||||
properties. The application can request that they do by setting the
|
properties. The application can request that they do by setting the
|
||||||
|
@ -4453,8 +4453,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 20 March 2020
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -4941,12 +4941,13 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
||||||
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
|
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
|
||||||
is built with Unicode support (the default). The properties that can be
|
is built with Unicode support (the default). The properties that can be
|
||||||
tested with \p and \P are limited to the general category properties
|
tested with \p and \P are limited to the general category properties
|
||||||
such as Lu and Nd, script names such as Greek or Han, and the derived
|
such as Lu and Nd, script names such as Greek or Han, Bidi_Class,
|
||||||
properties Any and L&. Both PCRE2 and Perl support the Cs (surrogate)
|
Bidi_Control, and the derived properties Any and LC (synonym L&). Both
|
||||||
property, but in PCRE2 its use is limited. See the pcre2pattern docu-
|
PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its
|
||||||
mentation for details. The long synonyms for property names that Perl
|
use is limited. See the pcre2pattern documentation for details. The
|
||||||
supports (such as \p{Letter}) are not supported by PCRE2, nor is it
|
long synonyms for property names that Perl supports (such as \p{Let-
|
||||||
permitted to prefix any of these properties with "Is".
|
ter}) are not supported by PCRE2, nor is it permitted to prefix any of
|
||||||
|
these properties with "Is".
|
||||||
|
|
||||||
7. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
|
7. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
|
||||||
in between are treated as literals. However, this is slightly different
|
in between are treated as literals. However, this is slightly different
|
||||||
|
@ -5105,7 +5106,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 01 December 2021
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2021 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -6894,13 +6895,14 @@ BACKSLASH
|
||||||
\P{xx} a character without the xx property
|
\P{xx} a character without the xx property
|
||||||
\X a Unicode extended grapheme cluster
|
\X a Unicode extended grapheme cluster
|
||||||
|
|
||||||
The property names represented by xx above are case-sensitive. There is
|
The property names represented by xx above are not case-sensitive, and
|
||||||
support for Unicode script names, Unicode general category properties,
|
in accordance with Unicode's "loose matching" rules, spaces, hyphens,
|
||||||
"Any", which matches any character (including newline), and some spe-
|
and underscores are ignored. There is support for Unicode script names,
|
||||||
cial PCRE2 properties (described in the next section). Other Perl
|
Unicode general category properties, "Any", which matches any character
|
||||||
properties such as "InMusicalSymbols" are not supported by PCRE2. Note
|
(including newline), Bidi_Control, Bidi_Class, and some special PCRE2
|
||||||
that \P{Any} does not match any characters, so always causes a match
|
properties (described below). Other Perl properties such as "InMusi-
|
||||||
failure.
|
calSymbols" are not supported by PCRE2. Note that \P{Any} does not
|
||||||
|
match any characters, so always causes a match failure.
|
||||||
|
|
||||||
Sets of Unicode characters are defined as belonging to certain scripts.
|
Sets of Unicode characters are defined as belonging to certain scripts.
|
||||||
A character from one of these sets can be matched using a script name.
|
A character from one of these sets can be matched using a script name.
|
||||||
|
@ -7000,9 +7002,9 @@ BACKSLASH
|
||||||
Zp Paragraph separator
|
Zp Paragraph separator
|
||||||
Zs Space separator
|
Zs Space separator
|
||||||
|
|
||||||
The special property L& is also supported: it matches a character that
|
The special property LC, which has the synonym L&, is also supported:
|
||||||
has the Lu, Ll, or Lt property, in other words, a letter that is not
|
it matches a character that has the Lu, Ll, or Lt property, in other
|
||||||
classified as a modifier or "other".
|
words, a letter that is not classified as a modifier or "other".
|
||||||
|
|
||||||
The Cs (Surrogate) property applies only to characters whose code
|
The Cs (Surrogate) property applies only to characters whose code
|
||||||
points are in the range U+D800 to U+DFFF. These characters are no dif-
|
points are in the range U+D800 to U+DFFF. These characters are no dif-
|
||||||
|
@ -7031,6 +7033,43 @@ BACKSLASH
|
||||||
them do so by setting the PCRE2_UCP option or by starting the pattern
|
them do so by setting the PCRE2_UCP option or by starting the pattern
|
||||||
with (*UCP).
|
with (*UCP).
|
||||||
|
|
||||||
|
Bi-directional properties for \p and \P
|
||||||
|
|
||||||
|
Two properties relating to bi-directional text are supported:
|
||||||
|
|
||||||
|
\p{Bidi_Control} matches a Bidi control character
|
||||||
|
\p{Bidi_Class:<class>} matches a character with the given class
|
||||||
|
|
||||||
|
The recognized classes are:
|
||||||
|
|
||||||
|
AL Arabic letter
|
||||||
|
AN Arabic number
|
||||||
|
B paragraph separator
|
||||||
|
BN boundary neutral
|
||||||
|
CS common separator
|
||||||
|
EN European number
|
||||||
|
ES European separator
|
||||||
|
ET European terminator
|
||||||
|
FSI first strong isolate
|
||||||
|
L left-to-right
|
||||||
|
LRE left-to-right embedding
|
||||||
|
LRI left-to-right isolate
|
||||||
|
LRO left-to-right override
|
||||||
|
NSM non-spacing mark
|
||||||
|
ON other neutral
|
||||||
|
PDF pop directional format
|
||||||
|
PDI pop directional isolate
|
||||||
|
R right-to-left
|
||||||
|
RLE right-to-left embedding
|
||||||
|
RLI right-to-left isolate
|
||||||
|
RLO right-to-left override
|
||||||
|
S segment separator
|
||||||
|
WS which space
|
||||||
|
|
||||||
|
For Bidi_Class, an equals sign may be used instead of a colon. The
|
||||||
|
class names are case-insensitive. As for other properties, only the
|
||||||
|
short names are recognized.
|
||||||
|
|
||||||
Extended grapheme clusters
|
Extended grapheme clusters
|
||||||
|
|
||||||
The \X escape matches any number of Unicode characters that form an
|
The \X escape matches any number of Unicode characters that form an
|
||||||
|
@ -9659,7 +9698,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 01 December 2021
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2021 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -10698,6 +10737,38 @@ SCRIPT NAMES FOR \p AND \P
|
||||||
Warang_Citi, Yezidi, Yi, Zanabazar_Square.
|
Warang_Citi, Yezidi, Yi, Zanabazar_Square.
|
||||||
|
|
||||||
|
|
||||||
|
BIDI_PROPERTIES FOR \p AND \P
|
||||||
|
|
||||||
|
\p{Bidi_Control} matches a Bidi control character
|
||||||
|
\p{Bidi_Class:<class>} matches a character with the given class
|
||||||
|
|
||||||
|
The recognized classes are:
|
||||||
|
|
||||||
|
AL Arabic letter
|
||||||
|
AN Arabic number
|
||||||
|
B paragraph separator
|
||||||
|
BN boundary neutral
|
||||||
|
CS common separator
|
||||||
|
EN European number
|
||||||
|
ES European separator
|
||||||
|
ET European terminator
|
||||||
|
FSI first strong isolate
|
||||||
|
L left-to-right
|
||||||
|
LRE left-to-right embedding
|
||||||
|
LRI left-to-right isolate
|
||||||
|
LRO left-to-right override
|
||||||
|
NSM non-spacing mark
|
||||||
|
ON other neutral
|
||||||
|
PDF pop directional format
|
||||||
|
PDI pop directional isolate
|
||||||
|
R right-to-left
|
||||||
|
RLE right-to-left embedding
|
||||||
|
RLI right-to-left isolate
|
||||||
|
RLO right-to-left override
|
||||||
|
S segment separator
|
||||||
|
WS which space
|
||||||
|
|
||||||
|
|
||||||
CHARACTER CLASSES
|
CHARACTER CLASSES
|
||||||
|
|
||||||
[...] positive character class
|
[...] positive character class
|
||||||
|
@ -11027,7 +11098,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 30 August 2021
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2021 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -11073,12 +11144,13 @@ UNICODE PROPERTY SUPPORT
|
||||||
ting. The Unicode properties that can be tested are limited to the
|
ting. The Unicode properties that can be tested are limited to the
|
||||||
general category properties such as Lu for an upper case letter or Nd
|
general category properties such as Lu for an upper case letter or Nd
|
||||||
for a decimal number, the Unicode script names such as Arabic or Han,
|
for a decimal number, the Unicode script names such as Arabic or Han,
|
||||||
and the derived properties Any and L&. Full lists are given in the
|
Bidi_Class, Bidi_Control, and the derived properties Any and LC (syn-
|
||||||
pcre2pattern and pcre2syntax documentation. Only the short names for
|
onym L&). Full lists are given in the pcre2pattern and pcre2syntax doc-
|
||||||
properties are supported. For example, \p{L} matches a letter. Its Perl
|
umentation. Only the short names for properties are supported. For ex-
|
||||||
synonym, \p{Letter}, is not supported. Furthermore, in Perl, many
|
ample, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
|
||||||
properties may optionally be prefixed by "Is", for compatibility with
|
supported. Furthermore, in Perl, many properties may optionally be
|
||||||
Perl 5.6. PCRE2 does not support this.
|
prefixed by "Is", for compatibility with Perl 5.6. PCRE2 does not sup-
|
||||||
|
port this.
|
||||||
|
|
||||||
|
|
||||||
WIDE CHARACTERS AND UTF MODES
|
WIDE CHARACTERS AND UTF MODES
|
||||||
|
@ -11462,8 +11534,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 23 February 2020
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "30 November 2021" "PCRE2 10.40"
|
.TH PCRE2API 3 "08 December 2021" "PCRE2 10.40"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -2015,8 +2015,8 @@ point. However, this applies only to characters whose code points are less than
|
||||||
256. By default, higher-valued code points never match escapes such as \ew or
|
256. By default, higher-valued code points never match escapes such as \ew or
|
||||||
\ed.
|
\ed.
|
||||||
.P
|
.P
|
||||||
When PCRE2 is built with Unicode support (the default), the Unicode properties
|
When PCRE2 is built with Unicode support (the default), certain Unicode
|
||||||
of all characters can be tested with \ep and \eP, or, alternatively, the
|
character properties can be tested with \ep and \eP, or, alternatively, the
|
||||||
PCRE2_UCP option can be set when a pattern is compiled; this causes \ew and
|
PCRE2_UCP option can be set when a pattern is compiled; this causes \ew and
|
||||||
friends to use Unicode property support instead of the built-in tables.
|
friends to use Unicode property support instead of the built-in tables.
|
||||||
PCRE2_UCP also causes upper/lower casing operations on characters with code
|
PCRE2_UCP also causes upper/lower casing operations on characters with code
|
||||||
|
@ -4025,6 +4025,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 30 November 2021
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2021 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2BUILD 3 "20 March 2020" "PCRE2 10.35"
|
.TH PCRE2BUILD 3 "08 December 2021" "PCRE2 10.40"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.
|
.
|
||||||
|
@ -122,8 +122,9 @@ locked this out by setting PCRE2_NEVER_UTF.
|
||||||
UTF support allows the libraries to process character code points up to
|
UTF support allows the libraries to process character code points up to
|
||||||
0x10ffff in the strings that they handle. Unicode support also gives access to
|
0x10ffff in the strings that they handle. Unicode support also gives access to
|
||||||
the Unicode properties of characters, using pattern escapes such as \eP, \ep,
|
the Unicode properties of characters, using pattern escapes such as \eP, \ep,
|
||||||
and \eX. Only the general category properties such as \fILu\fP and \fINd\fP are
|
and \eX. Only the general category properties such as \fILu\fP and \fINd\fP,
|
||||||
supported. Details are given in the
|
script names, and some bi-directional properties are supported. Details are
|
||||||
|
given in the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
|
@ -633,6 +634,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 20 March 2020
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2COMPAT 3 "01 December 2021" "PCRE2 10.40"
|
.TH PCRE2COMPAT 3 "08 December 2021" "PCRE2 10.40"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
|
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
|
||||||
|
@ -50,9 +50,9 @@ interprets them.
|
||||||
6. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is
|
6. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is
|
||||||
built with Unicode support (the default). The properties that can be tested
|
built with Unicode support (the default). The properties that can be tested
|
||||||
with \ep and \eP are limited to the general category properties such as Lu and
|
with \ep and \eP are limited to the general category properties such as Lu and
|
||||||
Nd, script names such as Greek or Han, and the derived properties Any and L&.
|
Nd, script names such as Greek or Han, Bidi_Class, Bidi_Control, and the
|
||||||
Both PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its use
|
derived properties Any and LC (synonym L&). Both PCRE2 and Perl support the Cs
|
||||||
is limited. See the
|
(surrogate) property, but in PCRE2 its use is limited. See the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
|
@ -222,6 +222,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 01 December 2021
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2021 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "01 December 2021" "PCRE2 10.40"
|
.TH PCRE2PATTERN 3 "08 December 2021" "PCRE2 10.40"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -779,13 +779,15 @@ escape sequences are:
|
||||||
\eP{\fIxx\fP} a character without the \fIxx\fP property
|
\eP{\fIxx\fP} a character without the \fIxx\fP property
|
||||||
\eX a Unicode extended grapheme cluster
|
\eX a Unicode extended grapheme cluster
|
||||||
.sp
|
.sp
|
||||||
The property names represented by \fIxx\fP above are case-sensitive. There is
|
The property names represented by \fIxx\fP above are not case-sensitive, and in
|
||||||
support for Unicode script names, Unicode general category properties, "Any",
|
accordance with Unicode's "loose matching" rules, spaces, hyphens, and
|
||||||
which matches any character (including newline), and some special PCRE2
|
underscores are ignored. There is support for Unicode script names, Unicode
|
||||||
properties (described in the
|
general category properties, "Any", which matches any character (including
|
||||||
|
newline), Bidi_Control, Bidi_Class, and some special PCRE2 properties
|
||||||
|
(described
|
||||||
.\" HTML <a href="#extraprops">
|
.\" HTML <a href="#extraprops">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
next section).
|
below).
|
||||||
.\"
|
.\"
|
||||||
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
|
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
|
||||||
Note that \eP{Any} does not match any characters, so always causes a match
|
Note that \eP{Any} does not match any characters, so always causes a match
|
||||||
|
@ -1025,9 +1027,9 @@ The following general category property codes are supported:
|
||||||
Zp Paragraph separator
|
Zp Paragraph separator
|
||||||
Zs Space separator
|
Zs Space separator
|
||||||
.sp
|
.sp
|
||||||
The special property L& is also supported: it matches a character that has
|
The special property LC, which has the synonym L&, is also supported: it
|
||||||
the Lu, Ll, or Lt property, in other words, a letter that is not classified as
|
matches a character that has the Lu, Ll, or Lt property, in other words, a
|
||||||
a modifier or "other".
|
letter that is not classified as a modifier or "other".
|
||||||
.P
|
.P
|
||||||
The Cs (Surrogate) property applies only to characters whose code points are in
|
The Cs (Surrogate) property applies only to characters whose code points are in
|
||||||
the range U+D800 to U+DFFF. These characters are no different to any other
|
the range U+D800 to U+DFFF. These characters are no different to any other
|
||||||
|
@ -1059,6 +1061,45 @@ properties in PCRE2 by default, though you can make them do so by setting the
|
||||||
PCRE2_UCP option or by starting the pattern with (*UCP).
|
PCRE2_UCP option or by starting the pattern with (*UCP).
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SS "Bi-directional properties for \ep and \eP"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
Two properties relating to bi-directional text are supported:
|
||||||
|
.sp
|
||||||
|
\ep{Bidi_Control} matches a Bidi control character
|
||||||
|
\ep{Bidi_Class:<class>} matches a character with the given class
|
||||||
|
.sp
|
||||||
|
The recognized classes are:
|
||||||
|
.sp
|
||||||
|
AL Arabic letter
|
||||||
|
AN Arabic number
|
||||||
|
B paragraph separator
|
||||||
|
BN boundary neutral
|
||||||
|
CS common separator
|
||||||
|
EN European number
|
||||||
|
ES European separator
|
||||||
|
ET European terminator
|
||||||
|
FSI first strong isolate
|
||||||
|
L left-to-right
|
||||||
|
LRE left-to-right embedding
|
||||||
|
LRI left-to-right isolate
|
||||||
|
LRO left-to-right override
|
||||||
|
NSM non-spacing mark
|
||||||
|
ON other neutral
|
||||||
|
PDF pop directional format
|
||||||
|
PDI pop directional isolate
|
||||||
|
R right-to-left
|
||||||
|
RLE right-to-left embedding
|
||||||
|
RLI right-to-left isolate
|
||||||
|
RLO right-to-left override
|
||||||
|
S segment separator
|
||||||
|
WS which space
|
||||||
|
.sp
|
||||||
|
For Bidi_Class, an equals sign may be used instead of a colon. The class names
|
||||||
|
are case-insensitive. As for other properties, only the short names are
|
||||||
|
recognized.
|
||||||
|
.
|
||||||
|
.
|
||||||
.SS Extended grapheme clusters
|
.SS Extended grapheme clusters
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -3909,6 +3950,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 01 December 2021
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2021 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2SYNTAX 3 "30 August 2021" "PCRE2 10.38"
|
.TH PCRE2SYNTAX 3 "08 December 2021" "PCRE2 10.40"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
@ -333,6 +333,39 @@ Yi,
|
||||||
Zanabazar_Square.
|
Zanabazar_Square.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SH "BIDI_PROPERTIES FOR \ep AND \eP"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
\ep{Bidi_Control} matches a Bidi control character
|
||||||
|
\ep{Bidi_Class:<class>} matches a character with the given class
|
||||||
|
.sp
|
||||||
|
The recognized classes are:
|
||||||
|
.sp
|
||||||
|
AL Arabic letter
|
||||||
|
AN Arabic number
|
||||||
|
B paragraph separator
|
||||||
|
BN boundary neutral
|
||||||
|
CS common separator
|
||||||
|
EN European number
|
||||||
|
ES European separator
|
||||||
|
ET European terminator
|
||||||
|
FSI first strong isolate
|
||||||
|
L left-to-right
|
||||||
|
LRE left-to-right embedding
|
||||||
|
LRI left-to-right isolate
|
||||||
|
LRO left-to-right override
|
||||||
|
NSM non-spacing mark
|
||||||
|
ON other neutral
|
||||||
|
PDF pop directional format
|
||||||
|
PDI pop directional isolate
|
||||||
|
R right-to-left
|
||||||
|
RLE right-to-left embedding
|
||||||
|
RLI right-to-left isolate
|
||||||
|
RLO right-to-left override
|
||||||
|
S segment separator
|
||||||
|
WS which space
|
||||||
|
.
|
||||||
|
.
|
||||||
.SH "CHARACTER CLASSES"
|
.SH "CHARACTER CLASSES"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -684,6 +717,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 30 August 2021
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2021 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2UNICODE 3 "23 February 2020" "PCRE2 10.35"
|
.TH PCRE2UNICODE 3 "08 December 2021" "PCRE2 10.40"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE - Perl-compatible regular expressions (revised API)
|
PCRE - Perl-compatible regular expressions (revised API)
|
||||||
.SH "UNICODE AND UTF SUPPORT"
|
.SH "UNICODE AND UTF SUPPORT"
|
||||||
|
@ -42,8 +42,8 @@ When PCRE2 is built with Unicode support, the escape sequences \ep{..},
|
||||||
\eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting.
|
\eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting.
|
||||||
The Unicode properties that can be tested are limited to the general category
|
The Unicode properties that can be tested are limited to the general category
|
||||||
properties such as Lu for an upper case letter or Nd for a decimal number, the
|
properties such as Lu for an upper case letter or Nd for a decimal number, the
|
||||||
Unicode script names such as Arabic or Han, and the derived properties Any and
|
Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
|
||||||
L&. Full lists are given in the
|
derived properties Any and LC (synonym L&). Full lists are given in the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
|
@ -52,7 +52,7 @@ and
|
||||||
\fBpcre2syntax\fP
|
\fBpcre2syntax\fP
|
||||||
.\"
|
.\"
|
||||||
documentation. Only the short names for properties are supported. For example,
|
documentation. Only the short names for properties are supported. For example,
|
||||||
\ep{L} matches a letter. Its Perl synonym, \ep{Letter}, is not supported.
|
\ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not supported.
|
||||||
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
|
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
|
||||||
compatibility with Perl 5.6. PCRE2 does not support this.
|
compatibility with Perl 5.6. PCRE2 does not support this.
|
||||||
.
|
.
|
||||||
|
@ -457,6 +457,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 23 February 2020
|
Last updated: 08 December 2021
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
Loading…
Reference in New Issue