Documentation for Bidi_Control and Bidi_Class

This commit is contained in:
Philip Hazel 2021-12-08 16:37:34 +00:00
parent 0246c6bf64
commit 30abd0ac8d
14 changed files with 1673 additions and 1448 deletions

View File

@ -39,6 +39,8 @@ pcre2_substitute(), and the replacement argument of the latter, if the pointer
is NULL and the length is zero, treat as an empty string. Apparently a number is NULL and the length is zero, treat as an empty string. Apparently a number
of applications treat NULL/0 in this way. of applications treat NULL/0 in this way.
14. Added support for Bidi_Class and Bidi_Control Unicode properties.
Version 10.39 29-October-2021 Version 10.39 29-October-2021
----------------------------- -----------------------------

View File

@ -2055,8 +2055,8 @@ point. However, this applies only to characters whose code points are less than
\d. \d.
</P> </P>
<P> <P>
When PCRE2 is built with Unicode support (the default), the Unicode properties When PCRE2 is built with Unicode support (the default), certain Unicode
of all characters can be tested with \p and \P, or, alternatively, the character properties can be tested with \p and \P, or, alternatively, the
PCRE2_UCP option can be set when a pattern is compiled; this causes \w and PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
friends to use Unicode property support instead of the built-in tables. friends to use Unicode property support instead of the built-in tables.
PCRE2_UCP also causes upper/lower casing operations on characters with code PCRE2_UCP also causes upper/lower casing operations on characters with code
@ -4018,7 +4018,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 30 November 2021 Last updated: 08 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

View File

@ -142,8 +142,9 @@ locked this out by setting PCRE2_NEVER_UTF.
UTF support allows the libraries to process character code points up to UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. Unicode support also gives access to 0x10ffff in the strings that they handle. Unicode support also gives access to
the Unicode properties of characters, using pattern escapes such as \P, \p, the Unicode properties of characters, using pattern escapes such as \P, \p,
and \X. Only the general category properties such as <i>Lu</i> and <i>Nd</i> are and \X. Only the general category properties such as <i>Lu</i> and <i>Nd</i>,
supported. Details are given in the script names, and some bi-directional properties are supported. Details are
given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation. documentation.
</P> </P>
@ -615,9 +616,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC26" href="#TOC1">REVISION</a><br> <br><a name="SEC26" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 20 March 2020 Last updated: 08 December 2021
<br> <br>
Copyright &copy; 1997-2020 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -66,9 +66,9 @@ interprets them.
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
built with Unicode support (the default). The properties that can be tested built with Unicode support (the default). The properties that can be tested
with \p and \P are limited to the general category properties such as Lu and with \p and \P are limited to the general category properties such as Lu and
Nd, script names such as Greek or Han, and the derived properties Any and L&. Nd, script names such as Greek or Han, Bidi_Class, Bidi_Control, and the
Both PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its use derived properties Any and LC (synonym L&). Both PCRE2 and Perl support the Cs
is limited. See the (surrogate) property, but in PCRE2 its use is limited. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation for details. The long synonyms for property names that Perl documentation for details. The long synonyms for property names that Perl
supports (such as \p{Letter}) are not supported by PCRE2, nor is it permitted supports (such as \p{Letter}) are not supported by PCRE2, nor is it permitted
@ -257,7 +257,7 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 01 December 2021 Last updated: 08 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

View File

@ -783,11 +783,13 @@ escape sequences are:
\P{<i>xx</i>} a character without the <i>xx</i> property \P{<i>xx</i>} a character without the <i>xx</i> property
\X a Unicode extended grapheme cluster \X a Unicode extended grapheme cluster
</pre> </pre>
The property names represented by <i>xx</i> above are case-sensitive. There is The property names represented by <i>xx</i> above are not case-sensitive, and in
support for Unicode script names, Unicode general category properties, "Any", accordance with Unicode's "loose matching" rules, spaces, hyphens, and
which matches any character (including newline), and some special PCRE2 underscores are ignored. There is support for Unicode script names, Unicode
properties (described in the general category properties, "Any", which matches any character (including
<a href="#extraprops">next section).</a> newline), Bidi_Control, Bidi_Class, and some special PCRE2 properties
(described
<a href="#extraprops">below).</a>
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2. Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
Note that \P{Any} does not match any characters, so always causes a match Note that \P{Any} does not match any characters, so always causes a match
failure. failure.
@ -1030,9 +1032,9 @@ The following general category property codes are supported:
Zp Paragraph separator Zp Paragraph separator
Zs Space separator Zs Space separator
</pre> </pre>
The special property L& is also supported: it matches a character that has The special property LC, which has the synonym L&, is also supported: it
the Lu, Ll, or Lt property, in other words, a letter that is not classified as matches a character that has the Lu, Ll, or Lt property, in other words, a
a modifier or "other". letter that is not classified as a modifier or "other".
</P> </P>
<P> <P>
The Cs (Surrogate) property applies only to characters whose code points are in The Cs (Surrogate) property applies only to characters whose code points are in
@ -1067,6 +1069,45 @@ properties in PCRE2 by default, though you can make them do so by setting the
PCRE2_UCP option or by starting the pattern with (*UCP). PCRE2_UCP option or by starting the pattern with (*UCP).
</P> </P>
<br><b> <br><b>
Bi-directional properties for \p and \P
</b><br>
<P>
Two properties relating to bi-directional text are supported:
<pre>
\p{Bidi_Control} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class
</pre>
The recognized classes are:
<pre>
AL Arabic letter
AN Arabic number
B paragraph separator
BN boundary neutral
CS common separator
EN European number
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
NSM non-spacing mark
ON other neutral
PDF pop directional format
PDI pop directional isolate
R right-to-left
RLE right-to-left embedding
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS which space
</pre>
For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive. As for other properties, only the short names are
recognized.
</P>
<br><b>
Extended grapheme clusters Extended grapheme clusters
</b><br> </b><br>
<P> <P>
@ -3861,7 +3902,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br> <br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 01 December 2021 Last updated: 08 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

View File

@ -20,28 +20,29 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> <li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a> <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a> <li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a>
<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a> <li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a>
<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a> <li><a name="TOC10" href="#SEC10">QUANTIFIERS</a>
<li><a name="TOC11" href="#SEC11">REPORTED MATCH POINT SETTING</a> <li><a name="TOC11" href="#SEC11">ANCHORS AND SIMPLE ASSERTIONS</a>
<li><a name="TOC12" href="#SEC12">ALTERNATION</a> <li><a name="TOC12" href="#SEC12">REPORTED MATCH POINT SETTING</a>
<li><a name="TOC13" href="#SEC13">CAPTURING</a> <li><a name="TOC13" href="#SEC13">ALTERNATION</a>
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a> <li><a name="TOC14" href="#SEC14">CAPTURING</a>
<li><a name="TOC15" href="#SEC15">COMMENT</a> <li><a name="TOC15" href="#SEC15">ATOMIC GROUPS</a>
<li><a name="TOC16" href="#SEC16">OPTION SETTING</a> <li><a name="TOC16" href="#SEC16">COMMENT</a>
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a> <li><a name="TOC17" href="#SEC17">OPTION SETTING</a>
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a> <li><a name="TOC18" href="#SEC18">NEWLINE CONVENTION</a>
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a> <li><a name="TOC19" href="#SEC19">WHAT \R MATCHES</a>
<li><a name="TOC20" href="#SEC20">NON-ATOMIC LOOKAROUND ASSERTIONS</a> <li><a name="TOC20" href="#SEC20">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a> <li><a name="TOC21" href="#SEC21">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
<li><a name="TOC22" href="#SEC22">BACKREFERENCES</a> <li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a>
<li><a name="TOC23" href="#SEC23">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a> <li><a name="TOC23" href="#SEC23">BACKREFERENCES</a>
<li><a name="TOC24" href="#SEC24">CONDITIONAL PATTERNS</a> <li><a name="TOC24" href="#SEC24">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a> <li><a name="TOC25" href="#SEC25">CONDITIONAL PATTERNS</a>
<li><a name="TOC26" href="#SEC26">CALLOUTS</a> <li><a name="TOC26" href="#SEC26">BACKTRACKING CONTROL</a>
<li><a name="TOC27" href="#SEC27">SEE ALSO</a> <li><a name="TOC27" href="#SEC27">CALLOUTS</a>
<li><a name="TOC28" href="#SEC28">AUTHOR</a> <li><a name="TOC28" href="#SEC28">SEE ALSO</a>
<li><a name="TOC29" href="#SEC29">REVISION</a> <li><a name="TOC29" href="#SEC29">AUTHOR</a>
<li><a name="TOC30" href="#SEC30">REVISION</a>
</ul> </ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br> <br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
<P> <P>
@ -362,7 +363,40 @@ Yezidi,
Yi, Yi,
Zanabazar_Square. Zanabazar_Square.
</P> </P>
<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br> <br><a name="SEC8" href="#TOC1">BIDI_PROPERTIES FOR \p AND \P</a><br>
<P>
<pre>
\p{Bidi_Control} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class
</pre>
The recognized classes are:
<pre>
AL Arabic letter
AN Arabic number
B paragraph separator
BN boundary neutral
CS common separator
EN European number
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
NSM non-spacing mark
ON other neutral
PDF pop directional format
PDI pop directional isolate
R right-to-left
RLE right-to-left embedding
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS which space
</PRE>
</P>
<br><a name="SEC9" href="#TOC1">CHARACTER CLASSES</a><br>
<P> <P>
<pre> <pre>
[...] positive character class [...] positive character class
@ -390,7 +424,7 @@ In PCRE2, POSIX character set names recognize only ASCII characters by default,
but some of them use Unicode properties if PCRE2_UCP is set. You can use but some of them use Unicode properties if PCRE2_UCP is set. You can use
\Q...\E inside a character class. \Q...\E inside a character class.
</P> </P>
<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br> <br><a name="SEC10" href="#TOC1">QUANTIFIERS</a><br>
<P> <P>
<pre> <pre>
? 0 or 1, greedy ? 0 or 1, greedy
@ -411,7 +445,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
{n,}? n or more, lazy {n,}? n or more, lazy
</PRE> </PRE>
</P> </P>
<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br> <br><a name="SEC11" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
<P> <P>
<pre> <pre>
\b word boundary \b word boundary
@ -429,7 +463,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
\G first matching position in subject \G first matching position in subject
</PRE> </PRE>
</P> </P>
<br><a name="SEC11" href="#TOC1">REPORTED MATCH POINT SETTING</a><br> <br><a name="SEC12" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
<P> <P>
<pre> <pre>
\K set reported start of match \K set reported start of match
@ -439,13 +473,13 @@ for compatibility with Perl. However, if the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
option is set, the previous behaviour is re-enabled. When this option is set, option is set, the previous behaviour is re-enabled. When this option is set,
\K is honoured in positive assertions, but ignored in negative ones. \K is honoured in positive assertions, but ignored in negative ones.
</P> </P>
<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br> <br><a name="SEC13" href="#TOC1">ALTERNATION</a><br>
<P> <P>
<pre> <pre>
expr|expr|expr... expr|expr|expr...
</PRE> </PRE>
</P> </P>
<br><a name="SEC13" href="#TOC1">CAPTURING</a><br> <br><a name="SEC14" href="#TOC1">CAPTURING</a><br>
<P> <P>
<pre> <pre>
(...) capture group (...) capture group
@ -460,20 +494,20 @@ In non-UTF modes, names may contain underscores and ASCII letters and digits;
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
both cases, a name must not start with a digit. both cases, a name must not start with a digit.
</P> </P>
<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br> <br><a name="SEC15" href="#TOC1">ATOMIC GROUPS</a><br>
<P> <P>
<pre> <pre>
(?&#62;...) atomic non-capture group (?&#62;...) atomic non-capture group
(*atomic:...) atomic non-capture group (*atomic:...) atomic non-capture group
</PRE> </PRE>
</P> </P>
<br><a name="SEC15" href="#TOC1">COMMENT</a><br> <br><a name="SEC16" href="#TOC1">COMMENT</a><br>
<P> <P>
<pre> <pre>
(?#....) comment (not nestable) (?#....) comment (not nestable)
</PRE> </PRE>
</P> </P>
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br> <br><a name="SEC17" href="#TOC1">OPTION SETTING</a><br>
<P> <P>
Changes of these options within a group are automatically cancelled at the end Changes of these options within a group are automatically cancelled at the end
of the group. of the group.
@ -518,7 +552,7 @@ not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
application can lock out the use of (*UTF) and (*UCP) by setting the application can lock out the use of (*UTF) and (*UCP) by setting the
PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time. PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
</P> </P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br> <br><a name="SEC18" href="#TOC1">NEWLINE CONVENTION</a><br>
<P> <P>
These are recognized only at the very start of the pattern or after option These are recognized only at the very start of the pattern or after option
settings with a similar syntax. settings with a similar syntax.
@ -531,7 +565,7 @@ settings with a similar syntax.
(*NUL) the NUL character (binary zero) (*NUL) the NUL character (binary zero)
</PRE> </PRE>
</P> </P>
<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br> <br><a name="SEC19" href="#TOC1">WHAT \R MATCHES</a><br>
<P> <P>
These are recognized only at the very start of the pattern or after option These are recognized only at the very start of the pattern or after option
setting with a similar syntax. setting with a similar syntax.
@ -540,7 +574,7 @@ setting with a similar syntax.
(*BSR_UNICODE) any Unicode newline sequence (*BSR_UNICODE) any Unicode newline sequence
</PRE> </PRE>
</P> </P>
<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br> <br><a name="SEC20" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P> <P>
<pre> <pre>
(?=...) ) (?=...) )
@ -561,7 +595,7 @@ setting with a similar syntax.
</pre> </pre>
Each top-level branch of a lookbehind must be of a fixed length. Each top-level branch of a lookbehind must be of a fixed length.
</P> </P>
<br><a name="SEC20" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br> <br><a name="SEC21" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
<P> <P>
These assertions are specific to PCRE2 and are not Perl-compatible. These assertions are specific to PCRE2 and are not Perl-compatible.
<pre> <pre>
@ -574,7 +608,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
(*non_atomic_positive_lookbehind:...) ) (*non_atomic_positive_lookbehind:...) )
</PRE> </PRE>
</P> </P>
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br> <br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br>
<P> <P>
<pre> <pre>
(*script_run:...) ) script run, can be backtracked into (*script_run:...) ) script run, can be backtracked into
@ -584,7 +618,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
(*asr:...) ) (*asr:...) )
</PRE> </PRE>
</P> </P>
<br><a name="SEC22" href="#TOC1">BACKREFERENCES</a><br> <br><a name="SEC23" href="#TOC1">BACKREFERENCES</a><br>
<P> <P>
<pre> <pre>
\n reference by number (can be ambiguous) \n reference by number (can be ambiguous)
@ -601,7 +635,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
(?P=name) reference by name (Python) (?P=name) reference by name (Python)
</PRE> </PRE>
</P> </P>
<br><a name="SEC23" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br> <br><a name="SEC24" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P> <P>
<pre> <pre>
(?R) recurse whole pattern (?R) recurse whole pattern
@ -620,7 +654,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
\g'-n' call subroutine by relative number (PCRE2 extension) \g'-n' call subroutine by relative number (PCRE2 extension)
</PRE> </PRE>
</P> </P>
<br><a name="SEC24" href="#TOC1">CONDITIONAL PATTERNS</a><br> <br><a name="SEC25" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P> <P>
<pre> <pre>
(?(condition)yes-pattern) (?(condition)yes-pattern)
@ -643,7 +677,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a reference conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists. condition if the relevant named group exists.
</P> </P>
<br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br> <br><a name="SEC26" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P> <P>
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
@ -670,7 +704,7 @@ pattern is not anchored.
The effect of one of these verbs in a group called as a subroutine is confined The effect of one of these verbs in a group called as a subroutine is confined
to the subroutine call. to the subroutine call.
</P> </P>
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br> <br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
<P> <P>
<pre> <pre>
(?C) callout (assumed number 0) (?C) callout (assumed number 0)
@ -681,12 +715,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it. delimiter }. To encode the ending delimiter within the string, double it.
</P> </P>
<br><a name="SEC27" href="#TOC1">SEE ALSO</a><br> <br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
<P> <P>
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
<b>pcre2matching</b>(3), <b>pcre2</b>(3). <b>pcre2matching</b>(3), <b>pcre2</b>(3).
</P> </P>
<br><a name="SEC28" href="#TOC1">AUTHOR</a><br> <br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
<P> <P>
Philip Hazel Philip Hazel
<br> <br>
@ -695,9 +729,9 @@ Retired from University Computing Service
Cambridge, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC29" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 30 August 2021 Last updated: 08 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

View File

@ -52,13 +52,13 @@ When PCRE2 is built with Unicode support, the escape sequences \p{..},
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting. \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
The Unicode properties that can be tested are limited to the general category The Unicode properties that can be tested are limited to the general category
properties such as Lu for an upper case letter or Nd for a decimal number, the properties such as Lu for an upper case letter or Nd for a decimal number, the
Unicode script names such as Arabic or Han, and the derived properties Any and Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
L&. Full lists are given in the derived properties Any and LC (synonym L&). Full lists are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
and and
<a href="pcre2syntax.html"><b>pcre2syntax</b></a> <a href="pcre2syntax.html"><b>pcre2syntax</b></a>
documentation. Only the short names for properties are supported. For example, documentation. Only the short names for properties are supported. For example,
\p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported. \p{L} matches a letter. Its longer synonym, \p{Letter}, is not supported.
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
compatibility with Perl 5.6. PCRE2 does not support this. compatibility with Perl 5.6. PCRE2 does not support this.
</P> </P>
@ -486,9 +486,9 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 23 February 2020 Last updated: 08 December 2021
<br> <br>
Copyright &copy; 1997-2020 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "30 November 2021" "PCRE2 10.40" .TH PCRE2API 3 "08 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -2015,8 +2015,8 @@ point. However, this applies only to characters whose code points are less than
256. By default, higher-valued code points never match escapes such as \ew or 256. By default, higher-valued code points never match escapes such as \ew or
\ed. \ed.
.P .P
When PCRE2 is built with Unicode support (the default), the Unicode properties When PCRE2 is built with Unicode support (the default), certain Unicode
of all characters can be tested with \ep and \eP, or, alternatively, the character properties can be tested with \ep and \eP, or, alternatively, the
PCRE2_UCP option can be set when a pattern is compiled; this causes \ew and PCRE2_UCP option can be set when a pattern is compiled; this causes \ew and
friends to use Unicode property support instead of the built-in tables. friends to use Unicode property support instead of the built-in tables.
PCRE2_UCP also causes upper/lower casing operations on characters with code PCRE2_UCP also causes upper/lower casing operations on characters with code
@ -4025,6 +4025,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 30 November 2021 Last updated: 08 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2BUILD 3 "20 March 2020" "PCRE2 10.35" .TH PCRE2BUILD 3 "08 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
. .
@ -122,8 +122,9 @@ locked this out by setting PCRE2_NEVER_UTF.
UTF support allows the libraries to process character code points up to UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. Unicode support also gives access to 0x10ffff in the strings that they handle. Unicode support also gives access to
the Unicode properties of characters, using pattern escapes such as \eP, \ep, the Unicode properties of characters, using pattern escapes such as \eP, \ep,
and \eX. Only the general category properties such as \fILu\fP and \fINd\fP are and \eX. Only the general category properties such as \fILu\fP and \fINd\fP,
supported. Details are given in the script names, and some bi-directional properties are supported. Details are
given in the
.\" HREF .\" HREF
\fBpcre2pattern\fP \fBpcre2pattern\fP
.\" .\"
@ -633,6 +634,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 20 March 2020 Last updated: 08 December 2021
Copyright (c) 1997-2020 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2COMPAT 3 "01 December 2021" "PCRE2 10.40" .TH PCRE2COMPAT 3 "08 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL" .SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@ -50,9 +50,9 @@ interprets them.
6. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is 6. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is
built with Unicode support (the default). The properties that can be tested built with Unicode support (the default). The properties that can be tested
with \ep and \eP are limited to the general category properties such as Lu and with \ep and \eP are limited to the general category properties such as Lu and
Nd, script names such as Greek or Han, and the derived properties Any and L&. Nd, script names such as Greek or Han, Bidi_Class, Bidi_Control, and the
Both PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its use derived properties Any and LC (synonym L&). Both PCRE2 and Perl support the Cs
is limited. See the (surrogate) property, but in PCRE2 its use is limited. See the
.\" HREF .\" HREF
\fBpcre2pattern\fP \fBpcre2pattern\fP
.\" .\"
@ -222,6 +222,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 01 December 2021 Last updated: 08 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "01 December 2021" "PCRE2 10.40" .TH PCRE2PATTERN 3 "08 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -779,13 +779,15 @@ escape sequences are:
\eP{\fIxx\fP} a character without the \fIxx\fP property \eP{\fIxx\fP} a character without the \fIxx\fP property
\eX a Unicode extended grapheme cluster \eX a Unicode extended grapheme cluster
.sp .sp
The property names represented by \fIxx\fP above are case-sensitive. There is The property names represented by \fIxx\fP above are not case-sensitive, and in
support for Unicode script names, Unicode general category properties, "Any", accordance with Unicode's "loose matching" rules, spaces, hyphens, and
which matches any character (including newline), and some special PCRE2 underscores are ignored. There is support for Unicode script names, Unicode
properties (described in the general category properties, "Any", which matches any character (including
newline), Bidi_Control, Bidi_Class, and some special PCRE2 properties
(described
.\" HTML <a href="#extraprops"> .\" HTML <a href="#extraprops">
.\" </a> .\" </a>
next section). below).
.\" .\"
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2. Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
Note that \eP{Any} does not match any characters, so always causes a match Note that \eP{Any} does not match any characters, so always causes a match
@ -1025,9 +1027,9 @@ The following general category property codes are supported:
Zp Paragraph separator Zp Paragraph separator
Zs Space separator Zs Space separator
.sp .sp
The special property L& is also supported: it matches a character that has The special property LC, which has the synonym L&, is also supported: it
the Lu, Ll, or Lt property, in other words, a letter that is not classified as matches a character that has the Lu, Ll, or Lt property, in other words, a
a modifier or "other". letter that is not classified as a modifier or "other".
.P .P
The Cs (Surrogate) property applies only to characters whose code points are in The Cs (Surrogate) property applies only to characters whose code points are in
the range U+D800 to U+DFFF. These characters are no different to any other the range U+D800 to U+DFFF. These characters are no different to any other
@ -1059,6 +1061,45 @@ properties in PCRE2 by default, though you can make them do so by setting the
PCRE2_UCP option or by starting the pattern with (*UCP). PCRE2_UCP option or by starting the pattern with (*UCP).
. .
. .
.SS "Bi-directional properties for \ep and \eP"
.rs
.sp
Two properties relating to bi-directional text are supported:
.sp
\ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class
.sp
The recognized classes are:
.sp
AL Arabic letter
AN Arabic number
B paragraph separator
BN boundary neutral
CS common separator
EN European number
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
NSM non-spacing mark
ON other neutral
PDF pop directional format
PDI pop directional isolate
R right-to-left
RLE right-to-left embedding
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS which space
.sp
For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive. As for other properties, only the short names are
recognized.
.
.
.SS Extended grapheme clusters .SS Extended grapheme clusters
.rs .rs
.sp .sp
@ -3909,6 +3950,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 01 December 2021 Last updated: 08 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "30 August 2021" "PCRE2 10.38" .TH PCRE2SYNTAX 3 "08 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -333,6 +333,39 @@ Yi,
Zanabazar_Square. Zanabazar_Square.
. .
. .
.SH "BIDI_PROPERTIES FOR \ep AND \eP"
.rs
.sp
\ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class
.sp
The recognized classes are:
.sp
AL Arabic letter
AN Arabic number
B paragraph separator
BN boundary neutral
CS common separator
EN European number
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
NSM non-spacing mark
ON other neutral
PDF pop directional format
PDI pop directional isolate
R right-to-left
RLE right-to-left embedding
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS which space
.
.
.SH "CHARACTER CLASSES" .SH "CHARACTER CLASSES"
.rs .rs
.sp .sp
@ -684,6 +717,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 30 August 2021 Last updated: 08 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2UNICODE 3 "23 February 2020" "PCRE2 10.35" .TH PCRE2UNICODE 3 "08 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE - Perl-compatible regular expressions (revised API) PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT" .SH "UNICODE AND UTF SUPPORT"
@ -42,8 +42,8 @@ When PCRE2 is built with Unicode support, the escape sequences \ep{..},
\eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting. \eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting.
The Unicode properties that can be tested are limited to the general category The Unicode properties that can be tested are limited to the general category
properties such as Lu for an upper case letter or Nd for a decimal number, the properties such as Lu for an upper case letter or Nd for a decimal number, the
Unicode script names such as Arabic or Han, and the derived properties Any and Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
L&. Full lists are given in the derived properties Any and LC (synonym L&). Full lists are given in the
.\" HREF .\" HREF
\fBpcre2pattern\fP \fBpcre2pattern\fP
.\" .\"
@ -52,7 +52,7 @@ and
\fBpcre2syntax\fP \fBpcre2syntax\fP
.\" .\"
documentation. Only the short names for properties are supported. For example, documentation. Only the short names for properties are supported. For example,
\ep{L} matches a letter. Its Perl synonym, \ep{Letter}, is not supported. \ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not supported.
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
compatibility with Perl 5.6. PCRE2 does not support this. compatibility with Perl 5.6. PCRE2 does not support this.
. .
@ -457,6 +457,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 23 February 2020 Last updated: 08 December 2021
Copyright (c) 1997-2020 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi