Documentation for Bidi_Control and Bidi_Class

This commit is contained in:
Philip Hazel 2021-12-08 16:37:34 +00:00
parent 0246c6bf64
commit 30abd0ac8d
14 changed files with 1673 additions and 1448 deletions

View File

@ -39,6 +39,8 @@ pcre2_substitute(), and the replacement argument of the latter, if the pointer
is NULL and the length is zero, treat as an empty string. Apparently a number
of applications treat NULL/0 in this way.
14. Added support for Bidi_Class and Bidi_Control Unicode properties.
Version 10.39 29-October-2021
-----------------------------

View File

@ -2055,8 +2055,8 @@ point. However, this applies only to characters whose code points are less than
\d.
</P>
<P>
When PCRE2 is built with Unicode support (the default), the Unicode properties
of all characters can be tested with \p and \P, or, alternatively, the
When PCRE2 is built with Unicode support (the default), certain Unicode
character properties can be tested with \p and \P, or, alternatively, the
PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
friends to use Unicode property support instead of the built-in tables.
PCRE2_UCP also causes upper/lower casing operations on characters with code
@ -4018,7 +4018,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 30 November 2021
Last updated: 08 December 2021
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>

View File

@ -142,8 +142,9 @@ locked this out by setting PCRE2_NEVER_UTF.
UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. Unicode support also gives access to
the Unicode properties of characters, using pattern escapes such as \P, \p,
and \X. Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
supported. Details are given in the
and \X. Only the general category properties such as <i>Lu</i> and <i>Nd</i>,
script names, and some bi-directional properties are supported. Details are
given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
</P>
@ -615,9 +616,9 @@ Cambridge, England.
</P>
<br><a name="SEC26" href="#TOC1">REVISION</a><br>
<P>
Last updated: 20 March 2020
Last updated: 08 December 2021
<br>
Copyright &copy; 1997-2020 University of Cambridge.
Copyright &copy; 1997-2021 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -66,9 +66,9 @@ interprets them.
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
built with Unicode support (the default). The properties that can be tested
with \p and \P are limited to the general category properties such as Lu and
Nd, script names such as Greek or Han, and the derived properties Any and L&.
Both PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its use
is limited. See the
Nd, script names such as Greek or Han, Bidi_Class, Bidi_Control, and the
derived properties Any and LC (synonym L&). Both PCRE2 and Perl support the Cs
(surrogate) property, but in PCRE2 its use is limited. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation for details. The long synonyms for property names that Perl
supports (such as \p{Letter}) are not supported by PCRE2, nor is it permitted
@ -257,7 +257,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 01 December 2021
Last updated: 08 December 2021
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>

View File

@ -783,11 +783,13 @@ escape sequences are:
\P{<i>xx</i>} a character without the <i>xx</i> property
\X a Unicode extended grapheme cluster
</pre>
The property names represented by <i>xx</i> above are case-sensitive. There is
support for Unicode script names, Unicode general category properties, "Any",
which matches any character (including newline), and some special PCRE2
properties (described in the
<a href="#extraprops">next section).</a>
The property names represented by <i>xx</i> above are not case-sensitive, and in
accordance with Unicode's "loose matching" rules, spaces, hyphens, and
underscores are ignored. There is support for Unicode script names, Unicode
general category properties, "Any", which matches any character (including
newline), Bidi_Control, Bidi_Class, and some special PCRE2 properties
(described
<a href="#extraprops">below).</a>
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
Note that \P{Any} does not match any characters, so always causes a match
failure.
@ -1030,9 +1032,9 @@ The following general category property codes are supported:
Zp Paragraph separator
Zs Space separator
</pre>
The special property L& is also supported: it matches a character that has
the Lu, Ll, or Lt property, in other words, a letter that is not classified as
a modifier or "other".
The special property LC, which has the synonym L&, is also supported: it
matches a character that has the Lu, Ll, or Lt property, in other words, a
letter that is not classified as a modifier or "other".
</P>
<P>
The Cs (Surrogate) property applies only to characters whose code points are in
@ -1067,6 +1069,45 @@ properties in PCRE2 by default, though you can make them do so by setting the
PCRE2_UCP option or by starting the pattern with (*UCP).
</P>
<br><b>
Bi-directional properties for \p and \P
</b><br>
<P>
Two properties relating to bi-directional text are supported:
<pre>
\p{Bidi_Control} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class
</pre>
The recognized classes are:
<pre>
AL Arabic letter
AN Arabic number
B paragraph separator
BN boundary neutral
CS common separator
EN European number
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
NSM non-spacing mark
ON other neutral
PDF pop directional format
PDI pop directional isolate
R right-to-left
RLE right-to-left embedding
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS which space
</pre>
For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive. As for other properties, only the short names are
recognized.
</P>
<br><b>
Extended grapheme clusters
</b><br>
<P>
@ -3861,7 +3902,7 @@ Cambridge, England.
</P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P>
Last updated: 01 December 2021
Last updated: 08 December 2021
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>

View File

@ -20,28 +20,29 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
<li><a name="TOC11" href="#SEC11">REPORTED MATCH POINT SETTING</a>
<li><a name="TOC12" href="#SEC12">ALTERNATION</a>
<li><a name="TOC13" href="#SEC13">CAPTURING</a>
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
<li><a name="TOC15" href="#SEC15">COMMENT</a>
<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC20" href="#SEC20">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
<li><a name="TOC22" href="#SEC22">BACKREFERENCES</a>
<li><a name="TOC23" href="#SEC23">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC24" href="#SEC24">CONDITIONAL PATTERNS</a>
<li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a>
<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
<li><a name="TOC27" href="#SEC27">SEE ALSO</a>
<li><a name="TOC28" href="#SEC28">AUTHOR</a>
<li><a name="TOC29" href="#SEC29">REVISION</a>
<li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a>
<li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a>
<li><a name="TOC10" href="#SEC10">QUANTIFIERS</a>
<li><a name="TOC11" href="#SEC11">ANCHORS AND SIMPLE ASSERTIONS</a>
<li><a name="TOC12" href="#SEC12">REPORTED MATCH POINT SETTING</a>
<li><a name="TOC13" href="#SEC13">ALTERNATION</a>
<li><a name="TOC14" href="#SEC14">CAPTURING</a>
<li><a name="TOC15" href="#SEC15">ATOMIC GROUPS</a>
<li><a name="TOC16" href="#SEC16">COMMENT</a>
<li><a name="TOC17" href="#SEC17">OPTION SETTING</a>
<li><a name="TOC18" href="#SEC18">NEWLINE CONVENTION</a>
<li><a name="TOC19" href="#SEC19">WHAT \R MATCHES</a>
<li><a name="TOC20" href="#SEC20">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC21" href="#SEC21">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
<li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a>
<li><a name="TOC23" href="#SEC23">BACKREFERENCES</a>
<li><a name="TOC24" href="#SEC24">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC25" href="#SEC25">CONDITIONAL PATTERNS</a>
<li><a name="TOC26" href="#SEC26">BACKTRACKING CONTROL</a>
<li><a name="TOC27" href="#SEC27">CALLOUTS</a>
<li><a name="TOC28" href="#SEC28">SEE ALSO</a>
<li><a name="TOC29" href="#SEC29">AUTHOR</a>
<li><a name="TOC30" href="#SEC30">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
<P>
@ -362,7 +363,40 @@ Yezidi,
Yi,
Zanabazar_Square.
</P>
<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
<br><a name="SEC8" href="#TOC1">BIDI_PROPERTIES FOR \p AND \P</a><br>
<P>
<pre>
\p{Bidi_Control} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class
</pre>
The recognized classes are:
<pre>
AL Arabic letter
AN Arabic number
B paragraph separator
BN boundary neutral
CS common separator
EN European number
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
NSM non-spacing mark
ON other neutral
PDF pop directional format
PDI pop directional isolate
R right-to-left
RLE right-to-left embedding
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS which space
</PRE>
</P>
<br><a name="SEC9" href="#TOC1">CHARACTER CLASSES</a><br>
<P>
<pre>
[...] positive character class
@ -390,7 +424,7 @@ In PCRE2, POSIX character set names recognize only ASCII characters by default,
but some of them use Unicode properties if PCRE2_UCP is set. You can use
\Q...\E inside a character class.
</P>
<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
<br><a name="SEC10" href="#TOC1">QUANTIFIERS</a><br>
<P>
<pre>
? 0 or 1, greedy
@ -411,7 +445,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
{n,}? n or more, lazy
</PRE>
</P>
<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
<br><a name="SEC11" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
<P>
<pre>
\b word boundary
@ -429,7 +463,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
\G first matching position in subject
</PRE>
</P>
<br><a name="SEC11" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
<br><a name="SEC12" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
<P>
<pre>
\K set reported start of match
@ -439,13 +473,13 @@ for compatibility with Perl. However, if the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
option is set, the previous behaviour is re-enabled. When this option is set,
\K is honoured in positive assertions, but ignored in negative ones.
</P>
<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
<br><a name="SEC13" href="#TOC1">ALTERNATION</a><br>
<P>
<pre>
expr|expr|expr...
</PRE>
</P>
<br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
<br><a name="SEC14" href="#TOC1">CAPTURING</a><br>
<P>
<pre>
(...) capture group
@ -460,20 +494,20 @@ In non-UTF modes, names may contain underscores and ASCII letters and digits;
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
both cases, a name must not start with a digit.
</P>
<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
<br><a name="SEC15" href="#TOC1">ATOMIC GROUPS</a><br>
<P>
<pre>
(?&#62;...) atomic non-capture group
(*atomic:...) atomic non-capture group
</PRE>
</P>
<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
<br><a name="SEC16" href="#TOC1">COMMENT</a><br>
<P>
<pre>
(?#....) comment (not nestable)
</PRE>
</P>
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
<br><a name="SEC17" href="#TOC1">OPTION SETTING</a><br>
<P>
Changes of these options within a group are automatically cancelled at the end
of the group.
@ -518,7 +552,7 @@ not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
application can lock out the use of (*UTF) and (*UCP) by setting the
PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
</P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
<br><a name="SEC18" href="#TOC1">NEWLINE CONVENTION</a><br>
<P>
These are recognized only at the very start of the pattern or after option
settings with a similar syntax.
@ -531,7 +565,7 @@ settings with a similar syntax.
(*NUL) the NUL character (binary zero)
</PRE>
</P>
<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
<br><a name="SEC19" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
These are recognized only at the very start of the pattern or after option
setting with a similar syntax.
@ -540,7 +574,7 @@ setting with a similar syntax.
(*BSR_UNICODE) any Unicode newline sequence
</PRE>
</P>
<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<br><a name="SEC20" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P>
<pre>
(?=...) )
@ -561,7 +595,7 @@ setting with a similar syntax.
</pre>
Each top-level branch of a lookbehind must be of a fixed length.
</P>
<br><a name="SEC20" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
<br><a name="SEC21" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
<P>
These assertions are specific to PCRE2 and are not Perl-compatible.
<pre>
@ -574,7 +608,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
(*non_atomic_positive_lookbehind:...) )
</PRE>
</P>
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
<br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br>
<P>
<pre>
(*script_run:...) ) script run, can be backtracked into
@ -584,7 +618,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
(*asr:...) )
</PRE>
</P>
<br><a name="SEC22" href="#TOC1">BACKREFERENCES</a><br>
<br><a name="SEC23" href="#TOC1">BACKREFERENCES</a><br>
<P>
<pre>
\n reference by number (can be ambiguous)
@ -601,7 +635,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
(?P=name) reference by name (Python)
</PRE>
</P>
<br><a name="SEC23" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<br><a name="SEC24" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P>
<pre>
(?R) recurse whole pattern
@ -620,7 +654,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
\g'-n' call subroutine by relative number (PCRE2 extension)
</PRE>
</P>
<br><a name="SEC24" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<br><a name="SEC25" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P>
<pre>
(?(condition)yes-pattern)
@ -643,7 +677,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists.
</P>
<br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br>
<br><a name="SEC26" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
@ -670,7 +704,7 @@ pattern is not anchored.
The effect of one of these verbs in a group called as a subroutine is confined
to the subroutine call.
</P>
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
<P>
<pre>
(?C) callout (assumed number 0)
@ -681,12 +715,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it.
</P>
<br><a name="SEC27" href="#TOC1">SEE ALSO</a><br>
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
</P>
<br><a name="SEC28" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@ -695,9 +729,9 @@ Retired from University Computing Service
Cambridge, England.
<br>
</P>
<br><a name="SEC29" href="#TOC1">REVISION</a><br>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 30 August 2021
Last updated: 08 December 2021
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>

View File

@ -52,13 +52,13 @@ When PCRE2 is built with Unicode support, the escape sequences \p{..},
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
The Unicode properties that can be tested are limited to the general category
properties such as Lu for an upper case letter or Nd for a decimal number, the
Unicode script names such as Arabic or Han, and the derived properties Any and
L&. Full lists are given in the
Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
derived properties Any and LC (synonym L&). Full lists are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
and
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
documentation. Only the short names for properties are supported. For example,
\p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported.
\p{L} matches a letter. Its longer synonym, \p{Letter}, is not supported.
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
compatibility with Perl 5.6. PCRE2 does not support this.
</P>
@ -486,9 +486,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 23 February 2020
Last updated: 08 December 2021
<br>
Copyright &copy; 1997-2020 University of Cambridge.
Copyright &copy; 1997-2021 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "30 November 2021" "PCRE2 10.40"
.TH PCRE2API 3 "08 December 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -2015,8 +2015,8 @@ point. However, this applies only to characters whose code points are less than
256. By default, higher-valued code points never match escapes such as \ew or
\ed.
.P
When PCRE2 is built with Unicode support (the default), the Unicode properties
of all characters can be tested with \ep and \eP, or, alternatively, the
When PCRE2 is built with Unicode support (the default), certain Unicode
character properties can be tested with \ep and \eP, or, alternatively, the
PCRE2_UCP option can be set when a pattern is compiled; this causes \ew and
friends to use Unicode property support instead of the built-in tables.
PCRE2_UCP also causes upper/lower casing operations on characters with code
@ -4025,6 +4025,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 30 November 2021
Last updated: 08 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2BUILD 3 "20 March 2020" "PCRE2 10.35"
.TH PCRE2BUILD 3 "08 December 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.
@ -122,8 +122,9 @@ locked this out by setting PCRE2_NEVER_UTF.
UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. Unicode support also gives access to
the Unicode properties of characters, using pattern escapes such as \eP, \ep,
and \eX. Only the general category properties such as \fILu\fP and \fINd\fP are
supported. Details are given in the
and \eX. Only the general category properties such as \fILu\fP and \fINd\fP,
script names, and some bi-directional properties are supported. Details are
given in the
.\" HREF
\fBpcre2pattern\fP
.\"
@ -633,6 +634,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 20 March 2020
Copyright (c) 1997-2020 University of Cambridge.
Last updated: 08 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2COMPAT 3 "01 December 2021" "PCRE2 10.40"
.TH PCRE2COMPAT 3 "08 December 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@ -50,9 +50,9 @@ interprets them.
6. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is
built with Unicode support (the default). The properties that can be tested
with \ep and \eP are limited to the general category properties such as Lu and
Nd, script names such as Greek or Han, and the derived properties Any and L&.
Both PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its use
is limited. See the
Nd, script names such as Greek or Han, Bidi_Class, Bidi_Control, and the
derived properties Any and LC (synonym L&). Both PCRE2 and Perl support the Cs
(surrogate) property, but in PCRE2 its use is limited. See the
.\" HREF
\fBpcre2pattern\fP
.\"
@ -222,6 +222,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 01 December 2021
Last updated: 08 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "01 December 2021" "PCRE2 10.40"
.TH PCRE2PATTERN 3 "08 December 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -779,13 +779,15 @@ escape sequences are:
\eP{\fIxx\fP} a character without the \fIxx\fP property
\eX a Unicode extended grapheme cluster
.sp
The property names represented by \fIxx\fP above are case-sensitive. There is
support for Unicode script names, Unicode general category properties, "Any",
which matches any character (including newline), and some special PCRE2
properties (described in the
The property names represented by \fIxx\fP above are not case-sensitive, and in
accordance with Unicode's "loose matching" rules, spaces, hyphens, and
underscores are ignored. There is support for Unicode script names, Unicode
general category properties, "Any", which matches any character (including
newline), Bidi_Control, Bidi_Class, and some special PCRE2 properties
(described
.\" HTML <a href="#extraprops">
.\" </a>
next section).
below).
.\"
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
Note that \eP{Any} does not match any characters, so always causes a match
@ -1025,9 +1027,9 @@ The following general category property codes are supported:
Zp Paragraph separator
Zs Space separator
.sp
The special property L& is also supported: it matches a character that has
the Lu, Ll, or Lt property, in other words, a letter that is not classified as
a modifier or "other".
The special property LC, which has the synonym L&, is also supported: it
matches a character that has the Lu, Ll, or Lt property, in other words, a
letter that is not classified as a modifier or "other".
.P
The Cs (Surrogate) property applies only to characters whose code points are in
the range U+D800 to U+DFFF. These characters are no different to any other
@ -1059,6 +1061,45 @@ properties in PCRE2 by default, though you can make them do so by setting the
PCRE2_UCP option or by starting the pattern with (*UCP).
.
.
.SS "Bi-directional properties for \ep and \eP"
.rs
.sp
Two properties relating to bi-directional text are supported:
.sp
\ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class
.sp
The recognized classes are:
.sp
AL Arabic letter
AN Arabic number
B paragraph separator
BN boundary neutral
CS common separator
EN European number
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
NSM non-spacing mark
ON other neutral
PDF pop directional format
PDI pop directional isolate
R right-to-left
RLE right-to-left embedding
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS which space
.sp
For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive. As for other properties, only the short names are
recognized.
.
.
.SS Extended grapheme clusters
.rs
.sp
@ -3909,6 +3950,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 01 December 2021
Last updated: 08 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "30 August 2021" "PCRE2 10.38"
.TH PCRE2SYNTAX 3 "08 December 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -333,6 +333,39 @@ Yi,
Zanabazar_Square.
.
.
.SH "BIDI_PROPERTIES FOR \ep AND \eP"
.rs
.sp
\ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class
.sp
The recognized classes are:
.sp
AL Arabic letter
AN Arabic number
B paragraph separator
BN boundary neutral
CS common separator
EN European number
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
NSM non-spacing mark
ON other neutral
PDF pop directional format
PDI pop directional isolate
R right-to-left
RLE right-to-left embedding
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS which space
.
.
.SH "CHARACTER CLASSES"
.rs
.sp
@ -684,6 +717,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 30 August 2021
Last updated: 08 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2UNICODE 3 "23 February 2020" "PCRE2 10.35"
.TH PCRE2UNICODE 3 "08 December 2021" "PCRE2 10.40"
.SH NAME
PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT"
@ -42,8 +42,8 @@ When PCRE2 is built with Unicode support, the escape sequences \ep{..},
\eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting.
The Unicode properties that can be tested are limited to the general category
properties such as Lu for an upper case letter or Nd for a decimal number, the
Unicode script names such as Arabic or Han, and the derived properties Any and
L&. Full lists are given in the
Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
derived properties Any and LC (synonym L&). Full lists are given in the
.\" HREF
\fBpcre2pattern\fP
.\"
@ -52,7 +52,7 @@ and
\fBpcre2syntax\fP
.\"
documentation. Only the short names for properties are supported. For example,
\ep{L} matches a letter. Its Perl synonym, \ep{Letter}, is not supported.
\ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not supported.
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
compatibility with Perl 5.6. PCRE2 does not support this.
.
@ -457,6 +457,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 February 2020
Copyright (c) 1997-2020 University of Cambridge.
Last updated: 08 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi