Documentation update for binary property support

This commit is contained in:
Philip Hazel 2022-01-12 15:30:22 +00:00
parent bf35c0518c
commit 7f7d3e8521
7 changed files with 292 additions and 917 deletions

View File

@ -776,8 +776,17 @@ can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
sequences are of course limited to testing characters whose code points are sequences are of course limited to testing characters whose code points are
less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
greater than 0x10ffff (the Unicode limit) may be encountered. These are all greater than 0x10ffff (the Unicode limit) may be encountered. These are all
treated as being in the Unknown script and with an unassigned type. The extra treated as being in the Unknown script and with an unassigned type.
escape sequences are: </P>
<P>
Matching characters by Unicode property is not fast, because PCRE2 has to do a
multistage table lookup in order to find a character's property. That is why
the traditional escape sequences such as \d and \w do not use Unicode
properties in PCRE2 by default, though you can make them do so by setting the
PCRE2_UCP option or by starting the pattern with (*UCP).
</P>
<P>
The extra escape sequences that provide property support are:
<pre> <pre>
\p{<i>xx</i>} a character with the <i>xx</i> property \p{<i>xx</i>} a character with the <i>xx</i> property
\P{<i>xx</i>} a character without the <i>xx</i> property \P{<i>xx</i>} a character without the <i>xx</i> property
@ -787,17 +796,20 @@ The property names represented by <i>xx</i> above are not case-sensitive, and in
accordance with Unicode's "loose matching" rules, spaces, hyphens, and accordance with Unicode's "loose matching" rules, spaces, hyphens, and
underscores are ignored. There is support for Unicode script names, Unicode underscores are ignored. There is support for Unicode script names, Unicode
general category properties, "Any", which matches any character (including general category properties, "Any", which matches any character (including
newline), Bidi_Control, Bidi_Class, and some special PCRE2 properties newline), Bidi_Class, a number of binary (yes/no) properties, and some special
(described PCRE2 properties (described
<a href="#extraprops">below).</a> <a href="#extraprops">below).</a>
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2. Certain other Perl properties such as "InMusicalSymbols" are not supported by
Note that \P{Any} does not match any characters, so always causes a match PCRE2. Note that \P{Any} does not match any characters, so always causes a
failure. match failure.
</P> </P>
<br><b>
Script properties for \p and \P
</b><br>
<P> <P>
There are three different syntax forms for matching a script. Each Unicode There are three different syntax forms for matching a script. Each Unicode
character has a basic script and, optionally, a list of other scripts ("Script character has a basic script and, optionally, a list of other scripts ("Script
Extentions") with which it is commonly used. Using the Adlam script as an Extensions") with which it is commonly used. Using the Adlam script as an
example, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas example, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas
\p{scx:Adlam} matches, in addition, characters that have Adlam in their \p{scx:Adlam} matches, in addition, characters that have Adlam in their
extensions list. The full names "script" and "script extensions" for the extensions list. The full names "script" and "script extensions" for the
@ -809,172 +821,17 @@ interpretation at release 5.26 and PCRE2 changed at release 10.40.
<P> <P>
Unassigned characters (and in non-UTF 32-bit mode, characters with code points Unassigned characters (and in non-UTF 32-bit mode, characters with code points
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
part of an identified script are lumped together as "Common". The current list part of an identified script are lumped together as "Common". The current list
of script names and their 4-letter abbreviations is: of recognized script names and their 4-character abbreviations can be obtained
</P> by running this command:
<P> <pre>
Adlam (Adlm), pcre2test -LS
Ahom (Ahom),
Anatolian_Hieroglyphs (Hluw), </PRE>
Arabic (Arab),
Armenian (Armn),
Avestan (Avst),
Balinese (Bali),
Bamum (Bamu),
Bassa_Vah (Bass),
Batak (Batk),
Bengali (Beng),
Bhaiksuki (Bhks),
Bopomofo (Bopo),
Brahmi (Brah),
Braille (Brai),
Buginese (Bugi),
Buhid (Buhd),
Canadian_Aboriginal (Cans),
Carian (Cari),
Caucasian_Albanian (Aghb),
Chakma (Cakm),
Cham (Cham),
Cherokee (Cher),
Chorasmian (Chrs),
Common (Zyyy),
Coptic (Copt),
Cuneiform (Xsux),
Cypriot (Cprt),
Cypro_Minoan (Cpmn),
Cyrillic (Cyrl),
Deseret (Dsrt),
Devanagari (Deva),
Dives_Akuru (Diak),
Dogra (Dogr),
Duployan (Dupl),
Egyptian_Hieroglyphs (Egyp),
Elbasan (Elba),
Elymaic (Elym),
Ethiopic (Ethi),
Georgian (Geor),
Glagolitic (Glag),
Gothic (Goth),
Grantha (Gran),
Greek (Grek),
Gujarati (Gujr),
Gunjala_Gondi (Gong),
Gurmukhi (Guru),
Han (Hani),
Hangul (Hang),
Hanifi_Rohingya (Rohg),
Hanunoo (Hano),
Hatran (Hatr),
Hebrew (Hebr),
Hiragana (Hira),
Imperial_Aramaic (Armi),
Inherited (Zinh),
Inscriptional_Pahlavi (Phli),
Inscriptional_Parthian (Prti),
Javanese (Java),
Kaithi (Kthi),
Kannada (Knda),
Katakana (Kana),
Kayah_Li (Kali),
Kharoshthi (Khar),
Khitan_Small_Script (Kits),
Khmer (Khmr),
Khojki (Khoj),
Khudawadi (Sind),
Lao (Laoo),
Latin (Latn),
Lepcha (Lepc),
Limbu (Limb),
Linear_A (Lina),
Linear_B (Linb),
Lisu (Lisu),
Lycian (Lyci),
Lydian (Lydi),
Mahajani (Majh),
Makasar (Maka),
Malayalam (Mlym),
Mandaic (Mand),
Manichaean (Mani),
Marchen (Marc),
Masaram_Gondi (Gonm),
Medefaidrin (Medf),
Meetei_Mayek (Mtei),
Mende_Kikakui (Mend),
Meroitic_Cursive (Merc),
Meroitic_Hieroglyphs (Mero),
Miao (Miao),
Modi (Modi),
Mongolian (Mong),
Mro (Mroo),
Multani (Mult),
Myanmar (Mymr),
Nabataean (Nbar),
Nandinagari (Nand),
New_Tai_Lue (Talu),
Newa (Newa),
Nko (Nkoo),
Nushu (Nshu),
Nyiakeng_Puachue_Hmong (Hmnp),
Ogham (Ogam),
Ol_Chiki (Olck),
Old_Hungarian (Hung),
Old_Italic (Olck),
Old_North_Arabian (Narb),
Old_Permic (Perm),
Old_Persian (Orkh),
Old_Sogdian (Sogo),
Old_South_Arabian (Sarb),
Old_Turkic (Orkh),
Old_Uyghur (Ougr),
Oriya (Orya),
Osage (Osge),
Osmanya (Osma),
Pahawh_Hmong (Hmng),
Palmyrene (Palm),
Pau_Cin_Hau (Pauc),
Phags_Pa (Phag),
Phoenician (Phnx),
Psalter_Pahlavi (Phli),
Rejang (Rjng),
Runic (Runr),
Samaritan (Samr),
Saurashtra (Saur),
Sharada (Shrd),
Shavian (Shaw),
Siddham (Sidd),
SignWriting (Sgnw),
Sinhala (Sinh),
Sogdian (Sogd),
Sora_Sompeng (Sora),
Soyombo (Soyo),
Sundanese (Sund),
Syloti_Nagri (Sylo),
Syriac (Syrc),
Tagalog (Tglg),
Tagbanwa (Tagb),
Tai_Le (Tale),
Tai_Tham (Lana),
Tai_Viet (Tavt),
Takri (Takr),
Tamil (Taml),
Tangsa (Tngs),
Tangut (Tang),
Telugu (Telu),
Thaana (Thaa),
Thai (Thai),
Tibetan (Tibt),
Tifinagh (Tfng),
Tirhuta (Tirh),
Toto (Toto),
Ugaritic (Ugar),
Vai (Vaii),
Vithkuqi (Vith),
Wancho (Wcho),
Warang_Citi (Wara),
Yezidi (Yezi),
Yi (Yiii),
Zanabazar_Square (Zanb).
</P> </P>
<br><b>
The general category property for \p and \P
</b><br>
<P> <P>
Each character has exactly one Unicode general category property, specified by Each character has exactly one Unicode general category property, specified by
a two-letter abbreviation. For compatibility with Perl, negation can be a two-letter abbreviation. For compatibility with Perl, negation can be
@ -1065,22 +922,23 @@ Specifying caseless matching does not affect these escape sequences. For
example, \p{Lu} always matches only upper case letters. This is different from example, \p{Lu} always matches only upper case letters. This is different from
the behaviour of current versions of Perl. the behaviour of current versions of Perl.
</P> </P>
<P>
Matching characters by Unicode property is not fast, because PCRE2 has to do a
multistage table lookup in order to find a character's property. That is why
the traditional escape sequences such as \d and \w do not use Unicode
properties in PCRE2 by default, though you can make them do so by setting the
PCRE2_UCP option or by starting the pattern with (*UCP).
</P>
<br><b> <br><b>
Bi-directional properties for \p and \P Binary (yes/no) properties for \p and \P
</b><br>
<P>
Unicode defines a number of binary properties, that is, properties whose only
values are true or false. You can obtain a list of those that are recognized by
\p and \P, along with their abbreviations, by running this command:
<pre>
pcre2test -LP
</PRE>
</P>
<br><b>
The Bidi_Class property for \p and \P
</b><br> </b><br>
<P> <P>
Two properties relating to bi-directional text (each with a shorter synonym)
are supported:
<pre> <pre>
\p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class \p{Bidi_Class:&#60;class&#62;} matches a character with the given class
\p{BC:&#60;class&#62;} matches a character with the given class \p{BC:&#60;class&#62;} matches a character with the given class
</pre> </pre>
@ -1110,8 +968,8 @@ The recognized classes are:
S segment separator S segment separator
WS which space WS which space
</pre> </pre>
For Bidi_Class, an equals sign may be used instead of a colon. The class names An equals sign may be used instead of a colon. The class names are
are case-insensitive; only the short names listed above are recognized. case-insensitive; only the short names listed above are recognized.
</P> </P>
<br><b> <br><b>
Extended grapheme clusters Extended grapheme clusters
@ -3908,9 +3766,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br> <br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 28 December 2021 Last updated: 12 January 2022
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2022 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -19,30 +19,31 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a> <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> <li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC7" href="#SEC7">SCRIPT MATCHING WITH \p AND \P</a> <li><a name="TOC7" href="#SEC7">BINARY PROPERTIES FOR \p AND \P</a>
<li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a> <li><a name="TOC8" href="#SEC8">SCRIPT MATCHING WITH \p AND \P</a>
<li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a> <li><a name="TOC9" href="#SEC9">THE BIDI_CLASS PROPERTY FOR \p AND \P</a>
<li><a name="TOC10" href="#SEC10">QUANTIFIERS</a> <li><a name="TOC10" href="#SEC10">CHARACTER CLASSES</a>
<li><a name="TOC11" href="#SEC11">ANCHORS AND SIMPLE ASSERTIONS</a> <li><a name="TOC11" href="#SEC11">QUANTIFIERS</a>
<li><a name="TOC12" href="#SEC12">REPORTED MATCH POINT SETTING</a> <li><a name="TOC12" href="#SEC12">ANCHORS AND SIMPLE ASSERTIONS</a>
<li><a name="TOC13" href="#SEC13">ALTERNATION</a> <li><a name="TOC13" href="#SEC13">REPORTED MATCH POINT SETTING</a>
<li><a name="TOC14" href="#SEC14">CAPTURING</a> <li><a name="TOC14" href="#SEC14">ALTERNATION</a>
<li><a name="TOC15" href="#SEC15">ATOMIC GROUPS</a> <li><a name="TOC15" href="#SEC15">CAPTURING</a>
<li><a name="TOC16" href="#SEC16">COMMENT</a> <li><a name="TOC16" href="#SEC16">ATOMIC GROUPS</a>
<li><a name="TOC17" href="#SEC17">OPTION SETTING</a> <li><a name="TOC17" href="#SEC17">COMMENT</a>
<li><a name="TOC18" href="#SEC18">NEWLINE CONVENTION</a> <li><a name="TOC18" href="#SEC18">OPTION SETTING</a>
<li><a name="TOC19" href="#SEC19">WHAT \R MATCHES</a> <li><a name="TOC19" href="#SEC19">NEWLINE CONVENTION</a>
<li><a name="TOC20" href="#SEC20">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a> <li><a name="TOC20" href="#SEC20">WHAT \R MATCHES</a>
<li><a name="TOC21" href="#SEC21">NON-ATOMIC LOOKAROUND ASSERTIONS</a> <li><a name="TOC21" href="#SEC21">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a> <li><a name="TOC22" href="#SEC22">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
<li><a name="TOC23" href="#SEC23">BACKREFERENCES</a> <li><a name="TOC23" href="#SEC23">SCRIPT RUNS</a>
<li><a name="TOC24" href="#SEC24">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a> <li><a name="TOC24" href="#SEC24">BACKREFERENCES</a>
<li><a name="TOC25" href="#SEC25">CONDITIONAL PATTERNS</a> <li><a name="TOC25" href="#SEC25">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC26" href="#SEC26">BACKTRACKING CONTROL</a> <li><a name="TOC26" href="#SEC26">CONDITIONAL PATTERNS</a>
<li><a name="TOC27" href="#SEC27">CALLOUTS</a> <li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a>
<li><a name="TOC28" href="#SEC28">SEE ALSO</a> <li><a name="TOC28" href="#SEC28">CALLOUTS</a>
<li><a name="TOC29" href="#SEC29">AUTHOR</a> <li><a name="TOC29" href="#SEC29">SEE ALSO</a>
<li><a name="TOC30" href="#SEC30">REVISION</a> <li><a name="TOC30" href="#SEC30">AUTHOR</a>
<li><a name="TOC31" href="#SEC31">REVISION</a>
</ul> </ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br> <br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
<P> <P>
@ -205,180 +206,27 @@ matching" rules.
Perl and POSIX space are now the same. Perl added VT to its space character set Perl and POSIX space are now the same. Perl added VT to its space character set
at release 5.18. at release 5.18.
</P> </P>
<br><a name="SEC7" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br> <br><a name="SEC7" href="#TOC1">BINARY PROPERTIES FOR \p AND \P</a><br>
<P> <P>
The following script names and their 4-letter abbreviations are recognized in Unicode defines a number of binary properties, that is, properties whose only
values are true or false. You can obtain a list of those that are recognized by
\p and \P, along with their abbreviations, by running this command:
<pre>
pcre2test -LP
</PRE>
</P>
<br><a name="SEC8" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
<P>
Many script names and their 4-letter abbreviations are recognized in
\p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P of \p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P of
course): course). You can obtain a list of these scripts by running this command:
<pre>
pcre2test -LS
</PRE>
</P> </P>
<P> <br><a name="SEC9" href="#TOC1">THE BIDI_CLASS PROPERTY FOR \p AND \P</a><br>
Adlam (Adlm),
Ahom (Ahom),
Anatolian_Hieroglyphs (Hluw),
Arabic (Arab),
Armenian (Armn),
Avestan (Avst),
Balinese (Bali),
Bamum (Bamu),
Bassa_Vah (Bass),
Batak (Batk),
Bengali (Beng),
Bhaiksuki (Bhks),
Bopomofo (Bopo),
Brahmi (Brah),
Braille (Brai),
Buginese (Bugi),
Buhid (Buhd),
Canadian_Aboriginal (Cans),
Carian (Cari),
Caucasian_Albanian (Aghb),
Chakma (Cakm),
Cham (Cham),
Cherokee (Cher),
Chorasmian (Chrs),
Common (Zyyy),
Coptic (Copt),
Cuneiform (Xsux),
Cypriot (Cprt),
Cypro_Minoan (Cpmn),
Cyrillic (Cyrl),
Deseret (Dsrt),
Devanagari (Deva),
Dives_Akuru (Diak),
Dogra (Dogr),
Duployan (Dupl),
Egyptian_Hieroglyphs (Egyp),
Elbasan (Elba),
Elymaic (Elym),
Ethiopic (Ethi),
Georgian (Geor),
Glagolitic (Glag),
Gothic (Goth),
Grantha (Gran),
Greek (Grek),
Gujarati (Gujr),
Gunjala_Gondi (Gong),
Gurmukhi (Guru),
Han (Hani),
Hangul (Hang),
Hanifi_Rohingya (Rohg),
Hanunoo (Hano),
Hatran (Hatr),
Hebrew (Hebr),
Hiragana (Hira),
Imperial_Aramaic (Armi),
Inherited (Zinh),
Inscriptional_Pahlavi (Phli),
Inscriptional_Parthian (Prti),
Javanese (Java),
Kaithi (Kthi),
Kannada (Knda),
Katakana (Kana),
Kayah_Li (Kali),
Kharoshthi (Khar),
Khitan_Small_Script (Kits),
Khmer (Khmr),
Khojki (Khoj),
Khudawadi (Sind),
Lao (Laoo),
Latin (Latn),
Lepcha (Lepc),
Limbu (Limb),
Linear_A (Lina),
Linear_B (Linb),
Lisu (Lisu),
Lycian (Lyci),
Lydian (Lydi),
Mahajani (Majh),
Makasar (Maka),
Malayalam (Mlym),
Mandaic (Mand),
Manichaean (Mani),
Marchen (Marc),
Masaram_Gondi (Gonm),
Medefaidrin (Medf),
Meetei_Mayek (Mtei),
Mende_Kikakui (Mend),
Meroitic_Cursive (Merc),
Meroitic_Hieroglyphs (Mero),
Miao (Miao),
Modi (Modi),
Mongolian (Mong),
Mro (Mroo),
Multani (Mult),
Myanmar (Mymr),
Nabataean (Nbar),
Nandinagari (Nand),
New_Tai_Lue (Talu),
Newa (Newa),
Nko (Nkoo),
Nushu (Nshu),
Nyiakeng_Puachue_Hmong (Hmnp),
Ogham (Ogam),
Ol_Chiki (Olck),
Old_Hungarian (Hung),
Old_Italic (Olck),
Old_North_Arabian (Narb),
Old_Permic (Perm),
Old_Persian (Orkh),
Old_Sogdian (Sogo),
Old_South_Arabian (Sarb),
Old_Turkic (Orkh),
Old_Uyghur (Ougr),
Oriya (Orya),
Osage (Osge),
Osmanya (Osma),
Pahawh_Hmong (Hmng),
Palmyrene (Palm),
Pau_Cin_Hau (Pauc),
Phags_Pa (Phag),
Phoenician (Phnx),
Psalter_Pahlavi (Phli),
Rejang (Rjng),
Runic (Runr),
Samaritan (Samr),
Saurashtra (Saur),
Sharada (Shrd),
Shavian (Shaw),
Siddham (Sidd),
SignWriting (Sgnw),
Sinhala (Sinh),
Sogdian (Sogd),
Sora_Sompeng (Sora),
Soyombo (Soyo),
Sundanese (Sund),
Syloti_Nagri (Sylo),
Syriac (Syrc),
Tagalog (Tglg),
Tagbanwa (Tagb),
Tai_Le (Tale),
Tai_Tham (Lana),
Tai_Viet (Tavt),
Takri (Takr),
Tamil (Taml),
Tangsa (Tngs),
Tangut (Tang),
Telugu (Telu),
Thaana (Thaa),
Thai (Thai),
Tibetan (Tibt),
Tifinagh (Tfng),
Tirhuta (Tirh),
Toto (Toto),
Ugaritic (Ugar),
Vai (Vaii),
Vithkuqi (Vith),
Wancho (Wcho),
Warang_Citi (Wara),
Yezidi (Yezi),
Yi (Yiii),
Zanabazar_Square (Zanb).
</P>
<br><a name="SEC8" href="#TOC1">BIDI_PROPERTIES FOR \p AND \P</a><br>
<P> <P>
<pre> <pre>
\p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class \p{Bidi_Class:&#60;class&#62;} matches a character with the given class
\p{BC:&#60;class&#62;} matches a character with the given class \p{BC:&#60;class&#62;} matches a character with the given class
</pre> </pre>
@ -409,7 +257,7 @@ The recognized classes are:
WS which space WS which space
</PRE> </PRE>
</P> </P>
<br><a name="SEC9" href="#TOC1">CHARACTER CLASSES</a><br> <br><a name="SEC10" href="#TOC1">CHARACTER CLASSES</a><br>
<P> <P>
<pre> <pre>
[...] positive character class [...] positive character class
@ -437,7 +285,7 @@ In PCRE2, POSIX character set names recognize only ASCII characters by default,
but some of them use Unicode properties if PCRE2_UCP is set. You can use but some of them use Unicode properties if PCRE2_UCP is set. You can use
\Q...\E inside a character class. \Q...\E inside a character class.
</P> </P>
<br><a name="SEC10" href="#TOC1">QUANTIFIERS</a><br> <br><a name="SEC11" href="#TOC1">QUANTIFIERS</a><br>
<P> <P>
<pre> <pre>
? 0 or 1, greedy ? 0 or 1, greedy
@ -458,7 +306,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
{n,}? n or more, lazy {n,}? n or more, lazy
</PRE> </PRE>
</P> </P>
<br><a name="SEC11" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br> <br><a name="SEC12" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
<P> <P>
<pre> <pre>
\b word boundary \b word boundary
@ -476,7 +324,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
\G first matching position in subject \G first matching position in subject
</PRE> </PRE>
</P> </P>
<br><a name="SEC12" href="#TOC1">REPORTED MATCH POINT SETTING</a><br> <br><a name="SEC13" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
<P> <P>
<pre> <pre>
\K set reported start of match \K set reported start of match
@ -486,13 +334,13 @@ for compatibility with Perl. However, if the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
option is set, the previous behaviour is re-enabled. When this option is set, option is set, the previous behaviour is re-enabled. When this option is set,
\K is honoured in positive assertions, but ignored in negative ones. \K is honoured in positive assertions, but ignored in negative ones.
</P> </P>
<br><a name="SEC13" href="#TOC1">ALTERNATION</a><br> <br><a name="SEC14" href="#TOC1">ALTERNATION</a><br>
<P> <P>
<pre> <pre>
expr|expr|expr... expr|expr|expr...
</PRE> </PRE>
</P> </P>
<br><a name="SEC14" href="#TOC1">CAPTURING</a><br> <br><a name="SEC15" href="#TOC1">CAPTURING</a><br>
<P> <P>
<pre> <pre>
(...) capture group (...) capture group
@ -507,20 +355,20 @@ In non-UTF modes, names may contain underscores and ASCII letters and digits;
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
both cases, a name must not start with a digit. both cases, a name must not start with a digit.
</P> </P>
<br><a name="SEC15" href="#TOC1">ATOMIC GROUPS</a><br> <br><a name="SEC16" href="#TOC1">ATOMIC GROUPS</a><br>
<P> <P>
<pre> <pre>
(?&#62;...) atomic non-capture group (?&#62;...) atomic non-capture group
(*atomic:...) atomic non-capture group (*atomic:...) atomic non-capture group
</PRE> </PRE>
</P> </P>
<br><a name="SEC16" href="#TOC1">COMMENT</a><br> <br><a name="SEC17" href="#TOC1">COMMENT</a><br>
<P> <P>
<pre> <pre>
(?#....) comment (not nestable) (?#....) comment (not nestable)
</PRE> </PRE>
</P> </P>
<br><a name="SEC17" href="#TOC1">OPTION SETTING</a><br> <br><a name="SEC18" href="#TOC1">OPTION SETTING</a><br>
<P> <P>
Changes of these options within a group are automatically cancelled at the end Changes of these options within a group are automatically cancelled at the end
of the group. of the group.
@ -565,7 +413,7 @@ not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
application can lock out the use of (*UTF) and (*UCP) by setting the application can lock out the use of (*UTF) and (*UCP) by setting the
PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time. PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
</P> </P>
<br><a name="SEC18" href="#TOC1">NEWLINE CONVENTION</a><br> <br><a name="SEC19" href="#TOC1">NEWLINE CONVENTION</a><br>
<P> <P>
These are recognized only at the very start of the pattern or after option These are recognized only at the very start of the pattern or after option
settings with a similar syntax. settings with a similar syntax.
@ -578,7 +426,7 @@ settings with a similar syntax.
(*NUL) the NUL character (binary zero) (*NUL) the NUL character (binary zero)
</PRE> </PRE>
</P> </P>
<br><a name="SEC19" href="#TOC1">WHAT \R MATCHES</a><br> <br><a name="SEC20" href="#TOC1">WHAT \R MATCHES</a><br>
<P> <P>
These are recognized only at the very start of the pattern or after option These are recognized only at the very start of the pattern or after option
setting with a similar syntax. setting with a similar syntax.
@ -587,7 +435,7 @@ setting with a similar syntax.
(*BSR_UNICODE) any Unicode newline sequence (*BSR_UNICODE) any Unicode newline sequence
</PRE> </PRE>
</P> </P>
<br><a name="SEC20" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br> <br><a name="SEC21" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P> <P>
<pre> <pre>
(?=...) ) (?=...) )
@ -608,7 +456,7 @@ setting with a similar syntax.
</pre> </pre>
Each top-level branch of a lookbehind must be of a fixed length. Each top-level branch of a lookbehind must be of a fixed length.
</P> </P>
<br><a name="SEC21" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br> <br><a name="SEC22" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
<P> <P>
These assertions are specific to PCRE2 and are not Perl-compatible. These assertions are specific to PCRE2 and are not Perl-compatible.
<pre> <pre>
@ -621,7 +469,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
(*non_atomic_positive_lookbehind:...) ) (*non_atomic_positive_lookbehind:...) )
</PRE> </PRE>
</P> </P>
<br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br> <br><a name="SEC23" href="#TOC1">SCRIPT RUNS</a><br>
<P> <P>
<pre> <pre>
(*script_run:...) ) script run, can be backtracked into (*script_run:...) ) script run, can be backtracked into
@ -631,7 +479,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
(*asr:...) ) (*asr:...) )
</PRE> </PRE>
</P> </P>
<br><a name="SEC23" href="#TOC1">BACKREFERENCES</a><br> <br><a name="SEC24" href="#TOC1">BACKREFERENCES</a><br>
<P> <P>
<pre> <pre>
\n reference by number (can be ambiguous) \n reference by number (can be ambiguous)
@ -648,7 +496,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
(?P=name) reference by name (Python) (?P=name) reference by name (Python)
</PRE> </PRE>
</P> </P>
<br><a name="SEC24" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br> <br><a name="SEC25" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P> <P>
<pre> <pre>
(?R) recurse whole pattern (?R) recurse whole pattern
@ -667,7 +515,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
\g'-n' call subroutine by relative number (PCRE2 extension) \g'-n' call subroutine by relative number (PCRE2 extension)
</PRE> </PRE>
</P> </P>
<br><a name="SEC25" href="#TOC1">CONDITIONAL PATTERNS</a><br> <br><a name="SEC26" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P> <P>
<pre> <pre>
(?(condition)yes-pattern) (?(condition)yes-pattern)
@ -690,7 +538,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a reference conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists. condition if the relevant named group exists.
</P> </P>
<br><a name="SEC26" href="#TOC1">BACKTRACKING CONTROL</a><br> <br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P> <P>
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
@ -717,7 +565,7 @@ pattern is not anchored.
The effect of one of these verbs in a group called as a subroutine is confined The effect of one of these verbs in a group called as a subroutine is confined
to the subroutine call. to the subroutine call.
</P> </P>
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br> <br><a name="SEC28" href="#TOC1">CALLOUTS</a><br>
<P> <P>
<pre> <pre>
(?C) callout (assumed number 0) (?C) callout (assumed number 0)
@ -728,12 +576,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it. delimiter }. To encode the ending delimiter within the string, double it.
</P> </P>
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br> <br><a name="SEC29" href="#TOC1">SEE ALSO</a><br>
<P> <P>
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
<b>pcre2matching</b>(3), <b>pcre2</b>(3). <b>pcre2matching</b>(3), <b>pcre2</b>(3).
</P> </P>
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br> <br><a name="SEC30" href="#TOC1">AUTHOR</a><br>
<P> <P>
Philip Hazel Philip Hazel
<br> <br>
@ -742,11 +590,11 @@ Retired from University Computing Service
Cambridge, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC31" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 28 December 2021 Last updated: 12 January 2022
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2022 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -253,7 +253,19 @@ available, and the use of JIT for matching is verified.
<b>-LM</b> <b>-LM</b>
List modifiers: write a list of available pattern and subject modifiers to the List modifiers: write a list of available pattern and subject modifiers to the
standard output, then exit with zero exit code. All other options are ignored. standard output, then exit with zero exit code. All other options are ignored.
If both -C and -LM are present, whichever is first is recognized. If both -C and any -Lx options are present, whichever is first is recognized.
</P>
<P>
<b>-LP</b>
List properties: write a list of recognized Unicode properties to the standard
output, then exit with zero exit code. All other options are ignored. If both
-C and any -Lx options are present, whichever is first is recognized.
</P>
<P>
<b>-LS</b>
List scripts: write a list of recogized Unicode script names to the standard
output, then exit with zero exit code. All other options are ignored. If both
-C and any -Lx options are present, whichever is first is recognized.
</P> </P>
<P> <P>
<b>-pattern</b> <i>modifier-list</i> <b>-pattern</b> <i>modifier-list</i>
@ -2129,9 +2141,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 28 November 2021 Last updated: 12 January 2022
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2022 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -6889,8 +6889,16 @@ BACKSLASH
ters whose code points are less than U+0100 and U+10000, respectively. ters whose code points are less than U+0100 and U+10000, respectively.
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
limit) may be encountered. These are all treated as being in the Un- limit) may be encountered. These are all treated as being in the Un-
known script and with an unassigned type. The extra escape sequences known script and with an unassigned type.
are:
Matching characters by Unicode property is not fast, because PCRE2 has
to do a multistage table lookup in order to find a character's prop-
erty. That is why the traditional escape sequences such as \d and \w do
not use Unicode properties in PCRE2 by default, though you can make
them do so by setting the PCRE2_UCP option or by starting the pattern
with (*UCP).
The extra escape sequences that provide property support are:
\p{xx} a character with the xx property \p{xx} a character with the xx property
\P{xx} a character without the xx property \P{xx} a character without the xx property
@ -6900,71 +6908,36 @@ BACKSLASH
in accordance with Unicode's "loose matching" rules, spaces, hyphens, in accordance with Unicode's "loose matching" rules, spaces, hyphens,
and underscores are ignored. There is support for Unicode script names, and underscores are ignored. There is support for Unicode script names,
Unicode general category properties, "Any", which matches any character Unicode general category properties, "Any", which matches any character
(including newline), Bidi_Control, Bidi_Class, and some special PCRE2 (including newline), Bidi_Class, a number of binary (yes/no) proper-
properties (described below). Other Perl properties such as "InMusi- ties, and some special PCRE2 properties (described below). Certain
calSymbols" are not supported by PCRE2. Note that \P{Any} does not other Perl properties such as "InMusicalSymbols" are not supported by
match any characters, so always causes a match failure. PCRE2. Note that \P{Any} does not match any characters, so always
causes a match failure.
Script properties for \p and \P
There are three different syntax forms for matching a script. Each Uni- There are three different syntax forms for matching a script. Each Uni-
code character has a basic script and, optionally, a list of other code character has a basic script and, optionally, a list of other
scripts ("Script Extentions") with which it is commonly used. Using the scripts ("Script Extensions") with which it is commonly used. Using the
Adlam script as an example, \p{sc:Adlam} matches characters whose basic Adlam script as an example, \p{sc:Adlam} matches characters whose basic
script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
that have Adlam in their extensions list. The full names "script" and that have Adlam in their extensions list. The full names "script" and
"script extensions" for the property types are recognized, and a equals "script extensions" for the property types are recognized, and a equals
sign is an alternative to the colon. If a script name is given without sign is an alternative to the colon. If a script name is given without
a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad- a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad-
lam}. Perl changed to this interpretation at release 5.26 and PCRE2 lam}. Perl changed to this interpretation at release 5.26 and PCRE2
changed at release 10.40. changed at release 10.40.
Unassigned characters (and in non-UTF 32-bit mode, characters with code Unassigned characters (and in non-UTF 32-bit mode, characters with code
points greater than 0x10FFFF) are assigned the "Unknown" script. Others points greater than 0x10FFFF) are assigned the "Unknown" script. Others
that are not part of an identified script are lumped together as "Com- that are not part of an identified script are lumped together as "Com-
mon". The current list of script names and their 4-letter abbreviations mon". The current list of recognized script names and their 4-character
is: abbreviations can be obtained by running this command:
Adlam (Adlm), Ahom (Ahom), Anatolian_Hieroglyphs (Hluw), Arabic (Arab), pcre2test -LS
Armenian (Armn), Avestan (Avst), Balinese (Bali), Bamum (Bamu),
Bassa_Vah (Bass), Batak (Batk), Bengali (Beng), Bhaiksuki (Bhks), Bopo-
mofo (Bopo), Brahmi (Brah), Braille (Brai), Buginese (Bugi), Buhid The general category property for \p and \P
(Buhd), Canadian_Aboriginal (Cans), Carian (Cari), Caucasian_Albanian
(Aghb), Chakma (Cakm), Cham (Cham), Cherokee (Cher), Chorasmian (Chrs),
Common (Zyyy), Coptic (Copt), Cuneiform (Xsux), Cypriot (Cprt),
Cypro_Minoan (Cpmn), Cyrillic (Cyrl), Deseret (Dsrt), Devanagari
(Deva), Dives_Akuru (Diak), Dogra (Dogr), Duployan (Dupl), Egyptian_Hi-
eroglyphs (Egyp), Elbasan (Elba), Elymaic (Elym), Ethiopic (Ethi),
Georgian (Geor), Glagolitic (Glag), Gothic (Goth), Grantha (Gran),
Greek (Grek), Gujarati (Gujr), Gunjala_Gondi (Gong), Gurmukhi (Guru),
Han (Hani), Hangul (Hang), Hanifi_Rohingya (Rohg), Hanunoo (Hano), Ha-
tran (Hatr), Hebrew (Hebr), Hiragana (Hira), Imperial_Aramaic (Armi),
Inherited (Zinh), Inscriptional_Pahlavi (Phli), Inscriptional_Parthian
(Prti), Javanese (Java), Kaithi (Kthi), Kannada (Knda), Katakana
(Kana), Kayah_Li (Kali), Kharoshthi (Khar), Khitan_Small_Script (Kits),
Khmer (Khmr), Khojki (Khoj), Khudawadi (Sind), Lao (Laoo), Latin
(Latn), Lepcha (Lepc), Limbu (Limb), Linear_A (Lina), Linear_B (Linb),
Lisu (Lisu), Lycian (Lyci), Lydian (Lydi), Mahajani (Majh), Makasar
(Maka), Malayalam (Mlym), Mandaic (Mand), Manichaean (Mani), Marchen
(Marc), Masaram_Gondi (Gonm), Medefaidrin (Medf), Meetei_Mayek (Mtei),
Mende_Kikakui (Mend), Meroitic_Cursive (Merc), Meroitic_Hieroglyphs
(Mero), Miao (Miao), Modi (Modi), Mongolian (Mong), Mro (Mroo), Multani
(Mult), Myanmar (Mymr), Nabataean (Nbar), Nandinagari (Nand),
New_Tai_Lue (Talu), Newa (Newa), Nko (Nkoo), Nushu (Nshu), Nyiak-
eng_Puachue_Hmong (Hmnp), Ogham (Ogam), Ol_Chiki (Olck), Old_Hungarian
(Hung), Old_Italic (Olck), Old_North_Arabian (Narb), Old_Permic (Perm),
Old_Persian (Orkh), Old_Sogdian (Sogo), Old_South_Arabian (Sarb),
Old_Turkic (Orkh), Old_Uyghur (Ougr), Oriya (Orya), Osage (Osge), Os-
manya (Osma), Pahawh_Hmong (Hmng), Palmyrene (Palm), Pau_Cin_Hau
(Pauc), Phags_Pa (Phag), Phoenician (Phnx), Psalter_Pahlavi (Phli), Re-
jang (Rjng), Runic (Runr), Samaritan (Samr), Saurashtra (Saur), Sharada
(Shrd), Shavian (Shaw), Siddham (Sidd), SignWriting (Sgnw), Sinhala
(Sinh), Sogdian (Sogd), Sora_Sompeng (Sora), Soyombo (Soyo), Sundanese
(Sund), Syloti_Nagri (Sylo), Syriac (Syrc), Tagalog (Tglg), Tagbanwa
(Tagb), Tai_Le (Tale), Tai_Tham (Lana), Tai_Viet (Tavt), Takri (Takr),
Tamil (Taml), Tangsa (Tngs), Tangut (Tang), Telugu (Telu), Thaana
(Thaa), Thai (Thai), Tibetan (Tibt), Tifinagh (Tfng), Tirhuta (Tirh),
Toto (Toto), Ugaritic (Ugar), Vai (Vaii), Vithkuqi (Vith), Wancho
(Wcho), Warang_Citi (Wara), Yezidi (Yezi), Yi (Yiii), Zanabazar_Square
(Zanb).
Each character has exactly one Unicode general category property, spec- Each character has exactly one Unicode general category property, spec-
ified by a two-letter abbreviation. For compatibility with Perl, nega- ified by a two-letter abbreviation. For compatibility with Perl, nega-
@ -7050,20 +7023,18 @@ BACKSLASH
For example, \p{Lu} always matches only upper case letters. This is For example, \p{Lu} always matches only upper case letters. This is
different from the behaviour of current versions of Perl. different from the behaviour of current versions of Perl.
Matching characters by Unicode property is not fast, because PCRE2 has Binary (yes/no) properties for \p and \P
to do a multistage table lookup in order to find a character's prop-
erty. That is why the traditional escape sequences such as \d and \w do
not use Unicode properties in PCRE2 by default, though you can make
them do so by setting the PCRE2_UCP option or by starting the pattern
with (*UCP).
Bi-directional properties for \p and \P Unicode defines a number of binary properties, that is, properties
whose only values are true or false. You can obtain a list of those
that are recognized by \p and \P, along with their abbreviations, by
running this command:
Two properties relating to bi-directional text (each with a shorter pcre2test -LP
synonym) are supported:
The Bidi_Class property for \p and \P
\p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:<class>} matches a character with the given class \p{Bidi_Class:<class>} matches a character with the given class
\p{BC:<class>} matches a character with the given class \p{BC:<class>} matches a character with the given class
@ -7093,9 +7064,8 @@ BACKSLASH
S segment separator S segment separator
WS which space WS which space
For Bidi_Class, an equals sign may be used instead of a colon. The An equals sign may be used instead of a colon. The class names are
class names are case-insensitive; only the short names listed above are case-insensitive; only the short names listed above are recognized.
recognized.
Extended grapheme clusters Extended grapheme clusters
@ -9725,8 +9695,8 @@ AUTHOR
REVISION REVISION
Last updated: 28 December 2021 Last updated: 12 January 2022
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2022 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -10739,60 +10709,28 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
acter set at release 5.18. acter set at release 5.18.
BINARY PROPERTIES FOR \p AND \P
Unicode defines a number of binary properties, that is, properties
whose only values are true or false. You can obtain a list of those
that are recognized by \p and \P, along with their abbreviations, by
running this command:
pcre2test -LP
SCRIPT MATCHING WITH \p AND \P SCRIPT MATCHING WITH \p AND \P
The following script names and their 4-letter abbreviations are recog- Many script names and their 4-letter abbreviations are recognized in
nized in \p{sc:...} or \p{scx:...} items, or on their own with \p (and \p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P
also \P of course): of course). You can obtain a list of these scripts by running this com-
mand:
Adlam (Adlm), Ahom (Ahom), Anatolian_Hieroglyphs (Hluw), Arabic (Arab), pcre2test -LS
Armenian (Armn), Avestan (Avst), Balinese (Bali), Bamum (Bamu),
Bassa_Vah (Bass), Batak (Batk), Bengali (Beng), Bhaiksuki (Bhks), Bopo-
mofo (Bopo), Brahmi (Brah), Braille (Brai), Buginese (Bugi), Buhid
(Buhd), Canadian_Aboriginal (Cans), Carian (Cari), Caucasian_Albanian
(Aghb), Chakma (Cakm), Cham (Cham), Cherokee (Cher), Chorasmian (Chrs),
Common (Zyyy), Coptic (Copt), Cuneiform (Xsux), Cypriot (Cprt),
Cypro_Minoan (Cpmn), Cyrillic (Cyrl), Deseret (Dsrt), Devanagari
(Deva), Dives_Akuru (Diak), Dogra (Dogr), Duployan (Dupl), Egyptian_Hi-
eroglyphs (Egyp), Elbasan (Elba), Elymaic (Elym), Ethiopic (Ethi),
Georgian (Geor), Glagolitic (Glag), Gothic (Goth), Grantha (Gran),
Greek (Grek), Gujarati (Gujr), Gunjala_Gondi (Gong), Gurmukhi (Guru),
Han (Hani), Hangul (Hang), Hanifi_Rohingya (Rohg), Hanunoo (Hano), Ha-
tran (Hatr), Hebrew (Hebr), Hiragana (Hira), Imperial_Aramaic (Armi),
Inherited (Zinh), Inscriptional_Pahlavi (Phli), Inscriptional_Parthian
(Prti), Javanese (Java), Kaithi (Kthi), Kannada (Knda), Katakana
(Kana), Kayah_Li (Kali), Kharoshthi (Khar), Khitan_Small_Script (Kits),
Khmer (Khmr), Khojki (Khoj), Khudawadi (Sind), Lao (Laoo), Latin
(Latn), Lepcha (Lepc), Limbu (Limb), Linear_A (Lina), Linear_B (Linb),
Lisu (Lisu), Lycian (Lyci), Lydian (Lydi), Mahajani (Majh), Makasar
(Maka), Malayalam (Mlym), Mandaic (Mand), Manichaean (Mani), Marchen
(Marc), Masaram_Gondi (Gonm), Medefaidrin (Medf), Meetei_Mayek (Mtei),
Mende_Kikakui (Mend), Meroitic_Cursive (Merc), Meroitic_Hieroglyphs
(Mero), Miao (Miao), Modi (Modi), Mongolian (Mong), Mro (Mroo), Multani
(Mult), Myanmar (Mymr), Nabataean (Nbar), Nandinagari (Nand),
New_Tai_Lue (Talu), Newa (Newa), Nko (Nkoo), Nushu (Nshu), Nyiak-
eng_Puachue_Hmong (Hmnp), Ogham (Ogam), Ol_Chiki (Olck), Old_Hungarian
(Hung), Old_Italic (Olck), Old_North_Arabian (Narb), Old_Permic (Perm),
Old_Persian (Orkh), Old_Sogdian (Sogo), Old_South_Arabian (Sarb),
Old_Turkic (Orkh), Old_Uyghur (Ougr), Oriya (Orya), Osage (Osge), Os-
manya (Osma), Pahawh_Hmong (Hmng), Palmyrene (Palm), Pau_Cin_Hau
(Pauc), Phags_Pa (Phag), Phoenician (Phnx), Psalter_Pahlavi (Phli), Re-
jang (Rjng), Runic (Runr), Samaritan (Samr), Saurashtra (Saur), Sharada
(Shrd), Shavian (Shaw), Siddham (Sidd), SignWriting (Sgnw), Sinhala
(Sinh), Sogdian (Sogd), Sora_Sompeng (Sora), Soyombo (Soyo), Sundanese
(Sund), Syloti_Nagri (Sylo), Syriac (Syrc), Tagalog (Tglg), Tagbanwa
(Tagb), Tai_Le (Tale), Tai_Tham (Lana), Tai_Viet (Tavt), Takri (Takr),
Tamil (Taml), Tangsa (Tngs), Tangut (Tang), Telugu (Telu), Thaana
(Thaa), Thai (Thai), Tibetan (Tibt), Tifinagh (Tfng), Tirhuta (Tirh),
Toto (Toto), Ugaritic (Ugar), Vai (Vaii), Vithkuqi (Vith), Wancho
(Wcho), Warang_Citi (Wara), Yezidi (Yezi), Yi (Yiii), Zanabazar_Square
(Zanb).
BIDI_PROPERTIES FOR \p AND \P THE BIDI_CLASS PROPERTY FOR \p AND \P
\p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:<class>} matches a character with the given class \p{Bidi_Class:<class>} matches a character with the given class
\p{BC:<class>} matches a character with the given class \p{BC:<class>} matches a character with the given class
@ -10846,8 +10784,8 @@ CHARACTER CLASSES
word same as \w word same as \w
xdigit hexadecimal digit xdigit hexadecimal digit
In PCRE2, POSIX character set names recognize only ASCII characters by In PCRE2, POSIX character set names recognize only ASCII characters by
default, but some of them use Unicode properties if PCRE2_UCP is set. default, but some of them use Unicode properties if PCRE2_UCP is set.
You can use \Q...\E inside a character class. You can use \Q...\E inside a character class.
@ -10892,8 +10830,8 @@ REPORTED MATCH POINT SETTING
\K set reported start of match \K set reported start of match
From release 10.38 \K is not permitted by default in lookaround asser- From release 10.38 \K is not permitted by default in lookaround asser-
tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL- tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL-
LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled. LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
When this option is set, \K is honoured in positive assertions, but ig- When this option is set, \K is honoured in positive assertions, but ig-
nored in negative ones. nored in negative ones.
@ -10914,8 +10852,8 @@ CAPTURING
(?|...) non-capture group; reset group numbers for (?|...) non-capture group; reset group numbers for
capture groups in each alternative capture groups in each alternative
In non-UTF modes, names may contain underscores and ASCII letters and In non-UTF modes, names may contain underscores and ASCII letters and
digits; in UTF modes, any Unicode letters and Unicode decimal digits digits; in UTF modes, any Unicode letters and Unicode decimal digits
are permitted. In both cases, a name must not start with a digit. are permitted. In both cases, a name must not start with a digit.
@ -10931,7 +10869,7 @@ COMMENT
OPTION SETTING OPTION SETTING
Changes of these options within a group are automatically cancelled at Changes of these options within a group are automatically cancelled at
the end of the group. the end of the group.
(?i) caseless (?i) caseless
@ -10945,7 +10883,7 @@ OPTION SETTING
(?-...) unset option(s) (?-...) unset option(s)
(?^) unset imnsx options (?^) unset imnsx options
Unsetting x or xx unsets both. Several options may be set at once, and Unsetting x or xx unsets both. Several options may be set at once, and
a mixture of setting and unsetting such as (?i-x) is allowed, but there a mixture of setting and unsetting such as (?i-x) is allowed, but there
may be only one hyphen. Setting (but no unsetting) is allowed after (?^ may be only one hyphen. Setting (but no unsetting) is allowed after (?^
for example (?^in). An option setting may appear at the start of a non- for example (?^in). An option setting may appear at the start of a non-
@ -10967,11 +10905,11 @@ OPTION SETTING
(*UTF) set appropriate UTF mode for the library in use (*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc) (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
value of the limits set by the caller of pcre2_match() or value of the limits set by the caller of pcre2_match() or
pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF) synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
respectively, at compile time. respectively, at compile time.
@ -11092,16 +11030,16 @@ CONDITIONAL PATTERNS
(?(VERSION[>]=n.m) test PCRE2 version (?(VERSION[>]=n.m) test PCRE2 version
(?(assert) assertion condition (?(assert) assertion condition
Note the ambiguity of (?(R) and (?(Rn) which might be named reference Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a conditions or recursion tests. Such a condition is interpreted as a
reference condition if the relevant named group exists. reference condition if the relevant named group exists.
BACKTRACKING CONTROL BACKTRACKING CONTROL
All backtracking control verbs may be in the form (*VERB:NAME). For All backtracking control verbs may be in the form (*VERB:NAME). For
(*MARK) the name is mandatory, for the others it is optional. (*SKIP) (*MARK) the name is mandatory, for the others it is optional. (*SKIP)
changes its behaviour if :NAME is present. The others just set a name changes its behaviour if :NAME is present. The others just set a name
for passing back to the caller, but this is not a name that (*SKIP) can for passing back to the caller, but this is not a name that (*SKIP) can
see. The following act immediately they are reached: see. The following act immediately they are reached:
@ -11109,7 +11047,7 @@ BACKTRACKING CONTROL
(*FAIL) force backtrack; synonym (*F) (*FAIL) force backtrack; synonym (*F)
(*MARK:NAME) set name to be passed back; synonym (*:NAME) (*MARK:NAME) set name to be passed back; synonym (*:NAME)
The following act only when a subsequent match failure causes a back- The following act only when a subsequent match failure causes a back-
track to reach them. They all force a match failure, but they differ in track to reach them. They all force a match failure, but they differ in
what happens afterwards. Those that advance the start-of-match point do what happens afterwards. Those that advance the start-of-match point do
so only if the pattern is not anchored. so only if the pattern is not anchored.
@ -11121,7 +11059,7 @@ BACKTRACKING CONTROL
(*MARK:NAME); if not found, the (*SKIP) is ignored (*MARK:NAME); if not found, the (*SKIP) is ignored
(*THEN) local failure, backtrack to next alternation (*THEN) local failure, backtrack to next alternation
The effect of one of these verbs in a group called as a subroutine is The effect of one of these verbs in a group called as a subroutine is
confined to the subroutine call. confined to the subroutine call.
@ -11132,14 +11070,14 @@ CALLOUTS
(?C"text") callout with string data (?C"text") callout with string data
The allowed string delimiters are ` ' " ^ % # $ (which are the same for The allowed string delimiters are ` ' " ^ % # $ (which are the same for
the start and the end), and the starting delimiter { matched with the the start and the end), and the starting delimiter { matched with the
ending delimiter }. To encode the ending delimiter within the string, ending delimiter }. To encode the ending delimiter within the string,
double it. double it.
SEE ALSO SEE ALSO
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
pcre2(3). pcre2(3).
@ -11152,8 +11090,8 @@ AUTHOR
REVISION REVISION
Last updated: 28 December 2021 Last updated: 12 January 2022
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2022 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "28 December 2021" "PCRE2 10.40" .TH PCRE2PATTERN 3 "12 January 2022" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -772,8 +772,15 @@ can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
sequences are of course limited to testing characters whose code points are sequences are of course limited to testing characters whose code points are
less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
greater than 0x10ffff (the Unicode limit) may be encountered. These are all greater than 0x10ffff (the Unicode limit) may be encountered. These are all
treated as being in the Unknown script and with an unassigned type. The extra treated as being in the Unknown script and with an unassigned type.
escape sequences are: .P
Matching characters by Unicode property is not fast, because PCRE2 has to do a
multistage table lookup in order to find a character's property. That is why
the traditional escape sequences such as \ed and \ew do not use Unicode
properties in PCRE2 by default, though you can make them do so by setting the
PCRE2_UCP option or by starting the pattern with (*UCP).
.P
The extra escape sequences that provide property support are:
.sp .sp
\ep{\fIxx\fP} a character with the \fIxx\fP property \ep{\fIxx\fP} a character with the \fIxx\fP property
\eP{\fIxx\fP} a character without the \fIxx\fP property \eP{\fIxx\fP} a character without the \fIxx\fP property
@ -783,19 +790,24 @@ The property names represented by \fIxx\fP above are not case-sensitive, and in
accordance with Unicode's "loose matching" rules, spaces, hyphens, and accordance with Unicode's "loose matching" rules, spaces, hyphens, and
underscores are ignored. There is support for Unicode script names, Unicode underscores are ignored. There is support for Unicode script names, Unicode
general category properties, "Any", which matches any character (including general category properties, "Any", which matches any character (including
newline), Bidi_Control, Bidi_Class, and some special PCRE2 properties newline), Bidi_Class, a number of binary (yes/no) properties, and some special
(described PCRE2 properties (described
.\" HTML <a href="#extraprops"> .\" HTML <a href="#extraprops">
.\" </a> .\" </a>
below). below).
.\" .\"
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2. Certain other Perl properties such as "InMusicalSymbols" are not supported by
Note that \eP{Any} does not match any characters, so always causes a match PCRE2. Note that \eP{Any} does not match any characters, so always causes a
failure. match failure.
.P .
.
.
.SS "Script properties for \ep and \eP"
.rs
.sp
There are three different syntax forms for matching a script. Each Unicode There are three different syntax forms for matching a script. Each Unicode
character has a basic script and, optionally, a list of other scripts ("Script character has a basic script and, optionally, a list of other scripts ("Script
Extentions") with which it is commonly used. Using the Adlam script as an Extensions") with which it is commonly used. Using the Adlam script as an
example, \ep{sc:Adlam} matches characters whose basic script is Adlam, whereas example, \ep{sc:Adlam} matches characters whose basic script is Adlam, whereas
\ep{scx:Adlam} matches, in addition, characters that have Adlam in their \ep{scx:Adlam} matches, in addition, characters that have Adlam in their
extensions list. The full names "script" and "script extensions" for the extensions list. The full names "script" and "script extensions" for the
@ -806,171 +818,18 @@ interpretation at release 5.26 and PCRE2 changed at release 10.40.
.P .P
Unassigned characters (and in non-UTF 32-bit mode, characters with code points Unassigned characters (and in non-UTF 32-bit mode, characters with code points
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
part of an identified script are lumped together as "Common". The current list part of an identified script are lumped together as "Common". The current list
of script names and their 4-letter abbreviations is: of recognized script names and their 4-character abbreviations can be obtained
.P by running this command:
Adlam (Adlm), .sp
Ahom (Ahom), pcre2test -LS
Anatolian_Hieroglyphs (Hluw), .sp
Arabic (Arab), .
Armenian (Armn), .
Avestan (Avst), .
Balinese (Bali), .SS "The general category property for \ep and \eP"
Bamum (Bamu), .rs
Bassa_Vah (Bass), .sp
Batak (Batk),
Bengali (Beng),
Bhaiksuki (Bhks),
Bopomofo (Bopo),
Brahmi (Brah),
Braille (Brai),
Buginese (Bugi),
Buhid (Buhd),
Canadian_Aboriginal (Cans),
Carian (Cari),
Caucasian_Albanian (Aghb),
Chakma (Cakm),
Cham (Cham),
Cherokee (Cher),
Chorasmian (Chrs),
Common (Zyyy),
Coptic (Copt),
Cuneiform (Xsux),
Cypriot (Cprt),
Cypro_Minoan (Cpmn),
Cyrillic (Cyrl),
Deseret (Dsrt),
Devanagari (Deva),
Dives_Akuru (Diak),
Dogra (Dogr),
Duployan (Dupl),
Egyptian_Hieroglyphs (Egyp),
Elbasan (Elba),
Elymaic (Elym),
Ethiopic (Ethi),
Georgian (Geor),
Glagolitic (Glag),
Gothic (Goth),
Grantha (Gran),
Greek (Grek),
Gujarati (Gujr),
Gunjala_Gondi (Gong),
Gurmukhi (Guru),
Han (Hani),
Hangul (Hang),
Hanifi_Rohingya (Rohg),
Hanunoo (Hano),
Hatran (Hatr),
Hebrew (Hebr),
Hiragana (Hira),
Imperial_Aramaic (Armi),
Inherited (Zinh),
Inscriptional_Pahlavi (Phli),
Inscriptional_Parthian (Prti),
Javanese (Java),
Kaithi (Kthi),
Kannada (Knda),
Katakana (Kana),
Kayah_Li (Kali),
Kharoshthi (Khar),
Khitan_Small_Script (Kits),
Khmer (Khmr),
Khojki (Khoj),
Khudawadi (Sind),
Lao (Laoo),
Latin (Latn),
Lepcha (Lepc),
Limbu (Limb),
Linear_A (Lina),
Linear_B (Linb),
Lisu (Lisu),
Lycian (Lyci),
Lydian (Lydi),
Mahajani (Majh),
Makasar (Maka),
Malayalam (Mlym),
Mandaic (Mand),
Manichaean (Mani),
Marchen (Marc),
Masaram_Gondi (Gonm),
Medefaidrin (Medf),
Meetei_Mayek (Mtei),
Mende_Kikakui (Mend),
Meroitic_Cursive (Merc),
Meroitic_Hieroglyphs (Mero),
Miao (Miao),
Modi (Modi),
Mongolian (Mong),
Mro (Mroo),
Multani (Mult),
Myanmar (Mymr),
Nabataean (Nbar),
Nandinagari (Nand),
New_Tai_Lue (Talu),
Newa (Newa),
Nko (Nkoo),
Nushu (Nshu),
Nyiakeng_Puachue_Hmong (Hmnp),
Ogham (Ogam),
Ol_Chiki (Olck),
Old_Hungarian (Hung),
Old_Italic (Olck),
Old_North_Arabian (Narb),
Old_Permic (Perm),
Old_Persian (Orkh),
Old_Sogdian (Sogo),
Old_South_Arabian (Sarb),
Old_Turkic (Orkh),
Old_Uyghur (Ougr),
Oriya (Orya),
Osage (Osge),
Osmanya (Osma),
Pahawh_Hmong (Hmng),
Palmyrene (Palm),
Pau_Cin_Hau (Pauc),
Phags_Pa (Phag),
Phoenician (Phnx),
Psalter_Pahlavi (Phli),
Rejang (Rjng),
Runic (Runr),
Samaritan (Samr),
Saurashtra (Saur),
Sharada (Shrd),
Shavian (Shaw),
Siddham (Sidd),
SignWriting (Sgnw),
Sinhala (Sinh),
Sogdian (Sogd),
Sora_Sompeng (Sora),
Soyombo (Soyo),
Sundanese (Sund),
Syloti_Nagri (Sylo),
Syriac (Syrc),
Tagalog (Tglg),
Tagbanwa (Tagb),
Tai_Le (Tale),
Tai_Tham (Lana),
Tai_Viet (Tavt),
Takri (Takr),
Tamil (Taml),
Tangsa (Tngs),
Tangut (Tang),
Telugu (Telu),
Thaana (Thaa),
Thai (Thai),
Tibetan (Tibt),
Tifinagh (Tfng),
Tirhuta (Tirh),
Toto (Toto),
Ugaritic (Ugar),
Vai (Vaii),
Vithkuqi (Vith),
Wancho (Wcho),
Warang_Citi (Wara),
Yezidi (Yezi),
Yi (Yiii),
Zanabazar_Square (Zanb).
.P
Each character has exactly one Unicode general category property, specified by Each character has exactly one Unicode general category property, specified by
a two-letter abbreviation. For compatibility with Perl, negation can be a two-letter abbreviation. For compatibility with Perl, negation can be
specified by including a circumflex between the opening brace and the property specified by including a circumflex between the opening brace and the property
@ -1056,22 +915,22 @@ Unicode table.
Specifying caseless matching does not affect these escape sequences. For Specifying caseless matching does not affect these escape sequences. For
example, \ep{Lu} always matches only upper case letters. This is different from example, \ep{Lu} always matches only upper case letters. This is different from
the behaviour of current versions of Perl. the behaviour of current versions of Perl.
.P
Matching characters by Unicode property is not fast, because PCRE2 has to do a
multistage table lookup in order to find a character's property. That is why
the traditional escape sequences such as \ed and \ew do not use Unicode
properties in PCRE2 by default, though you can make them do so by setting the
PCRE2_UCP option or by starting the pattern with (*UCP).
. .
. .
.SS "Bi-directional properties for \ep and \eP" .SS "Binary (yes/no) properties for \ep and \eP"
.rs .rs
.sp .sp
Two properties relating to bi-directional text (each with a shorter synonym) Unicode defines a number of binary properties, that is, properties whose only
are supported: values are true or false. You can obtain a list of those that are recognized by
\ep and \eP, along with their abbreviations, by running this command:
.sp
pcre2test -LP
.sp
.
.
.SS "The Bidi_Class property for \ep and \eP"
.rs
.sp .sp
\ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_C} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class \ep{Bidi_Class:<class>} matches a character with the given class
\ep{BC:<class>} matches a character with the given class \ep{BC:<class>} matches a character with the given class
.sp .sp
@ -1101,8 +960,8 @@ The recognized classes are:
S segment separator S segment separator
WS which space WS which space
.sp .sp
For Bidi_Class, an equals sign may be used instead of a colon. The class names An equals sign may be used instead of a colon. The class names are
are case-insensitive; only the short names listed above are recognized. case-insensitive; only the short names listed above are recognized.
. .
. .
.SS Extended grapheme clusters .SS Extended grapheme clusters
@ -3955,6 +3814,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 28 December 2021 Last updated: 12 January 2022
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2022 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "28 December 2021" "PCRE2 10.40" .TH PCRE2SYNTAX 3 "12 January 2022" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -172,181 +172,31 @@ Perl and POSIX space are now the same. Perl added VT to its space character set
at release 5.18. at release 5.18.
. .
. .
.SH "BINARY PROPERTIES FOR \ep AND \eP"
.rs
.sp
Unicode defines a number of binary properties, that is, properties whose only
values are true or false. You can obtain a list of those that are recognized by
\ep and \eP, along with their abbreviations, by running this command:
.sp
pcre2test -LP
.
.
.
.SH "SCRIPT MATCHING WITH \ep AND \eP" .SH "SCRIPT MATCHING WITH \ep AND \eP"
.rs .rs
.sp .sp
The following script names and their 4-letter abbreviations are recognized in Many script names and their 4-letter abbreviations are recognized in
\ep{sc:...} or \ep{scx:...} items, or on their own with \ep (and also \eP of \ep{sc:...} or \ep{scx:...} items, or on their own with \ep (and also \eP of
course): course). You can obtain a list of these scripts by running this command:
.P .sp
Adlam (Adlm), pcre2test -LS
Ahom (Ahom),
Anatolian_Hieroglyphs (Hluw),
Arabic (Arab),
Armenian (Armn),
Avestan (Avst),
Balinese (Bali),
Bamum (Bamu),
Bassa_Vah (Bass),
Batak (Batk),
Bengali (Beng),
Bhaiksuki (Bhks),
Bopomofo (Bopo),
Brahmi (Brah),
Braille (Brai),
Buginese (Bugi),
Buhid (Buhd),
Canadian_Aboriginal (Cans),
Carian (Cari),
Caucasian_Albanian (Aghb),
Chakma (Cakm),
Cham (Cham),
Cherokee (Cher),
Chorasmian (Chrs),
Common (Zyyy),
Coptic (Copt),
Cuneiform (Xsux),
Cypriot (Cprt),
Cypro_Minoan (Cpmn),
Cyrillic (Cyrl),
Deseret (Dsrt),
Devanagari (Deva),
Dives_Akuru (Diak),
Dogra (Dogr),
Duployan (Dupl),
Egyptian_Hieroglyphs (Egyp),
Elbasan (Elba),
Elymaic (Elym),
Ethiopic (Ethi),
Georgian (Geor),
Glagolitic (Glag),
Gothic (Goth),
Grantha (Gran),
Greek (Grek),
Gujarati (Gujr),
Gunjala_Gondi (Gong),
Gurmukhi (Guru),
Han (Hani),
Hangul (Hang),
Hanifi_Rohingya (Rohg),
Hanunoo (Hano),
Hatran (Hatr),
Hebrew (Hebr),
Hiragana (Hira),
Imperial_Aramaic (Armi),
Inherited (Zinh),
Inscriptional_Pahlavi (Phli),
Inscriptional_Parthian (Prti),
Javanese (Java),
Kaithi (Kthi),
Kannada (Knda),
Katakana (Kana),
Kayah_Li (Kali),
Kharoshthi (Khar),
Khitan_Small_Script (Kits),
Khmer (Khmr),
Khojki (Khoj),
Khudawadi (Sind),
Lao (Laoo),
Latin (Latn),
Lepcha (Lepc),
Limbu (Limb),
Linear_A (Lina),
Linear_B (Linb),
Lisu (Lisu),
Lycian (Lyci),
Lydian (Lydi),
Mahajani (Majh),
Makasar (Maka),
Malayalam (Mlym),
Mandaic (Mand),
Manichaean (Mani),
Marchen (Marc),
Masaram_Gondi (Gonm),
Medefaidrin (Medf),
Meetei_Mayek (Mtei),
Mende_Kikakui (Mend),
Meroitic_Cursive (Merc),
Meroitic_Hieroglyphs (Mero),
Miao (Miao),
Modi (Modi),
Mongolian (Mong),
Mro (Mroo),
Multani (Mult),
Myanmar (Mymr),
Nabataean (Nbar),
Nandinagari (Nand),
New_Tai_Lue (Talu),
Newa (Newa),
Nko (Nkoo),
Nushu (Nshu),
Nyiakeng_Puachue_Hmong (Hmnp),
Ogham (Ogam),
Ol_Chiki (Olck),
Old_Hungarian (Hung),
Old_Italic (Olck),
Old_North_Arabian (Narb),
Old_Permic (Perm),
Old_Persian (Orkh),
Old_Sogdian (Sogo),
Old_South_Arabian (Sarb),
Old_Turkic (Orkh),
Old_Uyghur (Ougr),
Oriya (Orya),
Osage (Osge),
Osmanya (Osma),
Pahawh_Hmong (Hmng),
Palmyrene (Palm),
Pau_Cin_Hau (Pauc),
Phags_Pa (Phag),
Phoenician (Phnx),
Psalter_Pahlavi (Phli),
Rejang (Rjng),
Runic (Runr),
Samaritan (Samr),
Saurashtra (Saur),
Sharada (Shrd),
Shavian (Shaw),
Siddham (Sidd),
SignWriting (Sgnw),
Sinhala (Sinh),
Sogdian (Sogd),
Sora_Sompeng (Sora),
Soyombo (Soyo),
Sundanese (Sund),
Syloti_Nagri (Sylo),
Syriac (Syrc),
Tagalog (Tglg),
Tagbanwa (Tagb),
Tai_Le (Tale),
Tai_Tham (Lana),
Tai_Viet (Tavt),
Takri (Takr),
Tamil (Taml),
Tangsa (Tngs),
Tangut (Tang),
Telugu (Telu),
Thaana (Thaa),
Thai (Thai),
Tibetan (Tibt),
Tifinagh (Tfng),
Tirhuta (Tirh),
Toto (Toto),
Ugaritic (Ugar),
Vai (Vaii),
Vithkuqi (Vith),
Wancho (Wcho),
Warang_Citi (Wara),
Yezidi (Yezi),
Yi (Yiii),
Zanabazar_Square (Zanb).
. .
. .
.SH "BIDI_PROPERTIES FOR \ep AND \eP" .
.SH "THE BIDI_CLASS PROPERTY FOR \ep AND \eP"
.rs .rs
.sp .sp
\ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_C} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class \ep{Bidi_Class:<class>} matches a character with the given class
\ep{BC:<class>} matches a character with the given class \ep{BC:<class>} matches a character with the given class
.sp .sp
@ -728,6 +578,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 28 December 2021 Last updated: 12 January 2022
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2022 University of Cambridge.
.fi .fi

View File

@ -197,7 +197,17 @@ COMMAND LINE OPTIONS
-LM List modifiers: write a list of available pattern and subject -LM List modifiers: write a list of available pattern and subject
modifiers to the standard output, then exit with zero exit modifiers to the standard output, then exit with zero exit
code. All other options are ignored. If both -C and -LM are code. All other options are ignored. If both -C and any -Lx
options are present, whichever is first is recognized.
-LP List properties: write a list of recognized Unicode proper-
ties to the standard output, then exit with zero exit code.
All other options are ignored. If both -C and any -Lx options
are present, whichever is first is recognized.
-LS List scripts: write a list of recogized Unicode script names
to the standard output, then exit with zero exit code. All
other options are ignored. If both -C and any -Lx options are
present, whichever is first is recognized. present, whichever is first is recognized.
-pattern modifier-list -pattern modifier-list
@ -1939,5 +1949,5 @@ AUTHOR
REVISION REVISION
Last updated: 28 November 2021 Last updated: 12 January 2022
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2022 University of Cambridge.