Documentation update for binary property support
This commit is contained in:
parent
bf35c0518c
commit
7f7d3e8521
|
@ -776,8 +776,17 @@ can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
|
|||
sequences are of course limited to testing characters whose code points are
|
||||
less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
|
||||
greater than 0x10ffff (the Unicode limit) may be encountered. These are all
|
||||
treated as being in the Unknown script and with an unassigned type. The extra
|
||||
escape sequences are:
|
||||
treated as being in the Unknown script and with an unassigned type.
|
||||
</P>
|
||||
<P>
|
||||
Matching characters by Unicode property is not fast, because PCRE2 has to do a
|
||||
multistage table lookup in order to find a character's property. That is why
|
||||
the traditional escape sequences such as \d and \w do not use Unicode
|
||||
properties in PCRE2 by default, though you can make them do so by setting the
|
||||
PCRE2_UCP option or by starting the pattern with (*UCP).
|
||||
</P>
|
||||
<P>
|
||||
The extra escape sequences that provide property support are:
|
||||
<pre>
|
||||
\p{<i>xx</i>} a character with the <i>xx</i> property
|
||||
\P{<i>xx</i>} a character without the <i>xx</i> property
|
||||
|
@ -787,17 +796,20 @@ The property names represented by <i>xx</i> above are not case-sensitive, and in
|
|||
accordance with Unicode's "loose matching" rules, spaces, hyphens, and
|
||||
underscores are ignored. There is support for Unicode script names, Unicode
|
||||
general category properties, "Any", which matches any character (including
|
||||
newline), Bidi_Control, Bidi_Class, and some special PCRE2 properties
|
||||
(described
|
||||
newline), Bidi_Class, a number of binary (yes/no) properties, and some special
|
||||
PCRE2 properties (described
|
||||
<a href="#extraprops">below).</a>
|
||||
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
|
||||
Note that \P{Any} does not match any characters, so always causes a match
|
||||
failure.
|
||||
Certain other Perl properties such as "InMusicalSymbols" are not supported by
|
||||
PCRE2. Note that \P{Any} does not match any characters, so always causes a
|
||||
match failure.
|
||||
</P>
|
||||
<br><b>
|
||||
Script properties for \p and \P
|
||||
</b><br>
|
||||
<P>
|
||||
There are three different syntax forms for matching a script. Each Unicode
|
||||
character has a basic script and, optionally, a list of other scripts ("Script
|
||||
Extentions") with which it is commonly used. Using the Adlam script as an
|
||||
Extensions") with which it is commonly used. Using the Adlam script as an
|
||||
example, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas
|
||||
\p{scx:Adlam} matches, in addition, characters that have Adlam in their
|
||||
extensions list. The full names "script" and "script extensions" for the
|
||||
|
@ -809,172 +821,17 @@ interpretation at release 5.26 and PCRE2 changed at release 10.40.
|
|||
<P>
|
||||
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
|
||||
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
|
||||
part of an identified script are lumped together as "Common". The current list
|
||||
of script names and their 4-letter abbreviations is:
|
||||
</P>
|
||||
<P>
|
||||
Adlam (Adlm),
|
||||
Ahom (Ahom),
|
||||
Anatolian_Hieroglyphs (Hluw),
|
||||
Arabic (Arab),
|
||||
Armenian (Armn),
|
||||
Avestan (Avst),
|
||||
Balinese (Bali),
|
||||
Bamum (Bamu),
|
||||
Bassa_Vah (Bass),
|
||||
Batak (Batk),
|
||||
Bengali (Beng),
|
||||
Bhaiksuki (Bhks),
|
||||
Bopomofo (Bopo),
|
||||
Brahmi (Brah),
|
||||
Braille (Brai),
|
||||
Buginese (Bugi),
|
||||
Buhid (Buhd),
|
||||
Canadian_Aboriginal (Cans),
|
||||
Carian (Cari),
|
||||
Caucasian_Albanian (Aghb),
|
||||
Chakma (Cakm),
|
||||
Cham (Cham),
|
||||
Cherokee (Cher),
|
||||
Chorasmian (Chrs),
|
||||
Common (Zyyy),
|
||||
Coptic (Copt),
|
||||
Cuneiform (Xsux),
|
||||
Cypriot (Cprt),
|
||||
Cypro_Minoan (Cpmn),
|
||||
Cyrillic (Cyrl),
|
||||
Deseret (Dsrt),
|
||||
Devanagari (Deva),
|
||||
Dives_Akuru (Diak),
|
||||
Dogra (Dogr),
|
||||
Duployan (Dupl),
|
||||
Egyptian_Hieroglyphs (Egyp),
|
||||
Elbasan (Elba),
|
||||
Elymaic (Elym),
|
||||
Ethiopic (Ethi),
|
||||
Georgian (Geor),
|
||||
Glagolitic (Glag),
|
||||
Gothic (Goth),
|
||||
Grantha (Gran),
|
||||
Greek (Grek),
|
||||
Gujarati (Gujr),
|
||||
Gunjala_Gondi (Gong),
|
||||
Gurmukhi (Guru),
|
||||
Han (Hani),
|
||||
Hangul (Hang),
|
||||
Hanifi_Rohingya (Rohg),
|
||||
Hanunoo (Hano),
|
||||
Hatran (Hatr),
|
||||
Hebrew (Hebr),
|
||||
Hiragana (Hira),
|
||||
Imperial_Aramaic (Armi),
|
||||
Inherited (Zinh),
|
||||
Inscriptional_Pahlavi (Phli),
|
||||
Inscriptional_Parthian (Prti),
|
||||
Javanese (Java),
|
||||
Kaithi (Kthi),
|
||||
Kannada (Knda),
|
||||
Katakana (Kana),
|
||||
Kayah_Li (Kali),
|
||||
Kharoshthi (Khar),
|
||||
Khitan_Small_Script (Kits),
|
||||
Khmer (Khmr),
|
||||
Khojki (Khoj),
|
||||
Khudawadi (Sind),
|
||||
Lao (Laoo),
|
||||
Latin (Latn),
|
||||
Lepcha (Lepc),
|
||||
Limbu (Limb),
|
||||
Linear_A (Lina),
|
||||
Linear_B (Linb),
|
||||
Lisu (Lisu),
|
||||
Lycian (Lyci),
|
||||
Lydian (Lydi),
|
||||
Mahajani (Majh),
|
||||
Makasar (Maka),
|
||||
Malayalam (Mlym),
|
||||
Mandaic (Mand),
|
||||
Manichaean (Mani),
|
||||
Marchen (Marc),
|
||||
Masaram_Gondi (Gonm),
|
||||
Medefaidrin (Medf),
|
||||
Meetei_Mayek (Mtei),
|
||||
Mende_Kikakui (Mend),
|
||||
Meroitic_Cursive (Merc),
|
||||
Meroitic_Hieroglyphs (Mero),
|
||||
Miao (Miao),
|
||||
Modi (Modi),
|
||||
Mongolian (Mong),
|
||||
Mro (Mroo),
|
||||
Multani (Mult),
|
||||
Myanmar (Mymr),
|
||||
Nabataean (Nbar),
|
||||
Nandinagari (Nand),
|
||||
New_Tai_Lue (Talu),
|
||||
Newa (Newa),
|
||||
Nko (Nkoo),
|
||||
Nushu (Nshu),
|
||||
Nyiakeng_Puachue_Hmong (Hmnp),
|
||||
Ogham (Ogam),
|
||||
Ol_Chiki (Olck),
|
||||
Old_Hungarian (Hung),
|
||||
Old_Italic (Olck),
|
||||
Old_North_Arabian (Narb),
|
||||
Old_Permic (Perm),
|
||||
Old_Persian (Orkh),
|
||||
Old_Sogdian (Sogo),
|
||||
Old_South_Arabian (Sarb),
|
||||
Old_Turkic (Orkh),
|
||||
Old_Uyghur (Ougr),
|
||||
Oriya (Orya),
|
||||
Osage (Osge),
|
||||
Osmanya (Osma),
|
||||
Pahawh_Hmong (Hmng),
|
||||
Palmyrene (Palm),
|
||||
Pau_Cin_Hau (Pauc),
|
||||
Phags_Pa (Phag),
|
||||
Phoenician (Phnx),
|
||||
Psalter_Pahlavi (Phli),
|
||||
Rejang (Rjng),
|
||||
Runic (Runr),
|
||||
Samaritan (Samr),
|
||||
Saurashtra (Saur),
|
||||
Sharada (Shrd),
|
||||
Shavian (Shaw),
|
||||
Siddham (Sidd),
|
||||
SignWriting (Sgnw),
|
||||
Sinhala (Sinh),
|
||||
Sogdian (Sogd),
|
||||
Sora_Sompeng (Sora),
|
||||
Soyombo (Soyo),
|
||||
Sundanese (Sund),
|
||||
Syloti_Nagri (Sylo),
|
||||
Syriac (Syrc),
|
||||
Tagalog (Tglg),
|
||||
Tagbanwa (Tagb),
|
||||
Tai_Le (Tale),
|
||||
Tai_Tham (Lana),
|
||||
Tai_Viet (Tavt),
|
||||
Takri (Takr),
|
||||
Tamil (Taml),
|
||||
Tangsa (Tngs),
|
||||
Tangut (Tang),
|
||||
Telugu (Telu),
|
||||
Thaana (Thaa),
|
||||
Thai (Thai),
|
||||
Tibetan (Tibt),
|
||||
Tifinagh (Tfng),
|
||||
Tirhuta (Tirh),
|
||||
Toto (Toto),
|
||||
Ugaritic (Ugar),
|
||||
Vai (Vaii),
|
||||
Vithkuqi (Vith),
|
||||
Wancho (Wcho),
|
||||
Warang_Citi (Wara),
|
||||
Yezidi (Yezi),
|
||||
Yi (Yiii),
|
||||
Zanabazar_Square (Zanb).
|
||||
part of an identified script are lumped together as "Common". The current list
|
||||
of recognized script names and their 4-character abbreviations can be obtained
|
||||
by running this command:
|
||||
<pre>
|
||||
pcre2test -LS
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<br><b>
|
||||
The general category property for \p and \P
|
||||
</b><br>
|
||||
<P>
|
||||
Each character has exactly one Unicode general category property, specified by
|
||||
a two-letter abbreviation. For compatibility with Perl, negation can be
|
||||
|
@ -1065,22 +922,23 @@ Specifying caseless matching does not affect these escape sequences. For
|
|||
example, \p{Lu} always matches only upper case letters. This is different from
|
||||
the behaviour of current versions of Perl.
|
||||
</P>
|
||||
<P>
|
||||
Matching characters by Unicode property is not fast, because PCRE2 has to do a
|
||||
multistage table lookup in order to find a character's property. That is why
|
||||
the traditional escape sequences such as \d and \w do not use Unicode
|
||||
properties in PCRE2 by default, though you can make them do so by setting the
|
||||
PCRE2_UCP option or by starting the pattern with (*UCP).
|
||||
</P>
|
||||
<br><b>
|
||||
Bi-directional properties for \p and \P
|
||||
Binary (yes/no) properties for \p and \P
|
||||
</b><br>
|
||||
<P>
|
||||
Unicode defines a number of binary properties, that is, properties whose only
|
||||
values are true or false. You can obtain a list of those that are recognized by
|
||||
\p and \P, along with their abbreviations, by running this command:
|
||||
<pre>
|
||||
pcre2test -LP
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<br><b>
|
||||
The Bidi_Class property for \p and \P
|
||||
</b><br>
|
||||
<P>
|
||||
Two properties relating to bi-directional text (each with a shorter synonym)
|
||||
are supported:
|
||||
<pre>
|
||||
\p{Bidi_Control} matches a Bidi control character
|
||||
\p{Bidi_C} matches a Bidi control character
|
||||
\p{Bidi_Class:<class>} matches a character with the given class
|
||||
\p{BC:<class>} matches a character with the given class
|
||||
</pre>
|
||||
|
@ -1110,8 +968,8 @@ The recognized classes are:
|
|||
S segment separator
|
||||
WS which space
|
||||
</pre>
|
||||
For Bidi_Class, an equals sign may be used instead of a colon. The class names
|
||||
are case-insensitive; only the short names listed above are recognized.
|
||||
An equals sign may be used instead of a colon. The class names are
|
||||
case-insensitive; only the short names listed above are recognized.
|
||||
</P>
|
||||
<br><b>
|
||||
Extended grapheme clusters
|
||||
|
@ -3908,9 +3766,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 December 2021
|
||||
Last updated: 12 January 2022
|
||||
<br>
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
Copyright © 1997-2022 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -19,30 +19,31 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
|
||||
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
<li><a name="TOC7" href="#SEC7">SCRIPT MATCHING WITH \p AND \P</a>
|
||||
<li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a>
|
||||
<li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a>
|
||||
<li><a name="TOC10" href="#SEC10">QUANTIFIERS</a>
|
||||
<li><a name="TOC11" href="#SEC11">ANCHORS AND SIMPLE ASSERTIONS</a>
|
||||
<li><a name="TOC12" href="#SEC12">REPORTED MATCH POINT SETTING</a>
|
||||
<li><a name="TOC13" href="#SEC13">ALTERNATION</a>
|
||||
<li><a name="TOC14" href="#SEC14">CAPTURING</a>
|
||||
<li><a name="TOC15" href="#SEC15">ATOMIC GROUPS</a>
|
||||
<li><a name="TOC16" href="#SEC16">COMMENT</a>
|
||||
<li><a name="TOC17" href="#SEC17">OPTION SETTING</a>
|
||||
<li><a name="TOC18" href="#SEC18">NEWLINE CONVENTION</a>
|
||||
<li><a name="TOC19" href="#SEC19">WHAT \R MATCHES</a>
|
||||
<li><a name="TOC20" href="#SEC20">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
||||
<li><a name="TOC21" href="#SEC21">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
|
||||
<li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a>
|
||||
<li><a name="TOC23" href="#SEC23">BACKREFERENCES</a>
|
||||
<li><a name="TOC24" href="#SEC24">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||
<li><a name="TOC25" href="#SEC25">CONDITIONAL PATTERNS</a>
|
||||
<li><a name="TOC26" href="#SEC26">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC27" href="#SEC27">CALLOUTS</a>
|
||||
<li><a name="TOC28" href="#SEC28">SEE ALSO</a>
|
||||
<li><a name="TOC29" href="#SEC29">AUTHOR</a>
|
||||
<li><a name="TOC30" href="#SEC30">REVISION</a>
|
||||
<li><a name="TOC7" href="#SEC7">BINARY PROPERTIES FOR \p AND \P</a>
|
||||
<li><a name="TOC8" href="#SEC8">SCRIPT MATCHING WITH \p AND \P</a>
|
||||
<li><a name="TOC9" href="#SEC9">THE BIDI_CLASS PROPERTY FOR \p AND \P</a>
|
||||
<li><a name="TOC10" href="#SEC10">CHARACTER CLASSES</a>
|
||||
<li><a name="TOC11" href="#SEC11">QUANTIFIERS</a>
|
||||
<li><a name="TOC12" href="#SEC12">ANCHORS AND SIMPLE ASSERTIONS</a>
|
||||
<li><a name="TOC13" href="#SEC13">REPORTED MATCH POINT SETTING</a>
|
||||
<li><a name="TOC14" href="#SEC14">ALTERNATION</a>
|
||||
<li><a name="TOC15" href="#SEC15">CAPTURING</a>
|
||||
<li><a name="TOC16" href="#SEC16">ATOMIC GROUPS</a>
|
||||
<li><a name="TOC17" href="#SEC17">COMMENT</a>
|
||||
<li><a name="TOC18" href="#SEC18">OPTION SETTING</a>
|
||||
<li><a name="TOC19" href="#SEC19">NEWLINE CONVENTION</a>
|
||||
<li><a name="TOC20" href="#SEC20">WHAT \R MATCHES</a>
|
||||
<li><a name="TOC21" href="#SEC21">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
||||
<li><a name="TOC22" href="#SEC22">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
|
||||
<li><a name="TOC23" href="#SEC23">SCRIPT RUNS</a>
|
||||
<li><a name="TOC24" href="#SEC24">BACKREFERENCES</a>
|
||||
<li><a name="TOC25" href="#SEC25">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||
<li><a name="TOC26" href="#SEC26">CONDITIONAL PATTERNS</a>
|
||||
<li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC28" href="#SEC28">CALLOUTS</a>
|
||||
<li><a name="TOC29" href="#SEC29">SEE ALSO</a>
|
||||
<li><a name="TOC30" href="#SEC30">AUTHOR</a>
|
||||
<li><a name="TOC31" href="#SEC31">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
||||
<P>
|
||||
|
@ -205,180 +206,27 @@ matching" rules.
|
|||
Perl and POSIX space are now the same. Perl added VT to its space character set
|
||||
at release 5.18.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
|
||||
<br><a name="SEC7" href="#TOC1">BINARY PROPERTIES FOR \p AND \P</a><br>
|
||||
<P>
|
||||
The following script names and their 4-letter abbreviations are recognized in
|
||||
Unicode defines a number of binary properties, that is, properties whose only
|
||||
values are true or false. You can obtain a list of those that are recognized by
|
||||
\p and \P, along with their abbreviations, by running this command:
|
||||
<pre>
|
||||
pcre2test -LP
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
|
||||
<P>
|
||||
Many script names and their 4-letter abbreviations are recognized in
|
||||
\p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P of
|
||||
course):
|
||||
course). You can obtain a list of these scripts by running this command:
|
||||
<pre>
|
||||
pcre2test -LS
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
Adlam (Adlm),
|
||||
Ahom (Ahom),
|
||||
Anatolian_Hieroglyphs (Hluw),
|
||||
Arabic (Arab),
|
||||
Armenian (Armn),
|
||||
Avestan (Avst),
|
||||
Balinese (Bali),
|
||||
Bamum (Bamu),
|
||||
Bassa_Vah (Bass),
|
||||
Batak (Batk),
|
||||
Bengali (Beng),
|
||||
Bhaiksuki (Bhks),
|
||||
Bopomofo (Bopo),
|
||||
Brahmi (Brah),
|
||||
Braille (Brai),
|
||||
Buginese (Bugi),
|
||||
Buhid (Buhd),
|
||||
Canadian_Aboriginal (Cans),
|
||||
Carian (Cari),
|
||||
Caucasian_Albanian (Aghb),
|
||||
Chakma (Cakm),
|
||||
Cham (Cham),
|
||||
Cherokee (Cher),
|
||||
Chorasmian (Chrs),
|
||||
Common (Zyyy),
|
||||
Coptic (Copt),
|
||||
Cuneiform (Xsux),
|
||||
Cypriot (Cprt),
|
||||
Cypro_Minoan (Cpmn),
|
||||
Cyrillic (Cyrl),
|
||||
Deseret (Dsrt),
|
||||
Devanagari (Deva),
|
||||
Dives_Akuru (Diak),
|
||||
Dogra (Dogr),
|
||||
Duployan (Dupl),
|
||||
Egyptian_Hieroglyphs (Egyp),
|
||||
Elbasan (Elba),
|
||||
Elymaic (Elym),
|
||||
Ethiopic (Ethi),
|
||||
Georgian (Geor),
|
||||
Glagolitic (Glag),
|
||||
Gothic (Goth),
|
||||
Grantha (Gran),
|
||||
Greek (Grek),
|
||||
Gujarati (Gujr),
|
||||
Gunjala_Gondi (Gong),
|
||||
Gurmukhi (Guru),
|
||||
Han (Hani),
|
||||
Hangul (Hang),
|
||||
Hanifi_Rohingya (Rohg),
|
||||
Hanunoo (Hano),
|
||||
Hatran (Hatr),
|
||||
Hebrew (Hebr),
|
||||
Hiragana (Hira),
|
||||
Imperial_Aramaic (Armi),
|
||||
Inherited (Zinh),
|
||||
Inscriptional_Pahlavi (Phli),
|
||||
Inscriptional_Parthian (Prti),
|
||||
Javanese (Java),
|
||||
Kaithi (Kthi),
|
||||
Kannada (Knda),
|
||||
Katakana (Kana),
|
||||
Kayah_Li (Kali),
|
||||
Kharoshthi (Khar),
|
||||
Khitan_Small_Script (Kits),
|
||||
Khmer (Khmr),
|
||||
Khojki (Khoj),
|
||||
Khudawadi (Sind),
|
||||
Lao (Laoo),
|
||||
Latin (Latn),
|
||||
Lepcha (Lepc),
|
||||
Limbu (Limb),
|
||||
Linear_A (Lina),
|
||||
Linear_B (Linb),
|
||||
Lisu (Lisu),
|
||||
Lycian (Lyci),
|
||||
Lydian (Lydi),
|
||||
Mahajani (Majh),
|
||||
Makasar (Maka),
|
||||
Malayalam (Mlym),
|
||||
Mandaic (Mand),
|
||||
Manichaean (Mani),
|
||||
Marchen (Marc),
|
||||
Masaram_Gondi (Gonm),
|
||||
Medefaidrin (Medf),
|
||||
Meetei_Mayek (Mtei),
|
||||
Mende_Kikakui (Mend),
|
||||
Meroitic_Cursive (Merc),
|
||||
Meroitic_Hieroglyphs (Mero),
|
||||
Miao (Miao),
|
||||
Modi (Modi),
|
||||
Mongolian (Mong),
|
||||
Mro (Mroo),
|
||||
Multani (Mult),
|
||||
Myanmar (Mymr),
|
||||
Nabataean (Nbar),
|
||||
Nandinagari (Nand),
|
||||
New_Tai_Lue (Talu),
|
||||
Newa (Newa),
|
||||
Nko (Nkoo),
|
||||
Nushu (Nshu),
|
||||
Nyiakeng_Puachue_Hmong (Hmnp),
|
||||
Ogham (Ogam),
|
||||
Ol_Chiki (Olck),
|
||||
Old_Hungarian (Hung),
|
||||
Old_Italic (Olck),
|
||||
Old_North_Arabian (Narb),
|
||||
Old_Permic (Perm),
|
||||
Old_Persian (Orkh),
|
||||
Old_Sogdian (Sogo),
|
||||
Old_South_Arabian (Sarb),
|
||||
Old_Turkic (Orkh),
|
||||
Old_Uyghur (Ougr),
|
||||
Oriya (Orya),
|
||||
Osage (Osge),
|
||||
Osmanya (Osma),
|
||||
Pahawh_Hmong (Hmng),
|
||||
Palmyrene (Palm),
|
||||
Pau_Cin_Hau (Pauc),
|
||||
Phags_Pa (Phag),
|
||||
Phoenician (Phnx),
|
||||
Psalter_Pahlavi (Phli),
|
||||
Rejang (Rjng),
|
||||
Runic (Runr),
|
||||
Samaritan (Samr),
|
||||
Saurashtra (Saur),
|
||||
Sharada (Shrd),
|
||||
Shavian (Shaw),
|
||||
Siddham (Sidd),
|
||||
SignWriting (Sgnw),
|
||||
Sinhala (Sinh),
|
||||
Sogdian (Sogd),
|
||||
Sora_Sompeng (Sora),
|
||||
Soyombo (Soyo),
|
||||
Sundanese (Sund),
|
||||
Syloti_Nagri (Sylo),
|
||||
Syriac (Syrc),
|
||||
Tagalog (Tglg),
|
||||
Tagbanwa (Tagb),
|
||||
Tai_Le (Tale),
|
||||
Tai_Tham (Lana),
|
||||
Tai_Viet (Tavt),
|
||||
Takri (Takr),
|
||||
Tamil (Taml),
|
||||
Tangsa (Tngs),
|
||||
Tangut (Tang),
|
||||
Telugu (Telu),
|
||||
Thaana (Thaa),
|
||||
Thai (Thai),
|
||||
Tibetan (Tibt),
|
||||
Tifinagh (Tfng),
|
||||
Tirhuta (Tirh),
|
||||
Toto (Toto),
|
||||
Ugaritic (Ugar),
|
||||
Vai (Vaii),
|
||||
Vithkuqi (Vith),
|
||||
Wancho (Wcho),
|
||||
Warang_Citi (Wara),
|
||||
Yezidi (Yezi),
|
||||
Yi (Yiii),
|
||||
Zanabazar_Square (Zanb).
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">BIDI_PROPERTIES FOR \p AND \P</a><br>
|
||||
<br><a name="SEC9" href="#TOC1">THE BIDI_CLASS PROPERTY FOR \p AND \P</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\p{Bidi_Control} matches a Bidi control character
|
||||
\p{Bidi_C} matches a Bidi control character
|
||||
\p{Bidi_Class:<class>} matches a character with the given class
|
||||
\p{BC:<class>} matches a character with the given class
|
||||
</pre>
|
||||
|
@ -409,7 +257,7 @@ The recognized classes are:
|
|||
WS which space
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">CHARACTER CLASSES</a><br>
|
||||
<br><a name="SEC10" href="#TOC1">CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
[...] positive character class
|
||||
|
@ -437,7 +285,7 @@ In PCRE2, POSIX character set names recognize only ASCII characters by default,
|
|||
but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||
\Q...\E inside a character class.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">QUANTIFIERS</a><br>
|
||||
<br><a name="SEC11" href="#TOC1">QUANTIFIERS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
? 0 or 1, greedy
|
||||
|
@ -458,7 +306,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
|||
{n,}? n or more, lazy
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
|
||||
<br><a name="SEC12" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\b word boundary
|
||||
|
@ -476,7 +324,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
|||
\G first matching position in subject
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
|
||||
<br><a name="SEC13" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\K set reported start of match
|
||||
|
@ -486,13 +334,13 @@ for compatibility with Perl. However, if the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
|
|||
option is set, the previous behaviour is re-enabled. When this option is set,
|
||||
\K is honoured in positive assertions, but ignored in negative ones.
|
||||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">ALTERNATION</a><br>
|
||||
<br><a name="SEC14" href="#TOC1">ALTERNATION</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
expr|expr|expr...
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">CAPTURING</a><br>
|
||||
<br><a name="SEC15" href="#TOC1">CAPTURING</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(...) capture group
|
||||
|
@ -507,20 +355,20 @@ In non-UTF modes, names may contain underscores and ASCII letters and digits;
|
|||
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
|
||||
both cases, a name must not start with a digit.
|
||||
</P>
|
||||
<br><a name="SEC15" href="#TOC1">ATOMIC GROUPS</a><br>
|
||||
<br><a name="SEC16" href="#TOC1">ATOMIC GROUPS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?>...) atomic non-capture group
|
||||
(*atomic:...) atomic non-capture group
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC16" href="#TOC1">COMMENT</a><br>
|
||||
<br><a name="SEC17" href="#TOC1">COMMENT</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?#....) comment (not nestable)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">OPTION SETTING</a><br>
|
||||
<br><a name="SEC18" href="#TOC1">OPTION SETTING</a><br>
|
||||
<P>
|
||||
Changes of these options within a group are automatically cancelled at the end
|
||||
of the group.
|
||||
|
@ -565,7 +413,7 @@ not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
|
|||
application can lock out the use of (*UTF) and (*UCP) by setting the
|
||||
PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||
<br><a name="SEC19" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||
<P>
|
||||
These are recognized only at the very start of the pattern or after option
|
||||
settings with a similar syntax.
|
||||
|
@ -578,7 +426,7 @@ settings with a similar syntax.
|
|||
(*NUL) the NUL character (binary zero)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||
<br><a name="SEC20" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||
<P>
|
||||
These are recognized only at the very start of the pattern or after option
|
||||
setting with a similar syntax.
|
||||
|
@ -587,7 +435,7 @@ setting with a similar syntax.
|
|||
(*BSR_UNICODE) any Unicode newline sequence
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
|
||||
<br><a name="SEC21" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?=...) )
|
||||
|
@ -608,7 +456,7 @@ setting with a similar syntax.
|
|||
</pre>
|
||||
Each top-level branch of a lookbehind must be of a fixed length.
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
|
||||
<br><a name="SEC22" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
|
||||
<P>
|
||||
These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||
<pre>
|
||||
|
@ -621,7 +469,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
|
|||
(*non_atomic_positive_lookbehind:...) )
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<br><a name="SEC23" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(*script_run:...) ) script run, can be backtracked into
|
||||
|
@ -631,7 +479,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
|
|||
(*asr:...) )
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC23" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<br><a name="SEC24" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\n reference by number (can be ambiguous)
|
||||
|
@ -648,7 +496,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
|
|||
(?P=name) reference by name (Python)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC24" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||
<br><a name="SEC25" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?R) recurse whole pattern
|
||||
|
@ -667,7 +515,7 @@ These assertions are specific to PCRE2 and are not Perl-compatible.
|
|||
\g'-n' call subroutine by relative number (PCRE2 extension)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||
<br><a name="SEC26" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?(condition)yes-pattern)
|
||||
|
@ -690,7 +538,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
|||
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||
condition if the relevant named group exists.
|
||||
</P>
|
||||
<br><a name="SEC26" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
||||
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
||||
|
@ -717,7 +565,7 @@ pattern is not anchored.
|
|||
The effect of one of these verbs in a group called as a subroutine is confined
|
||||
to the subroutine call.
|
||||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
|
||||
<br><a name="SEC28" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?C) callout (assumed number 0)
|
||||
|
@ -728,12 +576,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
|||
start and the end), and the starting delimiter { matched with the ending
|
||||
delimiter }. To encode the ending delimiter within the string, double it.
|
||||
</P>
|
||||
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC29" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC30" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
|
@ -742,11 +590,11 @@ Retired from University Computing Service
|
|||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 December 2021
|
||||
Last updated: 12 January 2022
|
||||
<br>
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
Copyright © 1997-2022 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -253,7 +253,19 @@ available, and the use of JIT for matching is verified.
|
|||
<b>-LM</b>
|
||||
List modifiers: write a list of available pattern and subject modifiers to the
|
||||
standard output, then exit with zero exit code. All other options are ignored.
|
||||
If both -C and -LM are present, whichever is first is recognized.
|
||||
If both -C and any -Lx options are present, whichever is first is recognized.
|
||||
</P>
|
||||
<P>
|
||||
<b>-LP</b>
|
||||
List properties: write a list of recognized Unicode properties to the standard
|
||||
output, then exit with zero exit code. All other options are ignored. If both
|
||||
-C and any -Lx options are present, whichever is first is recognized.
|
||||
</P>
|
||||
<P>
|
||||
<b>-LS</b>
|
||||
List scripts: write a list of recogized Unicode script names to the standard
|
||||
output, then exit with zero exit code. All other options are ignored. If both
|
||||
-C and any -Lx options are present, whichever is first is recognized.
|
||||
</P>
|
||||
<P>
|
||||
<b>-pattern</b> <i>modifier-list</i>
|
||||
|
@ -2129,9 +2141,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 November 2021
|
||||
Last updated: 12 January 2022
|
||||
<br>
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
Copyright © 1997-2022 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
228
doc/pcre2.txt
228
doc/pcre2.txt
|
@ -6889,8 +6889,16 @@ BACKSLASH
|
|||
ters whose code points are less than U+0100 and U+10000, respectively.
|
||||
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
|
||||
limit) may be encountered. These are all treated as being in the Un-
|
||||
known script and with an unassigned type. The extra escape sequences
|
||||
are:
|
||||
known script and with an unassigned type.
|
||||
|
||||
Matching characters by Unicode property is not fast, because PCRE2 has
|
||||
to do a multistage table lookup in order to find a character's prop-
|
||||
erty. That is why the traditional escape sequences such as \d and \w do
|
||||
not use Unicode properties in PCRE2 by default, though you can make
|
||||
them do so by setting the PCRE2_UCP option or by starting the pattern
|
||||
with (*UCP).
|
||||
|
||||
The extra escape sequences that provide property support are:
|
||||
|
||||
\p{xx} a character with the xx property
|
||||
\P{xx} a character without the xx property
|
||||
|
@ -6900,71 +6908,36 @@ BACKSLASH
|
|||
in accordance with Unicode's "loose matching" rules, spaces, hyphens,
|
||||
and underscores are ignored. There is support for Unicode script names,
|
||||
Unicode general category properties, "Any", which matches any character
|
||||
(including newline), Bidi_Control, Bidi_Class, and some special PCRE2
|
||||
properties (described below). Other Perl properties such as "InMusi-
|
||||
calSymbols" are not supported by PCRE2. Note that \P{Any} does not
|
||||
match any characters, so always causes a match failure.
|
||||
(including newline), Bidi_Class, a number of binary (yes/no) proper-
|
||||
ties, and some special PCRE2 properties (described below). Certain
|
||||
other Perl properties such as "InMusicalSymbols" are not supported by
|
||||
PCRE2. Note that \P{Any} does not match any characters, so always
|
||||
causes a match failure.
|
||||
|
||||
Script properties for \p and \P
|
||||
|
||||
There are three different syntax forms for matching a script. Each Uni-
|
||||
code character has a basic script and, optionally, a list of other
|
||||
scripts ("Script Extentions") with which it is commonly used. Using the
|
||||
code character has a basic script and, optionally, a list of other
|
||||
scripts ("Script Extensions") with which it is commonly used. Using the
|
||||
Adlam script as an example, \p{sc:Adlam} matches characters whose basic
|
||||
script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
|
||||
that have Adlam in their extensions list. The full names "script" and
|
||||
that have Adlam in their extensions list. The full names "script" and
|
||||
"script extensions" for the property types are recognized, and a equals
|
||||
sign is an alternative to the colon. If a script name is given without
|
||||
a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad-
|
||||
lam}. Perl changed to this interpretation at release 5.26 and PCRE2
|
||||
sign is an alternative to the colon. If a script name is given without
|
||||
a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad-
|
||||
lam}. Perl changed to this interpretation at release 5.26 and PCRE2
|
||||
changed at release 10.40.
|
||||
|
||||
Unassigned characters (and in non-UTF 32-bit mode, characters with code
|
||||
points greater than 0x10FFFF) are assigned the "Unknown" script. Others
|
||||
that are not part of an identified script are lumped together as "Com-
|
||||
mon". The current list of script names and their 4-letter abbreviations
|
||||
is:
|
||||
that are not part of an identified script are lumped together as "Com-
|
||||
mon". The current list of recognized script names and their 4-character
|
||||
abbreviations can be obtained by running this command:
|
||||
|
||||
Adlam (Adlm), Ahom (Ahom), Anatolian_Hieroglyphs (Hluw), Arabic (Arab),
|
||||
Armenian (Armn), Avestan (Avst), Balinese (Bali), Bamum (Bamu),
|
||||
Bassa_Vah (Bass), Batak (Batk), Bengali (Beng), Bhaiksuki (Bhks), Bopo-
|
||||
mofo (Bopo), Brahmi (Brah), Braille (Brai), Buginese (Bugi), Buhid
|
||||
(Buhd), Canadian_Aboriginal (Cans), Carian (Cari), Caucasian_Albanian
|
||||
(Aghb), Chakma (Cakm), Cham (Cham), Cherokee (Cher), Chorasmian (Chrs),
|
||||
Common (Zyyy), Coptic (Copt), Cuneiform (Xsux), Cypriot (Cprt),
|
||||
Cypro_Minoan (Cpmn), Cyrillic (Cyrl), Deseret (Dsrt), Devanagari
|
||||
(Deva), Dives_Akuru (Diak), Dogra (Dogr), Duployan (Dupl), Egyptian_Hi-
|
||||
eroglyphs (Egyp), Elbasan (Elba), Elymaic (Elym), Ethiopic (Ethi),
|
||||
Georgian (Geor), Glagolitic (Glag), Gothic (Goth), Grantha (Gran),
|
||||
Greek (Grek), Gujarati (Gujr), Gunjala_Gondi (Gong), Gurmukhi (Guru),
|
||||
Han (Hani), Hangul (Hang), Hanifi_Rohingya (Rohg), Hanunoo (Hano), Ha-
|
||||
tran (Hatr), Hebrew (Hebr), Hiragana (Hira), Imperial_Aramaic (Armi),
|
||||
Inherited (Zinh), Inscriptional_Pahlavi (Phli), Inscriptional_Parthian
|
||||
(Prti), Javanese (Java), Kaithi (Kthi), Kannada (Knda), Katakana
|
||||
(Kana), Kayah_Li (Kali), Kharoshthi (Khar), Khitan_Small_Script (Kits),
|
||||
Khmer (Khmr), Khojki (Khoj), Khudawadi (Sind), Lao (Laoo), Latin
|
||||
(Latn), Lepcha (Lepc), Limbu (Limb), Linear_A (Lina), Linear_B (Linb),
|
||||
Lisu (Lisu), Lycian (Lyci), Lydian (Lydi), Mahajani (Majh), Makasar
|
||||
(Maka), Malayalam (Mlym), Mandaic (Mand), Manichaean (Mani), Marchen
|
||||
(Marc), Masaram_Gondi (Gonm), Medefaidrin (Medf), Meetei_Mayek (Mtei),
|
||||
Mende_Kikakui (Mend), Meroitic_Cursive (Merc), Meroitic_Hieroglyphs
|
||||
(Mero), Miao (Miao), Modi (Modi), Mongolian (Mong), Mro (Mroo), Multani
|
||||
(Mult), Myanmar (Mymr), Nabataean (Nbar), Nandinagari (Nand),
|
||||
New_Tai_Lue (Talu), Newa (Newa), Nko (Nkoo), Nushu (Nshu), Nyiak-
|
||||
eng_Puachue_Hmong (Hmnp), Ogham (Ogam), Ol_Chiki (Olck), Old_Hungarian
|
||||
(Hung), Old_Italic (Olck), Old_North_Arabian (Narb), Old_Permic (Perm),
|
||||
Old_Persian (Orkh), Old_Sogdian (Sogo), Old_South_Arabian (Sarb),
|
||||
Old_Turkic (Orkh), Old_Uyghur (Ougr), Oriya (Orya), Osage (Osge), Os-
|
||||
manya (Osma), Pahawh_Hmong (Hmng), Palmyrene (Palm), Pau_Cin_Hau
|
||||
(Pauc), Phags_Pa (Phag), Phoenician (Phnx), Psalter_Pahlavi (Phli), Re-
|
||||
jang (Rjng), Runic (Runr), Samaritan (Samr), Saurashtra (Saur), Sharada
|
||||
(Shrd), Shavian (Shaw), Siddham (Sidd), SignWriting (Sgnw), Sinhala
|
||||
(Sinh), Sogdian (Sogd), Sora_Sompeng (Sora), Soyombo (Soyo), Sundanese
|
||||
(Sund), Syloti_Nagri (Sylo), Syriac (Syrc), Tagalog (Tglg), Tagbanwa
|
||||
(Tagb), Tai_Le (Tale), Tai_Tham (Lana), Tai_Viet (Tavt), Takri (Takr),
|
||||
Tamil (Taml), Tangsa (Tngs), Tangut (Tang), Telugu (Telu), Thaana
|
||||
(Thaa), Thai (Thai), Tibetan (Tibt), Tifinagh (Tfng), Tirhuta (Tirh),
|
||||
Toto (Toto), Ugaritic (Ugar), Vai (Vaii), Vithkuqi (Vith), Wancho
|
||||
(Wcho), Warang_Citi (Wara), Yezidi (Yezi), Yi (Yiii), Zanabazar_Square
|
||||
(Zanb).
|
||||
pcre2test -LS
|
||||
|
||||
|
||||
The general category property for \p and \P
|
||||
|
||||
Each character has exactly one Unicode general category property, spec-
|
||||
ified by a two-letter abbreviation. For compatibility with Perl, nega-
|
||||
|
@ -7050,20 +7023,18 @@ BACKSLASH
|
|||
For example, \p{Lu} always matches only upper case letters. This is
|
||||
different from the behaviour of current versions of Perl.
|
||||
|
||||
Matching characters by Unicode property is not fast, because PCRE2 has
|
||||
to do a multistage table lookup in order to find a character's prop-
|
||||
erty. That is why the traditional escape sequences such as \d and \w do
|
||||
not use Unicode properties in PCRE2 by default, though you can make
|
||||
them do so by setting the PCRE2_UCP option or by starting the pattern
|
||||
with (*UCP).
|
||||
Binary (yes/no) properties for \p and \P
|
||||
|
||||
Bi-directional properties for \p and \P
|
||||
Unicode defines a number of binary properties, that is, properties
|
||||
whose only values are true or false. You can obtain a list of those
|
||||
that are recognized by \p and \P, along with their abbreviations, by
|
||||
running this command:
|
||||
|
||||
Two properties relating to bi-directional text (each with a shorter
|
||||
synonym) are supported:
|
||||
pcre2test -LP
|
||||
|
||||
|
||||
The Bidi_Class property for \p and \P
|
||||
|
||||
\p{Bidi_Control} matches a Bidi control character
|
||||
\p{Bidi_C} matches a Bidi control character
|
||||
\p{Bidi_Class:<class>} matches a character with the given class
|
||||
\p{BC:<class>} matches a character with the given class
|
||||
|
||||
|
@ -7093,9 +7064,8 @@ BACKSLASH
|
|||
S segment separator
|
||||
WS which space
|
||||
|
||||
For Bidi_Class, an equals sign may be used instead of a colon. The
|
||||
class names are case-insensitive; only the short names listed above are
|
||||
recognized.
|
||||
An equals sign may be used instead of a colon. The class names are
|
||||
case-insensitive; only the short names listed above are recognized.
|
||||
|
||||
Extended grapheme clusters
|
||||
|
||||
|
@ -9725,8 +9695,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 28 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
Last updated: 12 January 2022
|
||||
Copyright (c) 1997-2022 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
@ -10739,60 +10709,28 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
|
|||
acter set at release 5.18.
|
||||
|
||||
|
||||
BINARY PROPERTIES FOR \p AND \P
|
||||
|
||||
Unicode defines a number of binary properties, that is, properties
|
||||
whose only values are true or false. You can obtain a list of those
|
||||
that are recognized by \p and \P, along with their abbreviations, by
|
||||
running this command:
|
||||
|
||||
pcre2test -LP
|
||||
|
||||
|
||||
SCRIPT MATCHING WITH \p AND \P
|
||||
|
||||
The following script names and their 4-letter abbreviations are recog-
|
||||
nized in \p{sc:...} or \p{scx:...} items, or on their own with \p (and
|
||||
also \P of course):
|
||||
Many script names and their 4-letter abbreviations are recognized in
|
||||
\p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P
|
||||
of course). You can obtain a list of these scripts by running this com-
|
||||
mand:
|
||||
|
||||
Adlam (Adlm), Ahom (Ahom), Anatolian_Hieroglyphs (Hluw), Arabic (Arab),
|
||||
Armenian (Armn), Avestan (Avst), Balinese (Bali), Bamum (Bamu),
|
||||
Bassa_Vah (Bass), Batak (Batk), Bengali (Beng), Bhaiksuki (Bhks), Bopo-
|
||||
mofo (Bopo), Brahmi (Brah), Braille (Brai), Buginese (Bugi), Buhid
|
||||
(Buhd), Canadian_Aboriginal (Cans), Carian (Cari), Caucasian_Albanian
|
||||
(Aghb), Chakma (Cakm), Cham (Cham), Cherokee (Cher), Chorasmian (Chrs),
|
||||
Common (Zyyy), Coptic (Copt), Cuneiform (Xsux), Cypriot (Cprt),
|
||||
Cypro_Minoan (Cpmn), Cyrillic (Cyrl), Deseret (Dsrt), Devanagari
|
||||
(Deva), Dives_Akuru (Diak), Dogra (Dogr), Duployan (Dupl), Egyptian_Hi-
|
||||
eroglyphs (Egyp), Elbasan (Elba), Elymaic (Elym), Ethiopic (Ethi),
|
||||
Georgian (Geor), Glagolitic (Glag), Gothic (Goth), Grantha (Gran),
|
||||
Greek (Grek), Gujarati (Gujr), Gunjala_Gondi (Gong), Gurmukhi (Guru),
|
||||
Han (Hani), Hangul (Hang), Hanifi_Rohingya (Rohg), Hanunoo (Hano), Ha-
|
||||
tran (Hatr), Hebrew (Hebr), Hiragana (Hira), Imperial_Aramaic (Armi),
|
||||
Inherited (Zinh), Inscriptional_Pahlavi (Phli), Inscriptional_Parthian
|
||||
(Prti), Javanese (Java), Kaithi (Kthi), Kannada (Knda), Katakana
|
||||
(Kana), Kayah_Li (Kali), Kharoshthi (Khar), Khitan_Small_Script (Kits),
|
||||
Khmer (Khmr), Khojki (Khoj), Khudawadi (Sind), Lao (Laoo), Latin
|
||||
(Latn), Lepcha (Lepc), Limbu (Limb), Linear_A (Lina), Linear_B (Linb),
|
||||
Lisu (Lisu), Lycian (Lyci), Lydian (Lydi), Mahajani (Majh), Makasar
|
||||
(Maka), Malayalam (Mlym), Mandaic (Mand), Manichaean (Mani), Marchen
|
||||
(Marc), Masaram_Gondi (Gonm), Medefaidrin (Medf), Meetei_Mayek (Mtei),
|
||||
Mende_Kikakui (Mend), Meroitic_Cursive (Merc), Meroitic_Hieroglyphs
|
||||
(Mero), Miao (Miao), Modi (Modi), Mongolian (Mong), Mro (Mroo), Multani
|
||||
(Mult), Myanmar (Mymr), Nabataean (Nbar), Nandinagari (Nand),
|
||||
New_Tai_Lue (Talu), Newa (Newa), Nko (Nkoo), Nushu (Nshu), Nyiak-
|
||||
eng_Puachue_Hmong (Hmnp), Ogham (Ogam), Ol_Chiki (Olck), Old_Hungarian
|
||||
(Hung), Old_Italic (Olck), Old_North_Arabian (Narb), Old_Permic (Perm),
|
||||
Old_Persian (Orkh), Old_Sogdian (Sogo), Old_South_Arabian (Sarb),
|
||||
Old_Turkic (Orkh), Old_Uyghur (Ougr), Oriya (Orya), Osage (Osge), Os-
|
||||
manya (Osma), Pahawh_Hmong (Hmng), Palmyrene (Palm), Pau_Cin_Hau
|
||||
(Pauc), Phags_Pa (Phag), Phoenician (Phnx), Psalter_Pahlavi (Phli), Re-
|
||||
jang (Rjng), Runic (Runr), Samaritan (Samr), Saurashtra (Saur), Sharada
|
||||
(Shrd), Shavian (Shaw), Siddham (Sidd), SignWriting (Sgnw), Sinhala
|
||||
(Sinh), Sogdian (Sogd), Sora_Sompeng (Sora), Soyombo (Soyo), Sundanese
|
||||
(Sund), Syloti_Nagri (Sylo), Syriac (Syrc), Tagalog (Tglg), Tagbanwa
|
||||
(Tagb), Tai_Le (Tale), Tai_Tham (Lana), Tai_Viet (Tavt), Takri (Takr),
|
||||
Tamil (Taml), Tangsa (Tngs), Tangut (Tang), Telugu (Telu), Thaana
|
||||
(Thaa), Thai (Thai), Tibetan (Tibt), Tifinagh (Tfng), Tirhuta (Tirh),
|
||||
Toto (Toto), Ugaritic (Ugar), Vai (Vaii), Vithkuqi (Vith), Wancho
|
||||
(Wcho), Warang_Citi (Wara), Yezidi (Yezi), Yi (Yiii), Zanabazar_Square
|
||||
(Zanb).
|
||||
pcre2test -LS
|
||||
|
||||
|
||||
BIDI_PROPERTIES FOR \p AND \P
|
||||
THE BIDI_CLASS PROPERTY FOR \p AND \P
|
||||
|
||||
\p{Bidi_Control} matches a Bidi control character
|
||||
\p{Bidi_C} matches a Bidi control character
|
||||
\p{Bidi_Class:<class>} matches a character with the given class
|
||||
\p{BC:<class>} matches a character with the given class
|
||||
|
||||
|
@ -10846,8 +10784,8 @@ CHARACTER CLASSES
|
|||
word same as \w
|
||||
xdigit hexadecimal digit
|
||||
|
||||
In PCRE2, POSIX character set names recognize only ASCII characters by
|
||||
default, but some of them use Unicode properties if PCRE2_UCP is set.
|
||||
In PCRE2, POSIX character set names recognize only ASCII characters by
|
||||
default, but some of them use Unicode properties if PCRE2_UCP is set.
|
||||
You can use \Q...\E inside a character class.
|
||||
|
||||
|
||||
|
@ -10892,8 +10830,8 @@ REPORTED MATCH POINT SETTING
|
|||
|
||||
\K set reported start of match
|
||||
|
||||
From release 10.38 \K is not permitted by default in lookaround asser-
|
||||
tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL-
|
||||
From release 10.38 \K is not permitted by default in lookaround asser-
|
||||
tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL-
|
||||
LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
|
||||
When this option is set, \K is honoured in positive assertions, but ig-
|
||||
nored in negative ones.
|
||||
|
@ -10914,8 +10852,8 @@ CAPTURING
|
|||
(?|...) non-capture group; reset group numbers for
|
||||
capture groups in each alternative
|
||||
|
||||
In non-UTF modes, names may contain underscores and ASCII letters and
|
||||
digits; in UTF modes, any Unicode letters and Unicode decimal digits
|
||||
In non-UTF modes, names may contain underscores and ASCII letters and
|
||||
digits; in UTF modes, any Unicode letters and Unicode decimal digits
|
||||
are permitted. In both cases, a name must not start with a digit.
|
||||
|
||||
|
||||
|
@ -10931,7 +10869,7 @@ COMMENT
|
|||
|
||||
|
||||
OPTION SETTING
|
||||
Changes of these options within a group are automatically cancelled at
|
||||
Changes of these options within a group are automatically cancelled at
|
||||
the end of the group.
|
||||
|
||||
(?i) caseless
|
||||
|
@ -10945,7 +10883,7 @@ OPTION SETTING
|
|||
(?-...) unset option(s)
|
||||
(?^) unset imnsx options
|
||||
|
||||
Unsetting x or xx unsets both. Several options may be set at once, and
|
||||
Unsetting x or xx unsets both. Several options may be set at once, and
|
||||
a mixture of setting and unsetting such as (?i-x) is allowed, but there
|
||||
may be only one hyphen. Setting (but no unsetting) is allowed after (?^
|
||||
for example (?^in). An option setting may appear at the start of a non-
|
||||
|
@ -10967,11 +10905,11 @@ OPTION SETTING
|
|||
(*UTF) set appropriate UTF mode for the library in use
|
||||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||
|
||||
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
|
||||
value of the limits set by the caller of pcre2_match() or
|
||||
pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
|
||||
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
|
||||
value of the limits set by the caller of pcre2_match() or
|
||||
pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
|
||||
synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
|
||||
and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
|
||||
and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
|
||||
respectively, at compile time.
|
||||
|
||||
|
||||
|
@ -11092,16 +11030,16 @@ CONDITIONAL PATTERNS
|
|||
(?(VERSION[>]=n.m) test PCRE2 version
|
||||
(?(assert) assertion condition
|
||||
|
||||
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||
conditions or recursion tests. Such a condition is interpreted as a
|
||||
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||
conditions or recursion tests. Such a condition is interpreted as a
|
||||
reference condition if the relevant named group exists.
|
||||
|
||||
|
||||
BACKTRACKING CONTROL
|
||||
|
||||
All backtracking control verbs may be in the form (*VERB:NAME). For
|
||||
(*MARK) the name is mandatory, for the others it is optional. (*SKIP)
|
||||
changes its behaviour if :NAME is present. The others just set a name
|
||||
All backtracking control verbs may be in the form (*VERB:NAME). For
|
||||
(*MARK) the name is mandatory, for the others it is optional. (*SKIP)
|
||||
changes its behaviour if :NAME is present. The others just set a name
|
||||
for passing back to the caller, but this is not a name that (*SKIP) can
|
||||
see. The following act immediately they are reached:
|
||||
|
||||
|
@ -11109,7 +11047,7 @@ BACKTRACKING CONTROL
|
|||
(*FAIL) force backtrack; synonym (*F)
|
||||
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
|
||||
|
||||
The following act only when a subsequent match failure causes a back-
|
||||
The following act only when a subsequent match failure causes a back-
|
||||
track to reach them. They all force a match failure, but they differ in
|
||||
what happens afterwards. Those that advance the start-of-match point do
|
||||
so only if the pattern is not anchored.
|
||||
|
@ -11121,7 +11059,7 @@ BACKTRACKING CONTROL
|
|||
(*MARK:NAME); if not found, the (*SKIP) is ignored
|
||||
(*THEN) local failure, backtrack to next alternation
|
||||
|
||||
The effect of one of these verbs in a group called as a subroutine is
|
||||
The effect of one of these verbs in a group called as a subroutine is
|
||||
confined to the subroutine call.
|
||||
|
||||
|
||||
|
@ -11132,14 +11070,14 @@ CALLOUTS
|
|||
(?C"text") callout with string data
|
||||
|
||||
The allowed string delimiters are ` ' " ^ % # $ (which are the same for
|
||||
the start and the end), and the starting delimiter { matched with the
|
||||
ending delimiter }. To encode the ending delimiter within the string,
|
||||
the start and the end), and the starting delimiter { matched with the
|
||||
ending delimiter }. To encode the ending delimiter within the string,
|
||||
double it.
|
||||
|
||||
|
||||
SEE ALSO
|
||||
|
||||
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
|
||||
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
|
||||
pcre2(3).
|
||||
|
||||
|
||||
|
@ -11152,8 +11090,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 28 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
Last updated: 12 January 2022
|
||||
Copyright (c) 1997-2022 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "28 December 2021" "PCRE2 10.40"
|
||||
.TH PCRE2PATTERN 3 "12 January 2022" "PCRE2 10.40"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -772,8 +772,15 @@ can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
|
|||
sequences are of course limited to testing characters whose code points are
|
||||
less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
|
||||
greater than 0x10ffff (the Unicode limit) may be encountered. These are all
|
||||
treated as being in the Unknown script and with an unassigned type. The extra
|
||||
escape sequences are:
|
||||
treated as being in the Unknown script and with an unassigned type.
|
||||
.P
|
||||
Matching characters by Unicode property is not fast, because PCRE2 has to do a
|
||||
multistage table lookup in order to find a character's property. That is why
|
||||
the traditional escape sequences such as \ed and \ew do not use Unicode
|
||||
properties in PCRE2 by default, though you can make them do so by setting the
|
||||
PCRE2_UCP option or by starting the pattern with (*UCP).
|
||||
.P
|
||||
The extra escape sequences that provide property support are:
|
||||
.sp
|
||||
\ep{\fIxx\fP} a character with the \fIxx\fP property
|
||||
\eP{\fIxx\fP} a character without the \fIxx\fP property
|
||||
|
@ -783,19 +790,24 @@ The property names represented by \fIxx\fP above are not case-sensitive, and in
|
|||
accordance with Unicode's "loose matching" rules, spaces, hyphens, and
|
||||
underscores are ignored. There is support for Unicode script names, Unicode
|
||||
general category properties, "Any", which matches any character (including
|
||||
newline), Bidi_Control, Bidi_Class, and some special PCRE2 properties
|
||||
(described
|
||||
newline), Bidi_Class, a number of binary (yes/no) properties, and some special
|
||||
PCRE2 properties (described
|
||||
.\" HTML <a href="#extraprops">
|
||||
.\" </a>
|
||||
below).
|
||||
.\"
|
||||
Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
|
||||
Note that \eP{Any} does not match any characters, so always causes a match
|
||||
failure.
|
||||
.P
|
||||
Certain other Perl properties such as "InMusicalSymbols" are not supported by
|
||||
PCRE2. Note that \eP{Any} does not match any characters, so always causes a
|
||||
match failure.
|
||||
.
|
||||
.
|
||||
.
|
||||
.SS "Script properties for \ep and \eP"
|
||||
.rs
|
||||
.sp
|
||||
There are three different syntax forms for matching a script. Each Unicode
|
||||
character has a basic script and, optionally, a list of other scripts ("Script
|
||||
Extentions") with which it is commonly used. Using the Adlam script as an
|
||||
Extensions") with which it is commonly used. Using the Adlam script as an
|
||||
example, \ep{sc:Adlam} matches characters whose basic script is Adlam, whereas
|
||||
\ep{scx:Adlam} matches, in addition, characters that have Adlam in their
|
||||
extensions list. The full names "script" and "script extensions" for the
|
||||
|
@ -806,171 +818,18 @@ interpretation at release 5.26 and PCRE2 changed at release 10.40.
|
|||
.P
|
||||
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
|
||||
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
|
||||
part of an identified script are lumped together as "Common". The current list
|
||||
of script names and their 4-letter abbreviations is:
|
||||
.P
|
||||
Adlam (Adlm),
|
||||
Ahom (Ahom),
|
||||
Anatolian_Hieroglyphs (Hluw),
|
||||
Arabic (Arab),
|
||||
Armenian (Armn),
|
||||
Avestan (Avst),
|
||||
Balinese (Bali),
|
||||
Bamum (Bamu),
|
||||
Bassa_Vah (Bass),
|
||||
Batak (Batk),
|
||||
Bengali (Beng),
|
||||
Bhaiksuki (Bhks),
|
||||
Bopomofo (Bopo),
|
||||
Brahmi (Brah),
|
||||
Braille (Brai),
|
||||
Buginese (Bugi),
|
||||
Buhid (Buhd),
|
||||
Canadian_Aboriginal (Cans),
|
||||
Carian (Cari),
|
||||
Caucasian_Albanian (Aghb),
|
||||
Chakma (Cakm),
|
||||
Cham (Cham),
|
||||
Cherokee (Cher),
|
||||
Chorasmian (Chrs),
|
||||
Common (Zyyy),
|
||||
Coptic (Copt),
|
||||
Cuneiform (Xsux),
|
||||
Cypriot (Cprt),
|
||||
Cypro_Minoan (Cpmn),
|
||||
Cyrillic (Cyrl),
|
||||
Deseret (Dsrt),
|
||||
Devanagari (Deva),
|
||||
Dives_Akuru (Diak),
|
||||
Dogra (Dogr),
|
||||
Duployan (Dupl),
|
||||
Egyptian_Hieroglyphs (Egyp),
|
||||
Elbasan (Elba),
|
||||
Elymaic (Elym),
|
||||
Ethiopic (Ethi),
|
||||
Georgian (Geor),
|
||||
Glagolitic (Glag),
|
||||
Gothic (Goth),
|
||||
Grantha (Gran),
|
||||
Greek (Grek),
|
||||
Gujarati (Gujr),
|
||||
Gunjala_Gondi (Gong),
|
||||
Gurmukhi (Guru),
|
||||
Han (Hani),
|
||||
Hangul (Hang),
|
||||
Hanifi_Rohingya (Rohg),
|
||||
Hanunoo (Hano),
|
||||
Hatran (Hatr),
|
||||
Hebrew (Hebr),
|
||||
Hiragana (Hira),
|
||||
Imperial_Aramaic (Armi),
|
||||
Inherited (Zinh),
|
||||
Inscriptional_Pahlavi (Phli),
|
||||
Inscriptional_Parthian (Prti),
|
||||
Javanese (Java),
|
||||
Kaithi (Kthi),
|
||||
Kannada (Knda),
|
||||
Katakana (Kana),
|
||||
Kayah_Li (Kali),
|
||||
Kharoshthi (Khar),
|
||||
Khitan_Small_Script (Kits),
|
||||
Khmer (Khmr),
|
||||
Khojki (Khoj),
|
||||
Khudawadi (Sind),
|
||||
Lao (Laoo),
|
||||
Latin (Latn),
|
||||
Lepcha (Lepc),
|
||||
Limbu (Limb),
|
||||
Linear_A (Lina),
|
||||
Linear_B (Linb),
|
||||
Lisu (Lisu),
|
||||
Lycian (Lyci),
|
||||
Lydian (Lydi),
|
||||
Mahajani (Majh),
|
||||
Makasar (Maka),
|
||||
Malayalam (Mlym),
|
||||
Mandaic (Mand),
|
||||
Manichaean (Mani),
|
||||
Marchen (Marc),
|
||||
Masaram_Gondi (Gonm),
|
||||
Medefaidrin (Medf),
|
||||
Meetei_Mayek (Mtei),
|
||||
Mende_Kikakui (Mend),
|
||||
Meroitic_Cursive (Merc),
|
||||
Meroitic_Hieroglyphs (Mero),
|
||||
Miao (Miao),
|
||||
Modi (Modi),
|
||||
Mongolian (Mong),
|
||||
Mro (Mroo),
|
||||
Multani (Mult),
|
||||
Myanmar (Mymr),
|
||||
Nabataean (Nbar),
|
||||
Nandinagari (Nand),
|
||||
New_Tai_Lue (Talu),
|
||||
Newa (Newa),
|
||||
Nko (Nkoo),
|
||||
Nushu (Nshu),
|
||||
Nyiakeng_Puachue_Hmong (Hmnp),
|
||||
Ogham (Ogam),
|
||||
Ol_Chiki (Olck),
|
||||
Old_Hungarian (Hung),
|
||||
Old_Italic (Olck),
|
||||
Old_North_Arabian (Narb),
|
||||
Old_Permic (Perm),
|
||||
Old_Persian (Orkh),
|
||||
Old_Sogdian (Sogo),
|
||||
Old_South_Arabian (Sarb),
|
||||
Old_Turkic (Orkh),
|
||||
Old_Uyghur (Ougr),
|
||||
Oriya (Orya),
|
||||
Osage (Osge),
|
||||
Osmanya (Osma),
|
||||
Pahawh_Hmong (Hmng),
|
||||
Palmyrene (Palm),
|
||||
Pau_Cin_Hau (Pauc),
|
||||
Phags_Pa (Phag),
|
||||
Phoenician (Phnx),
|
||||
Psalter_Pahlavi (Phli),
|
||||
Rejang (Rjng),
|
||||
Runic (Runr),
|
||||
Samaritan (Samr),
|
||||
Saurashtra (Saur),
|
||||
Sharada (Shrd),
|
||||
Shavian (Shaw),
|
||||
Siddham (Sidd),
|
||||
SignWriting (Sgnw),
|
||||
Sinhala (Sinh),
|
||||
Sogdian (Sogd),
|
||||
Sora_Sompeng (Sora),
|
||||
Soyombo (Soyo),
|
||||
Sundanese (Sund),
|
||||
Syloti_Nagri (Sylo),
|
||||
Syriac (Syrc),
|
||||
Tagalog (Tglg),
|
||||
Tagbanwa (Tagb),
|
||||
Tai_Le (Tale),
|
||||
Tai_Tham (Lana),
|
||||
Tai_Viet (Tavt),
|
||||
Takri (Takr),
|
||||
Tamil (Taml),
|
||||
Tangsa (Tngs),
|
||||
Tangut (Tang),
|
||||
Telugu (Telu),
|
||||
Thaana (Thaa),
|
||||
Thai (Thai),
|
||||
Tibetan (Tibt),
|
||||
Tifinagh (Tfng),
|
||||
Tirhuta (Tirh),
|
||||
Toto (Toto),
|
||||
Ugaritic (Ugar),
|
||||
Vai (Vaii),
|
||||
Vithkuqi (Vith),
|
||||
Wancho (Wcho),
|
||||
Warang_Citi (Wara),
|
||||
Yezidi (Yezi),
|
||||
Yi (Yiii),
|
||||
Zanabazar_Square (Zanb).
|
||||
.P
|
||||
part of an identified script are lumped together as "Common". The current list
|
||||
of recognized script names and their 4-character abbreviations can be obtained
|
||||
by running this command:
|
||||
.sp
|
||||
pcre2test -LS
|
||||
.sp
|
||||
.
|
||||
.
|
||||
.
|
||||
.SS "The general category property for \ep and \eP"
|
||||
.rs
|
||||
.sp
|
||||
Each character has exactly one Unicode general category property, specified by
|
||||
a two-letter abbreviation. For compatibility with Perl, negation can be
|
||||
specified by including a circumflex between the opening brace and the property
|
||||
|
@ -1056,22 +915,22 @@ Unicode table.
|
|||
Specifying caseless matching does not affect these escape sequences. For
|
||||
example, \ep{Lu} always matches only upper case letters. This is different from
|
||||
the behaviour of current versions of Perl.
|
||||
.P
|
||||
Matching characters by Unicode property is not fast, because PCRE2 has to do a
|
||||
multistage table lookup in order to find a character's property. That is why
|
||||
the traditional escape sequences such as \ed and \ew do not use Unicode
|
||||
properties in PCRE2 by default, though you can make them do so by setting the
|
||||
PCRE2_UCP option or by starting the pattern with (*UCP).
|
||||
.
|
||||
.
|
||||
.SS "Bi-directional properties for \ep and \eP"
|
||||
.SS "Binary (yes/no) properties for \ep and \eP"
|
||||
.rs
|
||||
.sp
|
||||
Two properties relating to bi-directional text (each with a shorter synonym)
|
||||
are supported:
|
||||
Unicode defines a number of binary properties, that is, properties whose only
|
||||
values are true or false. You can obtain a list of those that are recognized by
|
||||
\ep and \eP, along with their abbreviations, by running this command:
|
||||
.sp
|
||||
pcre2test -LP
|
||||
.sp
|
||||
.
|
||||
.
|
||||
.SS "The Bidi_Class property for \ep and \eP"
|
||||
.rs
|
||||
.sp
|
||||
\ep{Bidi_Control} matches a Bidi control character
|
||||
\ep{Bidi_C} matches a Bidi control character
|
||||
\ep{Bidi_Class:<class>} matches a character with the given class
|
||||
\ep{BC:<class>} matches a character with the given class
|
||||
.sp
|
||||
|
@ -1101,8 +960,8 @@ The recognized classes are:
|
|||
S segment separator
|
||||
WS which space
|
||||
.sp
|
||||
For Bidi_Class, an equals sign may be used instead of a colon. The class names
|
||||
are case-insensitive; only the short names listed above are recognized.
|
||||
An equals sign may be used instead of a colon. The class names are
|
||||
case-insensitive; only the short names listed above are recognized.
|
||||
.
|
||||
.
|
||||
.SS Extended grapheme clusters
|
||||
|
@ -3955,6 +3814,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 28 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
Last updated: 12 January 2022
|
||||
Copyright (c) 1997-2022 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "28 December 2021" "PCRE2 10.40"
|
||||
.TH PCRE2SYNTAX 3 "12 January 2022" "PCRE2 10.40"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -172,181 +172,31 @@ Perl and POSIX space are now the same. Perl added VT to its space character set
|
|||
at release 5.18.
|
||||
.
|
||||
.
|
||||
.SH "BINARY PROPERTIES FOR \ep AND \eP"
|
||||
.rs
|
||||
.sp
|
||||
Unicode defines a number of binary properties, that is, properties whose only
|
||||
values are true or false. You can obtain a list of those that are recognized by
|
||||
\ep and \eP, along with their abbreviations, by running this command:
|
||||
.sp
|
||||
pcre2test -LP
|
||||
.
|
||||
.
|
||||
.
|
||||
.SH "SCRIPT MATCHING WITH \ep AND \eP"
|
||||
.rs
|
||||
.sp
|
||||
The following script names and their 4-letter abbreviations are recognized in
|
||||
Many script names and their 4-letter abbreviations are recognized in
|
||||
\ep{sc:...} or \ep{scx:...} items, or on their own with \ep (and also \eP of
|
||||
course):
|
||||
.P
|
||||
Adlam (Adlm),
|
||||
Ahom (Ahom),
|
||||
Anatolian_Hieroglyphs (Hluw),
|
||||
Arabic (Arab),
|
||||
Armenian (Armn),
|
||||
Avestan (Avst),
|
||||
Balinese (Bali),
|
||||
Bamum (Bamu),
|
||||
Bassa_Vah (Bass),
|
||||
Batak (Batk),
|
||||
Bengali (Beng),
|
||||
Bhaiksuki (Bhks),
|
||||
Bopomofo (Bopo),
|
||||
Brahmi (Brah),
|
||||
Braille (Brai),
|
||||
Buginese (Bugi),
|
||||
Buhid (Buhd),
|
||||
Canadian_Aboriginal (Cans),
|
||||
Carian (Cari),
|
||||
Caucasian_Albanian (Aghb),
|
||||
Chakma (Cakm),
|
||||
Cham (Cham),
|
||||
Cherokee (Cher),
|
||||
Chorasmian (Chrs),
|
||||
Common (Zyyy),
|
||||
Coptic (Copt),
|
||||
Cuneiform (Xsux),
|
||||
Cypriot (Cprt),
|
||||
Cypro_Minoan (Cpmn),
|
||||
Cyrillic (Cyrl),
|
||||
Deseret (Dsrt),
|
||||
Devanagari (Deva),
|
||||
Dives_Akuru (Diak),
|
||||
Dogra (Dogr),
|
||||
Duployan (Dupl),
|
||||
Egyptian_Hieroglyphs (Egyp),
|
||||
Elbasan (Elba),
|
||||
Elymaic (Elym),
|
||||
Ethiopic (Ethi),
|
||||
Georgian (Geor),
|
||||
Glagolitic (Glag),
|
||||
Gothic (Goth),
|
||||
Grantha (Gran),
|
||||
Greek (Grek),
|
||||
Gujarati (Gujr),
|
||||
Gunjala_Gondi (Gong),
|
||||
Gurmukhi (Guru),
|
||||
Han (Hani),
|
||||
Hangul (Hang),
|
||||
Hanifi_Rohingya (Rohg),
|
||||
Hanunoo (Hano),
|
||||
Hatran (Hatr),
|
||||
Hebrew (Hebr),
|
||||
Hiragana (Hira),
|
||||
Imperial_Aramaic (Armi),
|
||||
Inherited (Zinh),
|
||||
Inscriptional_Pahlavi (Phli),
|
||||
Inscriptional_Parthian (Prti),
|
||||
Javanese (Java),
|
||||
Kaithi (Kthi),
|
||||
Kannada (Knda),
|
||||
Katakana (Kana),
|
||||
Kayah_Li (Kali),
|
||||
Kharoshthi (Khar),
|
||||
Khitan_Small_Script (Kits),
|
||||
Khmer (Khmr),
|
||||
Khojki (Khoj),
|
||||
Khudawadi (Sind),
|
||||
Lao (Laoo),
|
||||
Latin (Latn),
|
||||
Lepcha (Lepc),
|
||||
Limbu (Limb),
|
||||
Linear_A (Lina),
|
||||
Linear_B (Linb),
|
||||
Lisu (Lisu),
|
||||
Lycian (Lyci),
|
||||
Lydian (Lydi),
|
||||
Mahajani (Majh),
|
||||
Makasar (Maka),
|
||||
Malayalam (Mlym),
|
||||
Mandaic (Mand),
|
||||
Manichaean (Mani),
|
||||
Marchen (Marc),
|
||||
Masaram_Gondi (Gonm),
|
||||
Medefaidrin (Medf),
|
||||
Meetei_Mayek (Mtei),
|
||||
Mende_Kikakui (Mend),
|
||||
Meroitic_Cursive (Merc),
|
||||
Meroitic_Hieroglyphs (Mero),
|
||||
Miao (Miao),
|
||||
Modi (Modi),
|
||||
Mongolian (Mong),
|
||||
Mro (Mroo),
|
||||
Multani (Mult),
|
||||
Myanmar (Mymr),
|
||||
Nabataean (Nbar),
|
||||
Nandinagari (Nand),
|
||||
New_Tai_Lue (Talu),
|
||||
Newa (Newa),
|
||||
Nko (Nkoo),
|
||||
Nushu (Nshu),
|
||||
Nyiakeng_Puachue_Hmong (Hmnp),
|
||||
Ogham (Ogam),
|
||||
Ol_Chiki (Olck),
|
||||
Old_Hungarian (Hung),
|
||||
Old_Italic (Olck),
|
||||
Old_North_Arabian (Narb),
|
||||
Old_Permic (Perm),
|
||||
Old_Persian (Orkh),
|
||||
Old_Sogdian (Sogo),
|
||||
Old_South_Arabian (Sarb),
|
||||
Old_Turkic (Orkh),
|
||||
Old_Uyghur (Ougr),
|
||||
Oriya (Orya),
|
||||
Osage (Osge),
|
||||
Osmanya (Osma),
|
||||
Pahawh_Hmong (Hmng),
|
||||
Palmyrene (Palm),
|
||||
Pau_Cin_Hau (Pauc),
|
||||
Phags_Pa (Phag),
|
||||
Phoenician (Phnx),
|
||||
Psalter_Pahlavi (Phli),
|
||||
Rejang (Rjng),
|
||||
Runic (Runr),
|
||||
Samaritan (Samr),
|
||||
Saurashtra (Saur),
|
||||
Sharada (Shrd),
|
||||
Shavian (Shaw),
|
||||
Siddham (Sidd),
|
||||
SignWriting (Sgnw),
|
||||
Sinhala (Sinh),
|
||||
Sogdian (Sogd),
|
||||
Sora_Sompeng (Sora),
|
||||
Soyombo (Soyo),
|
||||
Sundanese (Sund),
|
||||
Syloti_Nagri (Sylo),
|
||||
Syriac (Syrc),
|
||||
Tagalog (Tglg),
|
||||
Tagbanwa (Tagb),
|
||||
Tai_Le (Tale),
|
||||
Tai_Tham (Lana),
|
||||
Tai_Viet (Tavt),
|
||||
Takri (Takr),
|
||||
Tamil (Taml),
|
||||
Tangsa (Tngs),
|
||||
Tangut (Tang),
|
||||
Telugu (Telu),
|
||||
Thaana (Thaa),
|
||||
Thai (Thai),
|
||||
Tibetan (Tibt),
|
||||
Tifinagh (Tfng),
|
||||
Tirhuta (Tirh),
|
||||
Toto (Toto),
|
||||
Ugaritic (Ugar),
|
||||
Vai (Vaii),
|
||||
Vithkuqi (Vith),
|
||||
Wancho (Wcho),
|
||||
Warang_Citi (Wara),
|
||||
Yezidi (Yezi),
|
||||
Yi (Yiii),
|
||||
Zanabazar_Square (Zanb).
|
||||
course). You can obtain a list of these scripts by running this command:
|
||||
.sp
|
||||
pcre2test -LS
|
||||
.
|
||||
.
|
||||
.SH "BIDI_PROPERTIES FOR \ep AND \eP"
|
||||
.
|
||||
.SH "THE BIDI_CLASS PROPERTY FOR \ep AND \eP"
|
||||
.rs
|
||||
.sp
|
||||
\ep{Bidi_Control} matches a Bidi control character
|
||||
\ep{Bidi_C} matches a Bidi control character
|
||||
\ep{Bidi_Class:<class>} matches a character with the given class
|
||||
\ep{BC:<class>} matches a character with the given class
|
||||
.sp
|
||||
|
@ -728,6 +578,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 28 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
Last updated: 12 January 2022
|
||||
Copyright (c) 1997-2022 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -197,7 +197,17 @@ COMMAND LINE OPTIONS
|
|||
|
||||
-LM List modifiers: write a list of available pattern and subject
|
||||
modifiers to the standard output, then exit with zero exit
|
||||
code. All other options are ignored. If both -C and -LM are
|
||||
code. All other options are ignored. If both -C and any -Lx
|
||||
options are present, whichever is first is recognized.
|
||||
|
||||
-LP List properties: write a list of recognized Unicode proper-
|
||||
ties to the standard output, then exit with zero exit code.
|
||||
All other options are ignored. If both -C and any -Lx options
|
||||
are present, whichever is first is recognized.
|
||||
|
||||
-LS List scripts: write a list of recogized Unicode script names
|
||||
to the standard output, then exit with zero exit code. All
|
||||
other options are ignored. If both -C and any -Lx options are
|
||||
present, whichever is first is recognized.
|
||||
|
||||
-pattern modifier-list
|
||||
|
@ -1939,5 +1949,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 28 November 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
Last updated: 12 January 2022
|
||||
Copyright (c) 1997-2022 University of Cambridge.
|
||||
|
|
Loading…
Reference in New Issue