Documentation for script handling update
This commit is contained in:
parent
b29732063b
commit
944f0e10a1
|
@ -795,13 +795,18 @@ Note that \P{Any} does not match any characters, so always causes a match
|
|||
failure.
|
||||
</P>
|
||||
<P>
|
||||
Sets of Unicode characters are defined as belonging to certain scripts. A
|
||||
character from one of these sets can be matched using a script name. For
|
||||
example:
|
||||
<pre>
|
||||
\p{Greek}
|
||||
\P{Han}
|
||||
</pre>
|
||||
There are three different syntax forms for matching a script. Each Unicode
|
||||
character has a basic script and, optionally, a list of other scripts ("Script
|
||||
Extentions") with which it is commonly used. Using the Adlam script as an
|
||||
example, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas
|
||||
\p{scx:Adlam} matches, in addition, characters that have Adlam in their
|
||||
extensions list. The full names "script" and "script extensions" for the
|
||||
property types are recognized, and a equals sign is an alternative to the
|
||||
colon. If a script name is given without a property type, for example,
|
||||
\p{Adlam}, it is treated as \p{scx:Adlam}. Perl changed to this
|
||||
interpretation at release 5.26 and PCRE2 changed at release 10.40.
|
||||
</P>
|
||||
<P>
|
||||
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
|
||||
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
|
||||
part of an identified script are lumped together as "Common". The current list
|
||||
|
@ -3904,7 +3909,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 10 December 2021
|
||||
Last updated: 22 December 2021
|
||||
<br>
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -19,7 +19,7 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
|
||||
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
|
||||
<li><a name="TOC7" href="#SEC7">SCRIPT MATCHING WITH \p AND \P</a>
|
||||
<li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a>
|
||||
<li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a>
|
||||
<li><a name="TOC10" href="#SEC10">QUANTIFIERS</a>
|
||||
|
@ -158,6 +158,7 @@ matching" rules.
|
|||
Lo Other letter
|
||||
Lt Title case letter
|
||||
Lu Upper case letter
|
||||
Lc Ll, Lu, or Lt
|
||||
L& Ll, Lu, or Lt
|
||||
|
||||
M Mark
|
||||
|
@ -204,7 +205,11 @@ matching" rules.
|
|||
Perl and POSIX space are now the same. Perl added VT to its space character set
|
||||
at release 5.18.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
|
||||
<br><a name="SEC7" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
|
||||
<P>
|
||||
The following script names are recognized in \p{sc:...} or \p{scx:...} items,
|
||||
or on their own with \p (and also \P of course):
|
||||
</P>
|
||||
<P>
|
||||
Adlam,
|
||||
Ahom,
|
||||
|
@ -738,7 +743,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 10 December 2021
|
||||
Last updated: 22 December 2021
|
||||
<br>
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -50,17 +50,18 @@ UNICODE PROPERTY SUPPORT
|
|||
<P>
|
||||
When PCRE2 is built with Unicode support, the escape sequences \p{..},
|
||||
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
|
||||
The Unicode properties that can be tested are limited to the general category
|
||||
properties such as Lu for an upper case letter or Nd for a decimal number, the
|
||||
Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
|
||||
derived properties Any and LC (synonym L&). Full lists are given in the
|
||||
The Unicode properties that can be tested are a subset of those that Perl
|
||||
supports. Currently they are limited to the general category properties such as
|
||||
Lu for an upper case letter or Nd for a decimal number, the Unicode script
|
||||
names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
|
||||
properties Any and LC (synonym L&). Full lists are given in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
and
|
||||
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
|
||||
documentation. Only the short names for properties are supported. For example,
|
||||
\p{L} matches a letter. Its longer synonym, \p{Letter}, is not supported.
|
||||
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
|
||||
compatibility with Perl 5.6. PCRE2 does not support this.
|
||||
documentation. In general, only the short names for properties are supported.
|
||||
For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
|
||||
supported. Furthermore, in Perl, many properties may optionally be prefixed by
|
||||
"Is", for compatibility with Perl 5.6. PCRE2 does not support this.
|
||||
</P>
|
||||
<br><b>
|
||||
WIDE CHARACTERS AND UTF MODES
|
||||
|
@ -477,7 +478,7 @@ AUTHOR
|
|||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
University Computing Service
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
|
@ -486,7 +487,7 @@ Cambridge, England.
|
|||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 08 December 2021
|
||||
Last updated: 22 December 2021
|
||||
<br>
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -6905,12 +6905,17 @@ BACKSLASH
|
|||
calSymbols" are not supported by PCRE2. Note that \P{Any} does not
|
||||
match any characters, so always causes a match failure.
|
||||
|
||||
Sets of Unicode characters are defined as belonging to certain scripts.
|
||||
A character from one of these sets can be matched using a script name.
|
||||
For example:
|
||||
|
||||
\p{Greek}
|
||||
\P{Han}
|
||||
There are three different syntax forms for matching a script. Each Uni-
|
||||
code character has a basic script and, optionally, a list of other
|
||||
scripts ("Script Extentions") with which it is commonly used. Using the
|
||||
Adlam script as an example, \p{sc:Adlam} matches characters whose basic
|
||||
script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
|
||||
that have Adlam in their extensions list. The full names "script" and
|
||||
"script extensions" for the property types are recognized, and a equals
|
||||
sign is an alternative to the colon. If a script name is given without
|
||||
a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad-
|
||||
lam}. Perl changed to this interpretation at release 5.26 and PCRE2
|
||||
changed at release 10.40.
|
||||
|
||||
Unassigned characters (and in non-UTF 32-bit mode, characters with code
|
||||
points greater than 0x10FFFF) are assigned the "Unknown" script. Others
|
||||
|
@ -9702,7 +9707,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 10 December 2021
|
||||
Last updated: 22 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -10670,6 +10675,7 @@ GENERAL CATEGORY PROPERTIES FOR \p and \P
|
|||
Lo Other letter
|
||||
Lt Title case letter
|
||||
Lu Upper case letter
|
||||
Lc Ll, Lu, or Lt
|
||||
L& Ll, Lu, or Lt
|
||||
|
||||
M Mark
|
||||
|
@ -10716,7 +10722,10 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
|
|||
acter set at release 5.18.
|
||||
|
||||
|
||||
SCRIPT NAMES FOR \p AND \P
|
||||
SCRIPT MATCHING WITH \p AND \P
|
||||
|
||||
The following script names are recognized in \p{sc:...} or \p{scx:...}
|
||||
items, or on their own with \p (and also \P of course):
|
||||
|
||||
Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
|
||||
nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
|
||||
|
@ -11108,7 +11117,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 10 December 2021
|
||||
Last updated: 22 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -11151,16 +11160,17 @@ UNICODE PROPERTY SUPPORT
|
|||
|
||||
When PCRE2 is built with Unicode support, the escape sequences \p{..},
|
||||
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
|
||||
ting. The Unicode properties that can be tested are limited to the
|
||||
general category properties such as Lu for an upper case letter or Nd
|
||||
for a decimal number, the Unicode script names such as Arabic or Han,
|
||||
Bidi_Class, Bidi_Control, and the derived properties Any and LC (syn-
|
||||
onym L&). Full lists are given in the pcre2pattern and pcre2syntax doc-
|
||||
umentation. Only the short names for properties are supported. For ex-
|
||||
ample, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
|
||||
supported. Furthermore, in Perl, many properties may optionally be
|
||||
prefixed by "Is", for compatibility with Perl 5.6. PCRE2 does not sup-
|
||||
port this.
|
||||
ting. The Unicode properties that can be tested are a subset of those
|
||||
that Perl supports. Currently they are limited to the general category
|
||||
properties such as Lu for an upper case letter or Nd for a decimal num-
|
||||
ber, the Unicode script names such as Arabic or Han, Bidi_Class,
|
||||
Bidi_Control, and the derived properties Any and LC (synonym L&). Full
|
||||
lists are given in the pcre2pattern and pcre2syntax documentation. In
|
||||
general, only the short names for properties are supported. For exam-
|
||||
ple, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
|
||||
supported. Furthermore, in Perl, many properties may optionally be pre-
|
||||
fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not support
|
||||
this.
|
||||
|
||||
|
||||
WIDE CHARACTERS AND UTF MODES
|
||||
|
@ -11538,13 +11548,13 @@ MATCHING IN INVALID UTF STRINGS
|
|||
AUTHOR
|
||||
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Retired from University Computing Service
|
||||
Cambridge, England.
|
||||
|
||||
|
||||
REVISION
|
||||
|
||||
Last updated: 08 December 2021
|
||||
Last updated: 22 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "10 December 2021" "PCRE2 10.40"
|
||||
.TH PCRE2PATTERN 3 "22 December 2021" "PCRE2 10.40"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -793,13 +793,17 @@ Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
|
|||
Note that \eP{Any} does not match any characters, so always causes a match
|
||||
failure.
|
||||
.P
|
||||
Sets of Unicode characters are defined as belonging to certain scripts. A
|
||||
character from one of these sets can be matched using a script name. For
|
||||
example:
|
||||
.sp
|
||||
\ep{Greek}
|
||||
\eP{Han}
|
||||
.sp
|
||||
There are three different syntax forms for matching a script. Each Unicode
|
||||
character has a basic script and, optionally, a list of other scripts ("Script
|
||||
Extentions") with which it is commonly used. Using the Adlam script as an
|
||||
example, \ep{sc:Adlam} matches characters whose basic script is Adlam, whereas
|
||||
\ep{scx:Adlam} matches, in addition, characters that have Adlam in their
|
||||
extensions list. The full names "script" and "script extensions" for the
|
||||
property types are recognized, and a equals sign is an alternative to the
|
||||
colon. If a script name is given without a property type, for example,
|
||||
\ep{Adlam}, it is treated as \ep{scx:Adlam}. Perl changed to this
|
||||
interpretation at release 5.26 and PCRE2 changed at release 10.40.
|
||||
.P
|
||||
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
|
||||
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
|
||||
part of an identified script are lumped together as "Common". The current list
|
||||
|
@ -3952,6 +3956,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 10 December 2021
|
||||
Last updated: 22 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "10 December 2021" "PCRE2 10.40"
|
||||
.TH PCRE2SYNTAX 3 "22 December 2021" "PCRE2 10.40"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -124,6 +124,7 @@ matching" rules.
|
|||
Lo Other letter
|
||||
Lt Title case letter
|
||||
Lu Upper case letter
|
||||
Lc Ll, Lu, or Lt
|
||||
L& Ll, Lu, or Lt
|
||||
.sp
|
||||
M Mark
|
||||
|
@ -171,9 +172,12 @@ Perl and POSIX space are now the same. Perl added VT to its space character set
|
|||
at release 5.18.
|
||||
.
|
||||
.
|
||||
.SH "SCRIPT NAMES FOR \ep AND \eP"
|
||||
.SH "SCRIPT MATCHING WITH \ep AND \eP"
|
||||
.rs
|
||||
.sp
|
||||
The following script names are recognized in \ep{sc:...} or \ep{scx:...} items,
|
||||
or on their own with \ep (and also \eP of course):
|
||||
.P
|
||||
Adlam,
|
||||
Ahom,
|
||||
Anatolian_Hieroglyphs,
|
||||
|
@ -723,6 +727,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 10 December 2021
|
||||
Last updated: 22 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2UNICODE 3 "08 December 2021" "PCRE2 10.40"
|
||||
.TH PCRE2UNICODE 3 "22 December 2021" "PCRE2 10.40"
|
||||
.SH NAME
|
||||
PCRE - Perl-compatible regular expressions (revised API)
|
||||
.SH "UNICODE AND UTF SUPPORT"
|
||||
|
@ -40,10 +40,11 @@ handled, as documented below.
|
|||
.sp
|
||||
When PCRE2 is built with Unicode support, the escape sequences \ep{..},
|
||||
\eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting.
|
||||
The Unicode properties that can be tested are limited to the general category
|
||||
properties such as Lu for an upper case letter or Nd for a decimal number, the
|
||||
Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
|
||||
derived properties Any and LC (synonym L&). Full lists are given in the
|
||||
The Unicode properties that can be tested are a subset of those that Perl
|
||||
supports. Currently they are limited to the general category properties such as
|
||||
Lu for an upper case letter or Nd for a decimal number, the Unicode script
|
||||
names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
|
||||
properties Any and LC (synonym L&). Full lists are given in the
|
||||
.\" HREF
|
||||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
|
@ -51,10 +52,10 @@ and
|
|||
.\" HREF
|
||||
\fBpcre2syntax\fP
|
||||
.\"
|
||||
documentation. Only the short names for properties are supported. For example,
|
||||
\ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not supported.
|
||||
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
|
||||
compatibility with Perl 5.6. PCRE2 does not support this.
|
||||
documentation. In general, only the short names for properties are supported.
|
||||
For example, \ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not
|
||||
supported. Furthermore, in Perl, many properties may optionally be prefixed by
|
||||
"Is", for compatibility with Perl 5.6. PCRE2 does not support this.
|
||||
.
|
||||
.
|
||||
.SH "WIDE CHARACTERS AND UTF MODES"
|
||||
|
@ -448,7 +449,7 @@ can be useful when searching for UTF text in executable or other binary files.
|
|||
.sp
|
||||
.nf
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Retired from University Computing Service
|
||||
Cambridge, England.
|
||||
.fi
|
||||
.
|
||||
|
@ -457,6 +458,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 08 December 2021
|
||||
Last updated: 22 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
.fi
|
||||
|
|
Loading…
Reference in New Issue