Documentation for script handling update

This commit is contained in:
Philip Hazel 2021-12-22 15:02:26 +00:00
parent b29732063b
commit 944f0e10a1
7 changed files with 287 additions and 257 deletions

View File

@ -795,13 +795,18 @@ Note that \P{Any} does not match any characters, so always causes a match
failure.
</P>
<P>
Sets of Unicode characters are defined as belonging to certain scripts. A
character from one of these sets can be matched using a script name. For
example:
<pre>
\p{Greek}
\P{Han}
</pre>
There are three different syntax forms for matching a script. Each Unicode
character has a basic script and, optionally, a list of other scripts ("Script
Extentions") with which it is commonly used. Using the Adlam script as an
example, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas
\p{scx:Adlam} matches, in addition, characters that have Adlam in their
extensions list. The full names "script" and "script extensions" for the
property types are recognized, and a equals sign is an alternative to the
colon. If a script name is given without a property type, for example,
\p{Adlam}, it is treated as \p{scx:Adlam}. Perl changed to this
interpretation at release 5.26 and PCRE2 changed at release 10.40.
</P>
<P>
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
part of an identified script are lumped together as "Common". The current list
@ -3904,7 +3909,7 @@ Cambridge, England.
</P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P>
Last updated: 10 December 2021
Last updated: 22 December 2021
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>

View File

@ -19,7 +19,7 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
<li><a name="TOC7" href="#SEC7">SCRIPT MATCHING WITH \p AND \P</a>
<li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a>
<li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a>
<li><a name="TOC10" href="#SEC10">QUANTIFIERS</a>
@ -158,6 +158,7 @@ matching" rules.
Lo Other letter
Lt Title case letter
Lu Upper case letter
Lc Ll, Lu, or Lt
L& Ll, Lu, or Lt
M Mark
@ -204,7 +205,11 @@ matching" rules.
Perl and POSIX space are now the same. Perl added VT to its space character set
at release 5.18.
</P>
<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
<br><a name="SEC7" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
<P>
The following script names are recognized in \p{sc:...} or \p{scx:...} items,
or on their own with \p (and also \P of course):
</P>
<P>
Adlam,
Ahom,
@ -738,7 +743,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 10 December 2021
Last updated: 22 December 2021
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>

View File

@ -50,17 +50,18 @@ UNICODE PROPERTY SUPPORT
<P>
When PCRE2 is built with Unicode support, the escape sequences \p{..},
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
The Unicode properties that can be tested are limited to the general category
properties such as Lu for an upper case letter or Nd for a decimal number, the
Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
derived properties Any and LC (synonym L&). Full lists are given in the
The Unicode properties that can be tested are a subset of those that Perl
supports. Currently they are limited to the general category properties such as
Lu for an upper case letter or Nd for a decimal number, the Unicode script
names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
properties Any and LC (synonym L&). Full lists are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
and
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
documentation. Only the short names for properties are supported. For example,
\p{L} matches a letter. Its longer synonym, \p{Letter}, is not supported.
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
compatibility with Perl 5.6. PCRE2 does not support this.
documentation. In general, only the short names for properties are supported.
For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
supported. Furthermore, in Perl, many properties may optionally be prefixed by
"Is", for compatibility with Perl 5.6. PCRE2 does not support this.
</P>
<br><b>
WIDE CHARACTERS AND UTF MODES
@ -477,7 +478,7 @@ AUTHOR
<P>
Philip Hazel
<br>
University Computing Service
Retired from University Computing Service
<br>
Cambridge, England.
<br>
@ -486,7 +487,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 08 December 2021
Last updated: 22 December 2021
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>

View File

@ -6905,12 +6905,17 @@ BACKSLASH
calSymbols" are not supported by PCRE2. Note that \P{Any} does not
match any characters, so always causes a match failure.
Sets of Unicode characters are defined as belonging to certain scripts.
A character from one of these sets can be matched using a script name.
For example:
\p{Greek}
\P{Han}
There are three different syntax forms for matching a script. Each Uni-
code character has a basic script and, optionally, a list of other
scripts ("Script Extentions") with which it is commonly used. Using the
Adlam script as an example, \p{sc:Adlam} matches characters whose basic
script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
that have Adlam in their extensions list. The full names "script" and
"script extensions" for the property types are recognized, and a equals
sign is an alternative to the colon. If a script name is given without
a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad-
lam}. Perl changed to this interpretation at release 5.26 and PCRE2
changed at release 10.40.
Unassigned characters (and in non-UTF 32-bit mode, characters with code
points greater than 0x10FFFF) are assigned the "Unknown" script. Others
@ -9702,7 +9707,7 @@ AUTHOR
REVISION
Last updated: 10 December 2021
Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------
@ -10670,6 +10675,7 @@ GENERAL CATEGORY PROPERTIES FOR \p and \P
Lo Other letter
Lt Title case letter
Lu Upper case letter
Lc Ll, Lu, or Lt
L& Ll, Lu, or Lt
M Mark
@ -10716,7 +10722,10 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
acter set at release 5.18.
SCRIPT NAMES FOR \p AND \P
SCRIPT MATCHING WITH \p AND \P
The following script names are recognized in \p{sc:...} or \p{scx:...}
items, or on their own with \p (and also \P of course):
Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
@ -11108,7 +11117,7 @@ AUTHOR
REVISION
Last updated: 10 December 2021
Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------
@ -11151,16 +11160,17 @@ UNICODE PROPERTY SUPPORT
When PCRE2 is built with Unicode support, the escape sequences \p{..},
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
ting. The Unicode properties that can be tested are limited to the
general category properties such as Lu for an upper case letter or Nd
for a decimal number, the Unicode script names such as Arabic or Han,
Bidi_Class, Bidi_Control, and the derived properties Any and LC (syn-
onym L&). Full lists are given in the pcre2pattern and pcre2syntax doc-
umentation. Only the short names for properties are supported. For ex-
ample, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
supported. Furthermore, in Perl, many properties may optionally be
prefixed by "Is", for compatibility with Perl 5.6. PCRE2 does not sup-
port this.
ting. The Unicode properties that can be tested are a subset of those
that Perl supports. Currently they are limited to the general category
properties such as Lu for an upper case letter or Nd for a decimal num-
ber, the Unicode script names such as Arabic or Han, Bidi_Class,
Bidi_Control, and the derived properties Any and LC (synonym L&). Full
lists are given in the pcre2pattern and pcre2syntax documentation. In
general, only the short names for properties are supported. For exam-
ple, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
supported. Furthermore, in Perl, many properties may optionally be pre-
fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not support
this.
WIDE CHARACTERS AND UTF MODES
@ -11538,13 +11548,13 @@ MATCHING IN INVALID UTF STRINGS
AUTHOR
Philip Hazel
University Computing Service
Retired from University Computing Service
Cambridge, England.
REVISION
Last updated: 08 December 2021
Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "10 December 2021" "PCRE2 10.40"
.TH PCRE2PATTERN 3 "22 December 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -793,13 +793,17 @@ Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
Note that \eP{Any} does not match any characters, so always causes a match
failure.
.P
Sets of Unicode characters are defined as belonging to certain scripts. A
character from one of these sets can be matched using a script name. For
example:
.sp
\ep{Greek}
\eP{Han}
.sp
There are three different syntax forms for matching a script. Each Unicode
character has a basic script and, optionally, a list of other scripts ("Script
Extentions") with which it is commonly used. Using the Adlam script as an
example, \ep{sc:Adlam} matches characters whose basic script is Adlam, whereas
\ep{scx:Adlam} matches, in addition, characters that have Adlam in their
extensions list. The full names "script" and "script extensions" for the
property types are recognized, and a equals sign is an alternative to the
colon. If a script name is given without a property type, for example,
\ep{Adlam}, it is treated as \ep{scx:Adlam}. Perl changed to this
interpretation at release 5.26 and PCRE2 changed at release 10.40.
.P
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
part of an identified script are lumped together as "Common". The current list
@ -3952,6 +3956,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 10 December 2021
Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "10 December 2021" "PCRE2 10.40"
.TH PCRE2SYNTAX 3 "22 December 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -124,6 +124,7 @@ matching" rules.
Lo Other letter
Lt Title case letter
Lu Upper case letter
Lc Ll, Lu, or Lt
L& Ll, Lu, or Lt
.sp
M Mark
@ -171,9 +172,12 @@ Perl and POSIX space are now the same. Perl added VT to its space character set
at release 5.18.
.
.
.SH "SCRIPT NAMES FOR \ep AND \eP"
.SH "SCRIPT MATCHING WITH \ep AND \eP"
.rs
.sp
The following script names are recognized in \ep{sc:...} or \ep{scx:...} items,
or on their own with \ep (and also \eP of course):
.P
Adlam,
Ahom,
Anatolian_Hieroglyphs,
@ -723,6 +727,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 10 December 2021
Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2UNICODE 3 "08 December 2021" "PCRE2 10.40"
.TH PCRE2UNICODE 3 "22 December 2021" "PCRE2 10.40"
.SH NAME
PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT"
@ -40,10 +40,11 @@ handled, as documented below.
.sp
When PCRE2 is built with Unicode support, the escape sequences \ep{..},
\eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting.
The Unicode properties that can be tested are limited to the general category
properties such as Lu for an upper case letter or Nd for a decimal number, the
Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
derived properties Any and LC (synonym L&). Full lists are given in the
The Unicode properties that can be tested are a subset of those that Perl
supports. Currently they are limited to the general category properties such as
Lu for an upper case letter or Nd for a decimal number, the Unicode script
names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
properties Any and LC (synonym L&). Full lists are given in the
.\" HREF
\fBpcre2pattern\fP
.\"
@ -51,10 +52,10 @@ and
.\" HREF
\fBpcre2syntax\fP
.\"
documentation. Only the short names for properties are supported. For example,
\ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not supported.
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
compatibility with Perl 5.6. PCRE2 does not support this.
documentation. In general, only the short names for properties are supported.
For example, \ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not
supported. Furthermore, in Perl, many properties may optionally be prefixed by
"Is", for compatibility with Perl 5.6. PCRE2 does not support this.
.
.
.SH "WIDE CHARACTERS AND UTF MODES"
@ -448,7 +449,7 @@ can be useful when searching for UTF text in executable or other binary files.
.sp
.nf
Philip Hazel
University Computing Service
Retired from University Computing Service
Cambridge, England.
.fi
.
@ -457,6 +458,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 08 December 2021
Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi