Documentation for script handling update

This commit is contained in:
Philip Hazel 2021-12-22 15:02:26 +00:00
parent b29732063b
commit 944f0e10a1
7 changed files with 287 additions and 257 deletions

View File

@ -795,13 +795,18 @@ Note that \P{Any} does not match any characters, so always causes a match
failure. failure.
</P> </P>
<P> <P>
Sets of Unicode characters are defined as belonging to certain scripts. A There are three different syntax forms for matching a script. Each Unicode
character from one of these sets can be matched using a script name. For character has a basic script and, optionally, a list of other scripts ("Script
example: Extentions") with which it is commonly used. Using the Adlam script as an
<pre> example, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas
\p{Greek} \p{scx:Adlam} matches, in addition, characters that have Adlam in their
\P{Han} extensions list. The full names "script" and "script extensions" for the
</pre> property types are recognized, and a equals sign is an alternative to the
colon. If a script name is given without a property type, for example,
\p{Adlam}, it is treated as \p{scx:Adlam}. Perl changed to this
interpretation at release 5.26 and PCRE2 changed at release 10.40.
</P>
<P>
Unassigned characters (and in non-UTF 32-bit mode, characters with code points Unassigned characters (and in non-UTF 32-bit mode, characters with code points
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
part of an identified script are lumped together as "Common". The current list part of an identified script are lumped together as "Common". The current list
@ -3904,7 +3909,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br> <br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 10 December 2021 Last updated: 22 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

View File

@ -19,7 +19,7 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a> <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> <li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a> <li><a name="TOC7" href="#SEC7">SCRIPT MATCHING WITH \p AND \P</a>
<li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a> <li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a>
<li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a> <li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a>
<li><a name="TOC10" href="#SEC10">QUANTIFIERS</a> <li><a name="TOC10" href="#SEC10">QUANTIFIERS</a>
@ -158,6 +158,7 @@ matching" rules.
Lo Other letter Lo Other letter
Lt Title case letter Lt Title case letter
Lu Upper case letter Lu Upper case letter
Lc Ll, Lu, or Lt
L& Ll, Lu, or Lt L& Ll, Lu, or Lt
M Mark M Mark
@ -204,7 +205,11 @@ matching" rules.
Perl and POSIX space are now the same. Perl added VT to its space character set Perl and POSIX space are now the same. Perl added VT to its space character set
at release 5.18. at release 5.18.
</P> </P>
<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br> <br><a name="SEC7" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
<P>
The following script names are recognized in \p{sc:...} or \p{scx:...} items,
or on their own with \p (and also \P of course):
</P>
<P> <P>
Adlam, Adlam,
Ahom, Ahom,
@ -738,7 +743,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 10 December 2021 Last updated: 22 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

View File

@ -50,17 +50,18 @@ UNICODE PROPERTY SUPPORT
<P> <P>
When PCRE2 is built with Unicode support, the escape sequences \p{..}, When PCRE2 is built with Unicode support, the escape sequences \p{..},
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting. \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
The Unicode properties that can be tested are limited to the general category The Unicode properties that can be tested are a subset of those that Perl
properties such as Lu for an upper case letter or Nd for a decimal number, the supports. Currently they are limited to the general category properties such as
Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the Lu for an upper case letter or Nd for a decimal number, the Unicode script
derived properties Any and LC (synonym L&). Full lists are given in the names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
properties Any and LC (synonym L&). Full lists are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
and and
<a href="pcre2syntax.html"><b>pcre2syntax</b></a> <a href="pcre2syntax.html"><b>pcre2syntax</b></a>
documentation. Only the short names for properties are supported. For example, documentation. In general, only the short names for properties are supported.
\p{L} matches a letter. Its longer synonym, \p{Letter}, is not supported. For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for supported. Furthermore, in Perl, many properties may optionally be prefixed by
compatibility with Perl 5.6. PCRE2 does not support this. "Is", for compatibility with Perl 5.6. PCRE2 does not support this.
</P> </P>
<br><b> <br><b>
WIDE CHARACTERS AND UTF MODES WIDE CHARACTERS AND UTF MODES
@ -477,7 +478,7 @@ AUTHOR
<P> <P>
Philip Hazel Philip Hazel
<br> <br>
University Computing Service Retired from University Computing Service
<br> <br>
Cambridge, England. Cambridge, England.
<br> <br>
@ -486,7 +487,7 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 08 December 2021 Last updated: 22 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

View File

@ -6905,12 +6905,17 @@ BACKSLASH
calSymbols" are not supported by PCRE2. Note that \P{Any} does not calSymbols" are not supported by PCRE2. Note that \P{Any} does not
match any characters, so always causes a match failure. match any characters, so always causes a match failure.
Sets of Unicode characters are defined as belonging to certain scripts. There are three different syntax forms for matching a script. Each Uni-
A character from one of these sets can be matched using a script name. code character has a basic script and, optionally, a list of other
For example: scripts ("Script Extentions") with which it is commonly used. Using the
Adlam script as an example, \p{sc:Adlam} matches characters whose basic
\p{Greek} script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
\P{Han} that have Adlam in their extensions list. The full names "script" and
"script extensions" for the property types are recognized, and a equals
sign is an alternative to the colon. If a script name is given without
a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad-
lam}. Perl changed to this interpretation at release 5.26 and PCRE2
changed at release 10.40.
Unassigned characters (and in non-UTF 32-bit mode, characters with code Unassigned characters (and in non-UTF 32-bit mode, characters with code
points greater than 0x10FFFF) are assigned the "Unknown" script. Others points greater than 0x10FFFF) are assigned the "Unknown" script. Others
@ -9702,7 +9707,7 @@ AUTHOR
REVISION REVISION
Last updated: 10 December 2021 Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -10670,6 +10675,7 @@ GENERAL CATEGORY PROPERTIES FOR \p and \P
Lo Other letter Lo Other letter
Lt Title case letter Lt Title case letter
Lu Upper case letter Lu Upper case letter
Lc Ll, Lu, or Lt
L& Ll, Lu, or Lt L& Ll, Lu, or Lt
M Mark M Mark
@ -10716,7 +10722,10 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
acter set at release 5.18. acter set at release 5.18.
SCRIPT NAMES FOR \p AND \P SCRIPT MATCHING WITH \p AND \P
The following script names are recognized in \p{sc:...} or \p{scx:...}
items, or on their own with \p (and also \P of course):
Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali- Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi, nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
@ -11108,7 +11117,7 @@ AUTHOR
REVISION REVISION
Last updated: 10 December 2021 Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -11151,16 +11160,17 @@ UNICODE PROPERTY SUPPORT
When PCRE2 is built with Unicode support, the escape sequences \p{..}, When PCRE2 is built with Unicode support, the escape sequences \p{..},
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set- \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
ting. The Unicode properties that can be tested are limited to the ting. The Unicode properties that can be tested are a subset of those
general category properties such as Lu for an upper case letter or Nd that Perl supports. Currently they are limited to the general category
for a decimal number, the Unicode script names such as Arabic or Han, properties such as Lu for an upper case letter or Nd for a decimal num-
Bidi_Class, Bidi_Control, and the derived properties Any and LC (syn- ber, the Unicode script names such as Arabic or Han, Bidi_Class,
onym L&). Full lists are given in the pcre2pattern and pcre2syntax doc- Bidi_Control, and the derived properties Any and LC (synonym L&). Full
umentation. Only the short names for properties are supported. For ex- lists are given in the pcre2pattern and pcre2syntax documentation. In
ample, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not general, only the short names for properties are supported. For exam-
supported. Furthermore, in Perl, many properties may optionally be ple, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
prefixed by "Is", for compatibility with Perl 5.6. PCRE2 does not sup- supported. Furthermore, in Perl, many properties may optionally be pre-
port this. fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not support
this.
WIDE CHARACTERS AND UTF MODES WIDE CHARACTERS AND UTF MODES
@ -11538,13 +11548,13 @@ MATCHING IN INVALID UTF STRINGS
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service Retired from University Computing Service
Cambridge, England. Cambridge, England.
REVISION REVISION
Last updated: 08 December 2021 Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "10 December 2021" "PCRE2 10.40" .TH PCRE2PATTERN 3 "22 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -793,13 +793,17 @@ Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
Note that \eP{Any} does not match any characters, so always causes a match Note that \eP{Any} does not match any characters, so always causes a match
failure. failure.
.P .P
Sets of Unicode characters are defined as belonging to certain scripts. A There are three different syntax forms for matching a script. Each Unicode
character from one of these sets can be matched using a script name. For character has a basic script and, optionally, a list of other scripts ("Script
example: Extentions") with which it is commonly used. Using the Adlam script as an
.sp example, \ep{sc:Adlam} matches characters whose basic script is Adlam, whereas
\ep{Greek} \ep{scx:Adlam} matches, in addition, characters that have Adlam in their
\eP{Han} extensions list. The full names "script" and "script extensions" for the
.sp property types are recognized, and a equals sign is an alternative to the
colon. If a script name is given without a property type, for example,
\ep{Adlam}, it is treated as \ep{scx:Adlam}. Perl changed to this
interpretation at release 5.26 and PCRE2 changed at release 10.40.
.P
Unassigned characters (and in non-UTF 32-bit mode, characters with code points Unassigned characters (and in non-UTF 32-bit mode, characters with code points
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
part of an identified script are lumped together as "Common". The current list part of an identified script are lumped together as "Common". The current list
@ -3952,6 +3956,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 10 December 2021 Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "10 December 2021" "PCRE2 10.40" .TH PCRE2SYNTAX 3 "22 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -124,6 +124,7 @@ matching" rules.
Lo Other letter Lo Other letter
Lt Title case letter Lt Title case letter
Lu Upper case letter Lu Upper case letter
Lc Ll, Lu, or Lt
L& Ll, Lu, or Lt L& Ll, Lu, or Lt
.sp .sp
M Mark M Mark
@ -171,9 +172,12 @@ Perl and POSIX space are now the same. Perl added VT to its space character set
at release 5.18. at release 5.18.
. .
. .
.SH "SCRIPT NAMES FOR \ep AND \eP" .SH "SCRIPT MATCHING WITH \ep AND \eP"
.rs .rs
.sp .sp
The following script names are recognized in \ep{sc:...} or \ep{scx:...} items,
or on their own with \ep (and also \eP of course):
.P
Adlam, Adlam,
Ahom, Ahom,
Anatolian_Hieroglyphs, Anatolian_Hieroglyphs,
@ -723,6 +727,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 10 December 2021 Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2UNICODE 3 "08 December 2021" "PCRE2 10.40" .TH PCRE2UNICODE 3 "22 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE - Perl-compatible regular expressions (revised API) PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT" .SH "UNICODE AND UTF SUPPORT"
@ -40,10 +40,11 @@ handled, as documented below.
.sp .sp
When PCRE2 is built with Unicode support, the escape sequences \ep{..}, When PCRE2 is built with Unicode support, the escape sequences \ep{..},
\eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting. \eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting.
The Unicode properties that can be tested are limited to the general category The Unicode properties that can be tested are a subset of those that Perl
properties such as Lu for an upper case letter or Nd for a decimal number, the supports. Currently they are limited to the general category properties such as
Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the Lu for an upper case letter or Nd for a decimal number, the Unicode script
derived properties Any and LC (synonym L&). Full lists are given in the names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
properties Any and LC (synonym L&). Full lists are given in the
.\" HREF .\" HREF
\fBpcre2pattern\fP \fBpcre2pattern\fP
.\" .\"
@ -51,10 +52,10 @@ and
.\" HREF .\" HREF
\fBpcre2syntax\fP \fBpcre2syntax\fP
.\" .\"
documentation. Only the short names for properties are supported. For example, documentation. In general, only the short names for properties are supported.
\ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not supported. For example, \ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for supported. Furthermore, in Perl, many properties may optionally be prefixed by
compatibility with Perl 5.6. PCRE2 does not support this. "Is", for compatibility with Perl 5.6. PCRE2 does not support this.
. .
. .
.SH "WIDE CHARACTERS AND UTF MODES" .SH "WIDE CHARACTERS AND UTF MODES"
@ -448,7 +449,7 @@ can be useful when searching for UTF text in executable or other binary files.
.sp .sp
.nf .nf
Philip Hazel Philip Hazel
University Computing Service Retired from University Computing Service
Cambridge, England. Cambridge, England.
.fi .fi
. .
@ -457,6 +458,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 08 December 2021 Last updated: 22 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi