Documentation for script handling update

2021-12-22 15:02:26 +00:00 · 2021-12-22 15:02:26 +00:00 · 944f0e10a1
parent b29732063b
commit 944f0e10a1
7 changed files with 287 additions and 257 deletions
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -795,13 +795,18 @@ Note that \P{Any} does not match any characters, so always causes a match
 failure.
 </P>
 <P>
-Sets of Unicode characters are defined as belonging to certain scripts. A
-character from one of these sets can be matched using a script name. For
-example:
-<pre>
-  \p{Greek}
-  \P{Han}
-</pre>
+There are three different syntax forms for matching a script. Each Unicode
+character has a basic script and, optionally, a list of other scripts ("Script
+Extentions") with which it is commonly used. Using the Adlam script as an
+example, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas
+\p{scx:Adlam} matches, in addition, characters that have Adlam in their
+extensions list. The full names "script" and "script extensions" for the
+property types are recognized, and a equals sign is an alternative to the
+colon. If a script name is given without a property type, for example,
+\p{Adlam}, it is treated as \p{scx:Adlam}. Perl changed to this
+interpretation at release 5.26 and PCRE2 changed at release 10.40.
+</P>
+<P>
 Unassigned characters (and in non-UTF 32-bit mode, characters with code points
 greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
 part of an identified script are lumped together as "Common". The current list
@ -3904,7 +3909,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC32" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 10 December 2021
+Last updated: 22 December 2021
 <br>
 Copyright &copy; 1997-2021 University of Cambridge.
 <br>
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@ -19,7 +19,7 @@ please consult the man page, in case the conversion went wrong.
 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
 <li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
-<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
+<li><a name="TOC7" href="#SEC7">SCRIPT MATCHING WITH \p AND \P</a>
 <li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a>
 <li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a>
 <li><a name="TOC10" href="#SEC10">QUANTIFIERS</a>
@ -158,6 +158,7 @@ matching" rules.
  Lo         Other letter
  Lt         Title case letter
  Lu         Upper case letter
+  Lc         Ll, Lu, or Lt
  L&         Ll, Lu, or Lt

  M          Mark
@ -204,7 +205,11 @@ matching" rules.
 Perl and POSIX space are now the same. Perl added VT to its space character set
 at release 5.18.
 </P>
-<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
+<br><a name="SEC7" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
+<P>
+The following script names are recognized in \p{sc:...} or \p{scx:...} items,
+or on their own with \p (and also \P of course):
+</P>
 <P>
 Adlam,
 Ahom,
@ -738,7 +743,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC30" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 10 December 2021
+Last updated: 22 December 2021
 <br>
 Copyright &copy; 1997-2021 University of Cambridge.
 <br>
--- a/doc/html/pcre2unicode.html
+++ b/doc/html/pcre2unicode.html
@ -50,17 +50,18 @@ UNICODE PROPERTY SUPPORT
 <P>
 When PCRE2 is built with Unicode support, the escape sequences \p{..},
 \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
-The Unicode properties that can be tested are limited to the general category
-properties such as Lu for an upper case letter or Nd for a decimal number, the
-Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
-derived properties Any and LC (synonym L&). Full lists are given in the
+The Unicode properties that can be tested are a subset of those that Perl
+supports. Currently they are limited to the general category properties such as
+Lu for an upper case letter or Nd for a decimal number, the Unicode script
+names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
+properties Any and LC (synonym L&). Full lists are given in the
 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
 and
 <a href="pcre2syntax.html"><b>pcre2syntax</b></a>
-documentation. Only the short names for properties are supported. For example,
-\p{L} matches a letter. Its longer synonym, \p{Letter}, is not supported.
-Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
-compatibility with Perl 5.6. PCRE2 does not support this.
+documentation. In general, only the short names for properties are supported.
+For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
+supported. Furthermore, in Perl, many properties may optionally be prefixed by
+"Is", for compatibility with Perl 5.6. PCRE2 does not support this.
 </P>
 <br><b>
 WIDE CHARACTERS AND UTF MODES
@ -477,7 +478,7 @@ AUTHOR
 <P>
 Philip Hazel
 <br>
-University Computing Service
+Retired from University Computing Service
 <br>
 Cambridge, England.
 <br>
@ -486,7 +487,7 @@ Cambridge, England.
 REVISION
 </b><br>
 <P>
-Last updated: 08 December 2021
+Last updated: 22 December 2021
 <br>
 Copyright &copy; 1997-2021 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@ -6905,12 +6905,17 @@ BACKSLASH
       calSymbols"  are  not  supported  by PCRE2.  Note that \P{Any} does not
       match any characters, so always causes a match failure.

-       Sets of Unicode characters are defined as belonging to certain scripts.
-       A  character from one of these sets can be matched using a script name.
-       For example:
-
-         \p{Greek}
-         \P{Han}
+       There are three different syntax forms for matching a script. Each Uni-
+       code  character  has  a  basic  script and, optionally, a list of other
+       scripts ("Script Extentions") with which it is commonly used. Using the
+       Adlam script as an example, \p{sc:Adlam} matches characters whose basic
+       script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
+       that  have  Adlam in their extensions list. The full names "script" and
+       "script extensions" for the property types are recognized, and a equals
+       sign  is an alternative to the colon. If a script name is given without
+       a property type, for example, \p{Adlam}, it is  treated  as  \p{scx:Ad-
+       lam}.  Perl  changed  to  this interpretation at release 5.26 and PCRE2
+       changed at release 10.40.

       Unassigned characters (and in non-UTF 32-bit mode, characters with code
       points greater than 0x10FFFF) are assigned the "Unknown" script. Others
@ -9702,7 +9707,7 @@ AUTHOR

 REVISION

-       Last updated: 10 December 2021
+       Last updated: 22 December 2021
       Copyright (c) 1997-2021 University of Cambridge.
 ------------------------------------------------------------------------------
 
@ -10670,6 +10675,7 @@ GENERAL CATEGORY PROPERTIES FOR \p and \P
         Lo         Other letter
         Lt         Title case letter
         Lu         Upper case letter
+         Lc         Ll, Lu, or Lt
         L&         Ll, Lu, or Lt

         M          Mark
@ -10716,7 +10722,10 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
       acter set at release 5.18.


-SCRIPT NAMES FOR \p AND \P
+SCRIPT MATCHING WITH \p AND \P
+
+       The  following script names are recognized in \p{sc:...} or \p{scx:...}
+       items, or on their own with \p (and also \P of course):

       Adlam, Ahom, Anatolian_Hieroglyphs, Arabic,  Armenian,  Avestan,  Bali-
       nese,  Bamum,  Bassa_Vah,  Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
@ -11108,7 +11117,7 @@ AUTHOR

 REVISION

-       Last updated: 10 December 2021
+       Last updated: 22 December 2021
       Copyright (c) 1997-2021 University of Cambridge.
 ------------------------------------------------------------------------------
 
@ -11151,16 +11160,17 @@ UNICODE PROPERTY SUPPORT

       When  PCRE2 is built with Unicode support, the escape sequences \p{..},
       \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
-       ting.   The  Unicode  properties  that can be tested are limited to the
-       general category properties such as Lu for an upper case letter  or  Nd
-       for  a  decimal number, the Unicode script names such as Arabic or Han,
-       Bidi_Class, Bidi_Control, and the derived properties Any and  LC  (syn-
-       onym L&). Full lists are given in the pcre2pattern and pcre2syntax doc-
-       umentation. Only the short names for properties are supported. For  ex-
-       ample,  \p{L}  matches a letter. Its longer synonym, \p{Letter}, is not
-       supported.  Furthermore, in Perl, many  properties  may  optionally  be
-       prefixed  by "Is", for compatibility with Perl 5.6. PCRE2 does not sup-
-       port this.
+       ting.   The Unicode properties that can be tested are a subset of those
+       that Perl supports. Currently they are limited to the general  category
+       properties such as Lu for an upper case letter or Nd for a decimal num-
+       ber, the Unicode script  names  such  as  Arabic  or  Han,  Bidi_Class,
+       Bidi_Control,  and the derived properties Any and LC (synonym L&). Full
+       lists are given in the pcre2pattern and pcre2syntax  documentation.  In
+       general,  only the short names for properties are supported.  For exam-
+       ple, \p{L} matches a letter. Its longer  synonym,  \p{Letter},  is  not
+       supported. Furthermore, in Perl, many properties may optionally be pre-
+       fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not  support
+       this.


 WIDE CHARACTERS AND UTF MODES
@ -11538,13 +11548,13 @@ MATCHING IN INVALID UTF STRINGS
 AUTHOR

       Philip Hazel
-       University Computing Service
+       Retired from University Computing Service
       Cambridge, England.


 REVISION

-       Last updated: 08 December 2021
+       Last updated: 22 December 2021
       Copyright (c) 1997-2021 University of Cambridge.
 ------------------------------------------------------------------------------
 
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "10 December 2021" "PCRE2 10.40"
+.TH PCRE2PATTERN 3 "22 December 2021" "PCRE2 10.40"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -793,13 +793,17 @@ Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
 Note that \eP{Any} does not match any characters, so always causes a match
 failure.
 .P
-Sets of Unicode characters are defined as belonging to certain scripts. A
-character from one of these sets can be matched using a script name. For
-example:
-.sp
-  \ep{Greek}
-  \eP{Han}
-.sp
+There are three different syntax forms for matching a script. Each Unicode
+character has a basic script and, optionally, a list of other scripts ("Script
+Extentions") with which it is commonly used. Using the Adlam script as an
+example, \ep{sc:Adlam} matches characters whose basic script is Adlam, whereas
+\ep{scx:Adlam} matches, in addition, characters that have Adlam in their
+extensions list. The full names "script" and "script extensions" for the
+property types are recognized, and a equals sign is an alternative to the
+colon. If a script name is given without a property type, for example,
+\ep{Adlam}, it is treated as \ep{scx:Adlam}. Perl changed to this
+interpretation at release 5.26 and PCRE2 changed at release 10.40.
+.P
 Unassigned characters (and in non-UTF 32-bit mode, characters with code points
 greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
 part of an identified script are lumped together as "Common". The current list
@ -3952,6 +3956,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 10 December 2021
+Last updated: 22 December 2021
 Copyright (c) 1997-2021 University of Cambridge.
 .fi
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "10 December 2021" "PCRE2 10.40"
+.TH PCRE2SYNTAX 3 "22 December 2021" "PCRE2 10.40"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -124,6 +124,7 @@ matching" rules.
  Lo         Other letter
  Lt         Title case letter
  Lu         Upper case letter
+  Lc         Ll, Lu, or Lt
  L&         Ll, Lu, or Lt
 .sp
  M          Mark
@ -171,9 +172,12 @@ Perl and POSIX space are now the same. Perl added VT to its space character set
 at release 5.18.
 .
 .
-.SH "SCRIPT NAMES FOR \ep AND \eP"
+.SH "SCRIPT MATCHING WITH \ep AND \eP"
 .rs
 .sp
+The following script names are recognized in \ep{sc:...} or \ep{scx:...} items,
+or on their own with \ep (and also \eP of course):
+.P
 Adlam,
 Ahom,
 Anatolian_Hieroglyphs,
@ -723,6 +727,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 10 December 2021
+Last updated: 22 December 2021
 Copyright (c) 1997-2021 University of Cambridge.
 .fi
--- a/doc/pcre2unicode.3
+++ b/doc/pcre2unicode.3
@ -1,4 +1,4 @@
-.TH PCRE2UNICODE 3 "08 December 2021" "PCRE2 10.40"
+.TH PCRE2UNICODE 3 "22 December 2021" "PCRE2 10.40"
 .SH NAME
 PCRE - Perl-compatible regular expressions (revised API)
 .SH "UNICODE AND UTF SUPPORT"
@ -40,10 +40,11 @@ handled, as documented below.
 .sp
 When PCRE2 is built with Unicode support, the escape sequences \ep{..},
 \eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting.
-The Unicode properties that can be tested are limited to the general category
-properties such as Lu for an upper case letter or Nd for a decimal number, the
-Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
-derived properties Any and LC (synonym L&). Full lists are given in the
+The Unicode properties that can be tested are a subset of those that Perl
+supports. Currently they are limited to the general category properties such as
+Lu for an upper case letter or Nd for a decimal number, the Unicode script
+names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
+properties Any and LC (synonym L&). Full lists are given in the
 .\" HREF
 \fBpcre2pattern\fP
 .\"
@ -51,10 +52,10 @@ and
 .\" HREF
 \fBpcre2syntax\fP
 .\"
-documentation. Only the short names for properties are supported. For example,
-\ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not supported.
-Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
-compatibility with Perl 5.6. PCRE2 does not support this.
+documentation. In general, only the short names for properties are supported.
+For example, \ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not
+supported. Furthermore, in Perl, many properties may optionally be prefixed by
+"Is", for compatibility with Perl 5.6. PCRE2 does not support this.
 .
 .
 .SH "WIDE CHARACTERS AND UTF MODES"
@ -448,7 +449,7 @@ can be useful when searching for UTF text in executable or other binary files.
 .sp
 .nf
 Philip Hazel
-University Computing Service
+Retired from University Computing Service
 Cambridge, England.
 .fi
 .
@ -457,6 +458,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 08 December 2021
+Last updated: 22 December 2021
 Copyright (c) 1997-2021 University of Cambridge.
 .fi