Documentation for script handling update

2021-12-22 15:02:26 +00:00 · 2021-12-22 15:02:26 +00:00 · 944f0e10a1
parent b29732063b
commit 944f0e10a1
7 changed files with 287 additions and 257 deletions
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -795,13 +795,18 @@ Note that \P{Any} does not match any characters, so always causes a match
 failure.
 </P>
 <P>
-Sets of Unicode characters are defined as belonging to certain scripts. A
+There are three different syntax forms for matching a script. Each Unicode
-character from one of these sets can be matched using a script name. For
+character has a basic script and, optionally, a list of other scripts ("Script
-example:
+Extentions") with which it is commonly used. Using the Adlam script as an
-<pre>
+example, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas
-  \p{Greek}
+\p{scx:Adlam} matches, in addition, characters that have Adlam in their
-  \P{Han}
+extensions list. The full names "script" and "script extensions" for the
-</pre>
+property types are recognized, and a equals sign is an alternative to the
 colon. If a script name is given without a property type, for example,
 \p{Adlam}, it is treated as \p{scx:Adlam}. Perl changed to this
 interpretation at release 5.26 and PCRE2 changed at release 10.40.
 </P>
 <P>
 Unassigned characters (and in non-UTF 32-bit mode, characters with code points
 greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
 part of an identified script are lumped together as "Common". The current list
@ -3904,7 +3909,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC32" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 10 December 2021
+Last updated: 22 December 2021
 <br>
 Copyright &copy; 1997-2021 University of Cambridge.
 <br>
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@ -19,7 +19,7 @@ please consult the man page, in case the conversion went wrong.
 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
 <li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
-<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
+<li><a name="TOC7" href="#SEC7">SCRIPT MATCHING WITH \p AND \P</a>
 <li><a name="TOC8" href="#SEC8">BIDI_PROPERTIES FOR \p AND \P</a>
 <li><a name="TOC9" href="#SEC9">CHARACTER CLASSES</a>
 <li><a name="TOC10" href="#SEC10">QUANTIFIERS</a>
@ -158,6 +158,7 @@ matching" rules.
  Lo         Other letter
  Lt         Title case letter
  Lu         Upper case letter
  Lc         Ll, Lu, or Lt
  L&         Ll, Lu, or Lt
  M          Mark
@ -204,7 +205,11 @@ matching" rules.
 Perl and POSIX space are now the same. Perl added VT to its space character set
 at release 5.18.
 </P>
-<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
+<br><a name="SEC7" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
 <P>
 The following script names are recognized in \p{sc:...} or \p{scx:...} items,
 or on their own with \p (and also \P of course):
 </P>
 <P>
 Adlam,
 Ahom,
@ -738,7 +743,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC30" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 10 December 2021
+Last updated: 22 December 2021
 <br>
 Copyright &copy; 1997-2021 University of Cambridge.
 <br>
--- a/doc/html/pcre2unicode.html
+++ b/doc/html/pcre2unicode.html
@ -50,17 +50,18 @@ UNICODE PROPERTY SUPPORT
 <P>
 When PCRE2 is built with Unicode support, the escape sequences \p{..},
 \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
-The Unicode properties that can be tested are limited to the general category
+The Unicode properties that can be tested are a subset of those that Perl
-properties such as Lu for an upper case letter or Nd for a decimal number, the
+supports. Currently they are limited to the general category properties such as
-Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
+Lu for an upper case letter or Nd for a decimal number, the Unicode script
-derived properties Any and LC (synonym L&). Full lists are given in the
+names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
 properties Any and LC (synonym L&). Full lists are given in the
 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
 and
 <a href="pcre2syntax.html"><b>pcre2syntax</b></a>
-documentation. Only the short names for properties are supported. For example,
+documentation. In general, only the short names for properties are supported.
-\p{L} matches a letter. Its longer synonym, \p{Letter}, is not supported.
+For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
-Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
+supported. Furthermore, in Perl, many properties may optionally be prefixed by
-compatibility with Perl 5.6. PCRE2 does not support this.
+"Is", for compatibility with Perl 5.6. PCRE2 does not support this.
 </P>
 <br><b>
 WIDE CHARACTERS AND UTF MODES
@ -477,7 +478,7 @@ AUTHOR
 <P>
 Philip Hazel
 <br>
-University Computing Service
+Retired from University Computing Service
 <br>
 Cambridge, England.
 <br>
@ -486,7 +487,7 @@ Cambridge, England.
 REVISION
 </b><br>
 <P>
-Last updated: 08 December 2021
+Last updated: 22 December 2021
 <br>
 Copyright &copy; 1997-2021 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@ -6905,12 +6905,17 @@ BACKSLASH
       calSymbols"  are  not  supported  by PCRE2.  Note that \P{Any} does not
       match any characters, so always causes a match failure.
-       Sets of Unicode characters are defined as belonging to certain scripts.
+       There are three different syntax forms for matching a script. Each Uni-
-       A  character from one of these sets can be matched using a script name.
+       code  character  has  a  basic  script and, optionally, a list of other
-       For example:
+       scripts ("Script Extentions") with which it is commonly used. Using the
-
+       Adlam script as an example, \p{sc:Adlam} matches characters whose basic
-         \p{Greek}
+       script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
-         \P{Han}
+       that  have  Adlam in their extensions list. The full names "script" and
       "script extensions" for the property types are recognized, and a equals
       sign  is an alternative to the colon. If a script name is given without
       a property type, for example, \p{Adlam}, it is  treated  as  \p{scx:Ad-
       lam}.  Perl  changed  to  this interpretation at release 5.26 and PCRE2
       changed at release 10.40.
       Unassigned characters (and in non-UTF 32-bit mode, characters with code
       points greater than 0x10FFFF) are assigned the "Unknown" script. Others
@ -9702,7 +9707,7 @@ AUTHOR
 REVISION
-       Last updated: 10 December 2021
+       Last updated: 22 December 2021
       Copyright (c) 1997-2021 University of Cambridge.
 ------------------------------------------------------------------------------
@ -10670,6 +10675,7 @@ GENERAL CATEGORY PROPERTIES FOR \p and \P
         Lo         Other letter
         Lt         Title case letter
         Lu         Upper case letter
         Lc         Ll, Lu, or Lt
         L&         Ll, Lu, or Lt
         M          Mark
@ -10716,7 +10722,10 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
       acter set at release 5.18.
-SCRIPT NAMES FOR \p AND \P
+SCRIPT MATCHING WITH \p AND \P
       The  following script names are recognized in \p{sc:...} or \p{scx:...}
       items, or on their own with \p (and also \P of course):
       Adlam, Ahom, Anatolian_Hieroglyphs, Arabic,  Armenian,  Avestan,  Bali-
       nese,  Bamum,  Bassa_Vah,  Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
@ -11108,7 +11117,7 @@ AUTHOR
 REVISION
-       Last updated: 10 December 2021
+       Last updated: 22 December 2021
       Copyright (c) 1997-2021 University of Cambridge.
 ------------------------------------------------------------------------------
@ -11151,16 +11160,17 @@ UNICODE PROPERTY SUPPORT
       When  PCRE2 is built with Unicode support, the escape sequences \p{..},
       \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
-       ting.   The  Unicode  properties  that can be tested are limited to the
+       ting.   The Unicode properties that can be tested are a subset of those
-       general category properties such as Lu for an upper case letter  or  Nd
+       that Perl supports. Currently they are limited to the general  category
-       for  a  decimal number, the Unicode script names such as Arabic or Han,
+       properties such as Lu for an upper case letter or Nd for a decimal num-
-       Bidi_Class, Bidi_Control, and the derived properties Any and  LC  (syn-
+       ber, the Unicode script  names  such  as  Arabic  or  Han,  Bidi_Class,
-       onym L&). Full lists are given in the pcre2pattern and pcre2syntax doc-
+       Bidi_Control,  and the derived properties Any and LC (synonym L&). Full
-       umentation. Only the short names for properties are supported. For  ex-
+       lists are given in the pcre2pattern and pcre2syntax  documentation.  In
-       ample,  \p{L}  matches a letter. Its longer synonym, \p{Letter}, is not
+       general,  only the short names for properties are supported.  For exam-
-       supported.  Furthermore, in Perl, many  properties  may  optionally  be
+       ple, \p{L} matches a letter. Its longer  synonym,  \p{Letter},  is  not
-       prefixed  by "Is", for compatibility with Perl 5.6. PCRE2 does not sup-
+       supported. Furthermore, in Perl, many properties may optionally be pre-
-       port this.
+       fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not  support
       this.
 WIDE CHARACTERS AND UTF MODES
@ -11538,13 +11548,13 @@ MATCHING IN INVALID UTF STRINGS
 AUTHOR
       Philip Hazel
-       University Computing Service
+       Retired from University Computing Service
       Cambridge, England.
 REVISION
-       Last updated: 08 December 2021
+       Last updated: 22 December 2021
       Copyright (c) 1997-2021 University of Cambridge.
 ------------------------------------------------------------------------------
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "10 December 2021" "PCRE2 10.40"
+.TH PCRE2PATTERN 3 "22 December 2021" "PCRE2 10.40"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -793,13 +793,17 @@ Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
 Note that \eP{Any} does not match any characters, so always causes a match
 failure.
 .P
-Sets of Unicode characters are defined as belonging to certain scripts. A
+There are three different syntax forms for matching a script. Each Unicode
-character from one of these sets can be matched using a script name. For
+character has a basic script and, optionally, a list of other scripts ("Script
-example:
+Extentions") with which it is commonly used. Using the Adlam script as an
-.sp
+example, \ep{sc:Adlam} matches characters whose basic script is Adlam, whereas
-  \ep{Greek}
+\ep{scx:Adlam} matches, in addition, characters that have Adlam in their
-  \eP{Han}
+extensions list. The full names "script" and "script extensions" for the
-.sp
+property types are recognized, and a equals sign is an alternative to the
 colon. If a script name is given without a property type, for example,
 \ep{Adlam}, it is treated as \ep{scx:Adlam}. Perl changed to this
 interpretation at release 5.26 and PCRE2 changed at release 10.40.
 .P
 Unassigned characters (and in non-UTF 32-bit mode, characters with code points
 greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
 part of an identified script are lumped together as "Common". The current list
@ -3952,6 +3956,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 10 December 2021
+Last updated: 22 December 2021
 Copyright (c) 1997-2021 University of Cambridge.
 .fi
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "10 December 2021" "PCRE2 10.40"
+.TH PCRE2SYNTAX 3 "22 December 2021" "PCRE2 10.40"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -124,6 +124,7 @@ matching" rules.
  Lo         Other letter
  Lt         Title case letter
  Lu         Upper case letter
  Lc         Ll, Lu, or Lt
  L&         Ll, Lu, or Lt
 .sp
  M          Mark
@ -171,9 +172,12 @@ Perl and POSIX space are now the same. Perl added VT to its space character set
 at release 5.18.
 .
 .
-.SH "SCRIPT NAMES FOR \ep AND \eP"
+.SH "SCRIPT MATCHING WITH \ep AND \eP"
 .rs
 .sp
 The following script names are recognized in \ep{sc:...} or \ep{scx:...} items,
 or on their own with \ep (and also \eP of course):
 .P
 Adlam,
 Ahom,
 Anatolian_Hieroglyphs,
@ -723,6 +727,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 10 December 2021
+Last updated: 22 December 2021
 Copyright (c) 1997-2021 University of Cambridge.
 .fi
--- a/doc/pcre2unicode.3
+++ b/doc/pcre2unicode.3
@ -1,4 +1,4 @@
-.TH PCRE2UNICODE 3 "08 December 2021" "PCRE2 10.40"
+.TH PCRE2UNICODE 3 "22 December 2021" "PCRE2 10.40"
 .SH NAME
 PCRE - Perl-compatible regular expressions (revised API)
 .SH "UNICODE AND UTF SUPPORT"
@ -40,10 +40,11 @@ handled, as documented below.
 .sp
 When PCRE2 is built with Unicode support, the escape sequences \ep{..},
 \eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting.
-The Unicode properties that can be tested are limited to the general category
+The Unicode properties that can be tested are a subset of those that Perl
-properties such as Lu for an upper case letter or Nd for a decimal number, the
+supports. Currently they are limited to the general category properties such as
-Unicode script names such as Arabic or Han, Bidi_Class, Bidi_Control, and the
+Lu for an upper case letter or Nd for a decimal number, the Unicode script
-derived properties Any and LC (synonym L&). Full lists are given in the
+names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
 properties Any and LC (synonym L&). Full lists are given in the
 .\" HREF
 \fBpcre2pattern\fP
 .\"
@ -51,10 +52,10 @@ and
 .\" HREF
 \fBpcre2syntax\fP
 .\"
-documentation. Only the short names for properties are supported. For example,
+documentation. In general, only the short names for properties are supported.
-\ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not supported.
+For example, \ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not
-Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
+supported. Furthermore, in Perl, many properties may optionally be prefixed by
-compatibility with Perl 5.6. PCRE2 does not support this.
+"Is", for compatibility with Perl 5.6. PCRE2 does not support this.
 .
 .
 .SH "WIDE CHARACTERS AND UTF MODES"
@ -448,7 +449,7 @@ can be useful when searching for UTF text in executable or other binary files.
 .sp
 .nf
 Philip Hazel
-University Computing Service
+Retired from University Computing Service
 Cambridge, England.
 .fi
 .
@ -457,6 +458,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 08 December 2021
+Last updated: 22 December 2021
 Copyright (c) 1997-2021 University of Cambridge.
 .fi