Documentation clarifications.

2015-01-26 14:21:45 +00:00 · 2015-01-26 14:21:45 +00:00 · 221cf10975
parent 5438fc8a6a
commit 221cf10975
7 changed files with 84 additions and 53 deletions
--- a/16
+++ b/16
@ -179,20 +179,24 @@ library. They are also documented in the pcre2build man page.

 . If you do not want to make use of the support for UTF-8 Unicode character
  strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
-  library, and UTF-32 Unicode character strings in the 32-bit library, you can
+  library, or UTF-32 Unicode character strings in the 32-bit library, you can
  add --disable-unicode to the "configure" command. This reduces the size of
  the libraries. It is not possible to configure one library with Unicode
  support, and another without, in the same configuration.

  When Unicode support is available, the use of a UTF encoding still has to be
-  enabled by an option at run time. When PCRE2 is compiled with Unicode
-  support, its input can only either be ASCII or UTF-8/16/32, even when running
-  on EBCDIC platforms. It is not possible to use both --enable-unicode and
-  --enable-ebcdic at the same time.
+  enabled by setting the PCRE2_UTF option at run time or starting a pattern
+  with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
+  either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
+  not possible to use both --enable-unicode and --enable-ebcdic at the same
+  time.

  As well as supporting UTF strings, Unicode support includes support for the
  \P, \p, and \X sequences that recognize Unicode character properties.
  However, only the basic two-letter properties such as Lu are supported.
+  Escape sequences such as \d and \w in patterns do not by default make use of
+  Unicode properties, but can be made to do so by setting the PCRE2_UCP option
+  or starting a pattern with (*UCP).

 . You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
  of the preceding, or any of the Unicode newline sequences, as indicating the
@ -825,4 +829,4 @@ The distribution should contain the files listed below.
 Philip Hazel
 Email local part: ph10
 Email domain: cam.ac.uk
-Last updated: 20 January 2015
+Last updated: 26 January 2015
--- a/doc/html/README.txt
+++ b/doc/html/README.txt
@ -179,20 +179,24 @@ library. They are also documented in the pcre2build man page.

 . If you do not want to make use of the support for UTF-8 Unicode character
  strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
-  library, and UTF-32 Unicode character strings in the 32-bit library, you can
+  library, or UTF-32 Unicode character strings in the 32-bit library, you can
  add --disable-unicode to the "configure" command. This reduces the size of
  the libraries. It is not possible to configure one library with Unicode
  support, and another without, in the same configuration.

  When Unicode support is available, the use of a UTF encoding still has to be
-  enabled by an option at run time. When PCRE2 is compiled with Unicode
-  support, its input can only either be ASCII or UTF-8/16/32, even when running
-  on EBCDIC platforms. It is not possible to use both --enable-unicode and
-  --enable-ebcdic at the same time.
+  enabled by setting the PCRE2_UTF option at run time or starting a pattern
+  with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
+  either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
+  not possible to use both --enable-unicode and --enable-ebcdic at the same
+  time.

  As well as supporting UTF strings, Unicode support includes support for the
  \P, \p, and \X sequences that recognize Unicode character properties.
  However, only the basic two-letter properties such as Lu are supported.
+  Escape sequences such as \d and \w in patterns do not by default make use of
+  Unicode properties, but can be made to do so by setting the PCRE2_UCP option
+  or starting a pattern with (*UCP).

 . You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
  of the preceding, or any of the Unicode newline sequences, as indicating the
@ -825,4 +829,4 @@ The distribution should contain the files listed below.
 Philip Hazel
 Email local part: ph10
 Email domain: cam.ac.uk
-Last updated: 20 January 2015
+Last updated: 26 January 2015
--- a/doc/html/pcre2build.html
+++ b/doc/html/pcre2build.html
@ -127,8 +127,10 @@ in the same configuration.
 </P>
 <P>
 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
-or UTF-32. To do that, applications that use the library have to set the
-PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
+or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
+option when they call <b>pcre2_compile()</b> to compile a pattern.
+Alternatively, patterns may be started with (*UTF) unless the application has
+locked this out by setting PCRE2_NEVER_UTF.
 </P>
 <P>
 UTF support allows the libraries to process character code points up to
@ -139,6 +141,12 @@ as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
 documentation.
 </P>
+<P>
+Pattern escapes such as \d and \w do not by default make use of Unicode
+properties. The application can request that they do by setting the PCRE2_UCP
+option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
+request this by starting with (*UCP).
+</P>
 <br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
 <P>
 Just-in-time compiler support is included in the build by specifying
@ -471,9 +479,9 @@ Cambridge, England.
 </P>
 <br><a name="SEC21" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 23 November 2014
+Last updated: 26 January 2015
 <br>
-Copyright &copy; 1997-2014 University of Cambridge.
+Copyright &copy; 1997-2015 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -110,7 +110,7 @@ Unicode property support
 Another special sequence that may appear at the start of a pattern is (*UCP).
 This has the same effect as setting the PCRE2_UCP option: it causes sequences
 such as \d and \w to use Unicode properties to determine character types,
-instead of recognizing only characters with codes less than 128 via a lookup
+instead of recognizing only characters with codes less than 256 via a lookup
 table.
 </P>
 <P>
@ -572,8 +572,8 @@ Unicode is discouraged.
 </P>
 <P>
 By default, characters whose code points are greater than 127 never match \d,
-\s, or \w, and always match \D, \S, and \W, although this may vary for
-characters in the range 128-255 when locale-specific matching is happening.
+\s, or \w, and always match \D, \S, and \W, although this may be different
+for characters in the range 128-255 when locale-specific matching is happening.
 These escape sequences retain their original meanings from before Unicode
 support was available, mainly for efficiency reasons. If the PCRE2_UCP option
 is set, the behaviour is changed so that Unicode properties are used to
@ -1369,11 +1369,12 @@ syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
 supported, and an error is given if they are encountered.
 </P>
 <P>
-By default, characters with values greater than 128 do not match any of the
-POSIX character classes. However, if the PCRE2_UCP option is passed to
-<b>pcre2_compile()</b>, some of the classes are changed so that Unicode
-character properties are used. This is achieved by replacing certain POSIX
-classes by other sequences, as follows:
+By default, characters with values greater than 127 do not match any of the
+POSIX character classes, although this may be different for characters in the
+range 128-255 when locale-specific matching is happening. However, if the
+PCRE2_UCP option is passed to <b>pcre2_compile()</b>, some of the classes are
+changed so that Unicode character properties are used. This is achieved by
+replacing certain POSIX classes with other sequences, as follows:
 <pre>
  [:alnum:]  becomes  \p{Xan}
  [:alpha:]  becomes  \p{L}
@ -1408,12 +1409,12 @@ not controls, that is, characters with the Zs property.
 <P>
 [:punct:]
 This matches all characters that have the Unicode P (punctuation) property,
-plus those characters with code points less than 128 that have the S (Symbol)
+plus those characters with code points less than 256 that have the S (Symbol)
 property.
 </P>
 <P>
 The other POSIX classes are unchanged, and match only characters with code
-points less than 128.
+points less than 256.
 </P>
 <br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
 <P>
@ -3248,7 +3249,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC30" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 02 January 2015
+Last updated: 26 January 2015
 <br>
 Copyright &copy; 1997-2015 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@ -2874,9 +2874,10 @@ UNICODE AND UTF SUPPORT
       another without, in the same configuration.

       Of  itself, Unicode support does not make PCRE2 treat strings as UTF-8,
-       UTF-16 or UTF-32. To do that, applications that use the library have to
-       set  the  PCRE2_UTF  option when they call pcre2_compile() to compile a
-       pattern.
+       UTF-16 or UTF-32. To do that, applications that use the library can set
+       the  PCRE2_UTF  option when they call pcre2_compile() to compile a pat-
+       tern.  Alternatively, patterns may be started with  (*UTF)  unless  the
+       application has locked this out by setting PCRE2_NEVER_UTF.

       UTF support allows the libraries to process character code points up to
       0x10ffff in the strings that they handle. It also provides support  for
@ -2885,6 +2886,11 @@ UNICODE AND UTF SUPPORT
       such  as Lu and Nd are supported. Details are given in the pcre2pattern
       documentation.

+       Pattern escapes such as \d and \w do not by default make use of Unicode
+       properties.  The  application  can  request that they do by setting the
+       PCRE2_UCP option. Unless the application  has  set  PCRE2_NEVER_UCP,  a
+       pattern may also request this by starting with (*UCP).
+

 JUST-IN-TIME COMPILER SUPPORT

@ -3226,8 +3232,8 @@ AUTHOR

 REVISION

-       Last updated: 23 November 2014
-       Copyright (c) 1997-2014 University of Cambridge.
+       Last updated: 26 January 2015
+       Copyright (c) 1997-2015 University of Cambridge.
 ------------------------------------------------------------------------------


--- a/doc/pcre2build.3
+++ b/doc/pcre2build.3
@ -1,4 +1,4 @@
-.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
+.TH PCRE2BUILD 3 "26 January 2015" "PCRE2 10.00"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .
@ -113,8 +113,10 @@ is not possible to build one library with Unicode support, and another without,
 in the same configuration.
 .P
 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
-or UTF-32. To do that, applications that use the library have to set the
-PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
+or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
+option when they call \fBpcre2_compile()\fP to compile a pattern.
+Alternatively, patterns may be started with (*UTF) unless the application has
+locked this out by setting PCRE2_NEVER_UTF.
 .P
 UTF support allows the libraries to process character code points up to
 0x10ffff in the strings that they handle. It also provides support for
@ -125,6 +127,11 @@ as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
 \fBpcre2pattern\fP
 .\"
 documentation.
+.P
+Pattern escapes such as \ed and \ew do not by default make use of Unicode
+properties. The application can request that they do by setting the PCRE2_UCP
+option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
+request this by starting with (*UCP).
 .
 .
 .SH "JUST-IN-TIME COMPILER SUPPORT"
@ -487,6 +494,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 23 November 2014
-Copyright (c) 1997-2014 University of Cambridge.
+Last updated: 26 January 2015
+Copyright (c) 1997-2015 University of Cambridge.
 .fi
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00"
+.TH PCRE2PATTERN 3 "26 January 2015" "PCRE2 10.00"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -73,7 +73,7 @@ appearance in a pattern causes an error.
 Another special sequence that may appear at the start of a pattern is (*UCP).
 This has the same effect as setting the PCRE2_UCP option: it causes sequences
 such as \ed and \ew to use Unicode properties to determine character types,
-instead of recognizing only characters with codes less than 128 via a lookup
+instead of recognizing only characters with codes less than 256 via a lookup
 table.
 .P
 Some applications that allow their users to supply patterns may wish to
@ -575,8 +575,8 @@ accented letters, and these are then matched by \ew. The use of locales with
 Unicode is discouraged.
 .P
 By default, characters whose code points are greater than 127 never match \ed,
-\es, or \ew, and always match \eD, \eS, and \eW, although this may vary for
-characters in the range 128-255 when locale-specific matching is happening.
+\es, or \ew, and always match \eD, \eS, and \eW, although this may be different
+for characters in the range 128-255 when locale-specific matching is happening.
 These escape sequences retain their original meanings from before Unicode
 support was available, mainly for efficiency reasons. If the PCRE2_UCP option
 is set, the behaviour is changed so that Unicode properties are used to
@ -1369,11 +1369,12 @@ matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
 syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
 supported, and an error is given if they are encountered.
 .P
-By default, characters with values greater than 128 do not match any of the
-POSIX character classes. However, if the PCRE2_UCP option is passed to
-\fBpcre2_compile()\fP, some of the classes are changed so that Unicode
-character properties are used. This is achieved by replacing certain POSIX
-classes by other sequences, as follows:
+By default, characters with values greater than 127 do not match any of the
+POSIX character classes, although this may be different for characters in the
+range 128-255 when locale-specific matching is happening. However, if the
+PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
+changed so that Unicode character properties are used. This is achieved by
+replacing certain POSIX classes with other sequences, as follows:
 .sp
  [:alnum:]  becomes  \ep{Xan}
  [:alpha:]  becomes  \ep{L}
@ -1404,11 +1405,11 @@ not controls, that is, characters with the Zs property.
 .TP 10
 [:punct:]
 This matches all characters that have the Unicode P (punctuation) property,
-plus those characters with code points less than 128 that have the S (Symbol)
+plus those characters with code points less than 256 that have the S (Symbol)
 property.
 .P
 The other POSIX classes are unchanged, and match only characters with code
-points less than 128.
+points less than 256.
 .
 .
 .SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
@ -3292,6 +3293,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 02 January 2015
+Last updated: 26 January 2015
 Copyright (c) 1997-2015 University of Cambridge.
 .fi