Documentation update.

2019-05-30 15:43:05 +00:00 · 2019-05-30 15:43:05 +00:00 · 16d47a9cb1
parent d5dc4e0c33
commit 16d47a9cb1
6 changed files with 951 additions and 920 deletions
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@ -1762,17 +1762,22 @@ subject string does not happen. The first match attempt is run starting from
 the overall result is "no match".
 </P>
 <P>
-There are also other start-up optimizations. For example, a minimum length for
-the subject may be recorded. Consider the pattern
+As another start-up optimization makes use of a minimum length for a matching
+subject, which is recorded when possible. Consider the pattern
 <pre>
-  (*MARK:A)(X|Y)
+  (*MARK:1)B(*MARK:2)(X|Y)
 </pre>
-The minimum length for a match is one character. If the subject is "ABC", there
-will be attempts to match "ABC", "BC", and "C". An attempt to match an empty
-string at the end of the subject does not take place, because PCRE2 knows that
-the subject is now too short, and so the (*MARK) is never encountered. In this
-case, the optimization does not affect the overall match result, which is still
-"no match", but it does affect the auxiliary information that is returned.
+The minimum length for a match is two characters. If the subject is "XXBB", the 
+"starting character" optimization skips "XX", then tries to match "BB", which 
+is long enough. In the process, (*MARK:2) is encountered and remembered. When 
+the match attempt fails, the next "B" is found, but there is only one character
+left, so there are no more attempts, and "no match" is returned with the "last
+mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
+at every possible starting position, including at the end of the subject, where
+(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
+returned is "1". In this case, the optimizations do not affect the overall
+match result, which is still "no match", but they do affect the auxiliary
+information that is returned.
 <pre>
  PCRE2_NO_UTF_CHECK
 </pre>
@ -3831,7 +3836,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC42" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 23 May 2019
+Last updated: 30 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/html/pcre2grep.html
+++ b/doc/html/pcre2grep.html
@ -741,13 +741,22 @@ ignored when used with <b>-L</b> (list files without matches), because the grand
 total would always be zero.
 </P>
 <P>
-<b>-u</b>, <b>--utf-8</b>
+<b>-u</b>, <b>--utf</b>
 Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
 with UTF-8 support. All patterns (including those for any <b>--exclude</b> and
 <b>--include</b> options) and all subject lines that are scanned must be valid
 strings of UTF-8 characters.
 </P>
 <P>
+<b>-U</b>, <b>--utf-allow-invalid</b>
+As <b>--utf</b>, but in addition subject lines may contain invalid UTF-8 code
+unit sequences. These can never form part of any pattern match. This facility
+allows valid UTF-8 strings to be sought in executable or other binary files.
+For more details about matching in non-valid UTF-8 strings, see the
+<a href="pcre2unicode.html"><b>pcre2unicode</b>(3)</a>
+documentation.
+</P>
+<P>
 <b>-V</b>, <b>--version</b>
 Write the version numbers of <b>pcre2grep</b> and the PCRE2 library to the
 standard output and then exit. Anything else on the command line is
@ -806,9 +815,9 @@ as in the GNU <b>grep</b> program. Any long option of the form
 <b>--file-offsets</b>, <b>--heap-limit</b>, <b>--include-dir</b>,
 <b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>, <b>-M</b>,
 <b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--om-separator</b>,
-<b>--output</b>, <b>-u</b>, and <b>--utf-8</b> options are specific to
-<b>pcre2grep</b>, as is the use of the <b>--only-matching</b> option with a
-capturing parentheses number.
+<b>--output</b>, <b>-u</b>, <b>--utf</b>, <b>-U</b>, and <b>--utf-allow-invalid</b>
+options are specific to <b>pcre2grep</b>, as is the use of the
+<b>--only-matching</b> option with a capturing parentheses number.
 </P>
 <P>
 Although most of the common options work the same way, a few are different in
@ -971,9 +980,9 @@ Cambridge, England.
 </P>
 <br><a name="SEC16" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 24 November 2018
+Last updated: 28 May 2019
 <br>
-Copyright &copy; 1997-2018 University of Cambridge.
+Copyright &copy; 1997-2019 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@ -1741,18 +1741,23 @@ COMPILING A PATTERN
       (*COMMIT)  prevents  any  further  matches  being tried, so the overall
       result is "no match".

-       There are also other start-up optimizations.  For  example,  a  minimum
-       length for the subject may be recorded. Consider the pattern
+       As another start-up optimization makes use of a minimum  length  for  a
+       matching subject, which is recorded when possible. Consider the pattern

-         (*MARK:A)(X|Y)
+         (*MARK:1)B(*MARK:2)(X|Y)

-       The  minimum  length  for  a  match is one character. If the subject is
-       "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
-       to match an empty string at the end of the subject does not take place,
-       because PCRE2 knows that the subject is  now  too  short,  and  so  the
-       (*MARK)  is  never encountered. In this case, the optimization does not
-       affect the overall match result, which is still "no match", but it does
-       affect the auxiliary information that is returned.
+       The  minimum  length  for  a match is two characters. If the subject is
+       "XXBB", the "starting character" optimization skips "XX", then tries to
+       match  "BB", which is long enough. In the process, (*MARK:2) is encoun-
+       tered and remembered. When the match attempt fails,  the  next  "B"  is
+       found,  but  there  is  only  one  character left, so there are no more
+       attempts, and "no match" is returned with the "last mark seen"  set  to
+       "2".  If  NO_START_OPTIMIZE is set, however, matches are tried at every
+       possible starting position, including at the end of the subject,  where
+       (*MARK:1)  is encountered, but there is no "B", so the "last mark seen"
+       that is returned is "1". In this case, the optimizations do not  affect
+       the overall match result, which is still "no match", but they do affect
+       the auxiliary information that is returned.

         PCRE2_NO_UTF_CHECK

@ -3698,7 +3703,7 @@ AUTHOR

 REVISION

-       Last updated: 23 May 2019
+       Last updated: 30 May 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
 
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@ -1,4 +1,4 @@
-.TH PCRE2API 3 "23 May 2019" "PCRE2 10.34"
+.TH PCRE2API 3 "30 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@ -1701,17 +1701,22 @@ subject string does not happen. The first match attempt is run starting from
 "D" and when this fails, (*COMMIT) prevents any further matches being tried, so
 the overall result is "no match".
 .P
-There are also other start-up optimizations. For example, a minimum length for
-the subject may be recorded. Consider the pattern
+As another start-up optimization makes use of a minimum length for a matching
+subject, which is recorded when possible. Consider the pattern
 .sp
-  (*MARK:A)(X|Y)
+  (*MARK:1)B(*MARK:2)(X|Y)
 .sp
-The minimum length for a match is one character. If the subject is "ABC", there
-will be attempts to match "ABC", "BC", and "C". An attempt to match an empty
-string at the end of the subject does not take place, because PCRE2 knows that
-the subject is now too short, and so the (*MARK) is never encountered. In this
-case, the optimization does not affect the overall match result, which is still
-"no match", but it does affect the auxiliary information that is returned.
+The minimum length for a match is two characters. If the subject is "XXBB", the 
+"starting character" optimization skips "XX", then tries to match "BB", which 
+is long enough. In the process, (*MARK:2) is encountered and remembered. When 
+the match attempt fails, the next "B" is found, but there is only one character
+left, so there are no more attempts, and "no match" is returned with the "last
+mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
+at every possible starting position, including at the end of the subject, where
+(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
+returned is "1". In this case, the optimizations do not affect the overall
+match result, which is still "no match", but they do affect the auxiliary
+information that is returned.
 .sp
  PCRE2_NO_UTF_CHECK
 .sp
@ -3843,6 +3848,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 23 May 2019
+Last updated: 30 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2grep.1
+++ b/doc/pcre2grep.1
@ -650,7 +650,7 @@ with UTF-8 support. All patterns (including those for any \fB--exclude\fP and
 \fB--include\fP options) and all subject lines that are scanned must be valid
 strings of UTF-8 characters.
 .TP
-\fb-U\fP, \fB--utf-allow-invalid\fP
+\fB-U\fP, \fB--utf-allow-invalid\fP
 As \fB--utf\fP, but in addition subject lines may contain invalid UTF-8 code
 unit sequences. These can never form part of any pattern match. This facility
 allows valid UTF-8 strings to be sought in executable or other binary files.
--- a/doc/pcre2grep.txt
+++ b/doc/pcre2grep.txt
@ -719,13 +719,20 @@ OPTIONS
                 (list  files  without matches), because the grand total would
                 always be zero.

-       -u, --utf-8
-                 Operate in UTF-8 mode. This option is available only if PCRE2
+       -u, --utf Operate in UTF-8 mode. This option is available only if PCRE2
                 has been compiled with UTF-8 support. All patterns (including
                 those for any --exclude and --include options) and  all  sub-
                 ject  lines  that  are scanned must be valid strings of UTF-8
                 characters.

+       -U, --utf-allow-invalid
+                 As --utf, but in addition subject lines may  contain  invalid
+                 UTF-8  code  unit sequences. These can never form part of any
+                 pattern match. This facility allows valid UTF-8 strings to be
+                 sought in executable or other binary files.  For more details
+                 about matching in non-valid UTF-8 strings, see the  pcre2uni-
+                 code(3) documentation.
+
       -V, --version
                 Write  the version numbers of pcre2grep and the PCRE2 library
                 to the standard output and then exit. Anything  else  on  the
@ -785,9 +792,9 @@ OPTIONS COMPATIBILITY
       terminology) is also available as --xxx-regex (PCRE2 terminology). How-
       ever,  the  --depth-limit,  --file-list,  --file-offsets, --heap-limit,
       --include-dir, --line-offsets, --locale,  --match-limit,  -M,  --multi-
-       line, -N, --newline, --om-separator, --output, -u, and --utf-8  options
-       are  specific to pcre2grep, as is the use of the --only-matching option
-       with a capturing parentheses number.
+       line,  -N,  --newline,  --om-separator,  --output,  -u,  --utf, -U, and
+       --utf-allow-invalid options are specific to pcre2grep, as is the use of
+       the --only-matching option with a capturing parentheses number.

       Although  most  of the common options work the same way, a few are dif-
       ferent in pcre2grep. For example, the --include option's argument is  a
@ -948,5 +955,5 @@ AUTHOR

 REVISION

-       Last updated: 24 November 2018
-       Copyright (c) 1997-2018 University of Cambridge.
+       Last updated: 28 May 2019
+       Copyright (c) 1997-2019 University of Cambridge.