Documentation update.

This commit is contained in:
Philip.Hazel 2019-05-30 15:43:05 +00:00
parent d5dc4e0c33
commit 16d47a9cb1
6 changed files with 951 additions and 920 deletions

View File

@ -1762,17 +1762,22 @@ subject string does not happen. The first match attempt is run starting from
the overall result is "no match".
</P>
<P>
There are also other start-up optimizations. For example, a minimum length for
the subject may be recorded. Consider the pattern
As another start-up optimization makes use of a minimum length for a matching
subject, which is recorded when possible. Consider the pattern
<pre>
(*MARK:A)(X|Y)
(*MARK:1)B(*MARK:2)(X|Y)
</pre>
The minimum length for a match is one character. If the subject is "ABC", there
will be attempts to match "ABC", "BC", and "C". An attempt to match an empty
string at the end of the subject does not take place, because PCRE2 knows that
the subject is now too short, and so the (*MARK) is never encountered. In this
case, the optimization does not affect the overall match result, which is still
"no match", but it does affect the auxiliary information that is returned.
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
the match attempt fails, the next "B" is found, but there is only one character
left, so there are no more attempts, and "no match" is returned with the "last
mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
at every possible starting position, including at the end of the subject, where
(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
returned is "1". In this case, the optimizations do not affect the overall
match result, which is still "no match", but they do affect the auxiliary
information that is returned.
<pre>
PCRE2_NO_UTF_CHECK
</pre>
@ -3831,7 +3836,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 23 May 2019
Last updated: 30 May 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -741,13 +741,22 @@ ignored when used with <b>-L</b> (list files without matches), because the grand
total would always be zero.
</P>
<P>
<b>-u</b>, <b>--utf-8</b>
<b>-u</b>, <b>--utf</b>
Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
with UTF-8 support. All patterns (including those for any <b>--exclude</b> and
<b>--include</b> options) and all subject lines that are scanned must be valid
strings of UTF-8 characters.
</P>
<P>
<b>-U</b>, <b>--utf-allow-invalid</b>
As <b>--utf</b>, but in addition subject lines may contain invalid UTF-8 code
unit sequences. These can never form part of any pattern match. This facility
allows valid UTF-8 strings to be sought in executable or other binary files.
For more details about matching in non-valid UTF-8 strings, see the
<a href="pcre2unicode.html"><b>pcre2unicode</b>(3)</a>
documentation.
</P>
<P>
<b>-V</b>, <b>--version</b>
Write the version numbers of <b>pcre2grep</b> and the PCRE2 library to the
standard output and then exit. Anything else on the command line is
@ -806,9 +815,9 @@ as in the GNU <b>grep</b> program. Any long option of the form
<b>--file-offsets</b>, <b>--heap-limit</b>, <b>--include-dir</b>,
<b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>, <b>-M</b>,
<b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--om-separator</b>,
<b>--output</b>, <b>-u</b>, and <b>--utf-8</b> options are specific to
<b>pcre2grep</b>, as is the use of the <b>--only-matching</b> option with a
capturing parentheses number.
<b>--output</b>, <b>-u</b>, <b>--utf</b>, <b>-U</b>, and <b>--utf-allow-invalid</b>
options are specific to <b>pcre2grep</b>, as is the use of the
<b>--only-matching</b> option with a capturing parentheses number.
</P>
<P>
Although most of the common options work the same way, a few are different in
@ -971,9 +980,9 @@ Cambridge, England.
</P>
<br><a name="SEC16" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 November 2018
Last updated: 28 May 2019
<br>
Copyright &copy; 1997-2018 University of Cambridge.
Copyright &copy; 1997-2019 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -1741,18 +1741,23 @@ COMPILING A PATTERN
(*COMMIT) prevents any further matches being tried, so the overall
result is "no match".
There are also other start-up optimizations. For example, a minimum
length for the subject may be recorded. Consider the pattern
As another start-up optimization makes use of a minimum length for a
matching subject, which is recorded when possible. Consider the pattern
(*MARK:A)(X|Y)
(*MARK:1)B(*MARK:2)(X|Y)
The minimum length for a match is one character. If the subject is
"ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
to match an empty string at the end of the subject does not take place,
because PCRE2 knows that the subject is now too short, and so the
(*MARK) is never encountered. In this case, the optimization does not
affect the overall match result, which is still "no match", but it does
affect the auxiliary information that is returned.
The minimum length for a match is two characters. If the subject is
"XXBB", the "starting character" optimization skips "XX", then tries to
match "BB", which is long enough. In the process, (*MARK:2) is encoun-
tered and remembered. When the match attempt fails, the next "B" is
found, but there is only one character left, so there are no more
attempts, and "no match" is returned with the "last mark seen" set to
"2". If NO_START_OPTIMIZE is set, however, matches are tried at every
possible starting position, including at the end of the subject, where
(*MARK:1) is encountered, but there is no "B", so the "last mark seen"
that is returned is "1". In this case, the optimizations do not affect
the overall match result, which is still "no match", but they do affect
the auxiliary information that is returned.
PCRE2_NO_UTF_CHECK
@ -3698,7 +3703,7 @@ AUTHOR
REVISION
Last updated: 23 May 2019
Last updated: 30 May 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "23 May 2019" "PCRE2 10.34"
.TH PCRE2API 3 "30 May 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -1701,17 +1701,22 @@ subject string does not happen. The first match attempt is run starting from
"D" and when this fails, (*COMMIT) prevents any further matches being tried, so
the overall result is "no match".
.P
There are also other start-up optimizations. For example, a minimum length for
the subject may be recorded. Consider the pattern
As another start-up optimization makes use of a minimum length for a matching
subject, which is recorded when possible. Consider the pattern
.sp
(*MARK:A)(X|Y)
(*MARK:1)B(*MARK:2)(X|Y)
.sp
The minimum length for a match is one character. If the subject is "ABC", there
will be attempts to match "ABC", "BC", and "C". An attempt to match an empty
string at the end of the subject does not take place, because PCRE2 knows that
the subject is now too short, and so the (*MARK) is never encountered. In this
case, the optimization does not affect the overall match result, which is still
"no match", but it does affect the auxiliary information that is returned.
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
the match attempt fails, the next "B" is found, but there is only one character
left, so there are no more attempts, and "no match" is returned with the "last
mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
at every possible starting position, including at the end of the subject, where
(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
returned is "1". In this case, the optimizations do not affect the overall
match result, which is still "no match", but they do affect the auxiliary
information that is returned.
.sp
PCRE2_NO_UTF_CHECK
.sp
@ -3843,6 +3848,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 May 2019
Last updated: 30 May 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -650,7 +650,7 @@ with UTF-8 support. All patterns (including those for any \fB--exclude\fP and
\fB--include\fP options) and all subject lines that are scanned must be valid
strings of UTF-8 characters.
.TP
\fb-U\fP, \fB--utf-allow-invalid\fP
\fB-U\fP, \fB--utf-allow-invalid\fP
As \fB--utf\fP, but in addition subject lines may contain invalid UTF-8 code
unit sequences. These can never form part of any pattern match. This facility
allows valid UTF-8 strings to be sought in executable or other binary files.

View File

@ -719,13 +719,20 @@ OPTIONS
(list files without matches), because the grand total would
always be zero.
-u, --utf-8
Operate in UTF-8 mode. This option is available only if PCRE2
-u, --utf Operate in UTF-8 mode. This option is available only if PCRE2
has been compiled with UTF-8 support. All patterns (including
those for any --exclude and --include options) and all sub-
ject lines that are scanned must be valid strings of UTF-8
characters.
-U, --utf-allow-invalid
As --utf, but in addition subject lines may contain invalid
UTF-8 code unit sequences. These can never form part of any
pattern match. This facility allows valid UTF-8 strings to be
sought in executable or other binary files. For more details
about matching in non-valid UTF-8 strings, see the pcre2uni-
code(3) documentation.
-V, --version
Write the version numbers of pcre2grep and the PCRE2 library
to the standard output and then exit. Anything else on the
@ -785,9 +792,9 @@ OPTIONS COMPATIBILITY
terminology) is also available as --xxx-regex (PCRE2 terminology). How-
ever, the --depth-limit, --file-list, --file-offsets, --heap-limit,
--include-dir, --line-offsets, --locale, --match-limit, -M, --multi-
line, -N, --newline, --om-separator, --output, -u, and --utf-8 options
are specific to pcre2grep, as is the use of the --only-matching option
with a capturing parentheses number.
line, -N, --newline, --om-separator, --output, -u, --utf, -U, and
--utf-allow-invalid options are specific to pcre2grep, as is the use of
the --only-matching option with a capturing parentheses number.
Although most of the common options work the same way, a few are dif-
ferent in pcre2grep. For example, the --include option's argument is a
@ -948,5 +955,5 @@ AUTHOR
REVISION
Last updated: 24 November 2018
Copyright (c) 1997-2018 University of Cambridge.
Last updated: 28 May 2019
Copyright (c) 1997-2019 University of Cambridge.