Documentation update.

This commit is contained in:
Philip.Hazel 2019-05-30 15:43:05 +00:00
parent d5dc4e0c33
commit 16d47a9cb1
6 changed files with 951 additions and 920 deletions

View File

@ -1762,17 +1762,22 @@ subject string does not happen. The first match attempt is run starting from
the overall result is "no match".
</P>
<P>
There are also other start-up optimizations. For example, a minimum length for
the subject may be recorded. Consider the pattern
As another start-up optimization makes use of a minimum length for a matching
subject, which is recorded when possible. Consider the pattern
<pre>
(*MARK:A)(X|Y)
(*MARK:1)B(*MARK:2)(X|Y)
</pre>
The minimum length for a match is one character. If the subject is "ABC", there
will be attempts to match "ABC", "BC", and "C". An attempt to match an empty
string at the end of the subject does not take place, because PCRE2 knows that
the subject is now too short, and so the (*MARK) is never encountered. In this
case, the optimization does not affect the overall match result, which is still
"no match", but it does affect the auxiliary information that is returned.
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
the match attempt fails, the next "B" is found, but there is only one character
left, so there are no more attempts, and "no match" is returned with the "last
mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
at every possible starting position, including at the end of the subject, where
(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
returned is "1". In this case, the optimizations do not affect the overall
match result, which is still "no match", but they do affect the auxiliary
information that is returned.
<pre>
PCRE2_NO_UTF_CHECK
</pre>
@ -3831,7 +3836,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 23 May 2019
Last updated: 30 May 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -741,13 +741,22 @@ ignored when used with <b>-L</b> (list files without matches), because the grand
total would always be zero.
</P>
<P>
<b>-u</b>, <b>--utf-8</b>
<b>-u</b>, <b>--utf</b>
Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
with UTF-8 support. All patterns (including those for any <b>--exclude</b> and
<b>--include</b> options) and all subject lines that are scanned must be valid
strings of UTF-8 characters.
</P>
<P>
<b>-U</b>, <b>--utf-allow-invalid</b>
As <b>--utf</b>, but in addition subject lines may contain invalid UTF-8 code
unit sequences. These can never form part of any pattern match. This facility
allows valid UTF-8 strings to be sought in executable or other binary files.
For more details about matching in non-valid UTF-8 strings, see the
<a href="pcre2unicode.html"><b>pcre2unicode</b>(3)</a>
documentation.
</P>
<P>
<b>-V</b>, <b>--version</b>
Write the version numbers of <b>pcre2grep</b> and the PCRE2 library to the
standard output and then exit. Anything else on the command line is
@ -806,9 +815,9 @@ as in the GNU <b>grep</b> program. Any long option of the form
<b>--file-offsets</b>, <b>--heap-limit</b>, <b>--include-dir</b>,
<b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>, <b>-M</b>,
<b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--om-separator</b>,
<b>--output</b>, <b>-u</b>, and <b>--utf-8</b> options are specific to
<b>pcre2grep</b>, as is the use of the <b>--only-matching</b> option with a
capturing parentheses number.
<b>--output</b>, <b>-u</b>, <b>--utf</b>, <b>-U</b>, and <b>--utf-allow-invalid</b>
options are specific to <b>pcre2grep</b>, as is the use of the
<b>--only-matching</b> option with a capturing parentheses number.
</P>
<P>
Although most of the common options work the same way, a few are different in
@ -971,9 +980,9 @@ Cambridge, England.
</P>
<br><a name="SEC16" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 November 2018
Last updated: 28 May 2019
<br>
Copyright &copy; 1997-2018 University of Cambridge.
Copyright &copy; 1997-2019 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "23 May 2019" "PCRE2 10.34"
.TH PCRE2API 3 "30 May 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -1701,17 +1701,22 @@ subject string does not happen. The first match attempt is run starting from
"D" and when this fails, (*COMMIT) prevents any further matches being tried, so
the overall result is "no match".
.P
There are also other start-up optimizations. For example, a minimum length for
the subject may be recorded. Consider the pattern
As another start-up optimization makes use of a minimum length for a matching
subject, which is recorded when possible. Consider the pattern
.sp
(*MARK:A)(X|Y)
(*MARK:1)B(*MARK:2)(X|Y)
.sp
The minimum length for a match is one character. If the subject is "ABC", there
will be attempts to match "ABC", "BC", and "C". An attempt to match an empty
string at the end of the subject does not take place, because PCRE2 knows that
the subject is now too short, and so the (*MARK) is never encountered. In this
case, the optimization does not affect the overall match result, which is still
"no match", but it does affect the auxiliary information that is returned.
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
the match attempt fails, the next "B" is found, but there is only one character
left, so there are no more attempts, and "no match" is returned with the "last
mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
at every possible starting position, including at the end of the subject, where
(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
returned is "1". In this case, the optimizations do not affect the overall
match result, which is still "no match", but they do affect the auxiliary
information that is returned.
.sp
PCRE2_NO_UTF_CHECK
.sp
@ -3843,6 +3848,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 May 2019
Last updated: 30 May 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -650,7 +650,7 @@ with UTF-8 support. All patterns (including those for any \fB--exclude\fP and
\fB--include\fP options) and all subject lines that are scanned must be valid
strings of UTF-8 characters.
.TP
\fb-U\fP, \fB--utf-allow-invalid\fP
\fB-U\fP, \fB--utf-allow-invalid\fP
As \fB--utf\fP, but in addition subject lines may contain invalid UTF-8 code
unit sequences. These can never form part of any pattern match. This facility
allows valid UTF-8 strings to be sought in executable or other binary files.

View File

@ -719,47 +719,54 @@ OPTIONS
(list files without matches), because the grand total would
always be zero.
-u, --utf-8
Operate in UTF-8 mode. This option is available only if PCRE2
-u, --utf Operate in UTF-8 mode. This option is available only if PCRE2
has been compiled with UTF-8 support. All patterns (including
those for any --exclude and --include options) and all sub-
ject lines that are scanned must be valid strings of UTF-8
characters.
-U, --utf-allow-invalid
As --utf, but in addition subject lines may contain invalid
UTF-8 code unit sequences. These can never form part of any
pattern match. This facility allows valid UTF-8 strings to be
sought in executable or other binary files. For more details
about matching in non-valid UTF-8 strings, see the pcre2uni-
code(3) documentation.
-V, --version
Write the version numbers of pcre2grep and the PCRE2 library
to the standard output and then exit. Anything else on the
Write the version numbers of pcre2grep and the PCRE2 library
to the standard output and then exit. Anything else on the
command line is ignored.
-v, --invert-match
Invert the sense of the match, so that lines which do not
Invert the sense of the match, so that lines which do not
match any of the patterns are the ones that are found.
-w, --word-regex, --word-regexp
Force the patterns only to match "words". That is, there must
be a word boundary at the start and end of each matched
string. This is equivalent to having "\b(?:" at the start of
each pattern, and ")\b" at the end. This option applies only
to the patterns that are matched against the contents of
files; it does not apply to patterns specified by any of the
be a word boundary at the start and end of each matched
string. This is equivalent to having "\b(?:" at the start of
each pattern, and ")\b" at the end. This option applies only
to the patterns that are matched against the contents of
files; it does not apply to patterns specified by any of the
--include or --exclude options.
-x, --line-regex, --line-regexp
Force the patterns to start matching only at the beginnings
of lines, and in addition, require them to match entire
Force the patterns to start matching only at the beginnings
of lines, and in addition, require them to match entire
lines. In multiline mode the match may be more than one line.
This is equivalent to having "^(?:" at the start of each pat-
tern and ")$" at the end. This option applies only to the
patterns that are matched against the contents of files; it
does not apply to patterns specified by any of the --include
tern and ")$" at the end. This option applies only to the
patterns that are matched against the contents of files; it
does not apply to patterns specified by any of the --include
or --exclude options.
ENVIRONMENT VARIABLES
The environment variables LC_ALL and LC_CTYPE are examined, in that
order, for a locale. The first one that is set is used. This can be
overridden by the --locale option. If no locale is set, the PCRE2
The environment variables LC_ALL and LC_CTYPE are examined, in that
order, for a locale. The first one that is set is used. This can be
overridden by the --locale option. If no locale is set, the PCRE2
library's default (usually the "C" locale) is used.
@ -767,107 +774,107 @@ NEWLINES
The -N (--newline) option allows pcre2grep to scan files with different
newline conventions from the default. Any parts of the input files that
are written to the standard output are copied identically, with what-
ever newline sequences they have in the input. However, the setting of
this option affects only the way scanned files are processed. It does
not affect the interpretation of files specified by the -f, --file-
are written to the standard output are copied identically, with what-
ever newline sequences they have in the input. However, the setting of
this option affects only the way scanned files are processed. It does
not affect the interpretation of files specified by the -f, --file-
list, --exclude-from, or --include-from options, nor does it affect the
way in which pcre2grep writes informational messages to the standard
way in which pcre2grep writes informational messages to the standard
error and output streams. For these it uses the string "\n" to indicate
newlines, relying on the C I/O library to convert this to an appropri-
newlines, relying on the C I/O library to convert this to an appropri-
ate sequence.
OPTIONS COMPATIBILITY
Many of the short and long forms of pcre2grep's options are the same as
in the GNU grep program. Any long option of the form --xxx-regexp (GNU
in the GNU grep program. Any long option of the form --xxx-regexp (GNU
terminology) is also available as --xxx-regex (PCRE2 terminology). How-
ever, the --depth-limit, --file-list, --file-offsets, --heap-limit,
--include-dir, --line-offsets, --locale, --match-limit, -M, --multi-
line, -N, --newline, --om-separator, --output, -u, and --utf-8 options
are specific to pcre2grep, as is the use of the --only-matching option
with a capturing parentheses number.
ever, the --depth-limit, --file-list, --file-offsets, --heap-limit,
--include-dir, --line-offsets, --locale, --match-limit, -M, --multi-
line, -N, --newline, --om-separator, --output, -u, --utf, -U, and
--utf-allow-invalid options are specific to pcre2grep, as is the use of
the --only-matching option with a capturing parentheses number.
Although most of the common options work the same way, a few are dif-
ferent in pcre2grep. For example, the --include option's argument is a
glob for GNU grep, but a regular expression for pcre2grep. If both the
-c and -l options are given, GNU grep lists only file names, without
Although most of the common options work the same way, a few are dif-
ferent in pcre2grep. For example, the --include option's argument is a
glob for GNU grep, but a regular expression for pcre2grep. If both the
-c and -l options are given, GNU grep lists only file names, without
counts, but pcre2grep gives the counts as well.
OPTIONS WITH DATA
There are four different ways in which an option with data can be spec-
ified. If a short form option is used, the data may follow immedi-
ified. If a short form option is used, the data may follow immedi-
ately, or (with one exception) in the next command line item. For exam-
ple:
-f/some/file
-f /some/file
The exception is the -o option, which may appear with or without data.
Because of this, if data is present, it must follow immediately in the
The exception is the -o option, which may appear with or without data.
Because of this, if data is present, it must follow immediately in the
same item, for example -o3.
If a long form option is used, the data may appear in the same command
line item, separated by an equals character, or (with two exceptions)
If a long form option is used, the data may appear in the same command
line item, separated by an equals character, or (with two exceptions)
it may appear in the next command line item. For example:
--file=/some/file
--file /some/file
Note, however, that if you want to supply a file name beginning with ~
as data in a shell command, and have the shell expand ~ to a home
Note, however, that if you want to supply a file name beginning with ~
as data in a shell command, and have the shell expand ~ to a home
directory, you must separate the file name from the option, because the
shell does not treat ~ specially unless it is at the start of an item.
The exceptions to the above are the --colour (or --color) and --only-
matching options, for which the data is optional. If one of these
options does have data, it must be given in the first form, using an
The exceptions to the above are the --colour (or --color) and --only-
matching options, for which the data is optional. If one of these
options does have data, it must be given in the first form, using an
equals character. Otherwise pcre2grep will assume that it has no data.
USING PCRE2'S CALLOUT FACILITY
pcre2grep has, by default, support for calling external programs or
scripts or echoing specific strings during matching by making use of
PCRE2's callout facility. However, this support can be completely or
partially disabled when pcre2grep is built. You can find out whether
your binary has support for callouts by running it with the --help
pcre2grep has, by default, support for calling external programs or
scripts or echoing specific strings during matching by making use of
PCRE2's callout facility. However, this support can be completely or
partially disabled when pcre2grep is built. You can find out whether
your binary has support for callouts by running it with the --help
option. If callout support is completely disabled, all callouts in pat-
terns are ignored by pcre2grep. If the facility is partially disabled,
calling external programs is not supported, and callouts that request
calling external programs is not supported, and callouts that request
it are ignored.
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu-
ment is either a number or a quoted string (see the pcre2callout docu-
mentation for details). Numbered callouts are ignored by pcre2grep;
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu-
ment is either a number or a quoted string (see the pcre2callout docu-
mentation for details). Numbered callouts are ignored by pcre2grep;
only callouts with string arguments are useful.
Calling external programs or scripts
This facility can be independently disabled when pcre2grep is built. It
is supported for Windows, where a call to _spawnvp() is used, for VMS,
where lib$spawn() is used, and for any other Unix-like environment
is supported for Windows, where a call to _spawnvp() is used, for VMS,
where lib$spawn() is used, and for any other Unix-like environment
where fork() and execv() are available.
If the callout string does not start with a pipe (vertical bar) charac-
ter, it is parsed into a list of substrings separated by pipe charac-
ters. The first substring must be an executable name, with the follow-
ter, it is parsed into a list of substrings separated by pipe charac-
ters. The first substring must be an executable name, with the follow-
ing substrings specifying arguments:
executable_name|arg1|arg2|...
Any substring (including the executable name) may contain escape
sequences started by a dollar character: $<digits> or ${<digits>} is
replaced by the captured substring of the given decimal number, which
must be greater than zero. If the number is greater than the number of
capturing substrings, or if the capture is unset, the replacement is
Any substring (including the executable name) may contain escape
sequences started by a dollar character: $<digits> or ${<digits>} is
replaced by the captured substring of the given decimal number, which
must be greater than zero. If the number is greater than the number of
capturing substrings, or if the capture is unset, the replacement is
empty.
Any other character is substituted by itself. In particular, $$ is
replaced by a single dollar and $| is replaced by a pipe character.
Any other character is substituted by itself. In particular, $$ is
replaced by a single dollar and $| is replaced by a pipe character.
Here is an example:
echo -e "abcde\n12345" | pcre2grep \
@ -881,13 +888,13 @@ USING PCRE2'S CALLOUT FACILITY
Arg1: [1] [234] [4] Arg2: |1| ()
12345
The parameters for the system call that is used to run the program or
The parameters for the system call that is used to run the program or
script are zero-terminated strings. This means that binary zero charac-
ters in the callout argument will cause premature termination of their
substrings, and therefore should not be present. Any syntax errors in
the string (for example, a dollar not followed by another character)
cause the callout to be ignored. If running the program fails for any
reason (including the non-existence of the executable), a local match-
ters in the callout argument will cause premature termination of their
substrings, and therefore should not be present. Any syntax errors in
the string (for example, a dollar not followed by another character)
cause the callout to be ignored. If running the program fails for any
reason (including the non-existence of the executable), a local match-
ing failure occurs and the matcher backtracks in the normal way.
Echoing a specific string
@ -896,41 +903,41 @@ USING PCRE2'S CALLOUT FACILITY
pletely disabled when pcre2grep was built. If the callout string starts
with a pipe (vertical bar) character, the rest of the string is written
to the output, having been passed through the same escape processing as
text from the --output option. This provides a simple echoing facility
that avoids calling an external program or script. No terminator is
added to the string, so if you want a newline, you must include it
explicitly. Matching continues normally after the string is output. If
you want to see only the callout output but not any output from an
text from the --output option. This provides a simple echoing facility
that avoids calling an external program or script. No terminator is
added to the string, so if you want a newline, you must include it
explicitly. Matching continues normally after the string is output. If
you want to see only the callout output but not any output from an
actual match, you should end the relevant pattern with (*FAIL).
MATCHING ERRORS
It is possible to supply a regular expression that takes a very long
time to fail to match certain lines. Such patterns normally involve
nested indefinite repeats, for example: (a+)*\d when matched against a
line of a's with no final digit. The PCRE2 matching function has a
resource limit that causes it to abort in these circumstances. If this
happens, pcre2grep outputs an error message and the line that caused
the problem to the standard error stream. If there are more than 20
It is possible to supply a regular expression that takes a very long
time to fail to match certain lines. Such patterns normally involve
nested indefinite repeats, for example: (a+)*\d when matched against a
line of a's with no final digit. The PCRE2 matching function has a
resource limit that causes it to abort in these circumstances. If this
happens, pcre2grep outputs an error message and the line that caused
the problem to the standard error stream. If there are more than 20
such errors, pcre2grep gives up.
The --match-limit option of pcre2grep can be used to set the overall
resource limit. There are also other limits that affect the amount of
memory used during matching; see the discussion of --heap-limit and
The --match-limit option of pcre2grep can be used to set the overall
resource limit. There are also other limits that affect the amount of
memory used during matching; see the discussion of --heap-limit and
--depth-limit above.
DIAGNOSTICS
Exit status is 0 if any matches were found, 1 if no matches were found,
and 2 for syntax errors, overlong lines, non-existent or inaccessible
files (even if matches were found in other files) or too many matching
and 2 for syntax errors, overlong lines, non-existent or inaccessible
files (even if matches were found in other files) or too many matching
errors. Using the -s option to suppress error messages about inaccessi-
ble files does not affect the return code.
When run under VMS, the return code is placed in the symbol
PCRE2GREP_RC because VMS does not distinguish between exit(0) and
When run under VMS, the return code is placed in the symbol
PCRE2GREP_RC because VMS does not distinguish between exit(0) and
exit(1).
@ -948,5 +955,5 @@ AUTHOR
REVISION
Last updated: 24 November 2018
Copyright (c) 1997-2018 University of Cambridge.
Last updated: 28 May 2019
Copyright (c) 1997-2019 University of Cambridge.