Documentation update.

This commit is contained in:
Philip.Hazel 2019-05-30 15:43:05 +00:00
parent d5dc4e0c33
commit 16d47a9cb1
6 changed files with 951 additions and 920 deletions

View File

@ -1762,17 +1762,22 @@ subject string does not happen. The first match attempt is run starting from
the overall result is "no match". the overall result is "no match".
</P> </P>
<P> <P>
There are also other start-up optimizations. For example, a minimum length for As another start-up optimization makes use of a minimum length for a matching
the subject may be recorded. Consider the pattern subject, which is recorded when possible. Consider the pattern
<pre> <pre>
(*MARK:A)(X|Y) (*MARK:1)B(*MARK:2)(X|Y)
</pre> </pre>
The minimum length for a match is one character. If the subject is "ABC", there The minimum length for a match is two characters. If the subject is "XXBB", the
will be attempts to match "ABC", "BC", and "C". An attempt to match an empty "starting character" optimization skips "XX", then tries to match "BB", which
string at the end of the subject does not take place, because PCRE2 knows that is long enough. In the process, (*MARK:2) is encountered and remembered. When
the subject is now too short, and so the (*MARK) is never encountered. In this the match attempt fails, the next "B" is found, but there is only one character
case, the optimization does not affect the overall match result, which is still left, so there are no more attempts, and "no match" is returned with the "last
"no match", but it does affect the auxiliary information that is returned. mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
at every possible starting position, including at the end of the subject, where
(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
returned is "1". In this case, the optimizations do not affect the overall
match result, which is still "no match", but they do affect the auxiliary
information that is returned.
<pre> <pre>
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
</pre> </pre>
@ -3831,7 +3836,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 23 May 2019 Last updated: 30 May 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -741,13 +741,22 @@ ignored when used with <b>-L</b> (list files without matches), because the grand
total would always be zero. total would always be zero.
</P> </P>
<P> <P>
<b>-u</b>, <b>--utf-8</b> <b>-u</b>, <b>--utf</b>
Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
with UTF-8 support. All patterns (including those for any <b>--exclude</b> and with UTF-8 support. All patterns (including those for any <b>--exclude</b> and
<b>--include</b> options) and all subject lines that are scanned must be valid <b>--include</b> options) and all subject lines that are scanned must be valid
strings of UTF-8 characters. strings of UTF-8 characters.
</P> </P>
<P> <P>
<b>-U</b>, <b>--utf-allow-invalid</b>
As <b>--utf</b>, but in addition subject lines may contain invalid UTF-8 code
unit sequences. These can never form part of any pattern match. This facility
allows valid UTF-8 strings to be sought in executable or other binary files.
For more details about matching in non-valid UTF-8 strings, see the
<a href="pcre2unicode.html"><b>pcre2unicode</b>(3)</a>
documentation.
</P>
<P>
<b>-V</b>, <b>--version</b> <b>-V</b>, <b>--version</b>
Write the version numbers of <b>pcre2grep</b> and the PCRE2 library to the Write the version numbers of <b>pcre2grep</b> and the PCRE2 library to the
standard output and then exit. Anything else on the command line is standard output and then exit. Anything else on the command line is
@ -806,9 +815,9 @@ as in the GNU <b>grep</b> program. Any long option of the form
<b>--file-offsets</b>, <b>--heap-limit</b>, <b>--include-dir</b>, <b>--file-offsets</b>, <b>--heap-limit</b>, <b>--include-dir</b>,
<b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>, <b>-M</b>, <b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>, <b>-M</b>,
<b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--om-separator</b>, <b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--om-separator</b>,
<b>--output</b>, <b>-u</b>, and <b>--utf-8</b> options are specific to <b>--output</b>, <b>-u</b>, <b>--utf</b>, <b>-U</b>, and <b>--utf-allow-invalid</b>
<b>pcre2grep</b>, as is the use of the <b>--only-matching</b> option with a options are specific to <b>pcre2grep</b>, as is the use of the
capturing parentheses number. <b>--only-matching</b> option with a capturing parentheses number.
</P> </P>
<P> <P>
Although most of the common options work the same way, a few are different in Although most of the common options work the same way, a few are different in
@ -971,9 +980,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC16" href="#TOC1">REVISION</a><br> <br><a name="SEC16" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 24 November 2018 Last updated: 28 May 2019
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "23 May 2019" "PCRE2 10.34" .TH PCRE2API 3 "30 May 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -1701,17 +1701,22 @@ subject string does not happen. The first match attempt is run starting from
"D" and when this fails, (*COMMIT) prevents any further matches being tried, so "D" and when this fails, (*COMMIT) prevents any further matches being tried, so
the overall result is "no match". the overall result is "no match".
.P .P
There are also other start-up optimizations. For example, a minimum length for As another start-up optimization makes use of a minimum length for a matching
the subject may be recorded. Consider the pattern subject, which is recorded when possible. Consider the pattern
.sp .sp
(*MARK:A)(X|Y) (*MARK:1)B(*MARK:2)(X|Y)
.sp .sp
The minimum length for a match is one character. If the subject is "ABC", there The minimum length for a match is two characters. If the subject is "XXBB", the
will be attempts to match "ABC", "BC", and "C". An attempt to match an empty "starting character" optimization skips "XX", then tries to match "BB", which
string at the end of the subject does not take place, because PCRE2 knows that is long enough. In the process, (*MARK:2) is encountered and remembered. When
the subject is now too short, and so the (*MARK) is never encountered. In this the match attempt fails, the next "B" is found, but there is only one character
case, the optimization does not affect the overall match result, which is still left, so there are no more attempts, and "no match" is returned with the "last
"no match", but it does affect the auxiliary information that is returned. mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
at every possible starting position, including at the end of the subject, where
(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
returned is "1". In this case, the optimizations do not affect the overall
match result, which is still "no match", but they do affect the auxiliary
information that is returned.
.sp .sp
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
.sp .sp
@ -3843,6 +3848,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 23 May 2019 Last updated: 30 May 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -650,7 +650,7 @@ with UTF-8 support. All patterns (including those for any \fB--exclude\fP and
\fB--include\fP options) and all subject lines that are scanned must be valid \fB--include\fP options) and all subject lines that are scanned must be valid
strings of UTF-8 characters. strings of UTF-8 characters.
.TP .TP
\fb-U\fP, \fB--utf-allow-invalid\fP \fB-U\fP, \fB--utf-allow-invalid\fP
As \fB--utf\fP, but in addition subject lines may contain invalid UTF-8 code As \fB--utf\fP, but in addition subject lines may contain invalid UTF-8 code
unit sequences. These can never form part of any pattern match. This facility unit sequences. These can never form part of any pattern match. This facility
allows valid UTF-8 strings to be sought in executable or other binary files. allows valid UTF-8 strings to be sought in executable or other binary files.

View File

@ -719,47 +719,54 @@ OPTIONS
(list files without matches), because the grand total would (list files without matches), because the grand total would
always be zero. always be zero.
-u, --utf-8 -u, --utf Operate in UTF-8 mode. This option is available only if PCRE2
Operate in UTF-8 mode. This option is available only if PCRE2
has been compiled with UTF-8 support. All patterns (including has been compiled with UTF-8 support. All patterns (including
those for any --exclude and --include options) and all sub- those for any --exclude and --include options) and all sub-
ject lines that are scanned must be valid strings of UTF-8 ject lines that are scanned must be valid strings of UTF-8
characters. characters.
-U, --utf-allow-invalid
As --utf, but in addition subject lines may contain invalid
UTF-8 code unit sequences. These can never form part of any
pattern match. This facility allows valid UTF-8 strings to be
sought in executable or other binary files. For more details
about matching in non-valid UTF-8 strings, see the pcre2uni-
code(3) documentation.
-V, --version -V, --version
Write the version numbers of pcre2grep and the PCRE2 library Write the version numbers of pcre2grep and the PCRE2 library
to the standard output and then exit. Anything else on the to the standard output and then exit. Anything else on the
command line is ignored. command line is ignored.
-v, --invert-match -v, --invert-match
Invert the sense of the match, so that lines which do not Invert the sense of the match, so that lines which do not
match any of the patterns are the ones that are found. match any of the patterns are the ones that are found.
-w, --word-regex, --word-regexp -w, --word-regex, --word-regexp
Force the patterns only to match "words". That is, there must Force the patterns only to match "words". That is, there must
be a word boundary at the start and end of each matched be a word boundary at the start and end of each matched
string. This is equivalent to having "\b(?:" at the start of string. This is equivalent to having "\b(?:" at the start of
each pattern, and ")\b" at the end. This option applies only each pattern, and ")\b" at the end. This option applies only
to the patterns that are matched against the contents of to the patterns that are matched against the contents of
files; it does not apply to patterns specified by any of the files; it does not apply to patterns specified by any of the
--include or --exclude options. --include or --exclude options.
-x, --line-regex, --line-regexp -x, --line-regex, --line-regexp
Force the patterns to start matching only at the beginnings Force the patterns to start matching only at the beginnings
of lines, and in addition, require them to match entire of lines, and in addition, require them to match entire
lines. In multiline mode the match may be more than one line. lines. In multiline mode the match may be more than one line.
This is equivalent to having "^(?:" at the start of each pat- This is equivalent to having "^(?:" at the start of each pat-
tern and ")$" at the end. This option applies only to the tern and ")$" at the end. This option applies only to the
patterns that are matched against the contents of files; it patterns that are matched against the contents of files; it
does not apply to patterns specified by any of the --include does not apply to patterns specified by any of the --include
or --exclude options. or --exclude options.
ENVIRONMENT VARIABLES ENVIRONMENT VARIABLES
The environment variables LC_ALL and LC_CTYPE are examined, in that The environment variables LC_ALL and LC_CTYPE are examined, in that
order, for a locale. The first one that is set is used. This can be order, for a locale. The first one that is set is used. This can be
overridden by the --locale option. If no locale is set, the PCRE2 overridden by the --locale option. If no locale is set, the PCRE2
library's default (usually the "C" locale) is used. library's default (usually the "C" locale) is used.
@ -767,107 +774,107 @@ NEWLINES
The -N (--newline) option allows pcre2grep to scan files with different The -N (--newline) option allows pcre2grep to scan files with different
newline conventions from the default. Any parts of the input files that newline conventions from the default. Any parts of the input files that
are written to the standard output are copied identically, with what- are written to the standard output are copied identically, with what-
ever newline sequences they have in the input. However, the setting of ever newline sequences they have in the input. However, the setting of
this option affects only the way scanned files are processed. It does this option affects only the way scanned files are processed. It does
not affect the interpretation of files specified by the -f, --file- not affect the interpretation of files specified by the -f, --file-
list, --exclude-from, or --include-from options, nor does it affect the list, --exclude-from, or --include-from options, nor does it affect the
way in which pcre2grep writes informational messages to the standard way in which pcre2grep writes informational messages to the standard
error and output streams. For these it uses the string "\n" to indicate error and output streams. For these it uses the string "\n" to indicate
newlines, relying on the C I/O library to convert this to an appropri- newlines, relying on the C I/O library to convert this to an appropri-
ate sequence. ate sequence.
OPTIONS COMPATIBILITY OPTIONS COMPATIBILITY
Many of the short and long forms of pcre2grep's options are the same as Many of the short and long forms of pcre2grep's options are the same as
in the GNU grep program. Any long option of the form --xxx-regexp (GNU in the GNU grep program. Any long option of the form --xxx-regexp (GNU
terminology) is also available as --xxx-regex (PCRE2 terminology). How- terminology) is also available as --xxx-regex (PCRE2 terminology). How-
ever, the --depth-limit, --file-list, --file-offsets, --heap-limit, ever, the --depth-limit, --file-list, --file-offsets, --heap-limit,
--include-dir, --line-offsets, --locale, --match-limit, -M, --multi- --include-dir, --line-offsets, --locale, --match-limit, -M, --multi-
line, -N, --newline, --om-separator, --output, -u, and --utf-8 options line, -N, --newline, --om-separator, --output, -u, --utf, -U, and
are specific to pcre2grep, as is the use of the --only-matching option --utf-allow-invalid options are specific to pcre2grep, as is the use of
with a capturing parentheses number. the --only-matching option with a capturing parentheses number.
Although most of the common options work the same way, a few are dif- Although most of the common options work the same way, a few are dif-
ferent in pcre2grep. For example, the --include option's argument is a ferent in pcre2grep. For example, the --include option's argument is a
glob for GNU grep, but a regular expression for pcre2grep. If both the glob for GNU grep, but a regular expression for pcre2grep. If both the
-c and -l options are given, GNU grep lists only file names, without -c and -l options are given, GNU grep lists only file names, without
counts, but pcre2grep gives the counts as well. counts, but pcre2grep gives the counts as well.
OPTIONS WITH DATA OPTIONS WITH DATA
There are four different ways in which an option with data can be spec- There are four different ways in which an option with data can be spec-
ified. If a short form option is used, the data may follow immedi- ified. If a short form option is used, the data may follow immedi-
ately, or (with one exception) in the next command line item. For exam- ately, or (with one exception) in the next command line item. For exam-
ple: ple:
-f/some/file -f/some/file
-f /some/file -f /some/file
The exception is the -o option, which may appear with or without data. The exception is the -o option, which may appear with or without data.
Because of this, if data is present, it must follow immediately in the Because of this, if data is present, it must follow immediately in the
same item, for example -o3. same item, for example -o3.
If a long form option is used, the data may appear in the same command If a long form option is used, the data may appear in the same command
line item, separated by an equals character, or (with two exceptions) line item, separated by an equals character, or (with two exceptions)
it may appear in the next command line item. For example: it may appear in the next command line item. For example:
--file=/some/file --file=/some/file
--file /some/file --file /some/file
Note, however, that if you want to supply a file name beginning with ~ Note, however, that if you want to supply a file name beginning with ~
as data in a shell command, and have the shell expand ~ to a home as data in a shell command, and have the shell expand ~ to a home
directory, you must separate the file name from the option, because the directory, you must separate the file name from the option, because the
shell does not treat ~ specially unless it is at the start of an item. shell does not treat ~ specially unless it is at the start of an item.
The exceptions to the above are the --colour (or --color) and --only- The exceptions to the above are the --colour (or --color) and --only-
matching options, for which the data is optional. If one of these matching options, for which the data is optional. If one of these
options does have data, it must be given in the first form, using an options does have data, it must be given in the first form, using an
equals character. Otherwise pcre2grep will assume that it has no data. equals character. Otherwise pcre2grep will assume that it has no data.
USING PCRE2'S CALLOUT FACILITY USING PCRE2'S CALLOUT FACILITY
pcre2grep has, by default, support for calling external programs or pcre2grep has, by default, support for calling external programs or
scripts or echoing specific strings during matching by making use of scripts or echoing specific strings during matching by making use of
PCRE2's callout facility. However, this support can be completely or PCRE2's callout facility. However, this support can be completely or
partially disabled when pcre2grep is built. You can find out whether partially disabled when pcre2grep is built. You can find out whether
your binary has support for callouts by running it with the --help your binary has support for callouts by running it with the --help
option. If callout support is completely disabled, all callouts in pat- option. If callout support is completely disabled, all callouts in pat-
terns are ignored by pcre2grep. If the facility is partially disabled, terns are ignored by pcre2grep. If the facility is partially disabled,
calling external programs is not supported, and callouts that request calling external programs is not supported, and callouts that request
it are ignored. it are ignored.
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu- A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu-
ment is either a number or a quoted string (see the pcre2callout docu- ment is either a number or a quoted string (see the pcre2callout docu-
mentation for details). Numbered callouts are ignored by pcre2grep; mentation for details). Numbered callouts are ignored by pcre2grep;
only callouts with string arguments are useful. only callouts with string arguments are useful.
Calling external programs or scripts Calling external programs or scripts
This facility can be independently disabled when pcre2grep is built. It This facility can be independently disabled when pcre2grep is built. It
is supported for Windows, where a call to _spawnvp() is used, for VMS, is supported for Windows, where a call to _spawnvp() is used, for VMS,
where lib$spawn() is used, and for any other Unix-like environment where lib$spawn() is used, and for any other Unix-like environment
where fork() and execv() are available. where fork() and execv() are available.
If the callout string does not start with a pipe (vertical bar) charac- If the callout string does not start with a pipe (vertical bar) charac-
ter, it is parsed into a list of substrings separated by pipe charac- ter, it is parsed into a list of substrings separated by pipe charac-
ters. The first substring must be an executable name, with the follow- ters. The first substring must be an executable name, with the follow-
ing substrings specifying arguments: ing substrings specifying arguments:
executable_name|arg1|arg2|... executable_name|arg1|arg2|...
Any substring (including the executable name) may contain escape Any substring (including the executable name) may contain escape
sequences started by a dollar character: $<digits> or ${<digits>} is sequences started by a dollar character: $<digits> or ${<digits>} is
replaced by the captured substring of the given decimal number, which replaced by the captured substring of the given decimal number, which
must be greater than zero. If the number is greater than the number of must be greater than zero. If the number is greater than the number of
capturing substrings, or if the capture is unset, the replacement is capturing substrings, or if the capture is unset, the replacement is
empty. empty.
Any other character is substituted by itself. In particular, $$ is Any other character is substituted by itself. In particular, $$ is
replaced by a single dollar and $| is replaced by a pipe character. replaced by a single dollar and $| is replaced by a pipe character.
Here is an example: Here is an example:
echo -e "abcde\n12345" | pcre2grep \ echo -e "abcde\n12345" | pcre2grep \
@ -881,13 +888,13 @@ USING PCRE2'S CALLOUT FACILITY
Arg1: [1] [234] [4] Arg2: |1| () Arg1: [1] [234] [4] Arg2: |1| ()
12345 12345
The parameters for the system call that is used to run the program or The parameters for the system call that is used to run the program or
script are zero-terminated strings. This means that binary zero charac- script are zero-terminated strings. This means that binary zero charac-
ters in the callout argument will cause premature termination of their ters in the callout argument will cause premature termination of their
substrings, and therefore should not be present. Any syntax errors in substrings, and therefore should not be present. Any syntax errors in
the string (for example, a dollar not followed by another character) the string (for example, a dollar not followed by another character)
cause the callout to be ignored. If running the program fails for any cause the callout to be ignored. If running the program fails for any
reason (including the non-existence of the executable), a local match- reason (including the non-existence of the executable), a local match-
ing failure occurs and the matcher backtracks in the normal way. ing failure occurs and the matcher backtracks in the normal way.
Echoing a specific string Echoing a specific string
@ -896,41 +903,41 @@ USING PCRE2'S CALLOUT FACILITY
pletely disabled when pcre2grep was built. If the callout string starts pletely disabled when pcre2grep was built. If the callout string starts
with a pipe (vertical bar) character, the rest of the string is written with a pipe (vertical bar) character, the rest of the string is written
to the output, having been passed through the same escape processing as to the output, having been passed through the same escape processing as
text from the --output option. This provides a simple echoing facility text from the --output option. This provides a simple echoing facility
that avoids calling an external program or script. No terminator is that avoids calling an external program or script. No terminator is
added to the string, so if you want a newline, you must include it added to the string, so if you want a newline, you must include it
explicitly. Matching continues normally after the string is output. If explicitly. Matching continues normally after the string is output. If
you want to see only the callout output but not any output from an you want to see only the callout output but not any output from an
actual match, you should end the relevant pattern with (*FAIL). actual match, you should end the relevant pattern with (*FAIL).
MATCHING ERRORS MATCHING ERRORS
It is possible to supply a regular expression that takes a very long It is possible to supply a regular expression that takes a very long
time to fail to match certain lines. Such patterns normally involve time to fail to match certain lines. Such patterns normally involve
nested indefinite repeats, for example: (a+)*\d when matched against a nested indefinite repeats, for example: (a+)*\d when matched against a
line of a's with no final digit. The PCRE2 matching function has a line of a's with no final digit. The PCRE2 matching function has a
resource limit that causes it to abort in these circumstances. If this resource limit that causes it to abort in these circumstances. If this
happens, pcre2grep outputs an error message and the line that caused happens, pcre2grep outputs an error message and the line that caused
the problem to the standard error stream. If there are more than 20 the problem to the standard error stream. If there are more than 20
such errors, pcre2grep gives up. such errors, pcre2grep gives up.
The --match-limit option of pcre2grep can be used to set the overall The --match-limit option of pcre2grep can be used to set the overall
resource limit. There are also other limits that affect the amount of resource limit. There are also other limits that affect the amount of
memory used during matching; see the discussion of --heap-limit and memory used during matching; see the discussion of --heap-limit and
--depth-limit above. --depth-limit above.
DIAGNOSTICS DIAGNOSTICS
Exit status is 0 if any matches were found, 1 if no matches were found, Exit status is 0 if any matches were found, 1 if no matches were found,
and 2 for syntax errors, overlong lines, non-existent or inaccessible and 2 for syntax errors, overlong lines, non-existent or inaccessible
files (even if matches were found in other files) or too many matching files (even if matches were found in other files) or too many matching
errors. Using the -s option to suppress error messages about inaccessi- errors. Using the -s option to suppress error messages about inaccessi-
ble files does not affect the return code. ble files does not affect the return code.
When run under VMS, the return code is placed in the symbol When run under VMS, the return code is placed in the symbol
PCRE2GREP_RC because VMS does not distinguish between exit(0) and PCRE2GREP_RC because VMS does not distinguish between exit(0) and
exit(1). exit(1).
@ -948,5 +955,5 @@ AUTHOR
REVISION REVISION
Last updated: 24 November 2018 Last updated: 28 May 2019
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.