Implement REG_PEND (GNU extension) for the POSIX wrapper.
This commit is contained in:
parent
f850015168
commit
bcba497c0b
|
@ -182,6 +182,8 @@ deeply. (Compare item 10.23/36.) This should fix oss-fuzz #1761.
|
||||||
38. Fix returned offsets from regexec() when REG_STARTEND is used with a
|
38. Fix returned offsets from regexec() when REG_STARTEND is used with a
|
||||||
starting offset greater than zero.
|
starting offset greater than zero.
|
||||||
|
|
||||||
|
39. Implement REG_PEND (GNU extension) for the POSIX wrapper.
|
||||||
|
|
||||||
|
|
||||||
Version 10.23 14-February-2017
|
Version 10.23 14-February-2017
|
||||||
------------------------------
|
------------------------------
|
||||||
|
|
|
@ -69,7 +69,7 @@ replacement library. Other POSIX options are not even defined.
|
||||||
<P>
|
<P>
|
||||||
There are also some options that are not defined by POSIX. These have been
|
There are also some options that are not defined by POSIX. These have been
|
||||||
added at the request of users who want to make use of certain PCRE2-specific
|
added at the request of users who want to make use of certain PCRE2-specific
|
||||||
features via the POSIX calling interface.
|
features via the POSIX calling interface or to add BSD or GNU functionality.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
||||||
|
@ -91,10 +91,11 @@ identifying error codes.
|
||||||
<br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
|
<br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
|
||||||
<P>
|
<P>
|
||||||
The function <b>regcomp()</b> is called to compile a pattern into an
|
The function <b>regcomp()</b> is called to compile a pattern into an
|
||||||
internal form. The pattern is a C string terminated by a binary zero, and
|
internal form. By default, the pattern is a C string terminated by a binary
|
||||||
is passed in the argument <i>pattern</i>. The <i>preg</i> argument is a pointer
|
zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a
|
||||||
to a <b>regex_t</b> structure that is used as a base for storing information
|
<b>regex_t</b> structure that is used as a base for storing information about
|
||||||
about the compiled regular expression.
|
the compiled regular expression. (It is also used for input when REG_PEND is
|
||||||
|
set.)
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The argument <i>cflags</i> is either zero, or contains one or more of the bits
|
The argument <i>cflags</i> is either zero, or contains one or more of the bits
|
||||||
|
@ -124,6 +125,16 @@ matching, the <i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no
|
||||||
captured strings are returned. Versions of the PCRE library prior to 10.22 used
|
captured strings are returned. Versions of the PCRE library prior to 10.22 used
|
||||||
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
|
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
|
||||||
because it disables the use of back references.
|
because it disables the use of back references.
|
||||||
|
<pre>
|
||||||
|
REG_PEND
|
||||||
|
</pre>
|
||||||
|
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
|
||||||
|
(which has the type const char *) must be set to point to the character beyond
|
||||||
|
the end of the pattern before calling <b>regcomp()</b>. The pattern itself may
|
||||||
|
now contain binary zeroes, which are treated as data characters. Without
|
||||||
|
REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
|
||||||
|
ignored. This is a GNU extension to the POSIX standard and should be used with
|
||||||
|
caution in software intended to be portable to other systems.
|
||||||
<pre>
|
<pre>
|
||||||
REG_UCP
|
REG_UCP
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -156,9 +167,10 @@ class such as [^a] (they are).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
|
The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
|
||||||
<i>preg</i> structure is filled in on success, and one member of the structure
|
<i>preg</i> structure is filled in on success, and one other member of the
|
||||||
is public: <i>re_nsub</i> contains the number of capturing subpatterns in
|
structure (as well as <i>re_endp</i>) is public: <i>re_nsub</i> contains the
|
||||||
the regular expression. Various error codes are defined in the header file.
|
number of capturing subpatterns in the regular expression. Various error codes
|
||||||
|
are defined in the header file.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
|
NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
|
||||||
|
@ -228,15 +240,26 @@ function.
|
||||||
<pre>
|
<pre>
|
||||||
REG_STARTEND
|
REG_STARTEND
|
||||||
</pre>
|
</pre>
|
||||||
The string is considered to start at <i>string</i> + <i>pmatch[0].rm_so</i> and
|
When this option is set, the subject string is starts at <i>string</i> +
|
||||||
to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
|
<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
|
||||||
(there need not actually be a NUL at that location), regardless of the value of
|
should point to the first character beyond the string. There may be binary
|
||||||
<i>nmatch</i>. This is a BSD extension, compatible with but not specified by
|
zeroes within the subject string, and indeed, using REG_STARTEND is the only
|
||||||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
way to pass a subject string that contains a binary zero.
|
||||||
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
|
</P>
|
||||||
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
<P>
|
||||||
how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL are
|
Whatever the value of <i>pmatch[0].rm_so</i>, the offsets of the matched string
|
||||||
mutually exclusive; the error REG_INVARG is returned.
|
and any captured substrings are still given relative to the start of
|
||||||
|
<i>string</i> itself. (Before PCRE2 release 10.30 these were given relative to
|
||||||
|
<i>string</i> + <i>pmatch[0].rm_so</i>, but this differs from other
|
||||||
|
implementations.)
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
This is a BSD extension, compatible with but not specified by IEEE Standard
|
||||||
|
1003.2 (POSIX.2), and should be used with caution in software intended to be
|
||||||
|
portable to other systems. Note that a non-zero <i>rm_so</i> does not imply
|
||||||
|
REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
|
||||||
|
not how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL
|
||||||
|
are mutually exclusive; the error REG_INVARG is returned.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||||
|
@ -291,9 +314,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 31 January 2016
|
Last updated: 05 June 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -1078,6 +1078,19 @@ are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
|
||||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
|
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
|
||||||
The other modifiers are ignored, with a warning message.
|
The other modifiers are ignored, with a warning message.
|
||||||
</P>
|
</P>
|
||||||
|
<P>
|
||||||
|
There is one additional modifier that can be used with the POSIX wrapper. It is
|
||||||
|
ignored (with a warning) if used for non-POSIX matching.
|
||||||
|
<pre>
|
||||||
|
posix_startend=<n>[:<m>]
|
||||||
|
</pre>
|
||||||
|
This causes the subject string to be passed to <b>regexec()</b> using the
|
||||||
|
REG_STARTEND option, which uses offsets to restrict which part of the string is
|
||||||
|
searched. If only one number is given, the end offset is passed as the end of
|
||||||
|
the subject string. For more detail of REG_STARTEND, see the
|
||||||
|
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||||
|
documentation.
|
||||||
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Setting match controls
|
Setting match controls
|
||||||
</b><br>
|
</b><br>
|
||||||
|
@ -1817,7 +1830,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 01 June 2017
|
Last updated: 03 June 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2017 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
147
doc/pcre2.txt
147
doc/pcre2.txt
|
@ -8986,32 +8986,34 @@ DESCRIPTION
|
||||||
|
|
||||||
There are also some options that are not defined by POSIX. These have
|
There are also some options that are not defined by POSIX. These have
|
||||||
been added at the request of users who want to make use of certain
|
been added at the request of users who want to make use of certain
|
||||||
PCRE2-specific features via the POSIX calling interface.
|
PCRE2-specific features via the POSIX calling interface or to add BSD
|
||||||
|
or GNU functionality.
|
||||||
|
|
||||||
When PCRE2 is called via these functions, it is only the API that is
|
When PCRE2 is called via these functions, it is only the API that is
|
||||||
POSIX-like in style. The syntax and semantics of the regular expres-
|
POSIX-like in style. The syntax and semantics of the regular expres-
|
||||||
sions themselves are still those of Perl, subject to the setting of
|
sions themselves are still those of Perl, subject to the setting of
|
||||||
various PCRE2 options, as described below. "POSIX-like in style" means
|
various PCRE2 options, as described below. "POSIX-like in style" means
|
||||||
that the API approximates to the POSIX definition; it is not fully
|
that the API approximates to the POSIX definition; it is not fully
|
||||||
POSIX-compatible, and in multi-unit encoding domains it is probably
|
POSIX-compatible, and in multi-unit encoding domains it is probably
|
||||||
even less compatible.
|
even less compatible.
|
||||||
|
|
||||||
The header for these functions is supplied as pcre2posix.h to avoid any
|
The header for these functions is supplied as pcre2posix.h to avoid any
|
||||||
potential clash with other POSIX libraries. It can, of course, be
|
potential clash with other POSIX libraries. It can, of course, be
|
||||||
renamed or aliased as regex.h, which is the "correct" name. It provides
|
renamed or aliased as regex.h, which is the "correct" name. It provides
|
||||||
two structure types, regex_t for compiled internal forms, and reg-
|
two structure types, regex_t for compiled internal forms, and reg-
|
||||||
match_t for returning captured substrings. It also defines some con-
|
match_t for returning captured substrings. It also defines some con-
|
||||||
stants whose names start with "REG_"; these are used for setting
|
stants whose names start with "REG_"; these are used for setting
|
||||||
options and identifying error codes.
|
options and identifying error codes.
|
||||||
|
|
||||||
|
|
||||||
COMPILING A PATTERN
|
COMPILING A PATTERN
|
||||||
|
|
||||||
The function regcomp() is called to compile a pattern into an internal
|
The function regcomp() is called to compile a pattern into an internal
|
||||||
form. The pattern is a C string terminated by a binary zero, and is
|
form. By default, the pattern is a C string terminated by a binary zero
|
||||||
passed in the argument pattern. The preg argument is a pointer to a
|
(but see REG_PEND below). The preg argument is a pointer to a regex_t
|
||||||
regex_t structure that is used as a base for storing information about
|
structure that is used as a base for storing information about the com-
|
||||||
the compiled regular expression.
|
piled regular expression. (It is also used for input when REG_PEND is
|
||||||
|
set.)
|
||||||
|
|
||||||
The argument cflags is either zero, or contains one or more of the bits
|
The argument cflags is either zero, or contains one or more of the bits
|
||||||
defined by the following macros:
|
defined by the following macros:
|
||||||
|
@ -9042,38 +9044,50 @@ COMPILING A PATTERN
|
||||||
used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
|
used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
|
||||||
longer happens because it disables the use of back references.
|
longer happens because it disables the use of back references.
|
||||||
|
|
||||||
|
REG_PEND
|
||||||
|
|
||||||
|
If this option is set, the reg_endp field in the preg structure (which
|
||||||
|
has the type const char *) must be set to point to the character beyond
|
||||||
|
the end of the pattern before calling regcomp(). The pattern itself may
|
||||||
|
now contain binary zeroes, which are treated as data characters. With-
|
||||||
|
out REG_PEND, a binary zero terminates the pattern and the re_endp
|
||||||
|
field is ignored. This is a GNU extension to the POSIX standard and
|
||||||
|
should be used with caution in software intended to be portable to
|
||||||
|
other systems.
|
||||||
|
|
||||||
REG_UCP
|
REG_UCP
|
||||||
|
|
||||||
The PCRE2_UCP option is set when the regular expression is passed for
|
The PCRE2_UCP option is set when the regular expression is passed for
|
||||||
compilation to the native function. This causes PCRE2 to use Unicode
|
compilation to the native function. This causes PCRE2 to use Unicode
|
||||||
properties when matchine \d, \w, etc., instead of just recognizing
|
properties when matchine \d, \w, etc., instead of just recognizing
|
||||||
ASCII values. Note that REG_UCP is not part of the POSIX standard.
|
ASCII values. Note that REG_UCP is not part of the POSIX standard.
|
||||||
|
|
||||||
REG_UNGREEDY
|
REG_UNGREEDY
|
||||||
|
|
||||||
The PCRE2_UNGREEDY option is set when the regular expression is passed
|
The PCRE2_UNGREEDY option is set when the regular expression is passed
|
||||||
for compilation to the native function. Note that REG_UNGREEDY is not
|
for compilation to the native function. Note that REG_UNGREEDY is not
|
||||||
part of the POSIX standard.
|
part of the POSIX standard.
|
||||||
|
|
||||||
REG_UTF
|
REG_UTF
|
||||||
|
|
||||||
The PCRE2_UTF option is set when the regular expression is passed for
|
The PCRE2_UTF option is set when the regular expression is passed for
|
||||||
compilation to the native function. This causes the pattern itself and
|
compilation to the native function. This causes the pattern itself and
|
||||||
all data strings used for matching it to be treated as UTF-8 strings.
|
all data strings used for matching it to be treated as UTF-8 strings.
|
||||||
Note that REG_UTF is not part of the POSIX standard.
|
Note that REG_UTF is not part of the POSIX standard.
|
||||||
|
|
||||||
In the absence of these flags, no options are passed to the native
|
In the absence of these flags, no options are passed to the native
|
||||||
function. This means the the regex is compiled with PCRE2 default
|
function. This means the the regex is compiled with PCRE2 default
|
||||||
semantics. In particular, the way it handles newline characters in the
|
semantics. In particular, the way it handles newline characters in the
|
||||||
subject string is the Perl way, not the POSIX way. Note that setting
|
subject string is the Perl way, not the POSIX way. Note that setting
|
||||||
PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
|
PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
|
||||||
It does not affect the way newlines are matched by the dot metacharac-
|
It does not affect the way newlines are matched by the dot metacharac-
|
||||||
ter (they are not) or by a negative class such as [^a] (they are).
|
ter (they are not) or by a negative class such as [^a] (they are).
|
||||||
|
|
||||||
The yield of regcomp() is zero on success, and non-zero otherwise. The
|
The yield of regcomp() is zero on success, and non-zero otherwise. The
|
||||||
preg structure is filled in on success, and one member of the structure
|
preg structure is filled in on success, and one other member of the
|
||||||
is public: re_nsub contains the number of capturing subpatterns in the
|
structure (as well as re_endp) is public: re_nsub contains the number
|
||||||
regular expression. Various error codes are defined in the header file.
|
of capturing subpatterns in the regular expression. Various error codes
|
||||||
|
are defined in the header file.
|
||||||
|
|
||||||
NOTE: If the yield of regcomp() is non-zero, you must not attempt to
|
NOTE: If the yield of regcomp() is non-zero, you must not attempt to
|
||||||
use the contents of the preg structure. If, for example, you pass it to
|
use the contents of the preg structure. If, for example, you pass it to
|
||||||
|
@ -9146,57 +9160,66 @@ MATCHING A PATTERN
|
||||||
|
|
||||||
REG_STARTEND
|
REG_STARTEND
|
||||||
|
|
||||||
The string is considered to start at string + pmatch[0].rm_so and to
|
When this option is set, the subject string is starts at string +
|
||||||
have a terminating NUL located at string + pmatch[0].rm_eo (there need
|
pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
|
||||||
not actually be a NUL at that location), regardless of the value of
|
point to the first character beyond the string. There may be binary
|
||||||
nmatch. This is a BSD extension, compatible with but not specified by
|
zeroes within the subject string, and indeed, using REG_STARTEND is the
|
||||||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in
|
only way to pass a subject string that contains a binary zero.
|
||||||
software intended to be portable to other systems. Note that a non-zero
|
|
||||||
rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
|
Whatever the value of pmatch[0].rm_so, the offsets of the matched
|
||||||
of the string, not how it is matched. Setting REG_STARTEND and passing
|
string and any captured substrings are still given relative to the
|
||||||
pmatch as NULL are mutually exclusive; the error REG_INVARG is
|
start of string itself. (Before PCRE2 release 10.30 these were given
|
||||||
|
relative to string + pmatch[0].rm_so, but this differs from other
|
||||||
|
implementations.)
|
||||||
|
|
||||||
|
This is a BSD extension, compatible with but not specified by IEEE
|
||||||
|
Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||||
|
intended to be portable to other systems. Note that a non-zero rm_so
|
||||||
|
does not imply REG_NOTBOL; REG_STARTEND affects only the location and
|
||||||
|
length of the string, not how it is matched. Setting REG_STARTEND and
|
||||||
|
passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
|
||||||
returned.
|
returned.
|
||||||
|
|
||||||
If the pattern was compiled with the REG_NOSUB flag, no data about any
|
If the pattern was compiled with the REG_NOSUB flag, no data about any
|
||||||
matched strings is returned. The nmatch and pmatch arguments of
|
matched strings is returned. The nmatch and pmatch arguments of
|
||||||
regexec() are ignored (except possibly as input for REG_STARTEND).
|
regexec() are ignored (except possibly as input for REG_STARTEND).
|
||||||
|
|
||||||
The value of nmatch may be zero, and the value pmatch may be NULL
|
The value of nmatch may be zero, and the value pmatch may be NULL
|
||||||
(unless REG_STARTEND is set); in both these cases no data about any
|
(unless REG_STARTEND is set); in both these cases no data about any
|
||||||
matched strings is returned.
|
matched strings is returned.
|
||||||
|
|
||||||
Otherwise, the portion of the string that was matched, and also any
|
Otherwise, the portion of the string that was matched, and also any
|
||||||
captured substrings, are returned via the pmatch argument, which points
|
captured substrings, are returned via the pmatch argument, which points
|
||||||
to an array of nmatch structures of type regmatch_t, containing the
|
to an array of nmatch structures of type regmatch_t, containing the
|
||||||
members rm_so and rm_eo. These contain the byte offset to the first
|
members rm_so and rm_eo. These contain the byte offset to the first
|
||||||
character of each substring and the offset to the first character after
|
character of each substring and the offset to the first character after
|
||||||
the end of each substring, respectively. The 0th element of the vector
|
the end of each substring, respectively. The 0th element of the vector
|
||||||
relates to the entire portion of string that was matched; subsequent
|
relates to the entire portion of string that was matched; subsequent
|
||||||
elements relate to the capturing subpatterns of the regular expression.
|
elements relate to the capturing subpatterns of the regular expression.
|
||||||
Unused entries in the array have both structure members set to -1.
|
Unused entries in the array have both structure members set to -1.
|
||||||
|
|
||||||
A successful match yields a zero return; various error codes are
|
A successful match yields a zero return; various error codes are
|
||||||
defined in the header file, of which REG_NOMATCH is the "expected"
|
defined in the header file, of which REG_NOMATCH is the "expected"
|
||||||
failure code.
|
failure code.
|
||||||
|
|
||||||
|
|
||||||
ERROR MESSAGES
|
ERROR MESSAGES
|
||||||
|
|
||||||
The regerror() function maps a non-zero errorcode from either regcomp()
|
The regerror() function maps a non-zero errorcode from either regcomp()
|
||||||
or regexec() to a printable message. If preg is not NULL, the error
|
or regexec() to a printable message. If preg is not NULL, the error
|
||||||
should have arisen from the use of that structure. A message terminated
|
should have arisen from the use of that structure. A message terminated
|
||||||
by a binary zero is placed in errbuf. If the buffer is too short, only
|
by a binary zero is placed in errbuf. If the buffer is too short, only
|
||||||
the first errbuf_size - 1 characters of the error message are used. The
|
the first errbuf_size - 1 characters of the error message are used. The
|
||||||
yield of the function is the size of buffer needed to hold the whole
|
yield of the function is the size of buffer needed to hold the whole
|
||||||
message, including the terminating zero. This value is greater than
|
message, including the terminating zero. This value is greater than
|
||||||
errbuf_size if the message was truncated.
|
errbuf_size if the message was truncated.
|
||||||
|
|
||||||
|
|
||||||
MEMORY USAGE
|
MEMORY USAGE
|
||||||
|
|
||||||
Compiling a regular expression causes memory to be allocated and asso-
|
Compiling a regular expression causes memory to be allocated and asso-
|
||||||
ciated with the preg structure. The function regfree() frees all such
|
ciated with the preg structure. The function regfree() frees all such
|
||||||
memory, after which preg may no longer be used as a compiled expres-
|
memory, after which preg may no longer be used as a compiled expres-
|
||||||
sion.
|
sion.
|
||||||
|
|
||||||
|
|
||||||
|
@ -9209,8 +9232,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 31 January 2016
|
Last updated: 05 June 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2POSIX 3 "03 June 2017" "PCRE2 10.30"
|
.TH PCRE2POSIX 3 "05 June 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "SYNOPSIS"
|
.SH "SYNOPSIS"
|
||||||
|
@ -46,7 +46,7 @@ replacement library. Other POSIX options are not even defined.
|
||||||
.P
|
.P
|
||||||
There are also some options that are not defined by POSIX. These have been
|
There are also some options that are not defined by POSIX. These have been
|
||||||
added at the request of users who want to make use of certain PCRE2-specific
|
added at the request of users who want to make use of certain PCRE2-specific
|
||||||
features via the POSIX calling interface.
|
features via the POSIX calling interface or to add BSD or GNU functionality.
|
||||||
.P
|
.P
|
||||||
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
||||||
in style. The syntax and semantics of the regular expressions themselves are
|
in style. The syntax and semantics of the regular expressions themselves are
|
||||||
|
@ -68,10 +68,11 @@ identifying error codes.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
The function \fBregcomp()\fP is called to compile a pattern into an
|
The function \fBregcomp()\fP is called to compile a pattern into an
|
||||||
internal form. The pattern is a C string terminated by a binary zero, and
|
internal form. By default, the pattern is a C string terminated by a binary
|
||||||
is passed in the argument \fIpattern\fP. The \fIpreg\fP argument is a pointer
|
zero (but see REG_PEND below). The \fIpreg\fP argument is a pointer to a
|
||||||
to a \fBregex_t\fP structure that is used as a base for storing information
|
\fBregex_t\fP structure that is used as a base for storing information about
|
||||||
about the compiled regular expression.
|
the compiled regular expression. (It is also used for input when REG_PEND is
|
||||||
|
set.)
|
||||||
.P
|
.P
|
||||||
The argument \fIcflags\fP is either zero, or contains one or more of the bits
|
The argument \fIcflags\fP is either zero, or contains one or more of the bits
|
||||||
defined by the following macros:
|
defined by the following macros:
|
||||||
|
@ -100,6 +101,16 @@ matching, the \fInmatch\fP and \fIpmatch\fP arguments are ignored, and no
|
||||||
captured strings are returned. Versions of the PCRE library prior to 10.22 used
|
captured strings are returned. Versions of the PCRE library prior to 10.22 used
|
||||||
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
|
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
|
||||||
because it disables the use of back references.
|
because it disables the use of back references.
|
||||||
|
.sp
|
||||||
|
REG_PEND
|
||||||
|
.sp
|
||||||
|
If this option is set, the \fBreg_endp\fP field in the \fIpreg\fP structure
|
||||||
|
(which has the type const char *) must be set to point to the character beyond
|
||||||
|
the end of the pattern before calling \fBregcomp()\fP. The pattern itself may
|
||||||
|
now contain binary zeroes, which are treated as data characters. Without
|
||||||
|
REG_PEND, a binary zero terminates the pattern and the \fBre_endp\fP field is
|
||||||
|
ignored. This is a GNU extension to the POSIX standard and should be used with
|
||||||
|
caution in software intended to be portable to other systems.
|
||||||
.sp
|
.sp
|
||||||
REG_UCP
|
REG_UCP
|
||||||
.sp
|
.sp
|
||||||
|
@ -130,9 +141,10 @@ newlines are matched by the dot metacharacter (they are not) or by a negative
|
||||||
class such as [^a] (they are).
|
class such as [^a] (they are).
|
||||||
.P
|
.P
|
||||||
The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The
|
The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The
|
||||||
\fIpreg\fP structure is filled in on success, and one member of the structure
|
\fIpreg\fP structure is filled in on success, and one other member of the
|
||||||
is public: \fIre_nsub\fP contains the number of capturing subpatterns in
|
structure (as well as \fIre_endp\fP) is public: \fIre_nsub\fP contains the
|
||||||
the regular expression. Various error codes are defined in the header file.
|
number of capturing subpatterns in the regular expression. Various error codes
|
||||||
|
are defined in the header file.
|
||||||
.P
|
.P
|
||||||
NOTE: If the yield of \fBregcomp()\fP is non-zero, you must not attempt to
|
NOTE: If the yield of \fBregcomp()\fP is non-zero, you must not attempt to
|
||||||
use the contents of the \fIpreg\fP structure. If, for example, you pass it to
|
use the contents of the \fIpreg\fP structure. If, for example, you pass it to
|
||||||
|
@ -204,21 +216,24 @@ function.
|
||||||
.sp
|
.sp
|
||||||
REG_STARTEND
|
REG_STARTEND
|
||||||
.sp
|
.sp
|
||||||
When this option is set, the string is considered to start at \fIstring\fP +
|
When this option is set, the subject string is starts at \fIstring\fP +
|
||||||
\fIpmatch[0].rm_so\fP and to have a terminating NUL located at \fIstring\fP +
|
\fIpmatch[0].rm_so\fP and ends at \fIstring\fP + \fIpmatch[0].rm_eo\fP, which
|
||||||
\fIpmatch[0].rm_eo\fP (there need not actually be a NUL at that location),
|
should point to the first character beyond the string. There may be binary
|
||||||
regardless of the value of \fInmatch\fP. However, the offsets of the matched
|
zeroes within the subject string, and indeed, using REG_STARTEND is the only
|
||||||
string and any captured substrings are still given relative to the start of
|
way to pass a subject string that contains a binary zero.
|
||||||
\fIstring\fP. (Before PCRE2 release 10.30 these were given relative to
|
.P
|
||||||
|
Whatever the value of \fIpmatch[0].rm_so\fP, the offsets of the matched string
|
||||||
|
and any captured substrings are still given relative to the start of
|
||||||
|
\fIstring\fP itself. (Before PCRE2 release 10.30 these were given relative to
|
||||||
\fIstring\fP + \fIpmatch[0].rm_so\fP, but this differs from other
|
\fIstring\fP + \fIpmatch[0].rm_so\fP, but this differs from other
|
||||||
implementations.)
|
implementations.)
|
||||||
.P
|
.P
|
||||||
This is a BSD extension, compatible with but not specified by IEEE Standard
|
This is a BSD extension, compatible with but not specified by IEEE Standard
|
||||||
1003.2 (POSIX.2), and should be used with caution in software intended to be
|
1003.2 (POSIX.2), and should be used with caution in software intended to be
|
||||||
portable to other systems. Note that a non-zero \fIrm_so\fP does not imply
|
portable to other systems. Note that a non-zero \fIrm_so\fP does not imply
|
||||||
REG_NOTBOL; REG_STARTEND affects only the location of the string, not how it is
|
REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
|
||||||
matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL are mutually
|
not how it is matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL
|
||||||
exclusive; the error REG_INVARG is returned.
|
are mutually exclusive; the error REG_INVARG is returned.
|
||||||
.P
|
.P
|
||||||
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||||
strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
|
strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
|
||||||
|
@ -277,6 +292,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 03 June 2017
|
Last updated: 05 June 2017
|
||||||
Copyright (c) 1997-2017 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -965,11 +965,22 @@ SUBJECT MODIFIERS
|
||||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec().
|
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec().
|
||||||
The other modifiers are ignored, with a warning message.
|
The other modifiers are ignored, with a warning message.
|
||||||
|
|
||||||
|
There is one additional modifier that can be used with the POSIX wrap-
|
||||||
|
per. It is ignored (with a warning) if used for non-POSIX matching.
|
||||||
|
|
||||||
|
posix_startend=<n>[:<m>]
|
||||||
|
|
||||||
|
This causes the subject string to be passed to regexec() using the
|
||||||
|
REG_STARTEND option, which uses offsets to restrict which part of the
|
||||||
|
string is searched. If only one number is given, the end offset is
|
||||||
|
passed as the end of the subject string. For more detail of REG_STAR-
|
||||||
|
TEND, see the pcre2posix documentation.
|
||||||
|
|
||||||
Setting match controls
|
Setting match controls
|
||||||
|
|
||||||
The following modifiers affect the matching process or request addi-
|
The following modifiers affect the matching process or request addi-
|
||||||
tional information. Some of them may also be specified on a pattern
|
tional information. Some of them may also be specified on a pattern
|
||||||
line (see above), in which case they apply to every subject line that
|
line (see above), in which case they apply to every subject line that
|
||||||
is matched against that pattern.
|
is matched against that pattern.
|
||||||
|
|
||||||
aftertext show text after match
|
aftertext show text after match
|
||||||
|
@ -1009,29 +1020,29 @@ SUBJECT MODIFIERS
|
||||||
zero_terminate pass the subject as zero-terminated
|
zero_terminate pass the subject as zero-terminated
|
||||||
|
|
||||||
The effects of these modifiers are described in the following sections.
|
The effects of these modifiers are described in the following sections.
|
||||||
When matching via the POSIX wrapper API, the aftertext, allaftertext,
|
When matching via the POSIX wrapper API, the aftertext, allaftertext,
|
||||||
and ovector subject modifiers work as described below. All other modi-
|
and ovector subject modifiers work as described below. All other modi-
|
||||||
fiers are either ignored, with a warning message, or cause an error.
|
fiers are either ignored, with a warning message, or cause an error.
|
||||||
|
|
||||||
Showing more text
|
Showing more text
|
||||||
|
|
||||||
The aftertext modifier requests that as well as outputting the part of
|
The aftertext modifier requests that as well as outputting the part of
|
||||||
the subject string that matched the entire pattern, pcre2test should in
|
the subject string that matched the entire pattern, pcre2test should in
|
||||||
addition output the remainder of the subject string. This is useful for
|
addition output the remainder of the subject string. This is useful for
|
||||||
tests where the subject contains multiple copies of the same substring.
|
tests where the subject contains multiple copies of the same substring.
|
||||||
The allaftertext modifier requests the same action for captured sub-
|
The allaftertext modifier requests the same action for captured sub-
|
||||||
strings as well as the main matched substring. In each case the remain-
|
strings as well as the main matched substring. In each case the remain-
|
||||||
der is output on the following line with a plus character following the
|
der is output on the following line with a plus character following the
|
||||||
capture number.
|
capture number.
|
||||||
|
|
||||||
The allusedtext modifier requests that all the text that was consulted
|
The allusedtext modifier requests that all the text that was consulted
|
||||||
during a successful pattern match by the interpreter should be shown.
|
during a successful pattern match by the interpreter should be shown.
|
||||||
This feature is not supported for JIT matching, and if requested with
|
This feature is not supported for JIT matching, and if requested with
|
||||||
JIT it is ignored (with a warning message). Setting this modifier
|
JIT it is ignored (with a warning message). Setting this modifier
|
||||||
affects the output if there is a lookbehind at the start of a match, or
|
affects the output if there is a lookbehind at the start of a match, or
|
||||||
a lookahead at the end, or if \K is used in the pattern. Characters
|
a lookahead at the end, or if \K is used in the pattern. Characters
|
||||||
that precede or follow the start and end of the actual match are indi-
|
that precede or follow the start and end of the actual match are indi-
|
||||||
cated in the output by '<' or '>' characters underneath them. Here is
|
cated in the output by '<' or '>' characters underneath them. Here is
|
||||||
an example:
|
an example:
|
||||||
|
|
||||||
re> /(?<=pqr)abc(?=xyz)/
|
re> /(?<=pqr)abc(?=xyz)/
|
||||||
|
@ -1039,16 +1050,16 @@ SUBJECT MODIFIERS
|
||||||
0: pqrabcxyz
|
0: pqrabcxyz
|
||||||
<<< >>>
|
<<< >>>
|
||||||
|
|
||||||
This shows that the matched string is "abc", with the preceding and
|
This shows that the matched string is "abc", with the preceding and
|
||||||
following strings "pqr" and "xyz" having been consulted during the
|
following strings "pqr" and "xyz" having been consulted during the
|
||||||
match (when processing the assertions).
|
match (when processing the assertions).
|
||||||
|
|
||||||
The startchar modifier requests that the starting character for the
|
The startchar modifier requests that the starting character for the
|
||||||
match be indicated, if it is different to the start of the matched
|
match be indicated, if it is different to the start of the matched
|
||||||
string. The only time when this occurs is when \K has been processed as
|
string. The only time when this occurs is when \K has been processed as
|
||||||
part of the match. In this situation, the output for the matched string
|
part of the match. In this situation, the output for the matched string
|
||||||
is displayed from the starting character instead of from the match
|
is displayed from the starting character instead of from the match
|
||||||
point, with circumflex characters under the earlier characters. For
|
point, with circumflex characters under the earlier characters. For
|
||||||
example:
|
example:
|
||||||
|
|
||||||
re> /abc\Kxyz/
|
re> /abc\Kxyz/
|
||||||
|
@ -1056,7 +1067,7 @@ SUBJECT MODIFIERS
|
||||||
0: abcxyz
|
0: abcxyz
|
||||||
^^^
|
^^^
|
||||||
|
|
||||||
Unlike allusedtext, the startchar modifier can be used with JIT. How-
|
Unlike allusedtext, the startchar modifier can be used with JIT. How-
|
||||||
ever, these two modifiers are mutually exclusive.
|
ever, these two modifiers are mutually exclusive.
|
||||||
|
|
||||||
Showing the value of all capture groups
|
Showing the value of all capture groups
|
||||||
|
@ -1064,98 +1075,98 @@ SUBJECT MODIFIERS
|
||||||
The allcaptures modifier requests that the values of all potential cap-
|
The allcaptures modifier requests that the values of all potential cap-
|
||||||
tured parentheses be output after a match. By default, only those up to
|
tured parentheses be output after a match. By default, only those up to
|
||||||
the highest one actually used in the match are output (corresponding to
|
the highest one actually used in the match are output (corresponding to
|
||||||
the return code from pcre2_match()). Groups that did not take part in
|
the return code from pcre2_match()). Groups that did not take part in
|
||||||
the match are output as "<unset>". This modifier is not relevant for
|
the match are output as "<unset>". This modifier is not relevant for
|
||||||
DFA matching (which does no capturing); it is ignored, with a warning
|
DFA matching (which does no capturing); it is ignored, with a warning
|
||||||
message, if present.
|
message, if present.
|
||||||
|
|
||||||
Testing callouts
|
Testing callouts
|
||||||
|
|
||||||
A callout function is supplied when pcre2test calls the library match-
|
A callout function is supplied when pcre2test calls the library match-
|
||||||
ing functions, unless callout_none is specified. If callout_capture is
|
ing functions, unless callout_none is specified. If callout_capture is
|
||||||
set, the current captured groups are output when a callout occurs. The
|
set, the current captured groups are output when a callout occurs. The
|
||||||
default return from the callout function is zero, which allows matching
|
default return from the callout function is zero, which allows matching
|
||||||
to continue.
|
to continue.
|
||||||
|
|
||||||
The callout_fail modifier can be given one or two numbers. If there is
|
The callout_fail modifier can be given one or two numbers. If there is
|
||||||
only one number, 1 is returned instead of 0 (causing matching to back-
|
only one number, 1 is returned instead of 0 (causing matching to back-
|
||||||
track) when a callout of that number is reached. If two numbers
|
track) when a callout of that number is reached. If two numbers
|
||||||
(<n>:<m>) are given, 1 is returned when callout <n> is reached and
|
(<n>:<m>) are given, 1 is returned when callout <n> is reached and
|
||||||
there have been at least <m> callouts. The callout_error modifier is
|
there have been at least <m> callouts. The callout_error modifier is
|
||||||
similar, except that PCRE2_ERROR_CALLOUT is returned, causing the
|
similar, except that PCRE2_ERROR_CALLOUT is returned, causing the
|
||||||
entire matching process to be aborted. If both these modifiers are set
|
entire matching process to be aborted. If both these modifiers are set
|
||||||
for the same callout number, callout_error takes precedence.
|
for the same callout number, callout_error takes precedence.
|
||||||
|
|
||||||
Note that callouts with string arguments are always given the number
|
Note that callouts with string arguments are always given the number
|
||||||
zero. See "Callouts" below for a description of the output when a call-
|
zero. See "Callouts" below for a description of the output when a call-
|
||||||
out it taken.
|
out it taken.
|
||||||
|
|
||||||
The callout_data modifier can be given an unsigned or a negative num-
|
The callout_data modifier can be given an unsigned or a negative num-
|
||||||
ber. This is set as the "user data" that is passed to the matching
|
ber. This is set as the "user data" that is passed to the matching
|
||||||
function, and passed back when the callout function is invoked. Any
|
function, and passed back when the callout function is invoked. Any
|
||||||
value other than zero is used as a return from pcre2test's callout
|
value other than zero is used as a return from pcre2test's callout
|
||||||
function.
|
function.
|
||||||
|
|
||||||
Finding all matches in a string
|
Finding all matches in a string
|
||||||
|
|
||||||
Searching for all possible matches within a subject can be requested by
|
Searching for all possible matches within a subject can be requested by
|
||||||
the global or altglobal modifier. After finding a match, the matching
|
the global or altglobal modifier. After finding a match, the matching
|
||||||
function is called again to search the remainder of the subject. The
|
function is called again to search the remainder of the subject. The
|
||||||
difference between global and altglobal is that the former uses the
|
difference between global and altglobal is that the former uses the
|
||||||
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
|
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
|
||||||
searching at a new point within the entire string (which is what Perl
|
searching at a new point within the entire string (which is what Perl
|
||||||
does), whereas the latter passes over a shortened subject. This makes a
|
does), whereas the latter passes over a shortened subject. This makes a
|
||||||
difference to the matching process if the pattern begins with a lookbe-
|
difference to the matching process if the pattern begins with a lookbe-
|
||||||
hind assertion (including \b or \B).
|
hind assertion (including \b or \B).
|
||||||
|
|
||||||
If an empty string is matched, the next match is done with the
|
If an empty string is matched, the next match is done with the
|
||||||
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
|
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
|
||||||
for another, non-empty, match at the same point in the subject. If this
|
for another, non-empty, match at the same point in the subject. If this
|
||||||
match fails, the start offset is advanced, and the normal match is
|
match fails, the start offset is advanced, and the normal match is
|
||||||
retried. This imitates the way Perl handles such cases when using the
|
retried. This imitates the way Perl handles such cases when using the
|
||||||
/g modifier or the split() function. Normally, the start offset is
|
/g modifier or the split() function. Normally, the start offset is
|
||||||
advanced by one character, but if the newline convention recognizes
|
advanced by one character, but if the newline convention recognizes
|
||||||
CRLF as a newline, and the current character is CR followed by LF, an
|
CRLF as a newline, and the current character is CR followed by LF, an
|
||||||
advance of two characters occurs.
|
advance of two characters occurs.
|
||||||
|
|
||||||
Testing substring extraction functions
|
Testing substring extraction functions
|
||||||
|
|
||||||
The copy and get modifiers can be used to test the pcre2_sub-
|
The copy and get modifiers can be used to test the pcre2_sub-
|
||||||
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
|
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
|
||||||
given more than once, and each can specify a group name or number, for
|
given more than once, and each can specify a group name or number, for
|
||||||
example:
|
example:
|
||||||
|
|
||||||
abcd\=copy=1,copy=3,get=G1
|
abcd\=copy=1,copy=3,get=G1
|
||||||
|
|
||||||
If the #subject command is used to set default copy and/or get lists,
|
If the #subject command is used to set default copy and/or get lists,
|
||||||
these can be unset by specifying a negative number to cancel all num-
|
these can be unset by specifying a negative number to cancel all num-
|
||||||
bered groups and an empty name to cancel all named groups.
|
bered groups and an empty name to cancel all named groups.
|
||||||
|
|
||||||
The getall modifier tests pcre2_substring_list_get(), which extracts
|
The getall modifier tests pcre2_substring_list_get(), which extracts
|
||||||
all captured substrings.
|
all captured substrings.
|
||||||
|
|
||||||
If the subject line is successfully matched, the substrings extracted
|
If the subject line is successfully matched, the substrings extracted
|
||||||
by the convenience functions are output with C, G, or L after the
|
by the convenience functions are output with C, G, or L after the
|
||||||
string number instead of a colon. This is in addition to the normal
|
string number instead of a colon. This is in addition to the normal
|
||||||
full list. The string length (that is, the return from the extraction
|
full list. The string length (that is, the return from the extraction
|
||||||
function) is given in parentheses after each substring, followed by the
|
function) is given in parentheses after each substring, followed by the
|
||||||
name when the extraction was by name.
|
name when the extraction was by name.
|
||||||
|
|
||||||
Testing the substitution function
|
Testing the substitution function
|
||||||
|
|
||||||
If the replace modifier is set, the pcre2_substitute() function is
|
If the replace modifier is set, the pcre2_substitute() function is
|
||||||
called instead of one of the matching functions. Note that replacement
|
called instead of one of the matching functions. Note that replacement
|
||||||
strings cannot contain commas, because a comma signifies the end of a
|
strings cannot contain commas, because a comma signifies the end of a
|
||||||
modifier. This is not thought to be an issue in a test program.
|
modifier. This is not thought to be an issue in a test program.
|
||||||
|
|
||||||
Unlike subject strings, pcre2test does not process replacement strings
|
Unlike subject strings, pcre2test does not process replacement strings
|
||||||
for escape sequences. In UTF mode, a replacement string is checked to
|
for escape sequences. In UTF mode, a replacement string is checked to
|
||||||
see if it is a valid UTF-8 string. If so, it is correctly converted to
|
see if it is a valid UTF-8 string. If so, it is correctly converted to
|
||||||
a UTF string of the appropriate code unit width. If it is not a valid
|
a UTF string of the appropriate code unit width. If it is not a valid
|
||||||
UTF-8 string, the individual code units are copied directly. This pro-
|
UTF-8 string, the individual code units are copied directly. This pro-
|
||||||
vides a means of passing an invalid UTF-8 string for testing purposes.
|
vides a means of passing an invalid UTF-8 string for testing purposes.
|
||||||
|
|
||||||
The following modifiers set options (in additional to the normal match
|
The following modifiers set options (in additional to the normal match
|
||||||
options) for pcre2_substitute():
|
options) for pcre2_substitute():
|
||||||
|
|
||||||
global PCRE2_SUBSTITUTE_GLOBAL
|
global PCRE2_SUBSTITUTE_GLOBAL
|
||||||
|
@ -1165,8 +1176,8 @@ SUBJECT MODIFIERS
|
||||||
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||||
|
|
||||||
|
|
||||||
After a successful substitution, the modified string is output, pre-
|
After a successful substitution, the modified string is output, pre-
|
||||||
ceded by the number of replacements. This may be zero if there were no
|
ceded by the number of replacements. This may be zero if there were no
|
||||||
matches. Here is a simple example of a substitution test:
|
matches. Here is a simple example of a substitution test:
|
||||||
|
|
||||||
/abc/replace=xxx
|
/abc/replace=xxx
|
||||||
|
@ -1175,12 +1186,12 @@ SUBJECT MODIFIERS
|
||||||
=abc=abc=\=global
|
=abc=abc=\=global
|
||||||
2: =xxx=xxx=
|
2: =xxx=xxx=
|
||||||
|
|
||||||
Subject and replacement strings should be kept relatively short (fewer
|
Subject and replacement strings should be kept relatively short (fewer
|
||||||
than 256 characters) for substitution tests, as fixed-size buffers are
|
than 256 characters) for substitution tests, as fixed-size buffers are
|
||||||
used. To make it easy to test for buffer overflow, if the replacement
|
used. To make it easy to test for buffer overflow, if the replacement
|
||||||
string starts with a number in square brackets, that number is passed
|
string starts with a number in square brackets, that number is passed
|
||||||
to pcre2_substitute() as the size of the output buffer, with the
|
to pcre2_substitute() as the size of the output buffer, with the
|
||||||
replacement string starting at the next character. Here is an example
|
replacement string starting at the next character. Here is an example
|
||||||
that tests the edge case:
|
that tests the edge case:
|
||||||
|
|
||||||
/abc/
|
/abc/
|
||||||
|
@ -1189,11 +1200,11 @@ SUBJECT MODIFIERS
|
||||||
123abc123\=replace=[9]XYZ
|
123abc123\=replace=[9]XYZ
|
||||||
Failed: error -47: no more memory
|
Failed: error -47: no more memory
|
||||||
|
|
||||||
The default action of pcre2_substitute() is to return
|
The default action of pcre2_substitute() is to return
|
||||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
|
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
|
||||||
the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
|
the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
|
||||||
stitute_overflow_length modifier), pcre2_substitute() continues to go
|
stitute_overflow_length modifier), pcre2_substitute() continues to go
|
||||||
through the motions of matching and substituting, in order to compute
|
through the motions of matching and substituting, in order to compute
|
||||||
the size of buffer that is required. When this happens, pcre2test shows
|
the size of buffer that is required. When this happens, pcre2test shows
|
||||||
the required buffer length (which includes space for the trailing zero)
|
the required buffer length (which includes space for the trailing zero)
|
||||||
as part of the error message. For example:
|
as part of the error message. For example:
|
||||||
|
@ -1203,151 +1214,151 @@ SUBJECT MODIFIERS
|
||||||
Failed: error -47: no more memory: 10 code units are needed
|
Failed: error -47: no more memory: 10 code units are needed
|
||||||
|
|
||||||
A replacement string is ignored with POSIX and DFA matching. Specifying
|
A replacement string is ignored with POSIX and DFA matching. Specifying
|
||||||
partial matching provokes an error return ("bad option value") from
|
partial matching provokes an error return ("bad option value") from
|
||||||
pcre2_substitute().
|
pcre2_substitute().
|
||||||
|
|
||||||
Setting the JIT stack size
|
Setting the JIT stack size
|
||||||
|
|
||||||
The jitstack modifier provides a way of setting the maximum stack size
|
The jitstack modifier provides a way of setting the maximum stack size
|
||||||
that is used by the just-in-time optimization code. It is ignored if
|
that is used by the just-in-time optimization code. It is ignored if
|
||||||
JIT optimization is not being used. The value is a number of kilobytes.
|
JIT optimization is not being used. The value is a number of kilobytes.
|
||||||
Providing a stack that is larger than the default 32K is necessary only
|
Providing a stack that is larger than the default 32K is necessary only
|
||||||
for very complicated patterns.
|
for very complicated patterns.
|
||||||
|
|
||||||
Setting heap, match, and depth limits
|
Setting heap, match, and depth limits
|
||||||
|
|
||||||
The heap_limit, match_limit, and depth_limit modifiers set the appro-
|
The heap_limit, match_limit, and depth_limit modifiers set the appro-
|
||||||
priate limits in the match context. These values are ignored when the
|
priate limits in the match context. These values are ignored when the
|
||||||
find_limits modifier is specified.
|
find_limits modifier is specified.
|
||||||
|
|
||||||
Finding minimum limits
|
Finding minimum limits
|
||||||
|
|
||||||
If the find_limits modifier is present on a subject line, pcre2test
|
If the find_limits modifier is present on a subject line, pcre2test
|
||||||
calls the relevant matching function several times, setting different
|
calls the relevant matching function several times, setting different
|
||||||
values in the match context via pcre2_set_heap_limit(),
|
values in the match context via pcre2_set_heap_limit(),
|
||||||
pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
|
pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
|
||||||
minimum values for each parameter that allows the match to complete
|
minimum values for each parameter that allows the match to complete
|
||||||
without error.
|
without error.
|
||||||
|
|
||||||
If JIT is being used, only the match limit is relevant. If DFA matching
|
If JIT is being used, only the match limit is relevant. If DFA matching
|
||||||
is being used, only the depth limit is relevant.
|
is being used, only the depth limit is relevant.
|
||||||
|
|
||||||
The match_limit number is a measure of the amount of backtracking that
|
The match_limit number is a measure of the amount of backtracking that
|
||||||
takes place, and learning the minimum value can be instructive. For
|
takes place, and learning the minimum value can be instructive. For
|
||||||
most simple matches, the number is quite small, but for patterns with
|
most simple matches, the number is quite small, but for patterns with
|
||||||
very large numbers of matching possibilities, it can become large very
|
very large numbers of matching possibilities, it can become large very
|
||||||
quickly with increasing length of subject string.
|
quickly with increasing length of subject string.
|
||||||
|
|
||||||
For non-DFA matching, the minimum depth_limit number is a measure of
|
For non-DFA matching, the minimum depth_limit number is a measure of
|
||||||
how much nested backtracking happens (that is, how deeply the pattern's
|
how much nested backtracking happens (that is, how deeply the pattern's
|
||||||
tree is searched). In the case of DFA matching, depth_limit controls
|
tree is searched). In the case of DFA matching, depth_limit controls
|
||||||
the depth of recursive calls of the internal function that is used for
|
the depth of recursive calls of the internal function that is used for
|
||||||
handling pattern recursion, lookaround assertions, and atomic groups.
|
handling pattern recursion, lookaround assertions, and atomic groups.
|
||||||
|
|
||||||
Showing MARK names
|
Showing MARK names
|
||||||
|
|
||||||
|
|
||||||
The mark modifier causes the names from backtracking control verbs that
|
The mark modifier causes the names from backtracking control verbs that
|
||||||
are returned from calls to pcre2_match() to be displayed. If a mark is
|
are returned from calls to pcre2_match() to be displayed. If a mark is
|
||||||
returned for a match, non-match, or partial match, pcre2test shows it.
|
returned for a match, non-match, or partial match, pcre2test shows it.
|
||||||
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
||||||
it is added to the non-match message.
|
it is added to the non-match message.
|
||||||
|
|
||||||
Showing memory usage
|
Showing memory usage
|
||||||
|
|
||||||
The memory modifier causes pcre2test to log the sizes of all heap mem-
|
The memory modifier causes pcre2test to log the sizes of all heap mem-
|
||||||
ory allocation and freeing calls that occur during a call to
|
ory allocation and freeing calls that occur during a call to
|
||||||
pcre2_match(). These occur only when a match requires a bigger vector
|
pcre2_match(). These occur only when a match requires a bigger vector
|
||||||
than the default for remembering backtracking points. In many cases
|
than the default for remembering backtracking points. In many cases
|
||||||
there will be no heap memory used and therefore no additional output.
|
there will be no heap memory used and therefore no additional output.
|
||||||
No heap memory is allocated during matching with pcre2_dfa_match or
|
No heap memory is allocated during matching with pcre2_dfa_match or
|
||||||
with JIT, so in those cases the memory modifier never has any effect.
|
with JIT, so in those cases the memory modifier never has any effect.
|
||||||
For this modifier to work, the null_context modifier must not be set on
|
For this modifier to work, the null_context modifier must not be set on
|
||||||
both the pattern and the subject, though it can be set on one or the
|
both the pattern and the subject, though it can be set on one or the
|
||||||
other.
|
other.
|
||||||
|
|
||||||
Setting a starting offset
|
Setting a starting offset
|
||||||
|
|
||||||
The offset modifier sets an offset in the subject string at which
|
The offset modifier sets an offset in the subject string at which
|
||||||
matching starts. Its value is a number of code units, not characters.
|
matching starts. Its value is a number of code units, not characters.
|
||||||
|
|
||||||
Setting an offset limit
|
Setting an offset limit
|
||||||
|
|
||||||
The offset_limit modifier sets a limit for unanchored matches. If a
|
The offset_limit modifier sets a limit for unanchored matches. If a
|
||||||
match cannot be found starting at or before this offset in the subject,
|
match cannot be found starting at or before this offset in the subject,
|
||||||
a "no match" return is given. The data value is a number of code units,
|
a "no match" return is given. The data value is a number of code units,
|
||||||
not characters. When this modifier is used, the use_offset_limit modi-
|
not characters. When this modifier is used, the use_offset_limit modi-
|
||||||
fier must have been set for the pattern; if not, an error is generated.
|
fier must have been set for the pattern; if not, an error is generated.
|
||||||
|
|
||||||
Setting the size of the output vector
|
Setting the size of the output vector
|
||||||
|
|
||||||
The ovector modifier applies only to the subject line in which it
|
The ovector modifier applies only to the subject line in which it
|
||||||
appears, though of course it can also be used to set a default in a
|
appears, though of course it can also be used to set a default in a
|
||||||
#subject command. It specifies the number of pairs of offsets that are
|
#subject command. It specifies the number of pairs of offsets that are
|
||||||
available for storing matching information. The default is 15.
|
available for storing matching information. The default is 15.
|
||||||
|
|
||||||
A value of zero is useful when testing the POSIX API because it causes
|
A value of zero is useful when testing the POSIX API because it causes
|
||||||
regexec() to be called with a NULL capture vector. When not testing the
|
regexec() to be called with a NULL capture vector. When not testing the
|
||||||
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
||||||
ate_from_pattern() to be called, in order to create a match block of
|
ate_from_pattern() to be called, in order to create a match block of
|
||||||
exactly the right size for the pattern. (It is not possible to create a
|
exactly the right size for the pattern. (It is not possible to create a
|
||||||
match block with a zero-length ovector; there is always at least one
|
match block with a zero-length ovector; there is always at least one
|
||||||
pair of offsets.)
|
pair of offsets.)
|
||||||
|
|
||||||
Passing the subject as zero-terminated
|
Passing the subject as zero-terminated
|
||||||
|
|
||||||
By default, the subject string is passed to a native API matching func-
|
By default, the subject string is passed to a native API matching func-
|
||||||
tion with its correct length. In order to test the facility for passing
|
tion with its correct length. In order to test the facility for passing
|
||||||
a zero-terminated string, the zero_terminate modifier is provided. It
|
a zero-terminated string, the zero_terminate modifier is provided. It
|
||||||
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
|
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
|
||||||
via the POSIX interface, this modifier has no effect, as there is no
|
via the POSIX interface, this modifier has no effect, as there is no
|
||||||
facility for passing a length.)
|
facility for passing a length.)
|
||||||
|
|
||||||
When testing pcre2_substitute(), this modifier also has the effect of
|
When testing pcre2_substitute(), this modifier also has the effect of
|
||||||
passing the replacement string as zero-terminated.
|
passing the replacement string as zero-terminated.
|
||||||
|
|
||||||
Passing a NULL context
|
Passing a NULL context
|
||||||
|
|
||||||
Normally, pcre2test passes a context block to pcre2_match(),
|
Normally, pcre2test passes a context block to pcre2_match(),
|
||||||
pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is
|
pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is
|
||||||
set, however, NULL is passed. This is for testing that the matching
|
set, however, NULL is passed. This is for testing that the matching
|
||||||
functions behave correctly in this case (they use default values). This
|
functions behave correctly in this case (they use default values). This
|
||||||
modifier cannot be used with the find_limits modifier or when testing
|
modifier cannot be used with the find_limits modifier or when testing
|
||||||
the substitution function.
|
the substitution function.
|
||||||
|
|
||||||
|
|
||||||
THE ALTERNATIVE MATCHING FUNCTION
|
THE ALTERNATIVE MATCHING FUNCTION
|
||||||
|
|
||||||
By default, pcre2test uses the standard PCRE2 matching function,
|
By default, pcre2test uses the standard PCRE2 matching function,
|
||||||
pcre2_match() to match each subject line. PCRE2 also supports an alter-
|
pcre2_match() to match each subject line. PCRE2 also supports an alter-
|
||||||
native matching function, pcre2_dfa_match(), which operates in a dif-
|
native matching function, pcre2_dfa_match(), which operates in a dif-
|
||||||
ferent way, and has some restrictions. The differences between the two
|
ferent way, and has some restrictions. The differences between the two
|
||||||
functions are described in the pcre2matching documentation.
|
functions are described in the pcre2matching documentation.
|
||||||
|
|
||||||
If the dfa modifier is set, the alternative matching function is used.
|
If the dfa modifier is set, the alternative matching function is used.
|
||||||
This function finds all possible matches at a given point in the sub-
|
This function finds all possible matches at a given point in the sub-
|
||||||
ject. If, however, the dfa_shortest modifier is set, processing stops
|
ject. If, however, the dfa_shortest modifier is set, processing stops
|
||||||
after the first match is found. This is always the shortest possible
|
after the first match is found. This is always the shortest possible
|
||||||
match.
|
match.
|
||||||
|
|
||||||
|
|
||||||
DEFAULT OUTPUT FROM pcre2test
|
DEFAULT OUTPUT FROM pcre2test
|
||||||
|
|
||||||
This section describes the output when the normal matching function,
|
This section describes the output when the normal matching function,
|
||||||
pcre2_match(), is being used.
|
pcre2_match(), is being used.
|
||||||
|
|
||||||
When a match succeeds, pcre2test outputs the list of captured sub-
|
When a match succeeds, pcre2test outputs the list of captured sub-
|
||||||
strings, starting with number 0 for the string that matched the whole
|
strings, starting with number 0 for the string that matched the whole
|
||||||
pattern. Otherwise, it outputs "No match" when the return is
|
pattern. Otherwise, it outputs "No match" when the return is
|
||||||
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
|
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
|
||||||
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
|
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
|
||||||
this is the entire substring that was inspected during the partial
|
this is the entire substring that was inspected during the partial
|
||||||
match; it may include characters before the actual match start if a
|
match; it may include characters before the actual match start if a
|
||||||
lookbehind assertion, \K, \b, or \B was involved.)
|
lookbehind assertion, \K, \b, or \B was involved.)
|
||||||
|
|
||||||
For any other return, pcre2test outputs the PCRE2 negative error number
|
For any other return, pcre2test outputs the PCRE2 negative error number
|
||||||
and a short descriptive phrase. If the error is a failed UTF string
|
and a short descriptive phrase. If the error is a failed UTF string
|
||||||
check, the code unit offset of the start of the failing character is
|
check, the code unit offset of the start of the failing character is
|
||||||
also output. Here is an example of an interactive pcre2test run.
|
also output. Here is an example of an interactive pcre2test run.
|
||||||
|
|
||||||
$ pcre2test
|
$ pcre2test
|
||||||
|
@ -1363,8 +1374,8 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
Unset capturing substrings that are not followed by one that is set are
|
Unset capturing substrings that are not followed by one that is set are
|
||||||
not shown by pcre2test unless the allcaptures modifier is specified. In
|
not shown by pcre2test unless the allcaptures modifier is specified. In
|
||||||
the following example, there are two capturing substrings, but when the
|
the following example, there are two capturing substrings, but when the
|
||||||
first data line is matched, the second, unset substring is not shown.
|
first data line is matched, the second, unset substring is not shown.
|
||||||
An "internal" unset substring is shown as "<unset>", as for the second
|
An "internal" unset substring is shown as "<unset>", as for the second
|
||||||
data line.
|
data line.
|
||||||
|
|
||||||
re> /(a)|(b)/
|
re> /(a)|(b)/
|
||||||
|
@ -1376,11 +1387,11 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
1: <unset>
|
1: <unset>
|
||||||
2: b
|
2: b
|
||||||
|
|
||||||
If the strings contain any non-printing characters, they are output as
|
If the strings contain any non-printing characters, they are output as
|
||||||
\xhh escapes if the value is less than 256 and UTF mode is not set.
|
\xhh escapes if the value is less than 256 and UTF mode is not set.
|
||||||
Otherwise they are output as \x{hh...} escapes. See below for the defi-
|
Otherwise they are output as \x{hh...} escapes. See below for the defi-
|
||||||
nition of non-printing characters. If the aftertext modifier is set,
|
nition of non-printing characters. If the aftertext modifier is set,
|
||||||
the output for substring 0 is followed by the the rest of the subject
|
the output for substring 0 is followed by the the rest of the subject
|
||||||
string, identified by "0+" like this:
|
string, identified by "0+" like this:
|
||||||
|
|
||||||
re> /cat/aftertext
|
re> /cat/aftertext
|
||||||
|
@ -1388,7 +1399,7 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
0: cat
|
0: cat
|
||||||
0+ aract
|
0+ aract
|
||||||
|
|
||||||
If global matching is requested, the results of successive matching
|
If global matching is requested, the results of successive matching
|
||||||
attempts are output in sequence, like this:
|
attempts are output in sequence, like this:
|
||||||
|
|
||||||
re> /\Bi(\w\w)/g
|
re> /\Bi(\w\w)/g
|
||||||
|
@ -1400,8 +1411,8 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
0: ipp
|
0: ipp
|
||||||
1: pp
|
1: pp
|
||||||
|
|
||||||
"No match" is output only if the first match attempt fails. Here is an
|
"No match" is output only if the first match attempt fails. Here is an
|
||||||
example of a failure message (the offset 4 that is specified by the
|
example of a failure message (the offset 4 that is specified by the
|
||||||
offset modifier is past the end of the subject string):
|
offset modifier is past the end of the subject string):
|
||||||
|
|
||||||
re> /xyz/
|
re> /xyz/
|
||||||
|
@ -1409,7 +1420,7 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
Error -24 (bad offset value)
|
Error -24 (bad offset value)
|
||||||
|
|
||||||
Note that whereas patterns can be continued over several lines (a plain
|
Note that whereas patterns can be continued over several lines (a plain
|
||||||
">" prompt is used for continuations), subject lines may not. However
|
">" prompt is used for continuations), subject lines may not. However
|
||||||
newlines can be included in a subject by means of the \n escape (or \r,
|
newlines can be included in a subject by means of the \n escape (or \r,
|
||||||
\r\n, etc., depending on the newline sequence setting).
|
\r\n, etc., depending on the newline sequence setting).
|
||||||
|
|
||||||
|
@ -1417,7 +1428,7 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
||||||
|
|
||||||
When the alternative matching function, pcre2_dfa_match(), is used, the
|
When the alternative matching function, pcre2_dfa_match(), is used, the
|
||||||
output consists of a list of all the matches that start at the first
|
output consists of a list of all the matches that start at the first
|
||||||
point in the subject where there is at least one match. For example:
|
point in the subject where there is at least one match. For example:
|
||||||
|
|
||||||
re> /(tang|tangerine|tan)/
|
re> /(tang|tangerine|tan)/
|
||||||
|
@ -1426,11 +1437,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
||||||
1: tang
|
1: tang
|
||||||
2: tan
|
2: tan
|
||||||
|
|
||||||
Using the normal matching function on this data finds only "tang". The
|
Using the normal matching function on this data finds only "tang". The
|
||||||
longest matching string is always given first (and numbered zero).
|
longest matching string is always given first (and numbered zero).
|
||||||
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
|
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
|
||||||
followed by the partially matching substring. Note that this is the
|
followed by the partially matching substring. Note that this is the
|
||||||
entire substring that was inspected during the partial match; it may
|
entire substring that was inspected during the partial match; it may
|
||||||
include characters before the actual match start if a lookbehind asser-
|
include characters before the actual match start if a lookbehind asser-
|
||||||
tion, \b, or \B was involved. (\K is not supported for DFA matching.)
|
tion, \b, or \B was involved. (\K is not supported for DFA matching.)
|
||||||
|
|
||||||
|
@ -1446,16 +1457,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
||||||
1: tan
|
1: tan
|
||||||
0: tan
|
0: tan
|
||||||
|
|
||||||
The alternative matching function does not support substring capture,
|
The alternative matching function does not support substring capture,
|
||||||
so the modifiers that are concerned with captured substrings are not
|
so the modifiers that are concerned with captured substrings are not
|
||||||
relevant.
|
relevant.
|
||||||
|
|
||||||
|
|
||||||
RESTARTING AFTER A PARTIAL MATCH
|
RESTARTING AFTER A PARTIAL MATCH
|
||||||
|
|
||||||
When the alternative matching function has given the PCRE2_ERROR_PAR-
|
When the alternative matching function has given the PCRE2_ERROR_PAR-
|
||||||
TIAL return, indicating that the subject partially matched the pattern,
|
TIAL return, indicating that the subject partially matched the pattern,
|
||||||
you can restart the match with additional subject data by means of the
|
you can restart the match with additional subject data by means of the
|
||||||
dfa_restart modifier. For example:
|
dfa_restart modifier. For example:
|
||||||
|
|
||||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||||
|
@ -1464,45 +1475,45 @@ RESTARTING AFTER A PARTIAL MATCH
|
||||||
data> n05\=dfa,dfa_restart
|
data> n05\=dfa,dfa_restart
|
||||||
0: n05
|
0: n05
|
||||||
|
|
||||||
For further information about partial matching, see the pcre2partial
|
For further information about partial matching, see the pcre2partial
|
||||||
documentation.
|
documentation.
|
||||||
|
|
||||||
|
|
||||||
CALLOUTS
|
CALLOUTS
|
||||||
|
|
||||||
If the pattern contains any callout requests, pcre2test's callout func-
|
If the pattern contains any callout requests, pcre2test's callout func-
|
||||||
tion is called during matching unless callout_none is specified. This
|
tion is called during matching unless callout_none is specified. This
|
||||||
works with both matching functions.
|
works with both matching functions.
|
||||||
|
|
||||||
The callout function in pcre2test returns zero (carry on matching) by
|
The callout function in pcre2test returns zero (carry on matching) by
|
||||||
default, but you can use a callout_fail modifier in a subject line (as
|
default, but you can use a callout_fail modifier in a subject line (as
|
||||||
described above) to change this and other parameters of the callout.
|
described above) to change this and other parameters of the callout.
|
||||||
|
|
||||||
Inserting callouts can be helpful when using pcre2test to check compli-
|
Inserting callouts can be helpful when using pcre2test to check compli-
|
||||||
cated regular expressions. For further information about callouts, see
|
cated regular expressions. For further information about callouts, see
|
||||||
the pcre2callout documentation.
|
the pcre2callout documentation.
|
||||||
|
|
||||||
The output for callouts with numerical arguments and those with string
|
The output for callouts with numerical arguments and those with string
|
||||||
arguments is slightly different.
|
arguments is slightly different.
|
||||||
|
|
||||||
Callouts with numerical arguments
|
Callouts with numerical arguments
|
||||||
|
|
||||||
By default, the callout function displays the callout number, the start
|
By default, the callout function displays the callout number, the start
|
||||||
and current positions in the subject text at the callout time, and the
|
and current positions in the subject text at the callout time, and the
|
||||||
next pattern item to be tested. For example:
|
next pattern item to be tested. For example:
|
||||||
|
|
||||||
--->pqrabcdef
|
--->pqrabcdef
|
||||||
0 ^ ^ \d
|
0 ^ ^ \d
|
||||||
|
|
||||||
This output indicates that callout number 0 occurred for a match
|
This output indicates that callout number 0 occurred for a match
|
||||||
attempt starting at the fourth character of the subject string, when
|
attempt starting at the fourth character of the subject string, when
|
||||||
the pointer was at the seventh character, and when the next pattern
|
the pointer was at the seventh character, and when the next pattern
|
||||||
item was \d. Just one circumflex is output if the start and current
|
item was \d. Just one circumflex is output if the start and current
|
||||||
positions are the same, or if the current position precedes the start
|
positions are the same, or if the current position precedes the start
|
||||||
position, which can happen if the callout is in a lookbehind assertion.
|
position, which can happen if the callout is in a lookbehind assertion.
|
||||||
|
|
||||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as
|
Callouts numbered 255 are assumed to be automatic callouts, inserted as
|
||||||
a result of the /auto_callout pattern modifier. In this case, instead
|
a result of the /auto_callout pattern modifier. In this case, instead
|
||||||
of showing the callout number, the offset in the pattern, preceded by a
|
of showing the callout number, the offset in the pattern, preceded by a
|
||||||
plus, is output. For example:
|
plus, is output. For example:
|
||||||
|
|
||||||
|
@ -1516,7 +1527,7 @@ CALLOUTS
|
||||||
0: E*
|
0: E*
|
||||||
|
|
||||||
If a pattern contains (*MARK) items, an additional line is output when-
|
If a pattern contains (*MARK) items, an additional line is output when-
|
||||||
ever a change of latest mark is passed to the callout function. For
|
ever a change of latest mark is passed to the callout function. For
|
||||||
example:
|
example:
|
||||||
|
|
||||||
re> /a(*MARK:X)bc/auto_callout
|
re> /a(*MARK:X)bc/auto_callout
|
||||||
|
@ -1530,17 +1541,17 @@ CALLOUTS
|
||||||
+12 ^ ^
|
+12 ^ ^
|
||||||
0: abc
|
0: abc
|
||||||
|
|
||||||
The mark changes between matching "a" and "b", but stays the same for
|
The mark changes between matching "a" and "b", but stays the same for
|
||||||
the rest of the match, so nothing more is output. If, as a result of
|
the rest of the match, so nothing more is output. If, as a result of
|
||||||
backtracking, the mark reverts to being unset, the text "<unset>" is
|
backtracking, the mark reverts to being unset, the text "<unset>" is
|
||||||
output.
|
output.
|
||||||
|
|
||||||
Callouts with string arguments
|
Callouts with string arguments
|
||||||
|
|
||||||
The output for a callout with a string argument is similar, except that
|
The output for a callout with a string argument is similar, except that
|
||||||
instead of outputting a callout number before the position indicators,
|
instead of outputting a callout number before the position indicators,
|
||||||
the callout string and its offset in the pattern string are output
|
the callout string and its offset in the pattern string are output
|
||||||
before the reflection of the subject string, and the subject string is
|
before the reflection of the subject string, and the subject string is
|
||||||
reflected for each callout. For example:
|
reflected for each callout. For example:
|
||||||
|
|
||||||
re> /^ab(?C'first')cd(?C"second")ef/
|
re> /^ab(?C'first')cd(?C"second")ef/
|
||||||
|
@ -1557,43 +1568,43 @@ CALLOUTS
|
||||||
NON-PRINTING CHARACTERS
|
NON-PRINTING CHARACTERS
|
||||||
|
|
||||||
When pcre2test is outputting text in the compiled version of a pattern,
|
When pcre2test is outputting text in the compiled version of a pattern,
|
||||||
bytes other than 32-126 are always treated as non-printing characters
|
bytes other than 32-126 are always treated as non-printing characters
|
||||||
and are therefore shown as hex escapes.
|
and are therefore shown as hex escapes.
|
||||||
|
|
||||||
When pcre2test is outputting text that is a matched part of a subject
|
When pcre2test is outputting text that is a matched part of a subject
|
||||||
string, it behaves in the same way, unless a different locale has been
|
string, it behaves in the same way, unless a different locale has been
|
||||||
set for the pattern (using the locale modifier). In this case, the
|
set for the pattern (using the locale modifier). In this case, the
|
||||||
isprint() function is used to distinguish printing and non-printing
|
isprint() function is used to distinguish printing and non-printing
|
||||||
characters.
|
characters.
|
||||||
|
|
||||||
|
|
||||||
SAVING AND RESTORING COMPILED PATTERNS
|
SAVING AND RESTORING COMPILED PATTERNS
|
||||||
|
|
||||||
It is possible to save compiled patterns on disc or elsewhere, and
|
It is possible to save compiled patterns on disc or elsewhere, and
|
||||||
reload them later, subject to a number of restrictions. JIT data cannot
|
reload them later, subject to a number of restrictions. JIT data cannot
|
||||||
be saved. The host on which the patterns are reloaded must be running
|
be saved. The host on which the patterns are reloaded must be running
|
||||||
the same version of PCRE2, with the same code unit width, and must also
|
the same version of PCRE2, with the same code unit width, and must also
|
||||||
have the same endianness, pointer width and PCRE2_SIZE type. Before
|
have the same endianness, pointer width and PCRE2_SIZE type. Before
|
||||||
compiled patterns can be saved they must be serialized, that is, con-
|
compiled patterns can be saved they must be serialized, that is, con-
|
||||||
verted to a stream of bytes. A single byte stream may contain any num-
|
verted to a stream of bytes. A single byte stream may contain any num-
|
||||||
ber of compiled patterns, but they must all use the same character
|
ber of compiled patterns, but they must all use the same character
|
||||||
tables. A single copy of the tables is included in the byte stream (its
|
tables. A single copy of the tables is included in the byte stream (its
|
||||||
size is 1088 bytes).
|
size is 1088 bytes).
|
||||||
|
|
||||||
The functions whose names begin with pcre2_serialize_ are used for
|
The functions whose names begin with pcre2_serialize_ are used for
|
||||||
serializing and de-serializing. They are described in the pcre2serial-
|
serializing and de-serializing. They are described in the pcre2serial-
|
||||||
ize documentation. In this section we describe the features of
|
ize documentation. In this section we describe the features of
|
||||||
pcre2test that can be used to test these functions.
|
pcre2test that can be used to test these functions.
|
||||||
|
|
||||||
When a pattern with push modifier is successfully compiled, it is
|
When a pattern with push modifier is successfully compiled, it is
|
||||||
pushed onto a stack of compiled patterns, and pcre2test expects the
|
pushed onto a stack of compiled patterns, and pcre2test expects the
|
||||||
next line to contain a new pattern (or command) instead of a subject
|
next line to contain a new pattern (or command) instead of a subject
|
||||||
line. By contrast, the pushcopy modifier causes a copy of the compiled
|
line. By contrast, the pushcopy modifier causes a copy of the compiled
|
||||||
pattern to be stacked, leaving the original available for immediate
|
pattern to be stacked, leaving the original available for immediate
|
||||||
matching. By using push and/or pushcopy, a number of patterns can be
|
matching. By using push and/or pushcopy, a number of patterns can be
|
||||||
compiled and retained. These modifiers are incompatible with posix, and
|
compiled and retained. These modifiers are incompatible with posix, and
|
||||||
control modifiers that act at match time are ignored (with a message)
|
control modifiers that act at match time are ignored (with a message)
|
||||||
for the stacked patterns. The jitverify modifier applies only at com-
|
for the stacked patterns. The jitverify modifier applies only at com-
|
||||||
pile time.
|
pile time.
|
||||||
|
|
||||||
The command
|
The command
|
||||||
|
@ -1601,21 +1612,21 @@ SAVING AND RESTORING COMPILED PATTERNS
|
||||||
#save <filename>
|
#save <filename>
|
||||||
|
|
||||||
causes all the stacked patterns to be serialized and the result written
|
causes all the stacked patterns to be serialized and the result written
|
||||||
to the named file. Afterwards, all the stacked patterns are freed. The
|
to the named file. Afterwards, all the stacked patterns are freed. The
|
||||||
command
|
command
|
||||||
|
|
||||||
#load <filename>
|
#load <filename>
|
||||||
|
|
||||||
reads the data in the file, and then arranges for it to be de-serial-
|
reads the data in the file, and then arranges for it to be de-serial-
|
||||||
ized, with the resulting compiled patterns added to the pattern stack.
|
ized, with the resulting compiled patterns added to the pattern stack.
|
||||||
The pattern on the top of the stack can be retrieved by the #pop com-
|
The pattern on the top of the stack can be retrieved by the #pop com-
|
||||||
mand, which must be followed by lines of subjects that are to be
|
mand, which must be followed by lines of subjects that are to be
|
||||||
matched with the pattern, terminated as usual by an empty line or end
|
matched with the pattern, terminated as usual by an empty line or end
|
||||||
of file. This command may be followed by a modifier list containing
|
of file. This command may be followed by a modifier list containing
|
||||||
only control modifiers that act after a pattern has been compiled. In
|
only control modifiers that act after a pattern has been compiled. In
|
||||||
particular, hex, posix, posix_nosub, push, and pushcopy are not
|
particular, hex, posix, posix_nosub, push, and pushcopy are not
|
||||||
allowed, nor are any option-setting modifiers. The JIT modifiers are,
|
allowed, nor are any option-setting modifiers. The JIT modifiers are,
|
||||||
however permitted. Here is an example that saves and reloads two pat-
|
however permitted. Here is an example that saves and reloads two pat-
|
||||||
terns.
|
terns.
|
||||||
|
|
||||||
/abc/push
|
/abc/push
|
||||||
|
@ -1628,10 +1639,10 @@ SAVING AND RESTORING COMPILED PATTERNS
|
||||||
#pop jit,bincode
|
#pop jit,bincode
|
||||||
abc
|
abc
|
||||||
|
|
||||||
If jitverify is used with #pop, it does not automatically imply jit,
|
If jitverify is used with #pop, it does not automatically imply jit,
|
||||||
which is different behaviour from when it is used on a pattern.
|
which is different behaviour from when it is used on a pattern.
|
||||||
|
|
||||||
The #popcopy command is analagous to the pushcopy modifier in that it
|
The #popcopy command is analagous to the pushcopy modifier in that it
|
||||||
makes current a copy of the topmost stack pattern, leaving the original
|
makes current a copy of the topmost stack pattern, leaving the original
|
||||||
still on the stack.
|
still on the stack.
|
||||||
|
|
||||||
|
@ -1651,5 +1662,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 01 June 2017
|
Last updated: 03 June 2017
|
||||||
Copyright (c) 1997-2017 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
|
|
|
@ -231,10 +231,14 @@ PCRE2POSIX_EXP_DEFN int PCRE2_CALL_CONVENTION
|
||||||
regcomp(regex_t *preg, const char *pattern, int cflags)
|
regcomp(regex_t *preg, const char *pattern, int cflags)
|
||||||
{
|
{
|
||||||
PCRE2_SIZE erroffset;
|
PCRE2_SIZE erroffset;
|
||||||
|
PCRE2_SIZE patlen;
|
||||||
int errorcode;
|
int errorcode;
|
||||||
int options = 0;
|
int options = 0;
|
||||||
int re_nsub = 0;
|
int re_nsub = 0;
|
||||||
|
|
||||||
|
patlen = ((cflags & REG_PEND) != 0)? (PCRE2_SIZE)(preg->re_endp - pattern) :
|
||||||
|
PCRE2_ZERO_TERMINATED;
|
||||||
|
|
||||||
if ((cflags & REG_ICASE) != 0) options |= PCRE2_CASELESS;
|
if ((cflags & REG_ICASE) != 0) options |= PCRE2_CASELESS;
|
||||||
if ((cflags & REG_NEWLINE) != 0) options |= PCRE2_MULTILINE;
|
if ((cflags & REG_NEWLINE) != 0) options |= PCRE2_MULTILINE;
|
||||||
if ((cflags & REG_DOTALL) != 0) options |= PCRE2_DOTALL;
|
if ((cflags & REG_DOTALL) != 0) options |= PCRE2_DOTALL;
|
||||||
|
@ -243,8 +247,8 @@ if ((cflags & REG_UCP) != 0) options |= PCRE2_UCP;
|
||||||
if ((cflags & REG_UNGREEDY) != 0) options |= PCRE2_UNGREEDY;
|
if ((cflags & REG_UNGREEDY) != 0) options |= PCRE2_UNGREEDY;
|
||||||
|
|
||||||
preg->re_cflags = cflags;
|
preg->re_cflags = cflags;
|
||||||
preg->re_pcre2_code = pcre2_compile((PCRE2_SPTR)pattern, PCRE2_ZERO_TERMINATED,
|
preg->re_pcre2_code = pcre2_compile((PCRE2_SPTR)pattern, patlen, options,
|
||||||
options, &errorcode, &erroffset, NULL);
|
&errorcode, &erroffset, NULL);
|
||||||
preg->re_erroffset = erroffset;
|
preg->re_erroffset = erroffset;
|
||||||
|
|
||||||
if (preg->re_pcre2_code == NULL)
|
if (preg->re_pcre2_code == NULL)
|
||||||
|
|
|
@ -62,6 +62,7 @@ extern "C" {
|
||||||
#define REG_NOTEMPTY 0x0100 /* NOT defined by POSIX; maps to PCRE2_NOTEMPTY */
|
#define REG_NOTEMPTY 0x0100 /* NOT defined by POSIX; maps to PCRE2_NOTEMPTY */
|
||||||
#define REG_UNGREEDY 0x0200 /* NOT defined by POSIX; maps to PCRE2_UNGREEDY */
|
#define REG_UNGREEDY 0x0200 /* NOT defined by POSIX; maps to PCRE2_UNGREEDY */
|
||||||
#define REG_UCP 0x0400 /* NOT defined by POSIX; maps to PCRE2_UCP */
|
#define REG_UCP 0x0400 /* NOT defined by POSIX; maps to PCRE2_UCP */
|
||||||
|
#define REG_PEND 0x0800 /* GNU feature: pass end pattern by re_endp */
|
||||||
|
|
||||||
/* This is not used by PCRE2, but by defining it we make it easier
|
/* This is not used by PCRE2, but by defining it we make it easier
|
||||||
to slot PCRE2 into existing programs that make POSIX calls. */
|
to slot PCRE2 into existing programs that make POSIX calls. */
|
||||||
|
@ -91,11 +92,13 @@ enum {
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|
||||||
/* The structure representing a compiled regular expression. */
|
/* The structure representing a compiled regular expression. It is also used
|
||||||
|
for passing the pattern end pointer when REG_PEND is set. */
|
||||||
|
|
||||||
typedef struct {
|
typedef struct {
|
||||||
void *re_pcre2_code;
|
void *re_pcre2_code;
|
||||||
void *re_match_data;
|
void *re_match_data;
|
||||||
|
const char *re_endp;
|
||||||
size_t re_nsub;
|
size_t re_nsub;
|
||||||
size_t re_erroffset;
|
size_t re_erroffset;
|
||||||
int re_cflags;
|
int re_cflags;
|
||||||
|
|
|
@ -699,7 +699,8 @@ static modstruct modlist[] = {
|
||||||
#define POSIX_SUPPORTED_COMPILE_EXTRA_OPTIONS (0)
|
#define POSIX_SUPPORTED_COMPILE_EXTRA_OPTIONS (0)
|
||||||
|
|
||||||
#define POSIX_SUPPORTED_COMPILE_CONTROLS ( \
|
#define POSIX_SUPPORTED_COMPILE_CONTROLS ( \
|
||||||
CTL_AFTERTEXT|CTL_ALLAFTERTEXT|CTL_EXPAND|CTL_POSIX|CTL_POSIX_NOSUB)
|
CTL_AFTERTEXT|CTL_ALLAFTERTEXT|CTL_EXPAND|CTL_HEXPAT|CTL_POSIX| \
|
||||||
|
CTL_POSIX_NOSUB|CTL_USE_LENGTH)
|
||||||
|
|
||||||
#define POSIX_SUPPORTED_COMPILE_CONTROLS2 (0)
|
#define POSIX_SUPPORTED_COMPILE_CONTROLS2 (0)
|
||||||
|
|
||||||
|
@ -733,11 +734,9 @@ the first control word. Note that CTL_POSIX_NOSUB is always accompanied by
|
||||||
CTL_POSIX, so it doesn't need its own entries. */
|
CTL_POSIX, so it doesn't need its own entries. */
|
||||||
|
|
||||||
static uint32_t exclusive_pat_controls[] = {
|
static uint32_t exclusive_pat_controls[] = {
|
||||||
CTL_POSIX | CTL_HEXPAT,
|
|
||||||
CTL_POSIX | CTL_PUSH,
|
CTL_POSIX | CTL_PUSH,
|
||||||
CTL_POSIX | CTL_PUSHCOPY,
|
CTL_POSIX | CTL_PUSHCOPY,
|
||||||
CTL_POSIX | CTL_PUSHTABLESCOPY,
|
CTL_POSIX | CTL_PUSHTABLESCOPY,
|
||||||
CTL_POSIX | CTL_USE_LENGTH,
|
|
||||||
CTL_PUSH | CTL_PUSHCOPY,
|
CTL_PUSH | CTL_PUSHCOPY,
|
||||||
CTL_PUSH | CTL_PUSHTABLESCOPY,
|
CTL_PUSH | CTL_PUSHTABLESCOPY,
|
||||||
CTL_PUSHCOPY | CTL_PUSHTABLESCOPY,
|
CTL_PUSHCOPY | CTL_PUSHTABLESCOPY,
|
||||||
|
@ -896,7 +895,7 @@ static PCRE2_SIZE malloclistlength[MALLOCLISTSIZE];
|
||||||
static uint32_t malloclistptr = 0;
|
static uint32_t malloclistptr = 0;
|
||||||
|
|
||||||
#ifdef SUPPORT_PCRE2_8
|
#ifdef SUPPORT_PCRE2_8
|
||||||
static regex_t preg = { NULL, NULL, 0, 0, 0 };
|
static regex_t preg = { NULL, NULL, 0, 0, 0, 0 };
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
static int *dfa_workspace = NULL;
|
static int *dfa_workspace = NULL;
|
||||||
|
@ -5264,6 +5263,12 @@ if ((pat_patctl.control & CTL_POSIX) != 0)
|
||||||
if ((pat_patctl.options & PCRE2_DOTALL) != 0) cflags |= REG_DOTALL;
|
if ((pat_patctl.options & PCRE2_DOTALL) != 0) cflags |= REG_DOTALL;
|
||||||
if ((pat_patctl.options & PCRE2_UNGREEDY) != 0) cflags |= REG_UNGREEDY;
|
if ((pat_patctl.options & PCRE2_UNGREEDY) != 0) cflags |= REG_UNGREEDY;
|
||||||
|
|
||||||
|
if ((pat_patctl.control & (CTL_HEXPAT|CTL_USE_LENGTH)) != 0)
|
||||||
|
{
|
||||||
|
preg.re_endp = (char *)pbuffer8 + patlen;
|
||||||
|
cflags |= REG_PEND;
|
||||||
|
}
|
||||||
|
|
||||||
rc = regcomp(&preg, (char *)pbuffer8, cflags);
|
rc = regcomp(&preg, (char *)pbuffer8, cflags);
|
||||||
|
|
||||||
/* Compiling failed */
|
/* Compiling failed */
|
||||||
|
|
|
@ -123,4 +123,10 @@
|
||||||
/^a\x{00}b$/posix
|
/^a\x{00}b$/posix
|
||||||
a\x{00}b\=posix_startend=0:3
|
a\x{00}b\=posix_startend=0:3
|
||||||
|
|
||||||
|
/"A" 00 "B"/hex
|
||||||
|
A\x{00}B\=posix_startend=0:3
|
||||||
|
|
||||||
|
/ABC/use_length
|
||||||
|
ABC
|
||||||
|
|
||||||
# End of testdata/testinput18
|
# End of testdata/testinput18
|
||||||
|
|
|
@ -15,4 +15,7 @@
|
||||||
/\w/ucp
|
/\w/ucp
|
||||||
+++\x{c2}
|
+++\x{c2}
|
||||||
|
|
||||||
|
/"^AB" 00 "\x{1234}$"/hex,utf
|
||||||
|
AB\x{00}\x{1234}\=posix_startend=0:6
|
||||||
|
|
||||||
# End of testdata/testinput19
|
# End of testdata/testinput19
|
||||||
|
|
|
@ -191,4 +191,12 @@ No match: POSIX code 17: match failed
|
||||||
a\x{00}b\=posix_startend=0:3
|
a\x{00}b\=posix_startend=0:3
|
||||||
0: a\x00b
|
0: a\x00b
|
||||||
|
|
||||||
|
/"A" 00 "B"/hex
|
||||||
|
A\x{00}B\=posix_startend=0:3
|
||||||
|
0: A\x00B
|
||||||
|
|
||||||
|
/ABC/use_length
|
||||||
|
ABC
|
||||||
|
0: ABC
|
||||||
|
|
||||||
# End of testdata/testinput18
|
# End of testdata/testinput18
|
||||||
|
|
|
@ -18,4 +18,8 @@ No match: POSIX code 17: match failed
|
||||||
+++\x{c2}
|
+++\x{c2}
|
||||||
0: \xc2
|
0: \xc2
|
||||||
|
|
||||||
|
/"^AB" 00 "\x{1234}$"/hex,utf
|
||||||
|
AB\x{00}\x{1234}\=posix_startend=0:6
|
||||||
|
0: AB\x{00}\x{1234}
|
||||||
|
|
||||||
# End of testdata/testinput19
|
# End of testdata/testinput19
|
||||||
|
|
Loading…
Reference in New Issue