File tidies, version updates, etc. for 10.21-RC1
This commit is contained in:
parent
293da188aa
commit
dffd559601
41
NEWS
41
NEWS
|
@ -1,6 +1,47 @@
|
||||||
News about PCRE2 releases
|
News about PCRE2 releases
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
|
Version 10.21 15-December-2015
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
1. Many bugs have been fixed. A large number of them were provoked only by very
|
||||||
|
strange pattern input, and were discovered by fuzzers. Some others were
|
||||||
|
discovered by code auditing. See ChangeLog for details.
|
||||||
|
|
||||||
|
2. The Unicode tables have been updated to Unicode version 8.0.0.
|
||||||
|
|
||||||
|
3. For Perl compatibility in EBCDIC environments, ranges such as a-z in a
|
||||||
|
class, where both values are literal letters in the same case, omit the
|
||||||
|
non-letter EBCDIC code points within the range.
|
||||||
|
|
||||||
|
4. There have been a number of enhancements to the pcre2_substitute() function,
|
||||||
|
giving more flexibility to replacement facilities. It is now also possible to
|
||||||
|
cause the function to return the needed buffer size if the one given is too
|
||||||
|
small.
|
||||||
|
|
||||||
|
5. The PCRE2_ALT_VERBNAMES option causes the "name" parts of special verbs such
|
||||||
|
as (*THEN:name) to be processed for backslashes and to take note of
|
||||||
|
PCRE2_EXTENDED.
|
||||||
|
|
||||||
|
6. PCRE2_INFO_HASBACKSLASHC makes it possible for a client to find out if a
|
||||||
|
pattern uses \C, and --never-backslash-C makes it possible to compile a version
|
||||||
|
PCRE2 in which the use of \C is always forbidden.
|
||||||
|
|
||||||
|
7. A limit to the length of pattern that can be handled can now be set by
|
||||||
|
calling pcre2_set_max_pattern_length().
|
||||||
|
|
||||||
|
8. When matching an unanchored pattern, a match can be required to begin within
|
||||||
|
a given number of code units after the start of the subject by calling
|
||||||
|
pcre2_set_offset_limit().
|
||||||
|
|
||||||
|
9. The pcre2test program has been extended to test new facilities, and it can
|
||||||
|
now run the tests when LF on its own is not a valid newline sequence.
|
||||||
|
|
||||||
|
10. The RunTest script has also been updated to enable more tests to be run.
|
||||||
|
|
||||||
|
11. There have been some minor performance enhancements.
|
||||||
|
|
||||||
|
|
||||||
Version 10.20 30-June-2015
|
Version 10.20 30-June-2015
|
||||||
--------------------------
|
--------------------------
|
||||||
|
|
||||||
|
|
10
configure.ac
10
configure.ac
|
@ -11,16 +11,16 @@ dnl be defined as -RC2, for example. For real releases, it should be empty.
|
||||||
m4_define(pcre2_major, [10])
|
m4_define(pcre2_major, [10])
|
||||||
m4_define(pcre2_minor, [21])
|
m4_define(pcre2_minor, [21])
|
||||||
m4_define(pcre2_prerelease, [-RC1])
|
m4_define(pcre2_prerelease, [-RC1])
|
||||||
m4_define(pcre2_date, [2015-07-06])
|
m4_define(pcre2_date, [2015-12-15])
|
||||||
|
|
||||||
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
||||||
# 50 lines of this file. Please update that if the variables above are moved.
|
# 50 lines of this file. Please update that if the variables above are moved.
|
||||||
|
|
||||||
# Libtool shared library interface versions (current:revision:age)
|
# Libtool shared library interface versions (current:revision:age)
|
||||||
m4_define(libpcre2_8_version, [2:0:2])
|
m4_define(libpcre2_8_version, [3:0:3])
|
||||||
m4_define(libpcre2_16_version, [2:0:2])
|
m4_define(libpcre2_16_version, [3:0:3])
|
||||||
m4_define(libpcre2_32_version, [2:0:2])
|
m4_define(libpcre2_32_version, [3:0:3])
|
||||||
m4_define(libpcre2_posix_version, [0:0:0])
|
m4_define(libpcre2_posix_version, [0:1:0])
|
||||||
|
|
||||||
AC_PREREQ(2.57)
|
AC_PREREQ(2.57)
|
||||||
AC_INIT(PCRE2, pcre2_major.pcre2_minor[]pcre2_prerelease, , pcre2)
|
AC_INIT(PCRE2, pcre2_major.pcre2_minor[]pcre2_prerelease, , pcre2)
|
||||||
|
|
|
@ -42,19 +42,20 @@ request are as follows:
|
||||||
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
|
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
|
||||||
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
|
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
|
||||||
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
|
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
|
||||||
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
|
||||||
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
|
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
|
||||||
0 nothing set
|
0 nothing set
|
||||||
1 first code unit is set
|
1 first code unit is set
|
||||||
2 start of string or after newline
|
2 start of string or after newline
|
||||||
|
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
||||||
|
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \C
|
||||||
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches
|
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches
|
||||||
exist in the pattern
|
exist in the pattern
|
||||||
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
|
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
|
||||||
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
|
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
|
||||||
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
|
||||||
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
|
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
|
||||||
0 nothing set
|
0 nothing set
|
||||||
1 code unit is set
|
1 code unit is set
|
||||||
|
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
||||||
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an
|
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an
|
||||||
empty string, 0 otherwise
|
empty string, 0 otherwise
|
||||||
PCRE2_INFO_MATCHLIMIT Match limit if set,
|
PCRE2_INFO_MATCHLIMIT Match limit if set,
|
||||||
|
@ -62,8 +63,8 @@ request are as follows:
|
||||||
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest
|
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest
|
||||||
lookbehind assertion
|
lookbehind assertion
|
||||||
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
|
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
|
||||||
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
|
||||||
PCRE2_INFO_NAMECOUNT Number of named subpatterns
|
PCRE2_INFO_NAMECOUNT Number of named subpatterns
|
||||||
|
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
||||||
PCRE2_INFO_NAMETABLE Pointer to name table
|
PCRE2_INFO_NAMETABLE Pointer to name table
|
||||||
PCRE2_CONFIG_NEWLINE Code for the newline sequence:
|
PCRE2_CONFIG_NEWLINE Code for the newline sequence:
|
||||||
PCRE2_NEWLINE_CR
|
PCRE2_NEWLINE_CR
|
||||||
|
|
|
@ -70,6 +70,9 @@ The options are:
|
||||||
PCRE2_UTF was set at compile time)
|
PCRE2_UTF was set at compile time)
|
||||||
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
|
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
|
||||||
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
|
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
|
||||||
|
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
|
||||||
|
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
|
||||||
|
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
|
||||||
</pre>
|
</pre>
|
||||||
The function returns the number of substitutions, which may be zero if there
|
The function returns the number of substitutions, which may be zero if there
|
||||||
were no matches. The result can be greater than one only when
|
were no matches. The result can be greater than one only when
|
||||||
|
|
|
@ -716,8 +716,8 @@ of the following match-time parameters:
|
||||||
<pre>
|
<pre>
|
||||||
A callout function
|
A callout function
|
||||||
The offset limit for matching an unanchored pattern
|
The offset limit for matching an unanchored pattern
|
||||||
The limit for calling <i>match()</i>
|
The limit for calling <b>match()</b> (see below)
|
||||||
The limit for calling <i>match()</i> recursively
|
The limit for calling <b>match()</b> recursively
|
||||||
</pre>
|
</pre>
|
||||||
A match context is also required if you are using custom memory management.
|
A match context is also required if you are using custom memory management.
|
||||||
If none of these apply, just pass NULL as the context argument of
|
If none of these apply, just pass NULL as the context argument of
|
||||||
|
@ -771,7 +771,9 @@ PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
|
||||||
<P>
|
<P>
|
||||||
The offset limit facility can be used to track progress when searching large
|
The offset limit facility can be used to track progress when searching large
|
||||||
subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to
|
subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to
|
||||||
start within the first line of the subject.
|
start within the first line of the subject. If this is set with an offset
|
||||||
|
limit, a match must occur in the first line and also within the offset limit.
|
||||||
|
In other words, whichever limit comes first is used.
|
||||||
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||||
<b> uint32_t <i>value</i>);</b>
|
<b> uint32_t <i>value</i>);</b>
|
||||||
<br>
|
<br>
|
||||||
|
@ -1212,7 +1214,9 @@ built.
|
||||||
If this option is set, an unanchored pattern is required to match before or at
|
If this option is set, an unanchored pattern is required to match before or at
|
||||||
the first newline in the subject string, though the matched text may continue
|
the first newline in the subject string, though the matched text may continue
|
||||||
over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
|
over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
|
||||||
general limiting facility.
|
general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, a
|
||||||
|
match must occur in the first line and also within the offset limit. In other
|
||||||
|
words, whichever limit comes first is used.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_MATCH_UNSET_BACKREF
|
PCRE2_MATCH_UNSET_BACKREF
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1563,11 +1567,10 @@ are as follows:
|
||||||
Return a copy of the pattern's options. The third argument should point to a
|
Return a copy of the pattern's options. The third argument should point to a
|
||||||
<b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
|
<b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
|
||||||
were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns
|
were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns
|
||||||
the compile options as modified by any top-level option settings at the start
|
the compile options as modified by any top-level option settings such as (*UTF)
|
||||||
of the pattern itself. In other words, they are the options that will be in
|
at the start of the pattern itself. For example, if the pattern /(*UTF)abc/ is
|
||||||
force when matching starts. For example, if the pattern /(?im)abc(?-i)d/ is
|
compiled with the PCRE2_EXTENDED option, the result is PCRE2_EXTENDED and
|
||||||
compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
|
PCRE2_UTF.
|
||||||
PCRE2_MULTILINE, and PCRE2_EXTENDED.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
|
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
|
||||||
|
@ -1609,18 +1612,27 @@ matches only CR, LF, or CRLF.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_CAPTURECOUNT
|
PCRE2_INFO_CAPTURECOUNT
|
||||||
</pre>
|
</pre>
|
||||||
Return the number of capturing subpatterns in the pattern. The third argument
|
Return the highest capturing subpattern number in the pattern. In patterns
|
||||||
should point to an <b>uint32_t</b> variable.
|
where (?| is not used, this is also the total number of capturing subpatterns.
|
||||||
|
The third argument should point to an <b>uint32_t</b> variable.
|
||||||
|
<pre>
|
||||||
|
PCRE2_INFO_FIRSTBITMAP
|
||||||
|
</pre>
|
||||||
|
In the absence of a single first code unit for a non-anchored pattern,
|
||||||
|
<b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of
|
||||||
|
values for the first code unit in any match. For example, a pattern that starts
|
||||||
|
with [abc] results in a table with three bits set. When code unit values
|
||||||
|
greater than 255 are supported, the flag bit for 255 means "any code unit of
|
||||||
|
value 255 or above". If such a table was constructed, a pointer to it is
|
||||||
|
returned. Otherwise NULL is returned. The third argument should point to an
|
||||||
|
<b>const uint8_t *</b> variable.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_FIRSTCODETYPE
|
PCRE2_INFO_FIRSTCODETYPE
|
||||||
</pre>
|
</pre>
|
||||||
Return information about the first code unit of any matched string, for a
|
Return information about the first code unit of any matched string, for a
|
||||||
non-anchored pattern. The third argument should point to an <b>uint32_t</b>
|
non-anchored pattern. The third argument should point to an <b>uint32_t</b>
|
||||||
variable.
|
variable. If there is a fixed first value, for example, the letter "c" from a
|
||||||
</P>
|
pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||||
<P>
|
|
||||||
If there is a fixed first value, for example, the letter "c" from a pattern
|
|
||||||
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
|
||||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
||||||
it is known that a match can occur only at the start of the subject or
|
it is known that a match can occur only at the start of the subject or
|
||||||
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
||||||
|
@ -1635,16 +1647,10 @@ value is always less than 256. In the 16-bit library the value can be up to
|
||||||
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
|
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
|
||||||
and up to 0xffffffff when not using UTF-32 mode.
|
and up to 0xffffffff when not using UTF-32 mode.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_FIRSTBITMAP
|
PCRE2_INFO_HASBACKSLASHC
|
||||||
</pre>
|
</pre>
|
||||||
In the absence of a single first code unit for a non-anchored pattern,
|
Return 1 if the pattern contains any instances of \C, otherwise 0. The third
|
||||||
<b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of
|
argument should point to an <b>uint32_t</b> variable.
|
||||||
values for the first code unit in any match. For example, a pattern that starts
|
|
||||||
with [abc] results in a table with three bits set. When code unit values
|
|
||||||
greater than 255 are supported, the flag bit for 255 means "any code unit of
|
|
||||||
value 255 or above". If such a table was constructed, a pointer to it is
|
|
||||||
returned. Otherwise NULL is returned. The third argument should point to an
|
|
||||||
<b>const uint8_t *</b> variable.
|
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_HASCRORLF
|
PCRE2_INFO_HASCRORLF
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1670,13 +1676,10 @@ Returns 1 if there is a rightmost literal code unit that must exist in any
|
||||||
matched string, other than at its start. The third argument should point to an
|
matched string, other than at its start. The third argument should point to an
|
||||||
<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is
|
<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is
|
||||||
returned, the code unit value itself can be retrieved using
|
returned, the code unit value itself can be retrieved using
|
||||||
PCRE2_INFO_LASTCODEUNIT.
|
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
|
||||||
</P>
|
recorded only if it follows something of variable length. For example, for the
|
||||||
<P>
|
pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned from
|
||||||
For anchored patterns, a last literal value is recorded only if it follows
|
PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
|
||||||
something of variable length. For example, for the pattern /^a\d+z\d+/ the
|
|
||||||
returned value is 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for
|
|
||||||
/^a\dz\d/ the returned value is 0.
|
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_LASTCODEUNIT
|
PCRE2_INFO_LASTCODEUNIT
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1687,8 +1690,11 @@ value, 0 is returned.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_MATCHEMPTY
|
PCRE2_INFO_MATCHEMPTY
|
||||||
</pre>
|
</pre>
|
||||||
Return 1 if the pattern can match an empty string, otherwise 0. The third
|
Return 1 if the pattern might match an empty string, otherwise 0. The third
|
||||||
argument should point to an <b>uint32_t</b> variable.
|
argument should point to an <b>uint32_t</b> variable. When a pattern contains
|
||||||
|
recursive subroutine calls it is not always possible to determine whether or
|
||||||
|
not it can match an empty string. PCRE2 takes a cautious approach and returns 1
|
||||||
|
in such cases.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_MATCHLIMIT
|
PCRE2_INFO_MATCHLIMIT
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -2142,8 +2148,13 @@ documentation.
|
||||||
When PCRE2 is built, a default newline convention is set; this is usually the
|
When PCRE2 is built, a default newline convention is set; this is usually the
|
||||||
standard convention for the operating system. The default can be overridden in
|
standard convention for the operating system. The default can be overridden in
|
||||||
a
|
a
|
||||||
<a href="#compilecontext">compile context.</a>
|
<a href="#compilecontext">compile context</a>
|
||||||
During matching, the newline choice affects the behaviour of the dot,
|
by calling <b>pcre2_set_newline()</b>. It can also be overridden by starting a
|
||||||
|
pattern string with, for example, (*CRLF), as described in the
|
||||||
|
<a href="pcre2pattern.html#newlines">section on newline conventions</a>
|
||||||
|
in the
|
||||||
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
|
page. During matching, the newline choice affects the behaviour of the dot,
|
||||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||||
starting position is advanced after a match failure for an unanchored pattern.
|
starting position is advanced after a match failure for an unanchored pattern.
|
||||||
</P>
|
</P>
|
||||||
|
@ -2191,19 +2202,20 @@ function can be used to find out how many capturing subpatterns there are in a
|
||||||
compiled pattern.
|
compiled pattern.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
A successful match returns the overall matched string and any captured
|
You can use auxiliary functions for accessing captured substrings
|
||||||
substrings to the caller via a vector of PCRE2_SIZE values. This is called the
|
|
||||||
<b>ovector</b>, and is contained within the
|
|
||||||
<a href="#matchdatablock">match data block.</a>
|
|
||||||
You can obtain direct access to the ovector by calling
|
|
||||||
<b>pcre2_get_ovector_pointer()</b> to find its address, and
|
|
||||||
<b>pcre2_get_ovector_count()</b> to find the number of pairs of values it
|
|
||||||
contains. Alternatively, you can use the auxiliary functions for accessing
|
|
||||||
captured substrings
|
|
||||||
<a href="#extractbynumber">by number</a>
|
<a href="#extractbynumber">by number</a>
|
||||||
or
|
or
|
||||||
<a href="#extractbyname">by name</a>
|
<a href="#extractbyname">by name,</a>
|
||||||
(see below).
|
as described in sections below.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Alternatively, you can make direct use of the vector of PCRE2_SIZE values,
|
||||||
|
called the <b>ovector</b>, which contains the offsets of captured strings. It is
|
||||||
|
part of the
|
||||||
|
<a href="#matchdatablock">match data block.</a>
|
||||||
|
The function <b>pcre2_get_ovector_pointer()</b> returns the address of the
|
||||||
|
ovector, and <b>pcre2_get_ovector_count()</b> returns the number of pairs of
|
||||||
|
values it contains.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Within the ovector, the first in each pair of values is set to the offset of
|
Within the ovector, the first in each pair of values is set to the offset of
|
||||||
|
@ -2292,7 +2304,13 @@ After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
|
||||||
to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
|
to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
|
||||||
<b>pcre2_get_mark()</b> can be called. It returns a pointer to the
|
<b>pcre2_get_mark()</b> can be called. It returns a pointer to the
|
||||||
zero-terminated name, which is within the compiled pattern. Otherwise NULL is
|
zero-terminated name, which is within the compiled pattern. Otherwise NULL is
|
||||||
returned. After a successful match, the (*MARK) name that is returned is the
|
returned. The length of the (*MARK) name (excluding the terminating zero) is
|
||||||
|
stored in the code unit that preceeds the name. You should use this instead of
|
||||||
|
relying on the terminating zero if the (*MARK) name might contain a binary
|
||||||
|
zero.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
After a successful match, the (*MARK) name that is returned is the
|
||||||
last one encountered on the matching path through the pattern. After a "no
|
last one encountered on the matching path through the pattern. After a "no
|
||||||
match" or a partial match, the last encountered (*MARK) name is returned. For
|
match" or a partial match, the last encountered (*MARK) name is returned. For
|
||||||
example, consider this pattern:
|
example, consider this pattern:
|
||||||
|
@ -2313,7 +2331,7 @@ escape sequence. After a partial match, however, this value is always the same
|
||||||
as <i>ovector[0]</i> because \K does not affect the result of a partial match.
|
as <i>ovector[0]</i> because \K does not affect the result of a partial match.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
After a UTF check failure, \fBpcre2_get_startchar()\fB can be used to obtain
|
After a UTF check failure, <b>pcre2_get_startchar()</b> can be used to obtain
|
||||||
the code unit offset of the invalid UTF character. Details are given in the
|
the code unit offset of the invalid UTF character. Details are given in the
|
||||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||||
page.
|
page.
|
||||||
|
@ -2650,12 +2668,21 @@ allocate memory for the compiled code.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The <i>outlengthptr</i> argument must point to a variable that contains the
|
The <i>outlengthptr</i> argument must point to a variable that contains the
|
||||||
length, in code units, of the output buffer. If the function is successful,
|
length, in code units, of the output buffer. If the function is successful, the
|
||||||
the value is updated to contain the length of the new string, excluding the
|
value is updated to contain the length of the new string, excluding the
|
||||||
trailing zero that is automatically added. If the function is not successful,
|
trailing zero that is automatically added.
|
||||||
the value is set to PCRE2_UNSET for general errors (such as output buffer too
|
</P>
|
||||||
small). For syntax errors in the replacement string, the value is set to the
|
<P>
|
||||||
offset in the replacement string where the error was detected.
|
If the function is not successful, the value set via <i>outlengthptr</i> depends
|
||||||
|
on the type of error. For syntax errors in the replacement string, the value is
|
||||||
|
the offset in the replacement string where the error was detected. For other
|
||||||
|
errors, the value is PCRE2_UNSET by default. This includes the case of the
|
||||||
|
output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set
|
||||||
|
(see below), in which case the value is the minimum length needed, including
|
||||||
|
space for the trailing zero. Note that in order to compute the required length,
|
||||||
|
<b>pcre2_substitute()</b> has to simulate all the matching and copying, instead
|
||||||
|
of giving an error return as soon as the buffer overflows. Note also that the
|
||||||
|
length is in code units, not bytes.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In the replacement string, which is interpreted as a UTF string in UTF mode,
|
In the replacement string, which is interpreted as a UTF string in UTF mode,
|
||||||
|
@ -2682,15 +2709,53 @@ simultaneous substitutions, as this <b>pcre2test</b> example shows:
|
||||||
apple lemon
|
apple lemon
|
||||||
2: pear orange
|
2: pear orange
|
||||||
</pre>
|
</pre>
|
||||||
There is an additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes the
|
As well as the usual options for <b>pcre2_match()</b>, a number of additional
|
||||||
function to iterate over the subject string, replacing every matching
|
options can be set in the <i>options</i> argument.
|
||||||
substring. If this is not set, only the first matching substring is replaced.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
A second additional option, PCRE2_SUBSTITUTE_EXTENDED, causes extra processing
|
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
|
||||||
to be applied to the replacement string. Without this option, only the dollar
|
replacing every matching substring. If this is not set, only the first matching
|
||||||
character is special, and only the group insertion forms listed above are
|
substring is replaced. If any matched substring has zero length, after the
|
||||||
valid. When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
substitution has happened, an attempt to find a non-empty match at the same
|
||||||
|
position is performed. If this is not successful, the current position is
|
||||||
|
advanced by one character except when CRLF is a valid newline sequence and the
|
||||||
|
next two characters are CR, LF. In this case, the current position is advanced
|
||||||
|
by two characters.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
|
||||||
|
too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
|
||||||
|
this option is set, however, <b>pcre2_substitute()</b> continues to go through
|
||||||
|
the motions of matching and substituting (without, of course, writing anything)
|
||||||
|
in order to compute the size of buffer that is needed. This value is passed
|
||||||
|
back via the <i>outlengthptr</i> variable, with the result of the function still
|
||||||
|
being PCRE2_ERROR_NOMEMORY.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Passing a buffer size of zero is a permitted way of finding out how much memory
|
||||||
|
is needed for given substitution. However, this does mean that the entire
|
||||||
|
operation is carried out twice. Depending on the application, it may be more
|
||||||
|
efficient to allocate a large buffer and free the excess afterwards, instead of
|
||||||
|
using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups that do
|
||||||
|
not appear in the pattern to be treated as unset groups. This option should be
|
||||||
|
used with care, because it means that a typo in a group name or number no
|
||||||
|
longer causes the PCRE2_ERROR_NOSUBSTRING error.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including unknown
|
||||||
|
groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty
|
||||||
|
strings when inserted as described above. If this option is not set, an attempt
|
||||||
|
to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does
|
||||||
|
not influence the extended substitution syntax described below.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
|
||||||
|
replacement string. Without this option, only the dollar character is special,
|
||||||
|
and only the group insertion forms listed above are valid. When
|
||||||
|
PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Firstly, backslash in a replacement string is interpreted as an escape
|
Firstly, backslash in a replacement string is interpreted as an escape
|
||||||
|
@ -2740,22 +2805,46 @@ string remains in force afterwards, as shown in this <b>pcre2test</b> example:
|
||||||
somebody
|
somebody
|
||||||
1: HELLO
|
1: HELLO
|
||||||
</pre>
|
</pre>
|
||||||
If successful, the function returns the number of replacements that were made.
|
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
|
||||||
This may be zero if no matches were found, and is never greater than 1 unless
|
substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause unknown
|
||||||
PCRE2_SUBSTITUTE_GLOBAL is set.
|
groups in the extended syntax forms to be treated as unset.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If successful, <b>pcre2_substitute()</b> returns the number of replacements that
|
||||||
|
were made. This may be zero if no matches were found, and is never greater than
|
||||||
|
1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In the event of an error, a negative error code is returned. Except for
|
In the event of an error, a negative error code is returned. Except for
|
||||||
PCRE2_ERROR_NOMATCH (which is never returned), errors from <b>pcre2_match()</b>
|
PCRE2_ERROR_NOMATCH (which is never returned), errors from <b>pcre2_match()</b>
|
||||||
are passed straight back. PCRE2_ERROR_NOMEMORY is returned if the output buffer
|
are passed straight back.
|
||||||
is not big enough. PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax
|
</P>
|
||||||
errors in the replacement string, with more particular errors being
|
<P>
|
||||||
PCRE2_ERROR_BADREPESCAPE (invalid escape sequence),
|
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion,
|
||||||
PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket not found),
|
unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
|
||||||
PCRE2_BADSUBSTITUTION (syntax error in extended group substitution), and
|
</P>
|
||||||
PCRE2_BADSUBPATTERN (the pattern match ended before it started). As for all
|
<P>
|
||||||
PCRE2 errors, a text message that describes the error can be obtained by
|
PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an
|
||||||
calling <b>pcre2_get_error_message()</b>.
|
unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple
|
||||||
|
(non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the
|
||||||
|
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is
|
||||||
|
needed is returned via <i>outlengthptr</i>. Note that this does not happen by
|
||||||
|
default.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
|
||||||
|
replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE
|
||||||
|
(invalid escape sequence), PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket
|
||||||
|
not found), PCRE2_BADSUBSTITUTION (syntax error in extended group
|
||||||
|
substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it
|
||||||
|
started, which can happen if \K is used in an assertion).
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
As for all PCRE2 errors, a text message that describes the error can be
|
||||||
|
obtained by calling <b>pcre2_get_error_message()</b>.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC35" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
|
<br><a name="SEC35" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -2796,11 +2885,11 @@ function returns the length of each entry in code units. In both cases,
|
||||||
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The format of the name table is described above in the section entitled
|
The format of the name table is described
|
||||||
<i>Information about a pattern</i>
|
<a href="#infoaboutpattern">above</a>
|
||||||
<a href="#infoaboutpattern">above.</a>
|
in the section entitled <i>Information about a pattern</i>. Given all the
|
||||||
Given all the relevant entries for the name, you can extract each of their
|
relevant entries for the name, you can extract each of their numbers, and hence
|
||||||
numbers, and hence the captured data.
|
the captured data.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC36" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
|
<br><a name="SEC36" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -3032,7 +3121,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC40" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC40" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 05 November 2015
|
Last updated: 16 December 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2015 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -86,6 +86,13 @@ results. The returned value from <b>pcre2_jit_compile()</b> is zero on success,
|
||||||
or a negative error code.
|
or a negative error code.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
There is a limit to the size of pattern that JIT supports, imposed by the size
|
||||||
|
of machine stack that it uses. The exact rules are not documented because they
|
||||||
|
may change at any time, in particular, when new optimizations are introduced.
|
||||||
|
If a pattern is too big, a call to \fBpcre2_jit_compile()\fB returns
|
||||||
|
PCRE2_ERROR_NOMEMORY.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
|
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
|
||||||
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
|
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
|
||||||
PCRE2_PARTIAL_SOFT options of <b>pcre2_match()</b>, you should set one or both
|
PCRE2_PARTIAL_SOFT options of <b>pcre2_match()</b>, you should set one or both
|
||||||
|
@ -425,7 +432,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 28 July 2015
|
Last updated: 14 November 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2015 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -669,8 +669,8 @@ This is an example of an "atomic group", details of which are given
|
||||||
This particular group matches either the two-character sequence CR followed by
|
This particular group matches either the two-character sequence CR followed by
|
||||||
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
|
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
|
||||||
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
|
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
|
||||||
line, U+0085). The two-character sequence is treated as a single unit that
|
line, U+0085). Because this is an atomic group, the two-character sequence is
|
||||||
cannot be split.
|
treated as a single unit that cannot be split.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In other modes, two additional characters whose codepoints are greater than 255
|
In other modes, two additional characters whose codepoints are greater than 255
|
||||||
|
@ -1186,6 +1186,16 @@ when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
|
||||||
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
|
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
When the newline convention (see
|
||||||
|
<a href="#newlines">"Newline conventions"</a>
|
||||||
|
below) recognizes the two-character sequence CRLF as a newline, this is
|
||||||
|
preferred, even if the single characters CR and LF are also recognized as
|
||||||
|
newlines. For example, if the newline convention is "any", a multiline mode
|
||||||
|
circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after
|
||||||
|
CR, even though CR on its own is a valid newline. (It also matches at the very
|
||||||
|
start of the string, of course.)
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
Note that the sequences \A, \Z, and \z can be used to match the start and
|
Note that the sequences \A, \Z, and \z can be used to match the start and
|
||||||
end of the subject in both modes, and if all branches of a pattern start with
|
end of the subject in both modes, and if all branches of a pattern start with
|
||||||
\A it is always anchored, whether or not PCRE2_MULTILINE is set.
|
\A it is always anchored, whether or not PCRE2_MULTILINE is set.
|
||||||
|
@ -1672,6 +1682,10 @@ first one in the pattern with the given number. The following pattern matches
|
||||||
<pre>
|
<pre>
|
||||||
/(?|(abc)|(def))(?1)/
|
/(?|(abc)|(def))(?1)/
|
||||||
</pre>
|
</pre>
|
||||||
|
A relative reference such as (?-1) is no different: it is just a convenient way
|
||||||
|
of computing an absolute group number.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
If a
|
If a
|
||||||
<a href="#conditions">condition test</a>
|
<a href="#conditions">condition test</a>
|
||||||
for a subpattern's having matched refers to a non-unique number, the test is
|
for a subpattern's having matched refers to a non-unique number, the test is
|
||||||
|
@ -2626,6 +2640,21 @@ parentheses preceding the recursion. In other words, a negative number counts
|
||||||
capturing parentheses leftwards from the point at which it is encountered.
|
capturing parentheses leftwards from the point at which it is encountered.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
Be aware however, that if
|
||||||
|
<a href="#dupsubpatternnumber">duplicate subpattern numbers</a>
|
||||||
|
are in use, relative references refer to the earliest subpattern with the
|
||||||
|
appropriate number. Consider, for example:
|
||||||
|
<pre>
|
||||||
|
(?|(a)|(b)) (c) (?-2)
|
||||||
|
</pre>
|
||||||
|
The first two capturing groups (a) and (b) are both numbered 1, and group (c)
|
||||||
|
is number 2. When the reference (?-2) is encountered, the second most recently
|
||||||
|
opened parentheses has the number 1, but it is the first such group (the (a)
|
||||||
|
group) to which the recursion refers. This would be the same if an absolute
|
||||||
|
reference (?1) was used. In other words, relative references are just a
|
||||||
|
shorthand for computing a group number.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
It is also possible to refer to subsequently opened parentheses, by writing
|
It is also possible to refer to subsequently opened parentheses, by writing
|
||||||
references such as (?+2). However, these cannot be recursive because the
|
references such as (?+2). However, these cannot be recursive because the
|
||||||
reference is not inside the parentheses that are referenced. They are always
|
reference is not inside the parentheses that are referenced. They are always
|
||||||
|
@ -3359,7 +3388,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 01 November 2015
|
Last updated: 13 November 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2015 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -235,7 +235,8 @@ to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
|
||||||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||||
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
|
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
|
||||||
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
||||||
how it is matched.
|
how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL are
|
||||||
|
mutually exclusive; the error REG_INVARG is returned.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||||
|
@ -289,7 +290,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 30 October 2015
|
Last updated: 29 November 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2015 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -900,6 +900,10 @@ compilation process.
|
||||||
mark show mark values
|
mark show mark values
|
||||||
replace=<string> specify a replacement string
|
replace=<string> specify a replacement string
|
||||||
startchar show starting character when relevant
|
startchar show starting character when relevant
|
||||||
|
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
|
||||||
|
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||||
|
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||||
|
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||||
</pre>
|
</pre>
|
||||||
These modifiers may not appear in a <b>#pattern</b> command. If you want them as
|
These modifiers may not appear in a <b>#pattern</b> command. If you want them as
|
||||||
defaults, set them in a <b>#subject</b> command.
|
defaults, set them in a <b>#subject</b> command.
|
||||||
|
@ -990,6 +994,11 @@ pattern.
|
||||||
recursion_limit=<n> set a recursion limit
|
recursion_limit=<n> set a recursion limit
|
||||||
replace=<string> specify a replacement string
|
replace=<string> specify a replacement string
|
||||||
startchar show startchar when relevant
|
startchar show startchar when relevant
|
||||||
|
startoffset=<n> same as offset=<n>
|
||||||
|
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
|
||||||
|
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||||
|
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||||
|
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||||
zero_terminate pass the subject as zero-terminated
|
zero_terminate pass the subject as zero-terminated
|
||||||
</pre>
|
</pre>
|
||||||
The effects of these modifiers are described in the following sections.
|
The effects of these modifiers are described in the following sections.
|
||||||
|
@ -1129,19 +1138,34 @@ Testing the substitution function
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
|
If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
|
||||||
called instead of one of the matching functions. Unlike subject strings,
|
called instead of one of the matching functions. Note that replacement strings
|
||||||
<b>pcre2test</b> does not process replacement strings for escape sequences. In
|
cannot contain commas, because a comma signifies the end of a modifier. This is
|
||||||
UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
|
not thought to be an issue in a test program.
|
||||||
If so, it is correctly converted to a UTF string of the appropriate code unit
|
|
||||||
width. If it is not a valid UTF-8 string, the individual code units are copied
|
|
||||||
directly. This provides a means of passing an invalid UTF-8 string for testing
|
|
||||||
purposes.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
Unlike subject strings, <b>pcre2test</b> does not process replacement strings
|
||||||
<b>pcre2_substitute()</b>. After a successful substitution, the modified string
|
for escape sequences. In UTF mode, a replacement string is checked to see if it
|
||||||
is output, preceded by the number of replacements. This may be zero if there
|
is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
|
||||||
were no matches. Here is a simple example of a substitution test:
|
the appropriate code unit width. If it is not a valid UTF-8 string, the
|
||||||
|
individual code units are copied directly. This provides a means of passing an
|
||||||
|
invalid UTF-8 string for testing purposes.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The following modifiers set options (in additional to the normal match options)
|
||||||
|
for <b>pcre2_substitute()</b>:
|
||||||
|
<pre>
|
||||||
|
global PCRE2_SUBSTITUTE_GLOBAL
|
||||||
|
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
|
||||||
|
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||||
|
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||||
|
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||||
|
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
After a successful substitution, the modified string is output, preceded by the
|
||||||
|
number of replacements. This may be zero if there were no matches. Here is a
|
||||||
|
simple example of a substitution test:
|
||||||
<pre>
|
<pre>
|
||||||
/abc/replace=xxx
|
/abc/replace=xxx
|
||||||
=abc=abc=
|
=abc=abc=
|
||||||
|
@ -1149,12 +1173,12 @@ were no matches. Here is a simple example of a substitution test:
|
||||||
=abc=abc=\=global
|
=abc=abc=\=global
|
||||||
2: =xxx=xxx=
|
2: =xxx=xxx=
|
||||||
</pre>
|
</pre>
|
||||||
Subject and replacement strings should be kept relatively short for
|
Subject and replacement strings should be kept relatively short (fewer than 256
|
||||||
substitution tests, as fixed-size buffers are used. To make it easy to test for
|
characters) for substitution tests, as fixed-size buffers are used. To make it
|
||||||
buffer overflow, if the replacement string starts with a number in square
|
easy to test for buffer overflow, if the replacement string starts with a
|
||||||
brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
|
number in square brackets, that number is passed to <b>pcre2_substitute()</b> as
|
||||||
output buffer, with the replacement string starting at the next character. Here
|
the size of the output buffer, with the replacement string starting at the next
|
||||||
is an example that tests the edge case:
|
character. Here is an example that tests the edge case:
|
||||||
<pre>
|
<pre>
|
||||||
/abc/
|
/abc/
|
||||||
123abc123\=replace=[10]XYZ
|
123abc123\=replace=[10]XYZ
|
||||||
|
@ -1162,6 +1186,19 @@ is an example that tests the edge case:
|
||||||
123abc123\=replace=[9]XYZ
|
123abc123\=replace=[9]XYZ
|
||||||
Failed: error -47: no more memory
|
Failed: error -47: no more memory
|
||||||
</pre>
|
</pre>
|
||||||
|
The default action of <b>pcre2_substitute()</b> is to return
|
||||||
|
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the
|
||||||
|
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the
|
||||||
|
<b>substitute_overflow_length</b> modifier), <b>pcre2_substitute()</b> continues
|
||||||
|
to go through the motions of matching and substituting, in order to compute the
|
||||||
|
size of buffer that is required. When this happens, <b>pcre2test</b> shows the
|
||||||
|
required buffer length (which includes space for the trailing zero) as part of
|
||||||
|
the error message. For example:
|
||||||
|
<pre>
|
||||||
|
/abc/substitute_overflow_length
|
||||||
|
123abc123\=replace=[9]XYZ
|
||||||
|
Failed: error -47: no more memory: 10 code units are needed
|
||||||
|
</pre>
|
||||||
A replacement string is ignored with POSIX and DFA matching. Specifying partial
|
A replacement string is ignored with POSIX and DFA matching. Specifying partial
|
||||||
matching provokes an error return ("bad option value") from
|
matching provokes an error return ("bad option value") from
|
||||||
<b>pcre2_substitute()</b>.
|
<b>pcre2_substitute()</b>.
|
||||||
|
@ -1623,7 +1660,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 05 November 2015
|
Last updated: 12 December 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2015 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
275
doc/pcre2.txt
275
doc/pcre2.txt
|
@ -774,7 +774,7 @@ PCRE2 CONTEXTS
|
||||||
|
|
||||||
A callout function
|
A callout function
|
||||||
The offset limit for matching an unanchored pattern
|
The offset limit for matching an unanchored pattern
|
||||||
The limit for calling match()
|
The limit for calling match() (see below)
|
||||||
The limit for calling match() recursively
|
The limit for calling match() recursively
|
||||||
|
|
||||||
A match context is also required if you are using custom memory manage-
|
A match context is also required if you are using custom memory manage-
|
||||||
|
@ -824,7 +824,10 @@ PCRE2 CONTEXTS
|
||||||
|
|
||||||
The offset limit facility can be used to track progress when searching
|
The offset limit facility can be used to track progress when searching
|
||||||
large subject strings. See also the PCRE2_FIRSTLINE option, which
|
large subject strings. See also the PCRE2_FIRSTLINE option, which
|
||||||
requires a match to start within the first line of the subject.
|
requires a match to start within the first line of the subject. If this
|
||||||
|
is set with an offset limit, a match must occur in the first line and
|
||||||
|
also within the offset limit. In other words, whichever limit comes
|
||||||
|
first is used.
|
||||||
|
|
||||||
int pcre2_set_match_limit(pcre2_match_context *mcontext,
|
int pcre2_set_match_limit(pcre2_match_context *mcontext,
|
||||||
uint32_t value);
|
uint32_t value);
|
||||||
|
@ -1251,7 +1254,10 @@ COMPILING A PATTERN
|
||||||
If this option is set, an unanchored pattern is required to match
|
If this option is set, an unanchored pattern is required to match
|
||||||
before or at the first newline in the subject string, though the
|
before or at the first newline in the subject string, though the
|
||||||
matched text may continue over the newline. See also PCRE2_USE_OFF-
|
matched text may continue over the newline. See also PCRE2_USE_OFF-
|
||||||
SET_LIMIT, which provides a more general limiting facility.
|
SET_LIMIT, which provides a more general limiting facility. If
|
||||||
|
PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the
|
||||||
|
first line and also within the offset limit. In other words, whichever
|
||||||
|
limit comes first is used.
|
||||||
|
|
||||||
PCRE2_MATCH_UNSET_BACKREF
|
PCRE2_MATCH_UNSET_BACKREF
|
||||||
|
|
||||||
|
@ -1590,11 +1596,9 @@ INFORMATION ABOUT A COMPILED PATTERN
|
||||||
to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
|
to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
|
||||||
options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
|
options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
|
||||||
TIONS returns the compile options as modified by any top-level option
|
TIONS returns the compile options as modified by any top-level option
|
||||||
settings at the start of the pattern itself. In other words, they are
|
settings such as (*UTF) at the start of the pattern itself. For exam-
|
||||||
the options that will be in force when matching starts. For example, if
|
ple, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
|
||||||
the pattern /(?im)abc(?-i)d/ is compiled with the PCRE2_EXTENDED
|
option, the result is PCRE2_EXTENDED and PCRE2_UTF.
|
||||||
option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and
|
|
||||||
PCRE2_EXTENDED.
|
|
||||||
|
|
||||||
A pattern compiled without PCRE2_ANCHORED is automatically anchored by
|
A pattern compiled without PCRE2_ANCHORED is automatically anchored by
|
||||||
PCRE2 if the first significant item in every top-level branch is one of
|
PCRE2 if the first significant item in every top-level branch is one of
|
||||||
|
@ -1638,20 +1642,30 @@ INFORMATION ABOUT A COMPILED PATTERN
|
||||||
|
|
||||||
PCRE2_INFO_CAPTURECOUNT
|
PCRE2_INFO_CAPTURECOUNT
|
||||||
|
|
||||||
Return the number of capturing subpatterns in the pattern. The third
|
Return the highest capturing subpattern number in the pattern. In pat-
|
||||||
argument should point to an uint32_t variable.
|
terns where (?| is not used, this is also the total number of capturing
|
||||||
|
subpatterns. The third argument should point to an uint32_t variable.
|
||||||
|
|
||||||
|
PCRE2_INFO_FIRSTBITMAP
|
||||||
|
|
||||||
|
In the absence of a single first code unit for a non-anchored pattern,
|
||||||
|
pcre2_compile() may construct a 256-bit table that defines a fixed set
|
||||||
|
of values for the first code unit in any match. For example, a pattern
|
||||||
|
that starts with [abc] results in a table with three bits set. When
|
||||||
|
code unit values greater than 255 are supported, the flag bit for 255
|
||||||
|
means "any code unit of value 255 or above". If such a table was con-
|
||||||
|
structed, a pointer to it is returned. Otherwise NULL is returned. The
|
||||||
|
third argument should point to an const uint8_t * variable.
|
||||||
|
|
||||||
PCRE2_INFO_FIRSTCODETYPE
|
PCRE2_INFO_FIRSTCODETYPE
|
||||||
|
|
||||||
Return information about the first code unit of any matched string, for
|
Return information about the first code unit of any matched string, for
|
||||||
a non-anchored pattern. The third argument should point to an uint32_t
|
a non-anchored pattern. The third argument should point to an uint32_t
|
||||||
variable.
|
variable. If there is a fixed first value, for example, the letter "c"
|
||||||
|
from a pattern such as (cat|cow|coyote), 1 is returned, and the charac-
|
||||||
If there is a fixed first value, for example, the letter "c" from a
|
ter value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is
|
||||||
pattern such as (cat|cow|coyote), 1 is returned, and the character
|
no fixed first value, but it is known that a match can occur only at
|
||||||
value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no
|
the start of the subject or following a newline in the subject, 2 is
|
||||||
fixed first value, but it is known that a match can occur only at the
|
|
||||||
start of the subject or following a newline in the subject, 2 is
|
|
||||||
returned. Otherwise, and for anchored patterns, 0 is returned.
|
returned. Otherwise, and for anchored patterns, 0 is returned.
|
||||||
|
|
||||||
PCRE2_INFO_FIRSTCODEUNIT
|
PCRE2_INFO_FIRSTCODEUNIT
|
||||||
|
@ -1664,16 +1678,10 @@ INFORMATION ABOUT A COMPILED PATTERN
|
||||||
value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
|
value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
|
||||||
mode.
|
mode.
|
||||||
|
|
||||||
PCRE2_INFO_FIRSTBITMAP
|
PCRE2_INFO_HASBACKSLASHC
|
||||||
|
|
||||||
In the absence of a single first code unit for a non-anchored pattern,
|
Return 1 if the pattern contains any instances of \C, otherwise 0. The
|
||||||
pcre2_compile() may construct a 256-bit table that defines a fixed set
|
third argument should point to an uint32_t variable.
|
||||||
of values for the first code unit in any match. For example, a pattern
|
|
||||||
that starts with [abc] results in a table with three bits set. When
|
|
||||||
code unit values greater than 255 are supported, the flag bit for 255
|
|
||||||
means "any code unit of value 255 or above". If such a table was con-
|
|
||||||
structed, a pointer to it is returned. Otherwise NULL is returned. The
|
|
||||||
third argument should point to an const uint8_t * variable.
|
|
||||||
|
|
||||||
PCRE2_INFO_HASCRORLF
|
PCRE2_INFO_HASCRORLF
|
||||||
|
|
||||||
|
@ -1701,12 +1709,11 @@ INFORMATION ABOUT A COMPILED PATTERN
|
||||||
any matched string, other than at its start. The third argument should
|
any matched string, other than at its start. The third argument should
|
||||||
point to an uint32_t variable. If there is no such value, 0 is
|
point to an uint32_t variable. If there is no such value, 0 is
|
||||||
returned. When 1 is returned, the code unit value itself can be
|
returned. When 1 is returned, the code unit value itself can be
|
||||||
retrieved using PCRE2_INFO_LASTCODEUNIT.
|
retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last
|
||||||
|
literal value is recorded only if it follows something of variable
|
||||||
For anchored patterns, a last literal value is recorded only if it fol-
|
length. For example, for the pattern /^a\d+z\d+/ the returned value is
|
||||||
lows something of variable length. For example, for the pattern
|
1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/
|
||||||
/^a\d+z\d+/ the returned value is 1 (with "z" returned from
|
the returned value is 0.
|
||||||
PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
|
|
||||||
|
|
||||||
PCRE2_INFO_LASTCODEUNIT
|
PCRE2_INFO_LASTCODEUNIT
|
||||||
|
|
||||||
|
@ -1717,8 +1724,11 @@ INFORMATION ABOUT A COMPILED PATTERN
|
||||||
|
|
||||||
PCRE2_INFO_MATCHEMPTY
|
PCRE2_INFO_MATCHEMPTY
|
||||||
|
|
||||||
Return 1 if the pattern can match an empty string, otherwise 0. The
|
Return 1 if the pattern might match an empty string, otherwise 0. The
|
||||||
third argument should point to an uint32_t variable.
|
third argument should point to an uint32_t variable. When a pattern
|
||||||
|
contains recursive subroutine calls it is not always possible to deter-
|
||||||
|
mine whether or not it can match an empty string. PCRE2 takes a cau-
|
||||||
|
tious approach and returns 1 in such cases.
|
||||||
|
|
||||||
PCRE2_INFO_MATCHLIMIT
|
PCRE2_INFO_MATCHLIMIT
|
||||||
|
|
||||||
|
@ -2142,10 +2152,13 @@ NEWLINE HANDLING WHEN MATCHING
|
||||||
|
|
||||||
When PCRE2 is built, a default newline convention is set; this is usu-
|
When PCRE2 is built, a default newline convention is set; this is usu-
|
||||||
ally the standard convention for the operating system. The default can
|
ally the standard convention for the operating system. The default can
|
||||||
be overridden in a compile context. During matching, the newline
|
be overridden in a compile context by calling pcre2_set_newline(). It
|
||||||
choice affects the behaviour of the dot, circumflex, and dollar
|
can also be overridden by starting a pattern string with, for example,
|
||||||
metacharacters. It may also alter the way the match starting position
|
(*CRLF), as described in the section on newline conventions in the
|
||||||
is advanced after a match failure for an unanchored pattern.
|
pcre2pattern page. During matching, the newline choice affects the be-
|
||||||
|
haviour of the dot, circumflex, and dollar metacharacters. It may also
|
||||||
|
alter the way the match starting position is advanced after a match
|
||||||
|
failure for an unanchored pattern.
|
||||||
|
|
||||||
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
|
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
|
||||||
set as the newline convention, and a match attempt for an unanchored
|
set as the newline convention, and a match attempt for an unanchored
|
||||||
|
@ -2188,14 +2201,15 @@ HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
|
||||||
be captured. The pcre2_pattern_info() function can be used to find out
|
be captured. The pcre2_pattern_info() function can be used to find out
|
||||||
how many capturing subpatterns there are in a compiled pattern.
|
how many capturing subpatterns there are in a compiled pattern.
|
||||||
|
|
||||||
A successful match returns the overall matched string and any captured
|
You can use auxiliary functions for accessing captured substrings by
|
||||||
substrings to the caller via a vector of PCRE2_SIZE values. This is
|
number or by name, as described in sections below.
|
||||||
called the ovector, and is contained within the match data block. You
|
|
||||||
can obtain direct access to the ovector by calling pcre2_get_ovec-
|
Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
|
||||||
tor_pointer() to find its address, and pcre2_get_ovector_count() to
|
ues, called the ovector, which contains the offsets of captured
|
||||||
find the number of pairs of values it contains. Alternatively, you can
|
strings. It is part of the match data block. The function
|
||||||
use the auxiliary functions for accessing captured substrings by number
|
pcre2_get_ovector_pointer() returns the address of the ovector, and
|
||||||
or by name (see below).
|
pcre2_get_ovector_count() returns the number of pairs of values it con-
|
||||||
|
tains.
|
||||||
|
|
||||||
Within the ovector, the first in each pair of values is set to the off-
|
Within the ovector, the first in each pair of values is set to the off-
|
||||||
set of the first code unit of a substring, and the second is set to the
|
set of the first code unit of a substring, and the second is set to the
|
||||||
|
@ -2274,10 +2288,15 @@ OTHER INFORMATION ABOUT A MATCH
|
||||||
failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail-
|
failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail-
|
||||||
able, and pcre2_get_mark() can be called. It returns a pointer to the
|
able, and pcre2_get_mark() can be called. It returns a pointer to the
|
||||||
zero-terminated name, which is within the compiled pattern. Otherwise
|
zero-terminated name, which is within the compiled pattern. Otherwise
|
||||||
NULL is returned. After a successful match, the (*MARK) name that is
|
NULL is returned. The length of the (*MARK) name (excluding the termi-
|
||||||
returned is the last one encountered on the matching path through the
|
nating zero) is stored in the code unit that preceeds the name. You
|
||||||
pattern. After a "no match" or a partial match, the last encountered
|
should use this instead of relying on the terminating zero if the
|
||||||
(*MARK) name is returned. For example, consider this pattern:
|
(*MARK) name might contain a binary zero.
|
||||||
|
|
||||||
|
After a successful match, the (*MARK) name that is returned is the last
|
||||||
|
one encountered on the matching path through the pattern. After a "no
|
||||||
|
match" or a partial match, the last encountered (*MARK) name is
|
||||||
|
returned. For example, consider this pattern:
|
||||||
|
|
||||||
^(*MARK:A)((*MARK:B)a|b)c
|
^(*MARK:A)((*MARK:B)a|b)c
|
||||||
|
|
||||||
|
@ -2609,11 +2628,19 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
||||||
The outlengthptr argument must point to a variable that contains the
|
The outlengthptr argument must point to a variable that contains the
|
||||||
length, in code units, of the output buffer. If the function is suc-
|
length, in code units, of the output buffer. If the function is suc-
|
||||||
cessful, the value is updated to contain the length of the new string,
|
cessful, the value is updated to contain the length of the new string,
|
||||||
excluding the trailing zero that is automatically added. If the func-
|
excluding the trailing zero that is automatically added.
|
||||||
tion is not successful, the value is set to PCRE2_UNSET for general
|
|
||||||
errors (such as output buffer too small). For syntax errors in the
|
If the function is not successful, the value set via outlengthptr
|
||||||
replacement string, the value is set to the offset in the replacement
|
depends on the type of error. For syntax errors in the replacement
|
||||||
string where the error was detected.
|
string, the value is the offset in the replacement string where the
|
||||||
|
error was detected. For other errors, the value is PCRE2_UNSET by
|
||||||
|
default. This includes the case of the output buffer being too small,
|
||||||
|
unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which
|
||||||
|
case the value is the minimum length needed, including space for the
|
||||||
|
trailing zero. Note that in order to compute the required length,
|
||||||
|
pcre2_substitute() has to simulate all the matching and copying,
|
||||||
|
instead of giving an error return as soon as the buffer overflows. Note
|
||||||
|
also that the length is in code units, not bytes.
|
||||||
|
|
||||||
In the replacement string, which is interpreted as a UTF string in UTF
|
In the replacement string, which is interpreted as a UTF string in UTF
|
||||||
mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
|
mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
|
||||||
|
@ -2639,16 +2666,51 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
||||||
apple lemon
|
apple lemon
|
||||||
2: pear orange
|
2: pear orange
|
||||||
|
|
||||||
There is an additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes
|
As well as the usual options for pcre2_match(), a number of additional
|
||||||
the function to iterate over the subject string, replacing every match-
|
options can be set in the options argument.
|
||||||
ing substring. If this is not set, only the first matching substring is
|
|
||||||
replaced.
|
|
||||||
|
|
||||||
A second additional option, PCRE2_SUBSTITUTE_EXTENDED, causes extra
|
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
|
||||||
processing to be applied to the replacement string. Without this
|
string, replacing every matching substring. If this is not set, only
|
||||||
option, only the dollar character is special, and only the group inser-
|
the first matching substring is replaced. If any matched substring has
|
||||||
tion forms listed above are valid. When PCRE2_SUBSTITUTE_EXTENDED is
|
zero length, after the substitution has happened, an attempt to find a
|
||||||
set, two things change:
|
non-empty match at the same position is performed. If this is not suc-
|
||||||
|
cessful, the current position is advanced by one character except when
|
||||||
|
CRLF is a valid newline sequence and the next two characters are CR,
|
||||||
|
LF. In this case, the current position is advanced by two characters.
|
||||||
|
|
||||||
|
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
|
||||||
|
buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
|
||||||
|
ORY immediately. If this option is set, however, pcre2_substitute()
|
||||||
|
continues to go through the motions of matching and substituting (with-
|
||||||
|
out, of course, writing anything) in order to compute the size of buf-
|
||||||
|
fer that is needed. This value is passed back via the outlengthptr
|
||||||
|
variable, with the result of the function still being
|
||||||
|
PCRE2_ERROR_NOMEMORY.
|
||||||
|
|
||||||
|
Passing a buffer size of zero is a permitted way of finding out how
|
||||||
|
much memory is needed for given substitution. However, this does mean
|
||||||
|
that the entire operation is carried out twice. Depending on the appli-
|
||||||
|
cation, it may be more efficient to allocate a large buffer and free
|
||||||
|
the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
|
||||||
|
FLOW_LENGTH.
|
||||||
|
|
||||||
|
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups
|
||||||
|
that do not appear in the pattern to be treated as unset groups. This
|
||||||
|
option should be used with care, because it means that a typo in a
|
||||||
|
group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING
|
||||||
|
error.
|
||||||
|
|
||||||
|
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including
|
||||||
|
unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be
|
||||||
|
treated as empty strings when inserted as described above. If this
|
||||||
|
option is not set, an attempt to insert an unset group causes the
|
||||||
|
PCRE2_ERROR_UNSET error. This option does not influence the extended
|
||||||
|
substitution syntax described below.
|
||||||
|
|
||||||
|
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
|
||||||
|
replacement string. Without this option, only the dollar character is
|
||||||
|
special, and only the group insertion forms listed above are valid.
|
||||||
|
When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
||||||
|
|
||||||
Firstly, backslash in a replacement string is interpreted as an escape
|
Firstly, backslash in a replacement string is interpreted as an escape
|
||||||
character. The usual forms such as \n or \x{ddd} can be used to specify
|
character. The usual forms such as \n or \x{ddd} can be used to specify
|
||||||
|
@ -2698,22 +2760,41 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
||||||
somebody
|
somebody
|
||||||
1: HELLO
|
1: HELLO
|
||||||
|
|
||||||
If successful, the function returns the number of replacements that
|
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
|
||||||
were made. This may be zero if no matches were found, and is never
|
substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause
|
||||||
|
unknown groups in the extended syntax forms to be treated as unset.
|
||||||
|
|
||||||
|
If successful, pcre2_substitute() returns the number of replacements
|
||||||
|
that were made. This may be zero if no matches were found, and is never
|
||||||
greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
|
greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
|
||||||
|
|
||||||
In the event of an error, a negative error code is returned. Except for
|
In the event of an error, a negative error code is returned. Except for
|
||||||
PCRE2_ERROR_NOMATCH (which is never returned), errors from
|
PCRE2_ERROR_NOMATCH (which is never returned), errors from
|
||||||
pcre2_match() are passed straight back. PCRE2_ERROR_NOMEMORY is
|
pcre2_match() are passed straight back.
|
||||||
returned if the output buffer is not big enough.
|
|
||||||
|
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
|
||||||
|
tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
|
||||||
|
|
||||||
|
PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
|
||||||
|
ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
|
||||||
|
when the simple (non-extended) syntax is used and PCRE2_SUBSTI-
|
||||||
|
TUTE_UNSET_EMPTY is not set.
|
||||||
|
|
||||||
|
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
|
||||||
|
enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
|
||||||
|
of buffer that is needed is returned via outlengthptr. Note that this
|
||||||
|
does not happen by default.
|
||||||
|
|
||||||
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
|
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
|
||||||
the replacement string, with more particular errors being
|
the replacement string, with more particular errors being
|
||||||
PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
|
PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
|
||||||
MISSING_BRACE (closing curly bracket not found), PCRE2_BADSUBSTITUTION
|
MISSING_BRACE (closing curly bracket not found), PCRE2_BADSUBSTITUTION
|
||||||
(syntax error in extended group substitution), and PCRE2_BADSUBPATTERN
|
(syntax error in extended group substitution), and PCRE2_BADSUBPATTERN
|
||||||
(the pattern match ended before it started). As for all PCRE2 errors, a
|
(the pattern match ended before it started, which can happen if \K is
|
||||||
text message that describes the error can be obtained by calling
|
used in an assertion).
|
||||||
pcre2_get_error_message().
|
|
||||||
|
As for all PCRE2 errors, a text message that describes the error can be
|
||||||
|
obtained by calling pcre2_get_error_message().
|
||||||
|
|
||||||
|
|
||||||
DUPLICATE SUBPATTERN NAMES
|
DUPLICATE SUBPATTERN NAMES
|
||||||
|
@ -2752,8 +2833,8 @@ DUPLICATE SUBPATTERN NAMES
|
||||||
no entries for the given name.
|
no entries for the given name.
|
||||||
|
|
||||||
The format of the name table is described above in the section entitled
|
The format of the name table is described above in the section entitled
|
||||||
Information about a pattern above. Given all the relevant entries for
|
Information about a pattern. Given all the relevant entries for the
|
||||||
the name, you can extract each of their numbers, and hence the captured
|
name, you can extract each of their numbers, and hence the captured
|
||||||
data.
|
data.
|
||||||
|
|
||||||
|
|
||||||
|
@ -2974,7 +3055,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 05 November 2015
|
Last updated: 16 December 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -4078,6 +4159,12 @@ SIMPLE USE OF JIT
|
||||||
exactly the same results. The returned value from pcre2_jit_compile()
|
exactly the same results. The returned value from pcre2_jit_compile()
|
||||||
is zero on success, or a negative error code.
|
is zero on success, or a negative error code.
|
||||||
|
|
||||||
|
There is a limit to the size of pattern that JIT supports, imposed by
|
||||||
|
the size of machine stack that it uses. The exact rules are not docu-
|
||||||
|
mented because they may change at any time, in particular, when new
|
||||||
|
optimizations are introduced. If a pattern is too big, a call to
|
||||||
|
pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
|
||||||
|
|
||||||
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
|
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
|
||||||
plete matches. If you want to run partial matches using the PCRE2_PAR-
|
plete matches. If you want to run partial matches using the PCRE2_PAR-
|
||||||
TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should
|
TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should
|
||||||
|
@ -4394,7 +4481,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 28 July 2015
|
Last updated: 14 November 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -5706,8 +5793,9 @@ BACKSLASH
|
||||||
below. This particular group matches either the two-character sequence
|
below. This particular group matches either the two-character sequence
|
||||||
CR followed by LF, or one of the single characters LF (linefeed,
|
CR followed by LF, or one of the single characters LF (linefeed,
|
||||||
U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
|
U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
|
||||||
riage return, U+000D), or NEL (next line, U+0085). The two-character
|
riage return, U+000D), or NEL (next line, U+0085). Because this is an
|
||||||
sequence is treated as a single unit that cannot be split.
|
atomic group, the two-character sequence is treated as a single unit
|
||||||
|
that cannot be split.
|
||||||
|
|
||||||
In other modes, two additional characters whose codepoints are greater
|
In other modes, two additional characters whose codepoints are greater
|
||||||
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
|
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
|
||||||
|
@ -6076,6 +6164,14 @@ CIRCUMFLEX AND DOLLAR
|
||||||
pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
|
pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
|
||||||
if PCRE2_MULTILINE is set.
|
if PCRE2_MULTILINE is set.
|
||||||
|
|
||||||
|
When the newline convention (see "Newline conventions" below) recog-
|
||||||
|
nizes the two-character sequence CRLF as a newline, this is preferred,
|
||||||
|
even if the single characters CR and LF are also recognized as new-
|
||||||
|
lines. For example, if the newline convention is "any", a multiline
|
||||||
|
mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
|
||||||
|
than after CR, even though CR on its own is a valid newline. (It also
|
||||||
|
matches at the very start of the string, of course.)
|
||||||
|
|
||||||
Note that the sequences \A, \Z, and \z can be used to match the start
|
Note that the sequences \A, \Z, and \z can be used to match the start
|
||||||
and end of the subject in both modes, and if all branches of a pattern
|
and end of the subject in both modes, and if all branches of a pattern
|
||||||
start with \A it is always anchored, whether or not PCRE2_MULTILINE is
|
start with \A it is always anchored, whether or not PCRE2_MULTILINE is
|
||||||
|
@ -6545,6 +6641,9 @@ DUPLICATE SUBPATTERN NUMBERS
|
||||||
|
|
||||||
/(?|(abc)|(def))(?1)/
|
/(?|(abc)|(def))(?1)/
|
||||||
|
|
||||||
|
A relative reference such as (?-1) is no different: it is just a conve-
|
||||||
|
nient way of computing an absolute group number.
|
||||||
|
|
||||||
If a condition test for a subpattern's having matched refers to a non-
|
If a condition test for a subpattern's having matched refers to a non-
|
||||||
unique number, the test is true if any of the subpatterns of that num-
|
unique number, the test is true if any of the subpatterns of that num-
|
||||||
ber have matched.
|
ber have matched.
|
||||||
|
@ -7444,6 +7543,20 @@ RECURSIVE PATTERNS
|
||||||
words, a negative number counts capturing parentheses leftwards from
|
words, a negative number counts capturing parentheses leftwards from
|
||||||
the point at which it is encountered.
|
the point at which it is encountered.
|
||||||
|
|
||||||
|
Be aware however, that if duplicate subpattern numbers are in use, rel-
|
||||||
|
ative references refer to the earliest subpattern with the appropriate
|
||||||
|
number. Consider, for example:
|
||||||
|
|
||||||
|
(?|(a)|(b)) (c) (?-2)
|
||||||
|
|
||||||
|
The first two capturing groups (a) and (b) are both numbered 1, and
|
||||||
|
group (c) is number 2. When the reference (?-2) is encountered, the
|
||||||
|
second most recently opened parentheses has the number 1, but it is the
|
||||||
|
first such group (the (a) group) to which the recursion refers. This
|
||||||
|
would be the same if an absolute reference (?1) was used. In other
|
||||||
|
words, relative references are just a shorthand for computing a group
|
||||||
|
number.
|
||||||
|
|
||||||
It is also possible to refer to subsequently opened parentheses, by
|
It is also possible to refer to subsequently opened parentheses, by
|
||||||
writing references such as (?+2). However, these cannot be recursive
|
writing references such as (?+2). However, these cannot be recursive
|
||||||
because the reference is not inside the parentheses that are refer-
|
because the reference is not inside the parentheses that are refer-
|
||||||
|
@ -8141,7 +8254,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 01 November 2015
|
Last updated: 13 November 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -8534,7 +8647,9 @@ MATCHING A PATTERN
|
||||||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in
|
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in
|
||||||
software intended to be portable to other systems. Note that a non-zero
|
software intended to be portable to other systems. Note that a non-zero
|
||||||
rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
|
rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
|
||||||
of the string, not how it is matched.
|
of the string, not how it is matched. Setting REG_STARTEND and passing
|
||||||
|
pmatch as NULL are mutually exclusive; the error REG_INVARG is
|
||||||
|
returned.
|
||||||
|
|
||||||
If the pattern was compiled with the REG_NOSUB flag, no data about any
|
If the pattern was compiled with the REG_NOSUB flag, no data about any
|
||||||
matched strings is returned. The nmatch and pmatch arguments of
|
matched strings is returned. The nmatch and pmatch arguments of
|
||||||
|
@ -8587,7 +8702,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 30 October 2015
|
Last updated: 29 November 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "12 December 2015" "PCRE2 10.21"
|
.TH PCRE2API 3 "16 December 2015" "PCRE2 10.21"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -678,8 +678,8 @@ of the following match-time parameters:
|
||||||
.sp
|
.sp
|
||||||
A callout function
|
A callout function
|
||||||
The offset limit for matching an unanchored pattern
|
The offset limit for matching an unanchored pattern
|
||||||
The limit for calling \fImatch()\fP
|
The limit for calling \fBmatch()\fP (see below)
|
||||||
The limit for calling \fImatch()\fP recursively
|
The limit for calling \fBmatch()\fP recursively
|
||||||
.sp
|
.sp
|
||||||
A match context is also required if you are using custom memory management.
|
A match context is also required if you are using custom memory management.
|
||||||
If none of these apply, just pass NULL as the context argument of
|
If none of these apply, just pass NULL as the context argument of
|
||||||
|
@ -1611,8 +1611,9 @@ matches only CR, LF, or CRLF.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_CAPTURECOUNT
|
PCRE2_INFO_CAPTURECOUNT
|
||||||
.sp
|
.sp
|
||||||
Return the number of capturing subpatterns in the pattern. The third argument
|
Return the highest capturing subpattern number in the pattern. In patterns
|
||||||
should point to an \fBuint32_t\fP variable.
|
where (?| is not used, this is also the total number of capturing subpatterns.
|
||||||
|
The third argument should point to an \fBuint32_t\fP variable.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_FIRSTBITMAP
|
PCRE2_INFO_FIRSTBITMAP
|
||||||
.sp
|
.sp
|
||||||
|
@ -1629,10 +1630,8 @@ returned. Otherwise NULL is returned. The third argument should point to an
|
||||||
.sp
|
.sp
|
||||||
Return information about the first code unit of any matched string, for a
|
Return information about the first code unit of any matched string, for a
|
||||||
non-anchored pattern. The third argument should point to an \fBuint32_t\fP
|
non-anchored pattern. The third argument should point to an \fBuint32_t\fP
|
||||||
variable.
|
variable. If there is a fixed first value, for example, the letter "c" from a
|
||||||
.P
|
pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||||
If there is a fixed first value, for example, the letter "c" from a pattern
|
|
||||||
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
|
||||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
||||||
it is known that a match can occur only at the start of the subject or
|
it is known that a match can occur only at the start of the subject or
|
||||||
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
||||||
|
@ -1676,12 +1675,10 @@ Returns 1 if there is a rightmost literal code unit that must exist in any
|
||||||
matched string, other than at its start. The third argument should point to an
|
matched string, other than at its start. The third argument should point to an
|
||||||
\fBuint32_t\fP variable. If there is no such value, 0 is returned. When 1 is
|
\fBuint32_t\fP variable. If there is no such value, 0 is returned. When 1 is
|
||||||
returned, the code unit value itself can be retrieved using
|
returned, the code unit value itself can be retrieved using
|
||||||
PCRE2_INFO_LASTCODEUNIT.
|
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
|
||||||
.P
|
recorded only if it follows something of variable length. For example, for the
|
||||||
For anchored patterns, a last literal value is recorded only if it follows
|
pattern /^a\ed+z\ed+/ the returned value is 1 (with "z" returned from
|
||||||
something of variable length. For example, for the pattern /^a\ed+z\ed+/ the
|
PCRE2_INFO_LASTCODEUNIT), but for /^a\edz\ed/ the returned value is 0.
|
||||||
returned value is 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for
|
|
||||||
/^a\edz\ed/ the returned value is 0.
|
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_LASTCODEUNIT
|
PCRE2_INFO_LASTCODEUNIT
|
||||||
.sp
|
.sp
|
||||||
|
@ -2181,9 +2178,19 @@ standard convention for the operating system. The default can be overridden in
|
||||||
a
|
a
|
||||||
.\" HTML <a href="#compilecontext">
|
.\" HTML <a href="#compilecontext">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
compile context.
|
compile context
|
||||||
.\"
|
.\"
|
||||||
During matching, the newline choice affects the behaviour of the dot,
|
by calling \fBpcre2_set_newline()\fP. It can also be overridden by starting a
|
||||||
|
pattern string with, for example, (*CRLF), as described in the
|
||||||
|
.\" HTML <a href="pcre2pattern.html#newlines">
|
||||||
|
.\" </a>
|
||||||
|
section on newline conventions
|
||||||
|
.\"
|
||||||
|
in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2pattern\fP
|
||||||
|
.\"
|
||||||
|
page. During matching, the newline choice affects the behaviour of the dot,
|
||||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||||
starting position is advanced after a match failure for an unanchored pattern.
|
starting position is advanced after a match failure for an unanchored pattern.
|
||||||
.P
|
.P
|
||||||
|
@ -2229,18 +2236,7 @@ that do not cause substrings to be captured. The \fBpcre2_pattern_info()\fP
|
||||||
function can be used to find out how many capturing subpatterns there are in a
|
function can be used to find out how many capturing subpatterns there are in a
|
||||||
compiled pattern.
|
compiled pattern.
|
||||||
.P
|
.P
|
||||||
A successful match returns the overall matched string and any captured
|
You can use auxiliary functions for accessing captured substrings
|
||||||
substrings to the caller via a vector of PCRE2_SIZE values. This is called the
|
|
||||||
\fBovector\fP, and is contained within the
|
|
||||||
.\" HTML <a href="#matchdatablock">
|
|
||||||
.\" </a>
|
|
||||||
match data block.
|
|
||||||
.\"
|
|
||||||
You can obtain direct access to the ovector by calling
|
|
||||||
\fBpcre2_get_ovector_pointer()\fP to find its address, and
|
|
||||||
\fBpcre2_get_ovector_count()\fP to find the number of pairs of values it
|
|
||||||
contains. Alternatively, you can use the auxiliary functions for accessing
|
|
||||||
captured substrings
|
|
||||||
.\" HTML <a href="#extractbynumber">
|
.\" HTML <a href="#extractbynumber">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
by number
|
by number
|
||||||
|
@ -2248,9 +2244,20 @@ by number
|
||||||
or
|
or
|
||||||
.\" HTML <a href="#extractbyname">
|
.\" HTML <a href="#extractbyname">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
by name
|
by name,
|
||||||
.\"
|
.\"
|
||||||
(see below).
|
as described in sections below.
|
||||||
|
.P
|
||||||
|
Alternatively, you can make direct use of the vector of PCRE2_SIZE values,
|
||||||
|
called the \fBovector\fP, which contains the offsets of captured strings. It is
|
||||||
|
part of the
|
||||||
|
.\" HTML <a href="#matchdatablock">
|
||||||
|
.\" </a>
|
||||||
|
match data block.
|
||||||
|
.\"
|
||||||
|
The function \fBpcre2_get_ovector_pointer()\fP returns the address of the
|
||||||
|
ovector, and \fBpcre2_get_ovector_count()\fP returns the number of pairs of
|
||||||
|
values it contains.
|
||||||
.P
|
.P
|
||||||
Within the ovector, the first in each pair of values is set to the offset of
|
Within the ovector, the first in each pair of values is set to the offset of
|
||||||
the first code unit of a substring, and the second is set to the offset of the
|
the first code unit of a substring, and the second is set to the offset of the
|
||||||
|
@ -2334,7 +2341,12 @@ After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
|
||||||
to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
|
to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
|
||||||
\fBpcre2_get_mark()\fP can be called. It returns a pointer to the
|
\fBpcre2_get_mark()\fP can be called. It returns a pointer to the
|
||||||
zero-terminated name, which is within the compiled pattern. Otherwise NULL is
|
zero-terminated name, which is within the compiled pattern. Otherwise NULL is
|
||||||
returned. After a successful match, the (*MARK) name that is returned is the
|
returned. The length of the (*MARK) name (excluding the terminating zero) is
|
||||||
|
stored in the code unit that preceeds the name. You should use this instead of
|
||||||
|
relying on the terminating zero if the (*MARK) name might contain a binary
|
||||||
|
zero.
|
||||||
|
.P
|
||||||
|
After a successful match, the (*MARK) name that is returned is the
|
||||||
last one encountered on the matching path through the pattern. After a "no
|
last one encountered on the matching path through the pattern. After a "no
|
||||||
match" or a partial match, the last encountered (*MARK) name is returned. For
|
match" or a partial match, the last encountered (*MARK) name is returned. For
|
||||||
example, consider this pattern:
|
example, consider this pattern:
|
||||||
|
@ -2353,7 +2365,7 @@ different to the value of \fIovector[0]\fP if the pattern contains the \eK
|
||||||
escape sequence. After a partial match, however, this value is always the same
|
escape sequence. After a partial match, however, this value is always the same
|
||||||
as \fIovector[0]\fP because \eK does not affect the result of a partial match.
|
as \fIovector[0]\fP because \eK does not affect the result of a partial match.
|
||||||
.P
|
.P
|
||||||
After a UTF check failure, \fBpcre2_get_startchar()\fB can be used to obtain
|
After a UTF check failure, \fBpcre2_get_startchar()\fP can be used to obtain
|
||||||
the code unit offset of the invalid UTF character. Details are given in the
|
the code unit offset of the invalid UTF character. Details are given in the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2unicode\fP
|
\fBpcre2unicode\fP
|
||||||
|
@ -2901,14 +2913,14 @@ first and last entries in the name-to-number table for the given name, and the
|
||||||
function returns the length of each entry in code units. In both cases,
|
function returns the length of each entry in code units. In both cases,
|
||||||
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
||||||
.P
|
.P
|
||||||
The format of the name table is described above in the section entitled
|
The format of the name table is described
|
||||||
\fIInformation about a pattern\fP
|
|
||||||
.\" HTML <a href="#infoaboutpattern">
|
.\" HTML <a href="#infoaboutpattern">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
above.
|
above
|
||||||
.\"
|
.\"
|
||||||
Given all the relevant entries for the name, you can extract each of their
|
in the section entitled \fIInformation about a pattern\fP. Given all the
|
||||||
numbers, and hence the captured data.
|
relevant entries for the name, you can extract each of their numbers, and hence
|
||||||
|
the captured data.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "FINDING ALL POSSIBLE MATCHES AT ONE POSITION"
|
.SH "FINDING ALL POSSIBLE MATCHES AT ONE POSITION"
|
||||||
|
@ -3154,6 +3166,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 21 December 2015
|
Last updated: 16 December 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -805,6 +805,10 @@ PATTERN MODIFIERS
|
||||||
mark show mark values
|
mark show mark values
|
||||||
replace=<string> specify a replacement string
|
replace=<string> specify a replacement string
|
||||||
startchar show starting character when relevant
|
startchar show starting character when relevant
|
||||||
|
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
|
||||||
|
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||||
|
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||||
|
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||||
|
|
||||||
These modifiers may not appear in a #pattern command. If you want them
|
These modifiers may not appear in a #pattern command. If you want them
|
||||||
as defaults, set them in a #subject command.
|
as defaults, set them in a #subject command.
|
||||||
|
@ -886,6 +890,11 @@ SUBJECT MODIFIERS
|
||||||
recursion_limit=<n> set a recursion limit
|
recursion_limit=<n> set a recursion limit
|
||||||
replace=<string> specify a replacement string
|
replace=<string> specify a replacement string
|
||||||
startchar show startchar when relevant
|
startchar show startchar when relevant
|
||||||
|
startoffset=<n> same as offset=<n>
|
||||||
|
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
|
||||||
|
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||||
|
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||||
|
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||||
zero_terminate pass the subject as zero-terminated
|
zero_terminate pass the subject as zero-terminated
|
||||||
|
|
||||||
The effects of these modifiers are described in the following sections.
|
The effects of these modifiers are described in the following sections.
|
||||||
|
@ -1011,19 +1020,30 @@ SUBJECT MODIFIERS
|
||||||
Testing the substitution function
|
Testing the substitution function
|
||||||
|
|
||||||
If the replace modifier is set, the pcre2_substitute() function is
|
If the replace modifier is set, the pcre2_substitute() function is
|
||||||
called instead of one of the matching functions. Unlike subject
|
called instead of one of the matching functions. Note that replacement
|
||||||
strings, pcre2test does not process replacement strings for escape
|
strings cannot contain commas, because a comma signifies the end of a
|
||||||
sequences. In UTF mode, a replacement string is checked to see if it is
|
modifier. This is not thought to be an issue in a test program.
|
||||||
a valid UTF-8 string. If so, it is correctly converted to a UTF string
|
|
||||||
of the appropriate code unit width. If it is not a valid UTF-8 string,
|
|
||||||
the individual code units are copied directly. This provides a means of
|
|
||||||
passing an invalid UTF-8 string for testing purposes.
|
|
||||||
|
|
||||||
If the global modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
Unlike subject strings, pcre2test does not process replacement strings
|
||||||
pcre2_substitute(). After a successful substitution, the modified
|
for escape sequences. In UTF mode, a replacement string is checked to
|
||||||
string is output, preceded by the number of replacements. This may be
|
see if it is a valid UTF-8 string. If so, it is correctly converted to
|
||||||
zero if there were no matches. Here is a simple example of a substitu-
|
a UTF string of the appropriate code unit width. If it is not a valid
|
||||||
tion test:
|
UTF-8 string, the individual code units are copied directly. This pro-
|
||||||
|
vides a means of passing an invalid UTF-8 string for testing purposes.
|
||||||
|
|
||||||
|
The following modifiers set options (in additional to the normal match
|
||||||
|
options) for pcre2_substitute():
|
||||||
|
|
||||||
|
global PCRE2_SUBSTITUTE_GLOBAL
|
||||||
|
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
|
||||||
|
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||||
|
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||||
|
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||||
|
|
||||||
|
|
||||||
|
After a successful substitution, the modified string is output, pre-
|
||||||
|
ceded by the number of replacements. This may be zero if there were no
|
||||||
|
matches. Here is a simple example of a substitution test:
|
||||||
|
|
||||||
/abc/replace=xxx
|
/abc/replace=xxx
|
||||||
=abc=abc=
|
=abc=abc=
|
||||||
|
@ -1031,12 +1051,13 @@ SUBJECT MODIFIERS
|
||||||
=abc=abc=\=global
|
=abc=abc=\=global
|
||||||
2: =xxx=xxx=
|
2: =xxx=xxx=
|
||||||
|
|
||||||
Subject and replacement strings should be kept relatively short for
|
Subject and replacement strings should be kept relatively short (fewer
|
||||||
substitution tests, as fixed-size buffers are used. To make it easy to
|
than 256 characters) for substitution tests, as fixed-size buffers are
|
||||||
test for buffer overflow, if the replacement string starts with a num-
|
used. To make it easy to test for buffer overflow, if the replacement
|
||||||
ber in square brackets, that number is passed to pcre2_substitute() as
|
string starts with a number in square brackets, that number is passed
|
||||||
the size of the output buffer, with the replacement string starting at
|
to pcre2_substitute() as the size of the output buffer, with the
|
||||||
the next character. Here is an example that tests the edge case:
|
replacement string starting at the next character. Here is an example
|
||||||
|
that tests the edge case:
|
||||||
|
|
||||||
/abc/
|
/abc/
|
||||||
123abc123\=replace=[10]XYZ
|
123abc123\=replace=[10]XYZ
|
||||||
|
@ -1044,6 +1065,19 @@ SUBJECT MODIFIERS
|
||||||
123abc123\=replace=[9]XYZ
|
123abc123\=replace=[9]XYZ
|
||||||
Failed: error -47: no more memory
|
Failed: error -47: no more memory
|
||||||
|
|
||||||
|
The default action of pcre2_substitute() is to return
|
||||||
|
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
|
||||||
|
the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
|
||||||
|
stitute_overflow_length modifier), pcre2_substitute() continues to go
|
||||||
|
through the motions of matching and substituting, in order to compute
|
||||||
|
the size of buffer that is required. When this happens, pcre2test shows
|
||||||
|
the required buffer length (which includes space for the trailing zero)
|
||||||
|
as part of the error message. For example:
|
||||||
|
|
||||||
|
/abc/substitute_overflow_length
|
||||||
|
123abc123\=replace=[9]XYZ
|
||||||
|
Failed: error -47: no more memory: 10 code units are needed
|
||||||
|
|
||||||
A replacement string is ignored with POSIX and DFA matching. Specifying
|
A replacement string is ignored with POSIX and DFA matching. Specifying
|
||||||
partial matching provokes an error return ("bad option value") from
|
partial matching provokes an error return ("bad option value") from
|
||||||
pcre2_substitute().
|
pcre2_substitute().
|
||||||
|
@ -1471,5 +1505,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 05 November 2015
|
Last updated: 12 December 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
|
|
|
@ -44,7 +44,7 @@ POSSIBILITY OF SUCH DAMAGE.
|
||||||
#define PCRE2_MAJOR 10
|
#define PCRE2_MAJOR 10
|
||||||
#define PCRE2_MINOR 21
|
#define PCRE2_MINOR 21
|
||||||
#define PCRE2_PRERELEASE -RC1
|
#define PCRE2_PRERELEASE -RC1
|
||||||
#define PCRE2_DATE 2015-07-06
|
#define PCRE2_DATE 2015-12-15
|
||||||
|
|
||||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||||
imported have to be identified as such. When building PCRE2, the appropriate
|
imported have to be identified as such. When building PCRE2, the appropriate
|
||||||
|
|
Loading…
Reference in New Issue