Fix global search/replace in pcre2test and pcre2_substitute() when the pattern

matches an empty string, but never at the starting offset.
This commit is contained in:
Philip.Hazel 2018-07-02 10:54:03 +00:00
parent 462f25d7d3
commit 1c79bdf36f
15 changed files with 333 additions and 229 deletions

View File

@ -90,6 +90,17 @@ standard systems:
when linking pcre2test with MSVC. This gets rid of a stack overflow error in when linking pcre2test with MSVC. This gets rid of a stack overflow error in
the standard set of tests. the standard set of tests.
20. Output a warning in pcre2test when ignoring the "altglobal" modifier when
it is given with the "replace" modifier.
21. In both pcre2test and pcre2_substitute(), with global matching, a pattern
that matched an empty string, but never at the starting match offset, was not
handled in a Perl-compatible way. The pattern /(<?=\G.)/ is an example of such
a pattern. Because \G is in a lookbehind assertion, there has to be a
"bumpalong" before there can be a match. The automatic "advance by one
character after an empty string match" rule is therefore inappropriate. A more
complicated algorithm has now been implemented.
Version 10.31 12-February-2018 Version 10.31 12-February-2018
------------------------------ ------------------------------

View File

@ -500,7 +500,7 @@ for bmode in "$test8" "$test16" "$test32"; do
for opt in "" $jitopt; do for opt in "" $jitopt; do
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $setstack $bmode $opt $testdata/testinput2 testtry $sim $valgrind ${opt:+$vjs} ./pcre2test -q $setstack $bmode $opt $testdata/testinput2 testtry
if [ $? = 0 ] ; then if [ $? = 0 ] ; then
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $bmode $opt -error -65,-62,-2,-1,0,100,101,191,200 >>testtry $sim $valgrind ${opt:+$vjs} ./pcre2test -q $bmode $opt -error -70,-62,-2,-1,0,100,101,191,200 >>testtry
checkresult $? 2 "$opt" checkresult $? 2 "$opt"
fi fi
done done

View File

@ -3154,7 +3154,10 @@ string in <i>outputbuffer</i>, replacing the part that was matched with the
<i>replacement</i> string, whose length is supplied in <b>rlength</b>. This can <i>replacement</i> string, whose length is supplied in <b>rlength</b>. This can
be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
which a \K item in a lookahead in the pattern causes the match to end before which a \K item in a lookahead in the pattern causes the match to end before
it starts are not supported, and give rise to an error return. it starts are not supported, and give rise to an error return. For global
replacements, matches in which \K in a lookbehind causes the match to start
earlier than the point that was reached in the previous iteration are also not
supported.
</P> </P>
<P> <P>
The first seven arguments of <b>pcre2_substitute()</b> are the same as for The first seven arguments of <b>pcre2_substitute()</b> are the same as for
@ -3631,7 +3634,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 30 June 2018 Last updated: 02 July 2018
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>

View File

@ -1084,9 +1084,9 @@ sequences but the characters that they represent.)
Resetting the match start Resetting the match start
</b><br> </b><br>
<P> <P>
The escape sequence \K causes any previously matched characters not to be In normal use, the escape sequence \K causes any previously matched characters
included in the final matched sequence that is returned. For example, the not to be included in the final matched sequence that is returned. For example,
pattern: the pattern:
<pre> <pre>
foo\Kbar foo\Kbar
</pre> </pre>
@ -1115,7 +1115,13 @@ PCRE2, \K is acted upon when it occurs inside positive assertions, but is
ignored in negative assertions. Note that when a pattern such as (?=ab\K) ignored in negative assertions. Note that when a pattern such as (?=ab\K)
matches, the reported start of the match can be greater than the end of the matches, the reported start of the match can be greater than the end of the
match. Using \K in a lookbehind assertion at the start of a pattern can also match. Using \K in a lookbehind assertion at the start of a pattern can also
lead to odd effects. lead to odd effects. For example, consider this pattern:
<pre>
(?&#60;=\Kfoo)bar
</pre>
If the subject is "foobar", a call to <b>pcre2_match()</b> with a starting
offset of 3 succeeds and reports the matching string as "foobar", that is, the
start of the reported match is earlier than where the match started.
<a name="smallassertions"></a></P> <a name="smallassertions"></a></P>
<br><b> <br><b>
Simple assertions Simple assertions
@ -3484,7 +3490,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 28 June 2018 Last updated: 30 June 2018
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>

View File

@ -3059,75 +3059,78 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
replacement string, whose length is supplied in rlength. This can be replacement string, whose length is supplied in rlength. This can be
given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
which a \K item in a lookahead in the pattern causes the match to end which a \K item in a lookahead in the pattern causes the match to end
before it starts are not supported, and give rise to an error return. before it starts are not supported, and give rise to an error return.
For global replacements, matches in which \K in a lookbehind causes the
match to start earlier than the point that was reached in the previous
iteration are also not supported.
The first seven arguments of pcre2_substitute() are the same as for The first seven arguments of pcre2_substitute() are the same as for
pcre2_match(), except that the partial matching options are not permit- pcre2_match(), except that the partial matching options are not permit-
ted, and match_data may be passed as NULL, in which case a match data ted, and match_data may be passed as NULL, in which case a match data
block is obtained and freed within this function, using memory manage- block is obtained and freed within this function, using memory manage-
ment functions from the match context, if provided, or else those that ment functions from the match context, if provided, or else those that
were used to allocate memory for the compiled code. were used to allocate memory for the compiled code.
The outlengthptr argument must point to a variable that contains the The outlengthptr argument must point to a variable that contains the
length, in code units, of the output buffer. If the function is suc- length, in code units, of the output buffer. If the function is suc-
cessful, the value is updated to contain the length of the new string, cessful, the value is updated to contain the length of the new string,
excluding the trailing zero that is automatically added. excluding the trailing zero that is automatically added.
If the function is not successful, the value set via outlengthptr If the function is not successful, the value set via outlengthptr
depends on the type of error. For syntax errors in the replacement depends on the type of error. For syntax errors in the replacement
string, the value is the offset in the replacement string where the string, the value is the offset in the replacement string where the
error was detected. For other errors, the value is PCRE2_UNSET by error was detected. For other errors, the value is PCRE2_UNSET by
default. This includes the case of the output buffer being too small, default. This includes the case of the output buffer being too small,
unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which
case the value is the minimum length needed, including space for the case the value is the minimum length needed, including space for the
trailing zero. Note that in order to compute the required length, trailing zero. Note that in order to compute the required length,
pcre2_substitute() has to simulate all the matching and copying, pcre2_substitute() has to simulate all the matching and copying,
instead of giving an error return as soon as the buffer overflows. Note instead of giving an error return as soon as the buffer overflows. Note
also that the length is in code units, not bytes. also that the length is in code units, not bytes.
In the replacement string, which is interpreted as a UTF string in UTF In the replacement string, which is interpreted as a UTF string in UTF
mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
option is set, a dollar character is an escape character that can spec- option is set, a dollar character is an escape character that can spec-
ify the insertion of characters from capturing groups or (*MARK), ify the insertion of characters from capturing groups or (*MARK),
(*PRUNE), or (*THEN) items in the pattern. The following forms are (*PRUNE), or (*THEN) items in the pattern. The following forms are
always recognized: always recognized:
$$ insert a dollar character $$ insert a dollar character
$<n> or ${<n>} insert the contents of group <n> $<n> or ${<n>} insert the contents of group <n>
$*MARK or ${*MARK} insert a (*MARK), (*PRUNE), or (*THEN) name $*MARK or ${*MARK} insert a (*MARK), (*PRUNE), or (*THEN) name
Either a group number or a group name can be given for <n>. Curly Either a group number or a group name can be given for <n>. Curly
brackets are required only if the following character would be inter- brackets are required only if the following character would be inter-
preted as part of the number or name. The number may be zero to include preted as part of the number or name. The number may be zero to include
the entire matched string. For example, if the pattern a(b)c is the entire matched string. For example, if the pattern a(b)c is
matched with "=abc=" and the replacement string "+$1$0$1+", the result matched with "=abc=" and the replacement string "+$1$0$1+", the result
is "=+babcb+=". is "=+babcb+=".
$*MARK inserts the name from the last encountered (*MARK), (*PRUNE), or $*MARK inserts the name from the last encountered (*MARK), (*PRUNE), or
(*THEN) on the matching path that has a name. (*MARK) must always (*THEN) on the matching path that has a name. (*MARK) must always
include a name, but (*PRUNE) and (*THEN) need not. For example, in the include a name, but (*PRUNE) and (*THEN) need not. For example, in the
case of (*MARK:A)(*PRUNE) the name inserted is "A", but for case of (*MARK:A)(*PRUNE) the name inserted is "A", but for
(*MARK:A)(*PRUNE:B) the relevant name is "B". This facility can be (*MARK:A)(*PRUNE:B) the relevant name is "B". This facility can be
used to perform simple simultaneous substitutions, as this pcre2test used to perform simple simultaneous substitutions, as this pcre2test
example shows: example shows:
/(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK} /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
apple lemon apple lemon
2: pear orange 2: pear orange
As well as the usual options for pcre2_match(), a number of additional As well as the usual options for pcre2_match(), a number of additional
options can be set in the options argument of pcre2_substitute(). options can be set in the options argument of pcre2_substitute().
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
string, replacing every matching substring. If this option is not set, string, replacing every matching substring. If this option is not set,
only the first matching substring is replaced. The search for matches only the first matching substring is replaced. The search for matches
takes place in the original subject string (that is, previous replace- takes place in the original subject string (that is, previous replace-
ments do not affect it). Iteration is implemented by advancing the ments do not affect it). Iteration is implemented by advancing the
startoffset value for each search, which is always passed the entire startoffset value for each search, which is always passed the entire
subject string. If an offset limit is set in the match context, search- subject string. If an offset limit is set in the match context, search-
ing stops when that limit is reached. ing stops when that limit is reached.
You can restrict the effect of a global substitution to a portion of You can restrict the effect of a global substitution to a portion of
the subject string by setting either or both of startoffset and an off- the subject string by setting either or both of startoffset and an off-
set limit. Here is a pcre2test example: set limit. Here is a pcre2test example:
@ -3135,87 +3138,87 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
ABC ABC ABC ABC\=offset=3,offset_limit=12 ABC ABC ABC ABC\=offset=3,offset_limit=12
2: ABC A!C A!C ABC 2: ABC A!C A!C ABC
When continuing with global substitutions after matching a substring When continuing with global substitutions after matching a substring
with zero length, an attempt to find a non-empty match at the same off- with zero length, an attempt to find a non-empty match at the same off-
set is performed. If this is not successful, the offset is advanced by set is performed. If this is not successful, the offset is advanced by
one character except when CRLF is a valid newline sequence and the next one character except when CRLF is a valid newline sequence and the next
two characters are CR, LF. In this case, the offset is advanced by two two characters are CR, LF. In this case, the offset is advanced by two
characters. characters.
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
ORY immediately. If this option is set, however, pcre2_substitute() ORY immediately. If this option is set, however, pcre2_substitute()
continues to go through the motions of matching and substituting (with- continues to go through the motions of matching and substituting (with-
out, of course, writing anything) in order to compute the size of buf- out, of course, writing anything) in order to compute the size of buf-
fer that is needed. This value is passed back via the outlengthptr fer that is needed. This value is passed back via the outlengthptr
variable, with the result of the function still being variable, with the result of the function still being
PCRE2_ERROR_NOMEMORY. PCRE2_ERROR_NOMEMORY.
Passing a buffer size of zero is a permitted way of finding out how Passing a buffer size of zero is a permitted way of finding out how
much memory is needed for given substitution. However, this does mean much memory is needed for given substitution. However, this does mean
that the entire operation is carried out twice. Depending on the appli- that the entire operation is carried out twice. Depending on the appli-
cation, it may be more efficient to allocate a large buffer and free cation, it may be more efficient to allocate a large buffer and free
the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
FLOW_LENGTH. FLOW_LENGTH.
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups
that do not appear in the pattern to be treated as unset groups. This that do not appear in the pattern to be treated as unset groups. This
option should be used with care, because it means that a typo in a option should be used with care, because it means that a typo in a
group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING
error. error.
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including
unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be
treated as empty strings when inserted as described above. If this treated as empty strings when inserted as described above. If this
option is not set, an attempt to insert an unset group causes the option is not set, an attempt to insert an unset group causes the
PCRE2_ERROR_UNSET error. This option does not influence the extended PCRE2_ERROR_UNSET error. This option does not influence the extended
substitution syntax described below. substitution syntax described below.
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
replacement string. Without this option, only the dollar character is replacement string. Without this option, only the dollar character is
special, and only the group insertion forms listed above are valid. special, and only the group insertion forms listed above are valid.
When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
Firstly, backslash in a replacement string is interpreted as an escape Firstly, backslash in a replacement string is interpreted as an escape
character. The usual forms such as \n or \x{ddd} can be used to specify character. The usual forms such as \n or \x{ddd} can be used to specify
particular character codes, and backslash followed by any non-alphanu- particular character codes, and backslash followed by any non-alphanu-
meric character quotes that character. Extended quoting can be coded meric character quotes that character. Extended quoting can be coded
using \Q...\E, exactly as in pattern strings. using \Q...\E, exactly as in pattern strings.
There are also four escape sequences for forcing the case of inserted There are also four escape sequences for forcing the case of inserted
letters. The insertion mechanism has three states: no case forcing, letters. The insertion mechanism has three states: no case forcing,
force upper case, and force lower case. The escape sequences change the force upper case, and force lower case. The escape sequences change the
current state: \U and \L change to upper or lower case forcing, respec- current state: \U and \L change to upper or lower case forcing, respec-
tively, and \E (when not terminating a \Q quoted sequence) reverts to tively, and \E (when not terminating a \Q quoted sequence) reverts to
no case forcing. The sequences \u and \l force the next character (if no case forcing. The sequences \u and \l force the next character (if
it is a letter) to upper or lower case, respectively, and then the it is a letter) to upper or lower case, respectively, and then the
state automatically reverts to no case forcing. Case forcing applies to state automatically reverts to no case forcing. Case forcing applies to
all inserted characters, including those from captured groups and let- all inserted characters, including those from captured groups and let-
ters within \Q...\E quoted sequences. ters within \Q...\E quoted sequences.
Note that case forcing sequences such as \U...\E do not nest. For exam- Note that case forcing sequences such as \U...\E do not nest. For exam-
ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
\E has no effect. \E has no effect.
The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
flexibility to group substitution. The syntax is similar to that used flexibility to group substitution. The syntax is similar to that used
by Bash: by Bash:
${<n>:-<string>} ${<n>:-<string>}
${<n>:+<string1>:<string2>} ${<n>:+<string1>:<string2>}
As before, <n> may be a group number or a name. The first form speci- As before, <n> may be a group number or a name. The first form speci-
fies a default value. If group <n> is set, its value is inserted; if fies a default value. If group <n> is set, its value is inserted; if
not, <string> is expanded and the result inserted. The second form not, <string> is expanded and the result inserted. The second form
specifies strings that are expanded and inserted when group <n> is set specifies strings that are expanded and inserted when group <n> is set
or unset, respectively. The first form is just a convenient shorthand or unset, respectively. The first form is just a convenient shorthand
for for
${<n>:+${<n>}:<string>} ${<n>:+${<n>}:<string>}
Backslash can be used to escape colons and closing curly brackets in Backslash can be used to escape colons and closing curly brackets in
the replacement strings. A change of the case forcing state within a the replacement strings. A change of the case forcing state within a
replacement string remains in force afterwards, as shown in this replacement string remains in force afterwards, as shown in this
pcre2test example: pcre2test example:
/(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
@ -3224,42 +3227,42 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
somebody somebody
1: HELLO 1: HELLO
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause
unknown groups in the extended syntax forms to be treated as unset. unknown groups in the extended syntax forms to be treated as unset.
If successful, pcre2_substitute() returns the number of replacements If successful, pcre2_substitute() returns the number of replacements
that were made. This may be zero if no matches were found, and is never that were made. This may be zero if no matches were found, and is never
greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set. greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
In the event of an error, a negative error code is returned. Except for In the event of an error, a negative error code is returned. Except for
PCRE2_ERROR_NOMATCH (which is never returned), errors from PCRE2_ERROR_NOMATCH (which is never returned), errors from
pcre2_match() are passed straight back. pcre2_match() are passed straight back.
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
when the simple (non-extended) syntax is used and PCRE2_SUBSTI- when the simple (non-extended) syntax is used and PCRE2_SUBSTI-
TUTE_UNSET_EMPTY is not set. TUTE_UNSET_EMPTY is not set.
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
of buffer that is needed is returned via outlengthptr. Note that this of buffer that is needed is returned via outlengthptr. Note that this
does not happen by default. does not happen by default.
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
the replacement string, with more particular errors being the replacement string, with more particular errors being
PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP- PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI- MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI-
TUTION (syntax error in extended group substitution), and TUTION (syntax error in extended group substitution), and
PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started
or the match started earlier than the current position in the subject, or the match started earlier than the current position in the subject,
which can happen if \K is used in an assertion). which can happen if \K is used in an assertion).
As for all PCRE2 errors, a text message that describes the error can be As for all PCRE2 errors, a text message that describes the error can be
obtained by calling the pcre2_get_error_message() function (see obtained by calling the pcre2_get_error_message() function (see
"Obtaining a textual error message" above). "Obtaining a textual error message" above).
@ -3268,56 +3271,56 @@ DUPLICATE SUBPATTERN NAMES
int pcre2_substring_nametable_scan(const pcre2_code *code, int pcre2_substring_nametable_scan(const pcre2_code *code,
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
When a pattern is compiled with the PCRE2_DUPNAMES option, names for When a pattern is compiled with the PCRE2_DUPNAMES option, names for
subpatterns are not required to be unique. Duplicate names are always subpatterns are not required to be unique. Duplicate names are always
allowed for subpatterns with the same number, created by using the (?| allowed for subpatterns with the same number, created by using the (?|
feature. Indeed, if such subpatterns are named, they are required to feature. Indeed, if such subpatterns are named, they are required to
use the same names. use the same names.
Normally, patterns with duplicate names are such that in any one match, Normally, patterns with duplicate names are such that in any one match,
only one of the named subpatterns participates. An example is shown in only one of the named subpatterns participates. An example is shown in
the pcre2pattern documentation. the pcre2pattern documentation.
When duplicates are present, pcre2_substring_copy_byname() and When duplicates are present, pcre2_substring_copy_byname() and
pcre2_substring_get_byname() return the first substring corresponding pcre2_substring_get_byname() return the first substring corresponding
to the given name that is set. Only if none are set is to the given name that is set. Only if none are set is
PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name() PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name()
function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
duplicate names. duplicate names.
If you want to get full details of all captured substrings for a given If you want to get full details of all captured substrings for a given
name, you must use the pcre2_substring_nametable_scan() function. The name, you must use the pcre2_substring_nametable_scan() function. The
first argument is the compiled pattern, and the second is the name. If first argument is the compiled pattern, and the second is the name. If
the third and fourth arguments are NULL, the function returns a group the third and fourth arguments are NULL, the function returns a group
number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
When the third and fourth arguments are not NULL, they must be pointers When the third and fourth arguments are not NULL, they must be pointers
to variables that are updated by the function. After it has run, they to variables that are updated by the function. After it has run, they
point to the first and last entries in the name-to-number table for the point to the first and last entries in the name-to-number table for the
given name, and the function returns the length of each entry in code given name, and the function returns the length of each entry in code
units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
no entries for the given name. no entries for the given name.
The format of the name table is described above in the section entitled The format of the name table is described above in the section entitled
Information about a pattern. Given all the relevant entries for the Information about a pattern. Given all the relevant entries for the
name, you can extract each of their numbers, and hence the captured name, you can extract each of their numbers, and hence the captured
data. data.
FINDING ALL POSSIBLE MATCHES AT ONE POSITION FINDING ALL POSSIBLE MATCHES AT ONE POSITION
The traditional matching function uses a similar algorithm to Perl, The traditional matching function uses a similar algorithm to Perl,
which stops when it finds the first match at a given point in the sub- which stops when it finds the first match at a given point in the sub-
ject. If you want to find all possible matches, or the longest possible ject. If you want to find all possible matches, or the longest possible
match at a given position, consider using the alternative matching match at a given position, consider using the alternative matching
function (see below) instead. If you cannot use the alternative func- function (see below) instead. If you cannot use the alternative func-
tion, you can kludge it up by making use of the callout facility, which tion, you can kludge it up by making use of the callout facility, which
is described in the pcre2callout documentation. is described in the pcre2callout documentation.
What you have to do is to insert a callout right at the end of the pat- What you have to do is to insert a callout right at the end of the pat-
tern. When your callout function is called, extract and save the cur- tern. When your callout function is called, extract and save the cur-
rent matched substring. Then return 1, which forces pcre2_match() to rent matched substring. Then return 1, which forces pcre2_match() to
backtrack and try other alternatives. Ultimately, when it runs out of backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
@ -3329,26 +3332,26 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
pcre2_match_context *mcontext, pcre2_match_context *mcontext,
int *workspace, PCRE2_SIZE wscount); int *workspace, PCRE2_SIZE wscount);
The function pcre2_dfa_match() is called to match a subject string The function pcre2_dfa_match() is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the against a compiled pattern, using a matching algorithm that scans the
subject string just once (not counting lookaround assertions), and does subject string just once (not counting lookaround assertions), and does
not backtrack. This has different characteristics to the normal algo- not backtrack. This has different characteristics to the normal algo-
rithm, and is not compatible with Perl. Some of the features of PCRE2 rithm, and is not compatible with Perl. Some of the features of PCRE2
patterns are not supported. Nevertheless, there are times when this patterns are not supported. Nevertheless, there are times when this
kind of matching can be useful. For a discussion of the two matching kind of matching can be useful. For a discussion of the two matching
algorithms, and a list of features that pcre2_dfa_match() does not sup- algorithms, and a list of features that pcre2_dfa_match() does not sup-
port, see the pcre2matching documentation. port, see the pcre2matching documentation.
The arguments for the pcre2_dfa_match() function are the same as for The arguments for the pcre2_dfa_match() function are the same as for
pcre2_match(), plus two extras. The ovector within the match data block pcre2_match(), plus two extras. The ovector within the match data block
is used in a different way, and this is described below. The other com- is used in a different way, and this is described below. The other com-
mon arguments are used in the same way as for pcre2_match(), so their mon arguments are used in the same way as for pcre2_match(), so their
description is not repeated here. description is not repeated here.
The two additional arguments provide workspace for the function. The The two additional arguments provide workspace for the function. The
workspace vector should contain at least 20 elements. It is used for workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More keeping track of multiple paths through the pattern tree. More
workspace is needed for patterns and subjects where there are a lot of workspace is needed for patterns and subjects where there are a lot of
potential matches. potential matches.
Here is an example of a simple call to pcre2_dfa_match(): Here is an example of a simple call to pcre2_dfa_match():
@ -3368,45 +3371,45 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
Option bits for pcre_dfa_match() Option bits for pcre_dfa_match()
The unused bits of the options argument for pcre2_dfa_match() must be The unused bits of the options argument for pcre2_dfa_match() must be
zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN- zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN-
CHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, CHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD,
PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but
the last four of these are exactly the same as for pcre2_match(), so the last four of these are exactly the same as for pcre2_match(), so
their description is not repeated here. their description is not repeated here.
PCRE2_PARTIAL_HARD PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT PCRE2_PARTIAL_SOFT
These have the same general effect as they do for pcre2_match(), but These have the same general effect as they do for pcre2_match(), but
the details are slightly different. When PCRE2_PARTIAL_HARD is set for the details are slightly different. When PCRE2_PARTIAL_HARD is set for
pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
subject is reached and there is still at least one matching possibility subject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete that requires additional characters. This happens even if some complete
matches have already been found. When PCRE2_PARTIAL_SOFT is set, the matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
if the end of the subject is reached, there have been no complete if the end of the subject is reached, there have been no complete
matches, but there is still at least one matching possibility. The por- matches, but there is still at least one matching possibility. The por-
tion of the string that was inspected when the longest partial match tion of the string that was inspected when the longest partial match
was found is set as the first matching string in both cases. There is a was found is set as the first matching string in both cases. There is a
more detailed discussion of partial and multi-segment matching, with more detailed discussion of partial and multi-segment matching, with
examples, in the pcre2partial documentation. examples, in the pcre2partial documentation.
PCRE2_DFA_SHORTEST PCRE2_DFA_SHORTEST
Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna- stop as soon as it has found one match. Because of the way the alterna-
tive algorithm works, this is necessarily the shortest possible match tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string. at the first possible matching point in the subject string.
PCRE2_DFA_RESTART PCRE2_DFA_RESTART
When pcre2_dfa_match() returns a partial match, it is possible to call When pcre2_dfa_match() returns a partial match, it is possible to call
it again, with additional subject characters, and have it continue with it again, with additional subject characters, and have it continue with
the same match. The PCRE2_DFA_RESTART option requests this action; when the same match. The PCRE2_DFA_RESTART option requests this action; when
it is set, the workspace and wscount options must reference the same it is set, the workspace and wscount options must reference the same
vector as before because data about the match so far is left in them vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the after a partial match. There is more discussion of this facility in the
pcre2partial documentation. pcre2partial documentation.
@ -3414,8 +3417,8 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
When pcre2_dfa_match() succeeds, it may have matched more than one sub- When pcre2_dfa_match() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run string in the subject. Note, however, that all the matches from one run
of the function start at the same point in the subject. The shorter of the function start at the same point in the subject. The shorter
matches are all initial substrings of the longer matches. For example, matches are all initial substrings of the longer matches. For example,
if the pattern if the pattern
<.*> <.*>
@ -3430,73 +3433,73 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
<something> <something else> <something> <something else>
<something> <something>
On success, the yield of the function is a number greater than zero, On success, the yield of the function is a number greater than zero,
which is the number of matched substrings. The offsets of the sub- which is the number of matched substrings. The offsets of the sub-
strings are returned in the ovector, and can be extracted by number in strings are returned in the ovector, and can be extracted by number in
the same way as for pcre2_match(), but the numbers bear no relation to the same way as for pcre2_match(), but the numbers bear no relation to
any capturing groups that may exist in the pattern, because DFA match- any capturing groups that may exist in the pattern, because DFA match-
ing does not support group capture. ing does not support group capture.
Calls to the convenience functions that extract substrings by name Calls to the convenience functions that extract substrings by name
return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used
after a DFA match. The convenience functions that extract substrings by after a DFA match. The convenience functions that extract substrings by
number never return PCRE2_ERROR_NOSUBSTRING. number never return PCRE2_ERROR_NOSUBSTRING.
The matched strings are stored in the ovector in reverse order of The matched strings are stored in the ovector in reverse order of
length; that is, the longest matching string is first. If there were length; that is, the longest matching string is first. If there were
too many matches to fit into the ovector, the yield of the function is too many matches to fit into the ovector, the yield of the function is
zero, and the vector is filled with the longest matches. zero, and the vector is filled with the longest matches.
NOTE: PCRE2's "auto-possessification" optimization usually applies to NOTE: PCRE2's "auto-possessification" optimization usually applies to
character repeats at the end of a pattern (as well as internally). For character repeats at the end of a pattern (as well as internally). For
example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
matching, this means that only one possible match is found. If you matching, this means that only one possible match is found. If you
really do want multiple matches in such cases, either use an ungreedy really do want multiple matches in such cases, either use an ungreedy
repeat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when repeat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when
compiling. compiling.
Error returns from pcre2_dfa_match() Error returns from pcre2_dfa_match()
The pcre2_dfa_match() function returns a negative number when it fails. The pcre2_dfa_match() function returns a negative number when it fails.
Many of the errors are the same as for pcre2_match(), as described Many of the errors are the same as for pcre2_match(), as described
above. There are in addition the following errors that are specific to above. There are in addition the following errors that are specific to
pcre2_dfa_match(): pcre2_dfa_match():
PCRE2_ERROR_DFA_UITEM PCRE2_ERROR_DFA_UITEM
This return is given if pcre2_dfa_match() encounters an item in the This return is given if pcre2_dfa_match() encounters an item in the
pattern that it does not support, for instance, the use of \C in a UTF pattern that it does not support, for instance, the use of \C in a UTF
mode or a backreference. mode or a backreference.
PCRE2_ERROR_DFA_UCOND PCRE2_ERROR_DFA_UCOND
This return is given if pcre2_dfa_match() encounters a condition item This return is given if pcre2_dfa_match() encounters a condition item
that uses a backreference for the condition, or a test for recursion in that uses a backreference for the condition, or a test for recursion in
a specific group. These are not supported. a specific group. These are not supported.
PCRE2_ERROR_DFA_WSSIZE PCRE2_ERROR_DFA_WSSIZE
This return is given if pcre2_dfa_match() runs out of space in the This return is given if pcre2_dfa_match() runs out of space in the
workspace vector. workspace vector.
PCRE2_ERROR_DFA_RECURSE PCRE2_ERROR_DFA_RECURSE
When a recursive subpattern is processed, the matching function calls When a recursive subpattern is processed, the matching function calls
itself recursively, using private memory for the ovector and workspace. itself recursively, using private memory for the ovector and workspace.
This error is given if the internal ovector is not large enough. This This error is given if the internal ovector is not large enough. This
should be extremely rare, as a vector of size 1000 is used. should be extremely rare, as a vector of size 1000 is used.
PCRE2_ERROR_DFA_BADRESTART PCRE2_ERROR_DFA_BADRESTART
When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
some plausibility checks are made on the contents of the workspace, some plausibility checks are made on the contents of the workspace,
which should contain data about the previous partial match. If any of which should contain data about the previous partial match. If any of
these checks fail, this error is given. these checks fail, this error is given.
SEE ALSO SEE ALSO
pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3). pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
@ -3509,7 +3512,7 @@ AUTHOR
REVISION REVISION
Last updated: 30 June 2018 Last updated: 02 July 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -6664,9 +6667,9 @@ BACKSLASH
Resetting the match start Resetting the match start
The escape sequence \K causes any previously matched characters not to In normal use, the escape sequence \K causes any previously matched
be included in the final matched sequence that is returned. For exam- characters not to be included in the final matched sequence that is
ple, the pattern: returned. For example, the pattern:
foo\Kbar foo\Kbar
@ -6692,7 +6695,15 @@ BACKSLASH
assertions, but is ignored in negative assertions. Note that when a assertions, but is ignored in negative assertions. Note that when a
pattern such as (?=ab\K) matches, the reported start of the match can pattern such as (?=ab\K) matches, the reported start of the match can
be greater than the end of the match. Using \K in a lookbehind asser- be greater than the end of the match. Using \K in a lookbehind asser-
tion at the start of a pattern can also lead to odd effects. tion at the start of a pattern can also lead to odd effects. For exam-
ple, consider this pattern:
(?<=\Kfoo)bar
If the subject is "foobar", a call to pcre2_match() with a starting
offset of 3 succeeds and reports the matching string as "foobar", that
is, the start of the reported match is earlier than where the match
started.
Simple assertions Simple assertions
@ -8930,7 +8941,7 @@ AUTHOR
REVISION REVISION
Last updated: 28 June 2018 Last updated: 30 June 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "30 June 2018" "PCRE2 10.32" .TH PCRE2API 3 "02 July 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -3163,7 +3163,10 @@ string in \fIoutputbuffer\fP, replacing the part that was matched with the
\fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This can \fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This can
be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
which a \eK item in a lookahead in the pattern causes the match to end before which a \eK item in a lookahead in the pattern causes the match to end before
it starts are not supported, and give rise to an error return. it starts are not supported, and give rise to an error return. For global
replacements, matches in which \eK in a lookbehind causes the match to start
earlier than the point that was reached in the previous iteration are also not
supported.
.P .P
The first seven arguments of \fBpcre2_substitute()\fP are the same as for The first seven arguments of \fBpcre2_substitute()\fP are the same as for
\fBpcre2_match()\fP, except that the partial matching options are not \fBpcre2_match()\fP, except that the partial matching options are not
@ -3637,6 +3640,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 30 June 2018 Last updated: 02 July 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1110,7 +1110,7 @@ matches, the reported start of the match can be greater than the end of the
match. Using \eK in a lookbehind assertion at the start of a pattern can also match. Using \eK in a lookbehind assertion at the start of a pattern can also
lead to odd effects. For example, consider this pattern: lead to odd effects. For example, consider this pattern:
.sp .sp
(?<=\Kfoo)bar (?<=\eKfoo)bar
.sp .sp
If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
offset of 3 succeeds and reports the matching string as "foobar", that is, the offset of 3 succeeds and reports the matching string as "foobar", that is, the

View File

@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, second API, to be /* This is the public header file for the PCRE library, second API, to be
#included by applications that call PCRE2 functions. #included by applications that call PCRE2 functions.
Copyright (c) 2016-2017 University of Cambridge Copyright (c) 2016-2018 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -399,6 +399,7 @@ released, the numbers must not be changed. */
#define PCRE2_ERROR_BADSERIALIZEDDATA (-62) #define PCRE2_ERROR_BADSERIALIZEDDATA (-62)
#define PCRE2_ERROR_HEAPLIMIT (-63) #define PCRE2_ERROR_HEAPLIMIT (-63)
#define PCRE2_ERROR_CONVERT_SYNTAX (-64) #define PCRE2_ERROR_CONVERT_SYNTAX (-64)
#define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
/* Request types for pcre2_pattern_info() */ /* Request types for pcre2_pattern_info() */

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016-2017 University of Cambridge New API code Copyright (c) 2016-2018 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -260,6 +260,8 @@ static const unsigned char match_error_texts[] =
"bad serialized data\0" "bad serialized data\0"
"heap limit exceeded\0" "heap limit exceeded\0"
"invalid syntax\0" "invalid syntax\0"
/* 65 */
"internal error - duplicate substitution match\0"
; ;

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016 University of Cambridge New API code Copyright (c) 2016-2018 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -238,10 +238,12 @@ PCRE2_SPTR repend;
PCRE2_SIZE extra_needed = 0; PCRE2_SIZE extra_needed = 0;
PCRE2_SIZE buff_offset, buff_length, lengthleft, fraglength; PCRE2_SIZE buff_offset, buff_length, lengthleft, fraglength;
PCRE2_SIZE *ovector; PCRE2_SIZE *ovector;
PCRE2_SIZE ovecsave[3];
buff_offset = 0; buff_offset = 0;
lengthleft = buff_length = *blength; lengthleft = buff_length = *blength;
*blength = PCRE2_UNSET; *blength = PCRE2_UNSET;
ovecsave[0] = ovecsave[1] = ovecsave[2] = PCRE2_UNSET;
/* Partial matching is not valid. */ /* Partial matching is not valid. */
@ -369,6 +371,26 @@ do
goto EXIT; goto EXIT;
} }
/* Check for the same match as previous. This is legitimate after matching an
empty string that starts after the initial match offset. We have tried again
at the match point in case the pattern is one like /(?<=\G.)/ which can never
match at its starting point, so running the match achieves the bumpalong. If
we do get the same (null) match at the original match point, it isn't such a
pattern, so we now do the empty string magic. In all other cases, a repeat
match should never occur. */
if (ovecsave[0] == ovector[0] && ovecsave[1] == ovector[1])
{
if (ovector[0] == ovector[1] && ovecsave[2] != start_offset)
{
goptions = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
ovecsave[2] = start_offset;
continue; /* Back to the top of the loop */
}
rc = PCRE2_ERROR_INTERNAL_DUPMATCH;
goto EXIT;
}
/* Count substitutions with a paranoid check for integer overflow; surely no /* Count substitutions with a paranoid check for integer overflow; surely no
real call to this function would ever hit this! */ real call to this function would ever hit this! */
@ -799,13 +821,18 @@ do
} /* End handling a literal code unit */ } /* End handling a literal code unit */
} /* End of loop for scanning the replacement. */ } /* End of loop for scanning the replacement. */
/* The replacement has been copied to the output. Update the start offset to /* The replacement has been copied to the output. Save the details of this
point to the rest of the subject string. If we matched an empty string, match. See above for how this data is used. If we matched an empty string, do
do the magic for global matches. */ the magic for global matches. Finally, update the start offset to point to
the rest of the subject string. */
start_offset = ovector[1]; ovecsave[0] = ovector[0];
goptions = (ovector[0] != ovector[1])? 0 : ovecsave[1] = ovector[1];
ovecsave[2] = start_offset;
goptions = (ovector[0] != ovector[1] || ovector[0] > start_offset)? 0 :
PCRE2_ANCHORED|PCRE2_NOTEMPTY_ATSTART; PCRE2_ANCHORED|PCRE2_NOTEMPTY_ATSTART;
start_offset = ovector[1];
} while ((suboptions & PCRE2_SUBSTITUTE_GLOBAL) != 0); /* Repeat "do" loop */ } while ((suboptions & PCRE2_SUBSTITUTE_GLOBAL) != 0); /* Repeat "do" loop */
/* Copy the rest of the subject. */ /* Copy the rest of the subject. */

View File

@ -6302,6 +6302,7 @@ size_t needlen;
void *use_dat_context; void *use_dat_context;
BOOL utf; BOOL utf;
BOOL subject_literal; BOOL subject_literal;
PCRE2_SIZE ovecsave[3];
#ifdef SUPPORT_PCRE2_8 #ifdef SUPPORT_PCRE2_8
uint8_t *q8 = NULL; uint8_t *q8 = NULL;
@ -6949,6 +6950,9 @@ if (dat_datctl.replacement[0] != 0)
if (timeitm) if (timeitm)
fprintf(outfile, "** Timing is not supported with replace: ignored\n"); fprintf(outfile, "** Timing is not supported with replace: ignored\n");
if ((dat_datctl.control & CTL_ALTGLOBAL) != 0)
fprintf(outfile, "** Altglobal is not supported with replace: ignored\n");
xoptions = (((dat_datctl.control & CTL_GLOBAL) == 0)? 0 : xoptions = (((dat_datctl.control & CTL_GLOBAL) == 0)? 0 :
PCRE2_SUBSTITUTE_GLOBAL) | PCRE2_SUBSTITUTE_GLOBAL) |
(((dat_datctl.control2 & CTL2_SUBSTITUTE_EXTENDED) == 0)? 0 : (((dat_datctl.control2 & CTL2_SUBSTITUTE_EXTENDED) == 0)? 0 :
@ -7067,35 +7071,24 @@ if (dat_datctl.replacement[0] != 0)
} }
fprintf(outfile, "\n"); fprintf(outfile, "\n");
show_memory = FALSE;
return PR_OK;
} /* End of substitution handling */ } /* End of substitution handling */
/* When a replacement string is not provided, run a loop for global matching /* When a replacement string is not provided, run a loop for global matching
with one of the basic matching functions. */ with one of the basic matching functions. For altglobal (or first time round
the loop), set an "unset" value for the previous match info. */
else for (gmatched = 0;; gmatched++) ovecsave[0] = ovecsave[1] = ovecsave[2] = PCRE2_UNSET;
for (gmatched = 0;; gmatched++)
{ {
PCRE2_SIZE j; PCRE2_SIZE j;
int capcount; int capcount;
PCRE2_SIZE *ovector; PCRE2_SIZE *ovector;
PCRE2_SIZE ovecsave[2];
ovector = FLD(match_data, ovector); ovector = FLD(match_data, ovector);
/* After the first time round a global loop, for a normal global (/g)
iteration, save the current ovector[0,1] so that we can check that they do
change each time. Otherwise a matching bug that returns the same string
causes an infinite loop. It has happened! */
if (gmatched > 0 && (dat_datctl.control & CTL_GLOBAL) != 0)
{
ovecsave[0] = ovector[0];
ovecsave[1] = ovector[1];
}
/* For altglobal (or first time round the loop), set an "unset" value. */
else ovecsave[0] = ovecsave[1] = PCRE2_UNSET;
/* Fill the ovector with junk to detect elements that do not get set /* Fill the ovector with junk to detect elements that do not get set
when they should be. */ when they should be. */
@ -7266,12 +7259,23 @@ else for (gmatched = 0;; gmatched++)
} }
/* If this is not the first time round a global loop, check that the /* If this is not the first time round a global loop, check that the
returned string has changed. If not, there is a bug somewhere and we must returned string has changed. If it has not, check for an empty string match
break the loop because it will go on for ever. We know that there are at different starting offset from the previous match. This is a failed test
always at least two elements in the ovector. */ retry for null-matching patterns that don't match at their starting offset,
for example /(?<=\G.)/. A repeated match at the same point is not such a
pattern, and must be discarded, and we then proceed to seek a non-null
match at the current point. For any other repeated match, there is a bug
somewhere and we must break the loop because it will go on for ever. We
know that there are always at least two elements in the ovector. */
if (gmatched > 0 && ovecsave[0] == ovector[0] && ovecsave[1] == ovector[1]) if (gmatched > 0 && ovecsave[0] == ovector[0] && ovecsave[1] == ovector[1])
{ {
if (ovector[0] == ovector[1] && ovecsave[2] != dat_datctl.offset)
{
g_notempty = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
ovecsave[2] = dat_datctl.offset;
continue; /* Back to the top of the loop */
}
fprintf(outfile, fprintf(outfile,
"** PCRE2 error: global repeat returned the same string as previous\n"); "** PCRE2 error: global repeat returned the same string as previous\n");
fprintf(outfile, "** Global loop abandoned\n"); fprintf(outfile, "** Global loop abandoned\n");
@ -7579,6 +7583,7 @@ else for (gmatched = 0;; gmatched++)
if ((dat_datctl.control & CTL_ANYGLOB) == 0) break; else if ((dat_datctl.control & CTL_ANYGLOB) == 0) break; else
{ {
PCRE2_SIZE match_offset = FLD(match_data, ovector)[0];
PCRE2_SIZE end_offset = FLD(match_data, ovector)[1]; PCRE2_SIZE end_offset = FLD(match_data, ovector)[1];
/* We must now set up for the next iteration of a global search. If we have /* We must now set up for the next iteration of a global search. If we have
@ -7586,12 +7591,19 @@ else for (gmatched = 0;; gmatched++)
subject. If so, the loop is over. Otherwise, mimic what Perl's /g option subject. If so, the loop is over. Otherwise, mimic what Perl's /g option
does. Set PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED and try the match again does. Set PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED and try the match again
at the same point. If this fails it will be picked up above, where a fake at the same point. If this fails it will be picked up above, where a fake
match is set up so that at this point we advance to the next character. */ match is set up so that at this point we advance to the next character.
if (FLD(match_data, ovector)[0] == end_offset) However, in order to cope with patterns that never match at their starting
offset (e.g. /(?<=\G.)/) we don't do this when the match offset is greater
than the starting offset. This means there will be a retry with the
starting offset at the match offset. If this returns the same match again,
it is picked up above and ignored, and the special action is then taken. */
if (match_offset == end_offset)
{ {
if (end_offset == ulen) break; /* End of subject */ if (end_offset == ulen) break; /* End of subject */
g_notempty = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED; if (match_offset <= dat_datctl.offset)
g_notempty = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
} }
/* However, even after matching a non-empty string, there is still one /* However, even after matching a non-empty string, there is still one
@ -7629,10 +7641,19 @@ else for (gmatched = 0;; gmatched++)
} }
} }
/* For /g (global), update the start offset, leaving the rest alone. */ /* For a normal global (/g) iteration, save the current ovector[0,1] and
the starting offset so that we can check that they do change each time.
Otherwise a matching bug that returns the same string causes an infinite
loop. It has happened! Then update the start offset, leaving other
parameters alone. */
if ((dat_datctl.control & CTL_GLOBAL) != 0) if ((dat_datctl.control & CTL_GLOBAL) != 0)
{
ovecsave[0] = ovector[0];
ovecsave[1] = ovector[1];
ovecsave[2] = dat_datctl.offset;
dat_datctl.offset = end_offset; dat_datctl.offset = end_offset;
}
/* For altglobal, just update the pointer and length. */ /* For altglobal, just update the pointer and length. */

3
testdata/testinput1 vendored
View File

@ -6189,4 +6189,7 @@ ef) x/x,mark
/(?=a+)a(a+)++b/ /(?=a+)a(a+)++b/
aab aab
/(?<=\G.)/g,aftertext
abc
# End of testinput1 # End of testinput1

3
testdata/testinput2 vendored
View File

@ -4938,6 +4938,9 @@ a)"xI
//replace=0 //replace=0
\=offset=7 \=offset=7
/(?<=\G.)/g,replace=+
abc
".+\QX\E+"B,no_auto_possess ".+\QX\E+"B,no_auto_possess
".+\QX\E+"B,auto_callout,no_auto_possess ".+\QX\E+"B,auto_callout,no_auto_possess

View File

@ -9822,4 +9822,13 @@ No match
0: aab 0: aab
1: a 1: a
/(?<=\G.)/g,aftertext
abc
0:
0+ bc
0:
0+ c
0:
0+
# End of testinput1 # End of testinput1

View File

@ -15549,6 +15549,10 @@ Failed: error -57 at offset 2 in replacement: bad escape sequence in replacement
\=offset=7 \=offset=7
Failed: error -33: bad offset value Failed: error -33: bad offset value
/(?<=\G.)/g,replace=+
abc
3: a+b+c+
".+\QX\E+"B,no_auto_possess ".+\QX\E+"B,no_auto_possess
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra
@ -16580,7 +16584,7 @@ No match
------------------------------------------------------------------ ------------------------------------------------------------------
# End of testinput2 # End of testinput2
Error -65: PCRE2_ERROR_BADDATA (unknown error number) Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data Error -62: bad serialized data
Error -2: partial match Error -2: partial match
Error -1: no match Error -1: no match