Fix global search/replace in pcre2test and pcre2_substitute() when the pattern

matches an empty string, but never at the starting offset.
This commit is contained in:
Philip.Hazel 2018-07-02 10:54:03 +00:00
parent 462f25d7d3
commit 1c79bdf36f
15 changed files with 333 additions and 229 deletions

View File

@ -90,6 +90,17 @@ standard systems:
when linking pcre2test with MSVC. This gets rid of a stack overflow error in
the standard set of tests.
20. Output a warning in pcre2test when ignoring the "altglobal" modifier when
it is given with the "replace" modifier.
21. In both pcre2test and pcre2_substitute(), with global matching, a pattern
that matched an empty string, but never at the starting match offset, was not
handled in a Perl-compatible way. The pattern /(<?=\G.)/ is an example of such
a pattern. Because \G is in a lookbehind assertion, there has to be a
"bumpalong" before there can be a match. The automatic "advance by one
character after an empty string match" rule is therefore inappropriate. A more
complicated algorithm has now been implemented.
Version 10.31 12-February-2018
------------------------------

View File

@ -500,7 +500,7 @@ for bmode in "$test8" "$test16" "$test32"; do
for opt in "" $jitopt; do
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $setstack $bmode $opt $testdata/testinput2 testtry
if [ $? = 0 ] ; then
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $bmode $opt -error -65,-62,-2,-1,0,100,101,191,200 >>testtry
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $bmode $opt -error -70,-62,-2,-1,0,100,101,191,200 >>testtry
checkresult $? 2 "$opt"
fi
done

View File

@ -3154,7 +3154,10 @@ string in <i>outputbuffer</i>, replacing the part that was matched with the
<i>replacement</i> string, whose length is supplied in <b>rlength</b>. This can
be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
which a \K item in a lookahead in the pattern causes the match to end before
it starts are not supported, and give rise to an error return.
it starts are not supported, and give rise to an error return. For global
replacements, matches in which \K in a lookbehind causes the match to start
earlier than the point that was reached in the previous iteration are also not
supported.
</P>
<P>
The first seven arguments of <b>pcre2_substitute()</b> are the same as for
@ -3631,7 +3634,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 30 June 2018
Last updated: 02 July 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -1084,9 +1084,9 @@ sequences but the characters that they represent.)
Resetting the match start
</b><br>
<P>
The escape sequence \K causes any previously matched characters not to be
included in the final matched sequence that is returned. For example, the
pattern:
In normal use, the escape sequence \K causes any previously matched characters
not to be included in the final matched sequence that is returned. For example,
the pattern:
<pre>
foo\Kbar
</pre>
@ -1115,7 +1115,13 @@ PCRE2, \K is acted upon when it occurs inside positive assertions, but is
ignored in negative assertions. Note that when a pattern such as (?=ab\K)
matches, the reported start of the match can be greater than the end of the
match. Using \K in a lookbehind assertion at the start of a pattern can also
lead to odd effects.
lead to odd effects. For example, consider this pattern:
<pre>
(?&#60;=\Kfoo)bar
</pre>
If the subject is "foobar", a call to <b>pcre2_match()</b> with a starting
offset of 3 succeeds and reports the matching string as "foobar", that is, the
start of the reported match is earlier than where the match started.
<a name="smallassertions"></a></P>
<br><b>
Simple assertions
@ -3484,7 +3490,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 28 June 2018
Last updated: 30 June 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -3060,6 +3060,9 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
which a \K item in a lookahead in the pattern causes the match to end
before it starts are not supported, and give rise to an error return.
For global replacements, matches in which \K in a lookbehind causes the
match to start earlier than the point that was reached in the previous
iteration are also not supported.
The first seven arguments of pcre2_substitute() are the same as for
pcre2_match(), except that the partial matching options are not permit-
@ -3509,7 +3512,7 @@ AUTHOR
REVISION
Last updated: 30 June 2018
Last updated: 02 July 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------
@ -6664,9 +6667,9 @@ BACKSLASH
Resetting the match start
The escape sequence \K causes any previously matched characters not to
be included in the final matched sequence that is returned. For exam-
ple, the pattern:
In normal use, the escape sequence \K causes any previously matched
characters not to be included in the final matched sequence that is
returned. For example, the pattern:
foo\Kbar
@ -6692,7 +6695,15 @@ BACKSLASH
assertions, but is ignored in negative assertions. Note that when a
pattern such as (?=ab\K) matches, the reported start of the match can
be greater than the end of the match. Using \K in a lookbehind asser-
tion at the start of a pattern can also lead to odd effects.
tion at the start of a pattern can also lead to odd effects. For exam-
ple, consider this pattern:
(?<=\Kfoo)bar
If the subject is "foobar", a call to pcre2_match() with a starting
offset of 3 succeeds and reports the matching string as "foobar", that
is, the start of the reported match is earlier than where the match
started.
Simple assertions
@ -8930,7 +8941,7 @@ AUTHOR
REVISION
Last updated: 28 June 2018
Last updated: 30 June 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "30 June 2018" "PCRE2 10.32"
.TH PCRE2API 3 "02 July 2018" "PCRE2 10.32"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -3163,7 +3163,10 @@ string in \fIoutputbuffer\fP, replacing the part that was matched with the
\fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This can
be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
which a \eK item in a lookahead in the pattern causes the match to end before
it starts are not supported, and give rise to an error return.
it starts are not supported, and give rise to an error return. For global
replacements, matches in which \eK in a lookbehind causes the match to start
earlier than the point that was reached in the previous iteration are also not
supported.
.P
The first seven arguments of \fBpcre2_substitute()\fP are the same as for
\fBpcre2_match()\fP, except that the partial matching options are not
@ -3637,6 +3640,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 30 June 2018
Last updated: 02 July 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -1110,7 +1110,7 @@ matches, the reported start of the match can be greater than the end of the
match. Using \eK in a lookbehind assertion at the start of a pattern can also
lead to odd effects. For example, consider this pattern:
.sp
(?<=\Kfoo)bar
(?<=\eKfoo)bar
.sp
If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
offset of 3 succeeds and reports the matching string as "foobar", that is, the

View File

@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, second API, to be
#included by applications that call PCRE2 functions.
Copyright (c) 2016-2017 University of Cambridge
Copyright (c) 2016-2018 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -399,6 +399,7 @@ released, the numbers must not be changed. */
#define PCRE2_ERROR_BADSERIALIZEDDATA (-62)
#define PCRE2_ERROR_HEAPLIMIT (-63)
#define PCRE2_ERROR_CONVERT_SYNTAX (-64)
#define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
/* Request types for pcre2_pattern_info() */

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016-2017 University of Cambridge
New API code Copyright (c) 2016-2018 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -260,6 +260,8 @@ static const unsigned char match_error_texts[] =
"bad serialized data\0"
"heap limit exceeded\0"
"invalid syntax\0"
/* 65 */
"internal error - duplicate substitution match\0"
;

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016 University of Cambridge
New API code Copyright (c) 2016-2018 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -238,10 +238,12 @@ PCRE2_SPTR repend;
PCRE2_SIZE extra_needed = 0;
PCRE2_SIZE buff_offset, buff_length, lengthleft, fraglength;
PCRE2_SIZE *ovector;
PCRE2_SIZE ovecsave[3];
buff_offset = 0;
lengthleft = buff_length = *blength;
*blength = PCRE2_UNSET;
ovecsave[0] = ovecsave[1] = ovecsave[2] = PCRE2_UNSET;
/* Partial matching is not valid. */
@ -369,6 +371,26 @@ do
goto EXIT;
}
/* Check for the same match as previous. This is legitimate after matching an
empty string that starts after the initial match offset. We have tried again
at the match point in case the pattern is one like /(?<=\G.)/ which can never
match at its starting point, so running the match achieves the bumpalong. If
we do get the same (null) match at the original match point, it isn't such a
pattern, so we now do the empty string magic. In all other cases, a repeat
match should never occur. */
if (ovecsave[0] == ovector[0] && ovecsave[1] == ovector[1])
{
if (ovector[0] == ovector[1] && ovecsave[2] != start_offset)
{
goptions = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
ovecsave[2] = start_offset;
continue; /* Back to the top of the loop */
}
rc = PCRE2_ERROR_INTERNAL_DUPMATCH;
goto EXIT;
}
/* Count substitutions with a paranoid check for integer overflow; surely no
real call to this function would ever hit this! */
@ -799,13 +821,18 @@ do
} /* End handling a literal code unit */
} /* End of loop for scanning the replacement. */
/* The replacement has been copied to the output. Update the start offset to
point to the rest of the subject string. If we matched an empty string,
do the magic for global matches. */
/* The replacement has been copied to the output. Save the details of this
match. See above for how this data is used. If we matched an empty string, do
the magic for global matches. Finally, update the start offset to point to
the rest of the subject string. */
start_offset = ovector[1];
goptions = (ovector[0] != ovector[1])? 0 :
ovecsave[0] = ovector[0];
ovecsave[1] = ovector[1];
ovecsave[2] = start_offset;
goptions = (ovector[0] != ovector[1] || ovector[0] > start_offset)? 0 :
PCRE2_ANCHORED|PCRE2_NOTEMPTY_ATSTART;
start_offset = ovector[1];
} while ((suboptions & PCRE2_SUBSTITUTE_GLOBAL) != 0); /* Repeat "do" loop */
/* Copy the rest of the subject. */

View File

@ -6302,6 +6302,7 @@ size_t needlen;
void *use_dat_context;
BOOL utf;
BOOL subject_literal;
PCRE2_SIZE ovecsave[3];
#ifdef SUPPORT_PCRE2_8
uint8_t *q8 = NULL;
@ -6949,6 +6950,9 @@ if (dat_datctl.replacement[0] != 0)
if (timeitm)
fprintf(outfile, "** Timing is not supported with replace: ignored\n");
if ((dat_datctl.control & CTL_ALTGLOBAL) != 0)
fprintf(outfile, "** Altglobal is not supported with replace: ignored\n");
xoptions = (((dat_datctl.control & CTL_GLOBAL) == 0)? 0 :
PCRE2_SUBSTITUTE_GLOBAL) |
(((dat_datctl.control2 & CTL2_SUBSTITUTE_EXTENDED) == 0)? 0 :
@ -7067,35 +7071,24 @@ if (dat_datctl.replacement[0] != 0)
}
fprintf(outfile, "\n");
show_memory = FALSE;
return PR_OK;
} /* End of substitution handling */
/* When a replacement string is not provided, run a loop for global matching
with one of the basic matching functions. */
with one of the basic matching functions. For altglobal (or first time round
the loop), set an "unset" value for the previous match info. */
else for (gmatched = 0;; gmatched++)
ovecsave[0] = ovecsave[1] = ovecsave[2] = PCRE2_UNSET;
for (gmatched = 0;; gmatched++)
{
PCRE2_SIZE j;
int capcount;
PCRE2_SIZE *ovector;
PCRE2_SIZE ovecsave[2];
ovector = FLD(match_data, ovector);
/* After the first time round a global loop, for a normal global (/g)
iteration, save the current ovector[0,1] so that we can check that they do
change each time. Otherwise a matching bug that returns the same string
causes an infinite loop. It has happened! */
if (gmatched > 0 && (dat_datctl.control & CTL_GLOBAL) != 0)
{
ovecsave[0] = ovector[0];
ovecsave[1] = ovector[1];
}
/* For altglobal (or first time round the loop), set an "unset" value. */
else ovecsave[0] = ovecsave[1] = PCRE2_UNSET;
/* Fill the ovector with junk to detect elements that do not get set
when they should be. */
@ -7266,12 +7259,23 @@ else for (gmatched = 0;; gmatched++)
}
/* If this is not the first time round a global loop, check that the
returned string has changed. If not, there is a bug somewhere and we must
break the loop because it will go on for ever. We know that there are
always at least two elements in the ovector. */
returned string has changed. If it has not, check for an empty string match
at different starting offset from the previous match. This is a failed test
retry for null-matching patterns that don't match at their starting offset,
for example /(?<=\G.)/. A repeated match at the same point is not such a
pattern, and must be discarded, and we then proceed to seek a non-null
match at the current point. For any other repeated match, there is a bug
somewhere and we must break the loop because it will go on for ever. We
know that there are always at least two elements in the ovector. */
if (gmatched > 0 && ovecsave[0] == ovector[0] && ovecsave[1] == ovector[1])
{
if (ovector[0] == ovector[1] && ovecsave[2] != dat_datctl.offset)
{
g_notempty = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
ovecsave[2] = dat_datctl.offset;
continue; /* Back to the top of the loop */
}
fprintf(outfile,
"** PCRE2 error: global repeat returned the same string as previous\n");
fprintf(outfile, "** Global loop abandoned\n");
@ -7579,6 +7583,7 @@ else for (gmatched = 0;; gmatched++)
if ((dat_datctl.control & CTL_ANYGLOB) == 0) break; else
{
PCRE2_SIZE match_offset = FLD(match_data, ovector)[0];
PCRE2_SIZE end_offset = FLD(match_data, ovector)[1];
/* We must now set up for the next iteration of a global search. If we have
@ -7586,11 +7591,18 @@ else for (gmatched = 0;; gmatched++)
subject. If so, the loop is over. Otherwise, mimic what Perl's /g option
does. Set PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED and try the match again
at the same point. If this fails it will be picked up above, where a fake
match is set up so that at this point we advance to the next character. */
match is set up so that at this point we advance to the next character.
if (FLD(match_data, ovector)[0] == end_offset)
However, in order to cope with patterns that never match at their starting
offset (e.g. /(?<=\G.)/) we don't do this when the match offset is greater
than the starting offset. This means there will be a retry with the
starting offset at the match offset. If this returns the same match again,
it is picked up above and ignored, and the special action is then taken. */
if (match_offset == end_offset)
{
if (end_offset == ulen) break; /* End of subject */
if (match_offset <= dat_datctl.offset)
g_notempty = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
}
@ -7629,10 +7641,19 @@ else for (gmatched = 0;; gmatched++)
}
}
/* For /g (global), update the start offset, leaving the rest alone. */
/* For a normal global (/g) iteration, save the current ovector[0,1] and
the starting offset so that we can check that they do change each time.
Otherwise a matching bug that returns the same string causes an infinite
loop. It has happened! Then update the start offset, leaving other
parameters alone. */
if ((dat_datctl.control & CTL_GLOBAL) != 0)
{
ovecsave[0] = ovector[0];
ovecsave[1] = ovector[1];
ovecsave[2] = dat_datctl.offset;
dat_datctl.offset = end_offset;
}
/* For altglobal, just update the pointer and length. */

3
testdata/testinput1 vendored
View File

@ -6189,4 +6189,7 @@ ef) x/x,mark
/(?=a+)a(a+)++b/
aab
/(?<=\G.)/g,aftertext
abc
# End of testinput1

3
testdata/testinput2 vendored
View File

@ -4938,6 +4938,9 @@ a)"xI
//replace=0
\=offset=7
/(?<=\G.)/g,replace=+
abc
".+\QX\E+"B,no_auto_possess
".+\QX\E+"B,auto_callout,no_auto_possess

View File

@ -9822,4 +9822,13 @@ No match
0: aab
1: a
/(?<=\G.)/g,aftertext
abc
0:
0+ bc
0:
0+ c
0:
0+
# End of testinput1

View File

@ -15549,6 +15549,10 @@ Failed: error -57 at offset 2 in replacement: bad escape sequence in replacement
\=offset=7
Failed: error -33: bad offset value
/(?<=\G.)/g,replace=+
abc
3: a+b+c+
".+\QX\E+"B,no_auto_possess
------------------------------------------------------------------
Bra
@ -16580,7 +16584,7 @@ No match
------------------------------------------------------------------
# End of testinput2
Error -65: PCRE2_ERROR_BADDATA (unknown error number)
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data
Error -2: partial match
Error -1: no match