Fix global search/replace in pcre2test and pcre2_substitute() when the pattern
matches an empty string, but never at the starting offset.
This commit is contained in:
parent
462f25d7d3
commit
1c79bdf36f
11
ChangeLog
11
ChangeLog
|
@ -90,6 +90,17 @@ standard systems:
|
|||
when linking pcre2test with MSVC. This gets rid of a stack overflow error in
|
||||
the standard set of tests.
|
||||
|
||||
20. Output a warning in pcre2test when ignoring the "altglobal" modifier when
|
||||
it is given with the "replace" modifier.
|
||||
|
||||
21. In both pcre2test and pcre2_substitute(), with global matching, a pattern
|
||||
that matched an empty string, but never at the starting match offset, was not
|
||||
handled in a Perl-compatible way. The pattern /(<?=\G.)/ is an example of such
|
||||
a pattern. Because \G is in a lookbehind assertion, there has to be a
|
||||
"bumpalong" before there can be a match. The automatic "advance by one
|
||||
character after an empty string match" rule is therefore inappropriate. A more
|
||||
complicated algorithm has now been implemented.
|
||||
|
||||
|
||||
Version 10.31 12-February-2018
|
||||
------------------------------
|
||||
|
|
2
RunTest
2
RunTest
|
@ -500,7 +500,7 @@ for bmode in "$test8" "$test16" "$test32"; do
|
|||
for opt in "" $jitopt; do
|
||||
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $setstack $bmode $opt $testdata/testinput2 testtry
|
||||
if [ $? = 0 ] ; then
|
||||
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $bmode $opt -error -65,-62,-2,-1,0,100,101,191,200 >>testtry
|
||||
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $bmode $opt -error -70,-62,-2,-1,0,100,101,191,200 >>testtry
|
||||
checkresult $? 2 "$opt"
|
||||
fi
|
||||
done
|
||||
|
|
|
@ -3154,7 +3154,10 @@ string in <i>outputbuffer</i>, replacing the part that was matched with the
|
|||
<i>replacement</i> string, whose length is supplied in <b>rlength</b>. This can
|
||||
be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
|
||||
which a \K item in a lookahead in the pattern causes the match to end before
|
||||
it starts are not supported, and give rise to an error return.
|
||||
it starts are not supported, and give rise to an error return. For global
|
||||
replacements, matches in which \K in a lookbehind causes the match to start
|
||||
earlier than the point that was reached in the previous iteration are also not
|
||||
supported.
|
||||
</P>
|
||||
<P>
|
||||
The first seven arguments of <b>pcre2_substitute()</b> are the same as for
|
||||
|
@ -3631,7 +3634,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 30 June 2018
|
||||
Last updated: 02 July 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -1084,9 +1084,9 @@ sequences but the characters that they represent.)
|
|||
Resetting the match start
|
||||
</b><br>
|
||||
<P>
|
||||
The escape sequence \K causes any previously matched characters not to be
|
||||
included in the final matched sequence that is returned. For example, the
|
||||
pattern:
|
||||
In normal use, the escape sequence \K causes any previously matched characters
|
||||
not to be included in the final matched sequence that is returned. For example,
|
||||
the pattern:
|
||||
<pre>
|
||||
foo\Kbar
|
||||
</pre>
|
||||
|
@ -1115,7 +1115,13 @@ PCRE2, \K is acted upon when it occurs inside positive assertions, but is
|
|||
ignored in negative assertions. Note that when a pattern such as (?=ab\K)
|
||||
matches, the reported start of the match can be greater than the end of the
|
||||
match. Using \K in a lookbehind assertion at the start of a pattern can also
|
||||
lead to odd effects.
|
||||
lead to odd effects. For example, consider this pattern:
|
||||
<pre>
|
||||
(?<=\Kfoo)bar
|
||||
</pre>
|
||||
If the subject is "foobar", a call to <b>pcre2_match()</b> with a starting
|
||||
offset of 3 succeeds and reports the matching string as "foobar", that is, the
|
||||
start of the reported match is earlier than where the match started.
|
||||
<a name="smallassertions"></a></P>
|
||||
<br><b>
|
||||
Simple assertions
|
||||
|
@ -3484,7 +3490,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 June 2018
|
||||
Last updated: 30 June 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -3060,6 +3060,9 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
|||
given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
|
||||
which a \K item in a lookahead in the pattern causes the match to end
|
||||
before it starts are not supported, and give rise to an error return.
|
||||
For global replacements, matches in which \K in a lookbehind causes the
|
||||
match to start earlier than the point that was reached in the previous
|
||||
iteration are also not supported.
|
||||
|
||||
The first seven arguments of pcre2_substitute() are the same as for
|
||||
pcre2_match(), except that the partial matching options are not permit-
|
||||
|
@ -3509,7 +3512,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 30 June 2018
|
||||
Last updated: 02 July 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -6664,9 +6667,9 @@ BACKSLASH
|
|||
|
||||
Resetting the match start
|
||||
|
||||
The escape sequence \K causes any previously matched characters not to
|
||||
be included in the final matched sequence that is returned. For exam-
|
||||
ple, the pattern:
|
||||
In normal use, the escape sequence \K causes any previously matched
|
||||
characters not to be included in the final matched sequence that is
|
||||
returned. For example, the pattern:
|
||||
|
||||
foo\Kbar
|
||||
|
||||
|
@ -6692,7 +6695,15 @@ BACKSLASH
|
|||
assertions, but is ignored in negative assertions. Note that when a
|
||||
pattern such as (?=ab\K) matches, the reported start of the match can
|
||||
be greater than the end of the match. Using \K in a lookbehind asser-
|
||||
tion at the start of a pattern can also lead to odd effects.
|
||||
tion at the start of a pattern can also lead to odd effects. For exam-
|
||||
ple, consider this pattern:
|
||||
|
||||
(?<=\Kfoo)bar
|
||||
|
||||
If the subject is "foobar", a call to pcre2_match() with a starting
|
||||
offset of 3 succeeds and reports the matching string as "foobar", that
|
||||
is, the start of the reported match is earlier than where the match
|
||||
started.
|
||||
|
||||
Simple assertions
|
||||
|
||||
|
@ -8930,7 +8941,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 28 June 2018
|
||||
Last updated: 30 June 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "30 June 2018" "PCRE2 10.32"
|
||||
.TH PCRE2API 3 "02 July 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -3163,7 +3163,10 @@ string in \fIoutputbuffer\fP, replacing the part that was matched with the
|
|||
\fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This can
|
||||
be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
|
||||
which a \eK item in a lookahead in the pattern causes the match to end before
|
||||
it starts are not supported, and give rise to an error return.
|
||||
it starts are not supported, and give rise to an error return. For global
|
||||
replacements, matches in which \eK in a lookbehind causes the match to start
|
||||
earlier than the point that was reached in the previous iteration are also not
|
||||
supported.
|
||||
.P
|
||||
The first seven arguments of \fBpcre2_substitute()\fP are the same as for
|
||||
\fBpcre2_match()\fP, except that the partial matching options are not
|
||||
|
@ -3637,6 +3640,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 30 June 2018
|
||||
Last updated: 02 July 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1110,7 +1110,7 @@ matches, the reported start of the match can be greater than the end of the
|
|||
match. Using \eK in a lookbehind assertion at the start of a pattern can also
|
||||
lead to odd effects. For example, consider this pattern:
|
||||
.sp
|
||||
(?<=\Kfoo)bar
|
||||
(?<=\eKfoo)bar
|
||||
.sp
|
||||
If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
|
||||
offset of 3 succeeds and reports the matching string as "foobar", that is, the
|
||||
|
|
|
@ -5,7 +5,7 @@
|
|||
/* This is the public header file for the PCRE library, second API, to be
|
||||
#included by applications that call PCRE2 functions.
|
||||
|
||||
Copyright (c) 2016-2017 University of Cambridge
|
||||
Copyright (c) 2016-2018 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
|
@ -399,6 +399,7 @@ released, the numbers must not be changed. */
|
|||
#define PCRE2_ERROR_BADSERIALIZEDDATA (-62)
|
||||
#define PCRE2_ERROR_HEAPLIMIT (-63)
|
||||
#define PCRE2_ERROR_CONVERT_SYNTAX (-64)
|
||||
#define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
|
||||
|
||||
|
||||
/* Request types for pcre2_pattern_info() */
|
||||
|
|
|
@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
|
|||
|
||||
Written by Philip Hazel
|
||||
Original API code Copyright (c) 1997-2012 University of Cambridge
|
||||
New API code Copyright (c) 2016-2017 University of Cambridge
|
||||
New API code Copyright (c) 2016-2018 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
|
@ -260,6 +260,8 @@ static const unsigned char match_error_texts[] =
|
|||
"bad serialized data\0"
|
||||
"heap limit exceeded\0"
|
||||
"invalid syntax\0"
|
||||
/* 65 */
|
||||
"internal error - duplicate substitution match\0"
|
||||
;
|
||||
|
||||
|
||||
|
|
|
@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
|
|||
|
||||
Written by Philip Hazel
|
||||
Original API code Copyright (c) 1997-2012 University of Cambridge
|
||||
New API code Copyright (c) 2016 University of Cambridge
|
||||
New API code Copyright (c) 2016-2018 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
|
@ -238,10 +238,12 @@ PCRE2_SPTR repend;
|
|||
PCRE2_SIZE extra_needed = 0;
|
||||
PCRE2_SIZE buff_offset, buff_length, lengthleft, fraglength;
|
||||
PCRE2_SIZE *ovector;
|
||||
PCRE2_SIZE ovecsave[3];
|
||||
|
||||
buff_offset = 0;
|
||||
lengthleft = buff_length = *blength;
|
||||
*blength = PCRE2_UNSET;
|
||||
ovecsave[0] = ovecsave[1] = ovecsave[2] = PCRE2_UNSET;
|
||||
|
||||
/* Partial matching is not valid. */
|
||||
|
||||
|
@ -369,6 +371,26 @@ do
|
|||
goto EXIT;
|
||||
}
|
||||
|
||||
/* Check for the same match as previous. This is legitimate after matching an
|
||||
empty string that starts after the initial match offset. We have tried again
|
||||
at the match point in case the pattern is one like /(?<=\G.)/ which can never
|
||||
match at its starting point, so running the match achieves the bumpalong. If
|
||||
we do get the same (null) match at the original match point, it isn't such a
|
||||
pattern, so we now do the empty string magic. In all other cases, a repeat
|
||||
match should never occur. */
|
||||
|
||||
if (ovecsave[0] == ovector[0] && ovecsave[1] == ovector[1])
|
||||
{
|
||||
if (ovector[0] == ovector[1] && ovecsave[2] != start_offset)
|
||||
{
|
||||
goptions = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
|
||||
ovecsave[2] = start_offset;
|
||||
continue; /* Back to the top of the loop */
|
||||
}
|
||||
rc = PCRE2_ERROR_INTERNAL_DUPMATCH;
|
||||
goto EXIT;
|
||||
}
|
||||
|
||||
/* Count substitutions with a paranoid check for integer overflow; surely no
|
||||
real call to this function would ever hit this! */
|
||||
|
||||
|
@ -799,13 +821,18 @@ do
|
|||
} /* End handling a literal code unit */
|
||||
} /* End of loop for scanning the replacement. */
|
||||
|
||||
/* The replacement has been copied to the output. Update the start offset to
|
||||
point to the rest of the subject string. If we matched an empty string,
|
||||
do the magic for global matches. */
|
||||
/* The replacement has been copied to the output. Save the details of this
|
||||
match. See above for how this data is used. If we matched an empty string, do
|
||||
the magic for global matches. Finally, update the start offset to point to
|
||||
the rest of the subject string. */
|
||||
|
||||
start_offset = ovector[1];
|
||||
goptions = (ovector[0] != ovector[1])? 0 :
|
||||
ovecsave[0] = ovector[0];
|
||||
ovecsave[1] = ovector[1];
|
||||
ovecsave[2] = start_offset;
|
||||
|
||||
goptions = (ovector[0] != ovector[1] || ovector[0] > start_offset)? 0 :
|
||||
PCRE2_ANCHORED|PCRE2_NOTEMPTY_ATSTART;
|
||||
start_offset = ovector[1];
|
||||
} while ((suboptions & PCRE2_SUBSTITUTE_GLOBAL) != 0); /* Repeat "do" loop */
|
||||
|
||||
/* Copy the rest of the subject. */
|
||||
|
|
|
@ -6302,6 +6302,7 @@ size_t needlen;
|
|||
void *use_dat_context;
|
||||
BOOL utf;
|
||||
BOOL subject_literal;
|
||||
PCRE2_SIZE ovecsave[3];
|
||||
|
||||
#ifdef SUPPORT_PCRE2_8
|
||||
uint8_t *q8 = NULL;
|
||||
|
@ -6949,6 +6950,9 @@ if (dat_datctl.replacement[0] != 0)
|
|||
if (timeitm)
|
||||
fprintf(outfile, "** Timing is not supported with replace: ignored\n");
|
||||
|
||||
if ((dat_datctl.control & CTL_ALTGLOBAL) != 0)
|
||||
fprintf(outfile, "** Altglobal is not supported with replace: ignored\n");
|
||||
|
||||
xoptions = (((dat_datctl.control & CTL_GLOBAL) == 0)? 0 :
|
||||
PCRE2_SUBSTITUTE_GLOBAL) |
|
||||
(((dat_datctl.control2 & CTL2_SUBSTITUTE_EXTENDED) == 0)? 0 :
|
||||
|
@ -7067,35 +7071,24 @@ if (dat_datctl.replacement[0] != 0)
|
|||
}
|
||||
|
||||
fprintf(outfile, "\n");
|
||||
show_memory = FALSE;
|
||||
return PR_OK;
|
||||
} /* End of substitution handling */
|
||||
|
||||
/* When a replacement string is not provided, run a loop for global matching
|
||||
with one of the basic matching functions. */
|
||||
with one of the basic matching functions. For altglobal (or first time round
|
||||
the loop), set an "unset" value for the previous match info. */
|
||||
|
||||
else for (gmatched = 0;; gmatched++)
|
||||
ovecsave[0] = ovecsave[1] = ovecsave[2] = PCRE2_UNSET;
|
||||
|
||||
for (gmatched = 0;; gmatched++)
|
||||
{
|
||||
PCRE2_SIZE j;
|
||||
int capcount;
|
||||
PCRE2_SIZE *ovector;
|
||||
PCRE2_SIZE ovecsave[2];
|
||||
|
||||
ovector = FLD(match_data, ovector);
|
||||
|
||||
/* After the first time round a global loop, for a normal global (/g)
|
||||
iteration, save the current ovector[0,1] so that we can check that they do
|
||||
change each time. Otherwise a matching bug that returns the same string
|
||||
causes an infinite loop. It has happened! */
|
||||
|
||||
if (gmatched > 0 && (dat_datctl.control & CTL_GLOBAL) != 0)
|
||||
{
|
||||
ovecsave[0] = ovector[0];
|
||||
ovecsave[1] = ovector[1];
|
||||
}
|
||||
|
||||
/* For altglobal (or first time round the loop), set an "unset" value. */
|
||||
|
||||
else ovecsave[0] = ovecsave[1] = PCRE2_UNSET;
|
||||
|
||||
/* Fill the ovector with junk to detect elements that do not get set
|
||||
when they should be. */
|
||||
|
||||
|
@ -7266,12 +7259,23 @@ else for (gmatched = 0;; gmatched++)
|
|||
}
|
||||
|
||||
/* If this is not the first time round a global loop, check that the
|
||||
returned string has changed. If not, there is a bug somewhere and we must
|
||||
break the loop because it will go on for ever. We know that there are
|
||||
always at least two elements in the ovector. */
|
||||
returned string has changed. If it has not, check for an empty string match
|
||||
at different starting offset from the previous match. This is a failed test
|
||||
retry for null-matching patterns that don't match at their starting offset,
|
||||
for example /(?<=\G.)/. A repeated match at the same point is not such a
|
||||
pattern, and must be discarded, and we then proceed to seek a non-null
|
||||
match at the current point. For any other repeated match, there is a bug
|
||||
somewhere and we must break the loop because it will go on for ever. We
|
||||
know that there are always at least two elements in the ovector. */
|
||||
|
||||
if (gmatched > 0 && ovecsave[0] == ovector[0] && ovecsave[1] == ovector[1])
|
||||
{
|
||||
if (ovector[0] == ovector[1] && ovecsave[2] != dat_datctl.offset)
|
||||
{
|
||||
g_notempty = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
|
||||
ovecsave[2] = dat_datctl.offset;
|
||||
continue; /* Back to the top of the loop */
|
||||
}
|
||||
fprintf(outfile,
|
||||
"** PCRE2 error: global repeat returned the same string as previous\n");
|
||||
fprintf(outfile, "** Global loop abandoned\n");
|
||||
|
@ -7579,6 +7583,7 @@ else for (gmatched = 0;; gmatched++)
|
|||
|
||||
if ((dat_datctl.control & CTL_ANYGLOB) == 0) break; else
|
||||
{
|
||||
PCRE2_SIZE match_offset = FLD(match_data, ovector)[0];
|
||||
PCRE2_SIZE end_offset = FLD(match_data, ovector)[1];
|
||||
|
||||
/* We must now set up for the next iteration of a global search. If we have
|
||||
|
@ -7586,11 +7591,18 @@ else for (gmatched = 0;; gmatched++)
|
|||
subject. If so, the loop is over. Otherwise, mimic what Perl's /g option
|
||||
does. Set PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED and try the match again
|
||||
at the same point. If this fails it will be picked up above, where a fake
|
||||
match is set up so that at this point we advance to the next character. */
|
||||
match is set up so that at this point we advance to the next character.
|
||||
|
||||
if (FLD(match_data, ovector)[0] == end_offset)
|
||||
However, in order to cope with patterns that never match at their starting
|
||||
offset (e.g. /(?<=\G.)/) we don't do this when the match offset is greater
|
||||
than the starting offset. This means there will be a retry with the
|
||||
starting offset at the match offset. If this returns the same match again,
|
||||
it is picked up above and ignored, and the special action is then taken. */
|
||||
|
||||
if (match_offset == end_offset)
|
||||
{
|
||||
if (end_offset == ulen) break; /* End of subject */
|
||||
if (match_offset <= dat_datctl.offset)
|
||||
g_notempty = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
|
||||
}
|
||||
|
||||
|
@ -7629,10 +7641,19 @@ else for (gmatched = 0;; gmatched++)
|
|||
}
|
||||
}
|
||||
|
||||
/* For /g (global), update the start offset, leaving the rest alone. */
|
||||
/* For a normal global (/g) iteration, save the current ovector[0,1] and
|
||||
the starting offset so that we can check that they do change each time.
|
||||
Otherwise a matching bug that returns the same string causes an infinite
|
||||
loop. It has happened! Then update the start offset, leaving other
|
||||
parameters alone. */
|
||||
|
||||
if ((dat_datctl.control & CTL_GLOBAL) != 0)
|
||||
{
|
||||
ovecsave[0] = ovector[0];
|
||||
ovecsave[1] = ovector[1];
|
||||
ovecsave[2] = dat_datctl.offset;
|
||||
dat_datctl.offset = end_offset;
|
||||
}
|
||||
|
||||
/* For altglobal, just update the pointer and length. */
|
||||
|
||||
|
|
|
@ -6189,4 +6189,7 @@ ef) x/x,mark
|
|||
/(?=a+)a(a+)++b/
|
||||
aab
|
||||
|
||||
/(?<=\G.)/g,aftertext
|
||||
abc
|
||||
|
||||
# End of testinput1
|
||||
|
|
|
@ -4938,6 +4938,9 @@ a)"xI
|
|||
//replace=0
|
||||
\=offset=7
|
||||
|
||||
/(?<=\G.)/g,replace=+
|
||||
abc
|
||||
|
||||
".+\QX\E+"B,no_auto_possess
|
||||
|
||||
".+\QX\E+"B,auto_callout,no_auto_possess
|
||||
|
|
|
@ -9822,4 +9822,13 @@ No match
|
|||
0: aab
|
||||
1: a
|
||||
|
||||
/(?<=\G.)/g,aftertext
|
||||
abc
|
||||
0:
|
||||
0+ bc
|
||||
0:
|
||||
0+ c
|
||||
0:
|
||||
0+
|
||||
|
||||
# End of testinput1
|
||||
|
|
|
@ -15549,6 +15549,10 @@ Failed: error -57 at offset 2 in replacement: bad escape sequence in replacement
|
|||
\=offset=7
|
||||
Failed: error -33: bad offset value
|
||||
|
||||
/(?<=\G.)/g,replace=+
|
||||
abc
|
||||
3: a+b+c+
|
||||
|
||||
".+\QX\E+"B,no_auto_possess
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
|
@ -16580,7 +16584,7 @@ No match
|
|||
------------------------------------------------------------------
|
||||
|
||||
# End of testinput2
|
||||
Error -65: PCRE2_ERROR_BADDATA (unknown error number)
|
||||
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
|
||||
Error -62: bad serialized data
|
||||
Error -2: partial match
|
||||
Error -1: no match
|
||||
|
|
Loading…
Reference in New Issue