Improve maximum lookbehind calculation for nested lookbehinds.

This commit is contained in:
Philip.Hazel 2019-06-25 15:40:42 +00:00
parent 7f24a98cfb
commit d21f7daf9b
7 changed files with 741 additions and 619 deletions

View File

@ -66,6 +66,11 @@ is made possessive and applied to an item in parentheses, because a
parenthesized item may contain multiple branches or other backtracking points,
for example /(a|ab){1}+c/ or /(a+){1}+a/.
13. Nested lookbehinds are now taken into account when computing the maximum
lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
it to 3, because matching looks back 3 characters.
Version 10.33 16-April-2019
---------------------------

View File

@ -1767,9 +1767,9 @@ subject, which is recorded when possible. Consider the pattern
<pre>
(*MARK:1)B(*MARK:2)(X|Y)
</pre>
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
the match attempt fails, the next "B" is found, but there is only one character
left, so there are no more attempts, and "no match" is returned with the "last
mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
@ -2252,16 +2252,26 @@ defaulted by the caller of the match function.
<pre>
PCRE2_INFO_MAXLOOKBEHIND
</pre>
Return the number of characters (not code units) in the longest lookbehind
assertion in the pattern. The third argument should point to a uint32_t
integer. This information is useful when doing multi-segment matching using the
partial matching facilities. Note that the simple assertions \b and \B
require a one-character lookbehind. \A also registers a one-character
lookbehind, though it does not actually inspect the previous character. This is
to ensure that at least one character from the old segment is retained when a
new segment is processed. Otherwise, if there are no lookbehinds in the
pattern, \A might match incorrectly at the start of a second or subsequent
segment.
Return the largest number of characters (not code units) before the current
matching point that could be inspected while processing a lookbehind assertion
in the pattern. Before release 10.34 this request used to give the largest
value for any individual assertion. Now it takes into account nested
lookbehinds, which can mean that the overall value is greater. For example, the
pattern (?&#60;=a(?&#60;=ba)c) previously returned 2, because that is the length of the
largest individual lookbehind. Now it returns 3, because matching actually
looks back 3 characters.
</P>
<P>
The third argument should point to a uint32_t integer. This information is
useful when doing multi-segment matching using the partial matching facilities.
Note that the simple assertions \b and \B require a one-character lookbehind.
\A also registers a one-character lookbehind, though it does not actually
inspect the previous character. This is to ensure that at least one character
from the old segment is retained when a new segment is processed. Otherwise, if
there are no lookbehinds in the pattern, \A might match incorrectly at the
start of a second or subsequent segment. There are more details in the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation.
<pre>
PCRE2_INFO_MINLENGTH
</pre>
@ -3836,7 +3846,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 11 June 2019
Last updated: 25 June 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "11 June 2019" "PCRE2 10.34"
.TH PCRE2API 3 "25 June 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -1706,9 +1706,9 @@ subject, which is recorded when possible. Consider the pattern
.sp
(*MARK:1)B(*MARK:2)(X|Y)
.sp
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
the match attempt fails, the next "B" is found, but there is only one character
left, so there are no more attempts, and "no match" is returned with the "last
mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
@ -2215,16 +2215,27 @@ defaulted by the caller of the match function.
.sp
PCRE2_INFO_MAXLOOKBEHIND
.sp
Return the number of characters (not code units) in the longest lookbehind
assertion in the pattern. The third argument should point to a uint32_t
integer. This information is useful when doing multi-segment matching using the
partial matching facilities. Note that the simple assertions \eb and \eB
require a one-character lookbehind. \eA also registers a one-character
lookbehind, though it does not actually inspect the previous character. This is
to ensure that at least one character from the old segment is retained when a
new segment is processed. Otherwise, if there are no lookbehinds in the
pattern, \eA might match incorrectly at the start of a second or subsequent
segment.
Return the largest number of characters (not code units) before the current
matching point that could be inspected while processing a lookbehind assertion
in the pattern. Before release 10.34 this request used to give the largest
value for any individual assertion. Now it takes into account nested
lookbehinds, which can mean that the overall value is greater. For example, the
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
largest individual lookbehind. Now it returns 3, because matching actually
looks back 3 characters.
.P
The third argument should point to a uint32_t integer. This information is
useful when doing multi-segment matching using the partial matching facilities.
Note that the simple assertions \eb and \eB require a one-character lookbehind.
\eA also registers a one-character lookbehind, though it does not actually
inspect the previous character. This is to ensure that at least one character
from the old segment is retained when a new segment is processed. Otherwise, if
there are no lookbehinds in the pattern, \eA might match incorrectly at the
start of a second or subsequent segment. There are more details in the
.\" HREF
\fBpcre2partial\fP
.\"
documentation.
.sp
PCRE2_INFO_MINLENGTH
.sp
@ -3848,6 +3859,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 11 June 2019
Last updated: 25 June 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -132,8 +132,8 @@ static int
compile_block *);
static BOOL
set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *,
compile_block *);
set_lookbehind_lengths(uint32_t **, int *, int *, int *,
parsed_recurse_check *, compile_block *);
@ -8902,7 +8902,8 @@ return -1;
/* Return a fixed length for a branch in a lookbehind, giving an error if the
length is not fixed. If any lookbehinds are encountered on the way, they get
their length set. On entry, *pptrptr points to the first element inside the
their length set, and there is a check for them looking further back than the
current lookbehind. On entry, *pptrptr points to the first element inside the
branch. On exit it is set to point to the ALT or KET.
Arguments:
@ -8921,6 +8922,8 @@ get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
{
int branchlength = 0;
int grouplength;
int max;
int extra = 0; /* Additional lookbehind from nesting */
uint32_t lastitemlength = 0;
uint32_t *pptr = *pptrptr;
PCRE2_SIZE offset;
@ -9067,12 +9070,17 @@ for (;; pptr++)
}
break;
/* Lookbehinds can be ignored, but must themselves be checked. */
/* A lookbehind does not contribute any length to this lookbehind, but must
itself be checked and have its lengths set. If the maximum lookebhind of
any branch is greater than the length so far computed for this branch, we
must set an extra value for use when setting the maximum overall
lookbehind. */
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
return -1;
if (max - branchlength > extra) extra = max - branchlength;
break;
/* Back references and recursions are handled by very similar code. At this
@ -9264,7 +9272,15 @@ for (;; pptr++)
EXIT:
*pptrptr = pptr;
if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
/* The overall maximum lookbehind for any branch in the pattern takes note of
any extra value that is generated from a nested lookbehind. For example, for
/(?<=a(?<=ba)c)/ each individual lookbehind has length 2, but the
max_lookbehind setting is 3 because matching inspects 3 characters before the
match starting point. */
if (branchlength + extra > cb->max_lookbehind)
cb->max_lookbehind = branchlength + extra;
return branchlength;
PARSED_SKIP_FAILED:
@ -9285,6 +9301,7 @@ ket.
Arguments:
pptrptr pointer to pointer in the parsed pattern
maxptr where to return maximum length for the whole group
errcodeptr pointer to error code
lcptr pointer to loop counter
recurses chain of recurse_check to catch mutual recursion
@ -9295,11 +9312,12 @@ Returns: TRUE if all is well
*/
static BOOL
set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
parsed_recurse_check *recurses, compile_block *cb)
set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
{
PCRE2_SIZE offset;
int branchlength;
int max = 0;
uint32_t *bptr = *pptrptr;
READPLUSOFFSET(offset, bptr); /* Offset for error messages */
@ -9316,11 +9334,13 @@ do
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
return FALSE;
}
if (branchlength > max) max = branchlength;
*bptr |= branchlength; /* branchlength never more than 65535 */
bptr = *pptrptr;
}
while (*bptr == META_ALT);
*maxptr = max;
return TRUE;
}
@ -9344,6 +9364,7 @@ static int
check_lookbehinds(compile_block *cb)
{
uint32_t *pptr;
int max;
int errorcode = 0;
int loopcount = 0;
@ -9446,7 +9467,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, NULL, cb))
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, NULL, cb))
return errorcode;
break;
}

18
testdata/testinput2 vendored
View File

@ -5629,4 +5629,22 @@ a)"xI
ABC
AXY
/(?<=(?<=a)b)c.*/I
abc\=ph
\= Expect no match
xbc\=ph
/(?<=ab)c.*/I
abc\=ph
\= Expect no match
xbc\=ph
/(?<=a(?<=a|a)c)/I
/(?<=a(?<=a|ba)c)/I
/(?<=(?<=a)b)(?<!abcd)/I
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
# End of testinput2

48
testdata/testoutput2 vendored
View File

@ -17039,6 +17039,54 @@ Subject length lower bound = 1
0: A
1: A
/(?<=(?<=a)b)c.*/I
Capture group count = 0
Max lookbehind = 2
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
Partial match: abc
<<
\= Expect no match
xbc\=ph
No match
/(?<=ab)c.*/I
Capture group count = 0
Max lookbehind = 2
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
Partial match: abc
<<
\= Expect no match
xbc\=ph
No match
/(?<=a(?<=a|a)c)/I
Capture group count = 0
Max lookbehind = 2
May match empty string
Subject length lower bound = 0
/(?<=a(?<=a|ba)c)/I
Capture group count = 0
Max lookbehind = 3
May match empty string
Subject length lower bound = 0
/(?<=(?<=a)b)(?<!abcd)/I
Capture group count = 0
Max lookbehind = 4
May match empty string
Subject length lower bound = 0
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
Capture group count = 0
Max lookbehind = 5
May match empty string
Subject length lower bound = 0
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data