Improve maximum lookbehind calculation for nested lookbehinds.

This commit is contained in:
Philip.Hazel 2019-06-25 15:40:42 +00:00
parent 7f24a98cfb
commit d21f7daf9b
7 changed files with 741 additions and 619 deletions

View File

@ -66,6 +66,11 @@ is made possessive and applied to an item in parentheses, because a
parenthesized item may contain multiple branches or other backtracking points, parenthesized item may contain multiple branches or other backtracking points,
for example /(a|ab){1}+c/ or /(a+){1}+a/. for example /(a|ab){1}+c/ or /(a+){1}+a/.
13. Nested lookbehinds are now taken into account when computing the maximum
lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
it to 3, because matching looks back 3 characters.
Version 10.33 16-April-2019 Version 10.33 16-April-2019
--------------------------- ---------------------------

View File

@ -2252,16 +2252,26 @@ defaulted by the caller of the match function.
<pre> <pre>
PCRE2_INFO_MAXLOOKBEHIND PCRE2_INFO_MAXLOOKBEHIND
</pre> </pre>
Return the number of characters (not code units) in the longest lookbehind Return the largest number of characters (not code units) before the current
assertion in the pattern. The third argument should point to a uint32_t matching point that could be inspected while processing a lookbehind assertion
integer. This information is useful when doing multi-segment matching using the in the pattern. Before release 10.34 this request used to give the largest
partial matching facilities. Note that the simple assertions \b and \B value for any individual assertion. Now it takes into account nested
require a one-character lookbehind. \A also registers a one-character lookbehinds, which can mean that the overall value is greater. For example, the
lookbehind, though it does not actually inspect the previous character. This is pattern (?&#60;=a(?&#60;=ba)c) previously returned 2, because that is the length of the
to ensure that at least one character from the old segment is retained when a largest individual lookbehind. Now it returns 3, because matching actually
new segment is processed. Otherwise, if there are no lookbehinds in the looks back 3 characters.
pattern, \A might match incorrectly at the start of a second or subsequent </P>
segment. <P>
The third argument should point to a uint32_t integer. This information is
useful when doing multi-segment matching using the partial matching facilities.
Note that the simple assertions \b and \B require a one-character lookbehind.
\A also registers a one-character lookbehind, though it does not actually
inspect the previous character. This is to ensure that at least one character
from the old segment is retained when a new segment is processed. Otherwise, if
there are no lookbehinds in the pattern, \A might match incorrectly at the
start of a second or subsequent segment. There are more details in the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation.
<pre> <pre>
PCRE2_INFO_MINLENGTH PCRE2_INFO_MINLENGTH
</pre> </pre>
@ -3836,7 +3846,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 11 June 2019 Last updated: 25 June 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -2218,16 +2218,25 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_MAXLOOKBEHIND PCRE2_INFO_MAXLOOKBEHIND
Return the number of characters (not code units) in the longest lookbe- Return the largest number of characters (not code units) before the
hind assertion in the pattern. The third argument should point to a current matching point that could be inspected while processing a look-
uint32_t integer. This information is useful when doing multi-segment behind assertion in the pattern. Before release 10.34 this request used
matching using the partial matching facilities. Note that the simple to give the largest value for any individual assertion. Now it takes
assertions \b and \B require a one-character lookbehind. \A also regis- into account nested lookbehinds, which can mean that the overall value
ters a one-character lookbehind, though it does not actually inspect is greater. For example, the pattern (?<=a(?<=ba)c) previously returned
the previous character. This is to ensure that at least one character 2, because that is the length of the largest individual lookbehind. Now
from the old segment is retained when a new segment is processed. Oth- it returns 3, because matching actually looks back 3 characters.
erwise, if there are no lookbehinds in the pattern, \A might match in-
correctly at the start of a second or subsequent segment. The third argument should point to a uint32_t integer. This information
is useful when doing multi-segment matching using the partial matching
facilities. Note that the simple assertions \b and \B require a one-
character lookbehind. \A also registers a one-character lookbehind,
though it does not actually inspect the previous character. This is to
ensure that at least one character from the old segment is retained
when a new segment is processed. Otherwise, if there are no lookbehinds
in the pattern, \A might match incorrectly at the start of a second or
subsequent segment. There are more details in the pcre2partial documen-
tation.
PCRE2_INFO_MINLENGTH PCRE2_INFO_MINLENGTH
@ -3693,7 +3702,7 @@ AUTHOR
REVISION REVISION
Last updated: 11 June 2019 Last updated: 25 June 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "11 June 2019" "PCRE2 10.34" .TH PCRE2API 3 "25 June 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -2215,16 +2215,27 @@ defaulted by the caller of the match function.
.sp .sp
PCRE2_INFO_MAXLOOKBEHIND PCRE2_INFO_MAXLOOKBEHIND
.sp .sp
Return the number of characters (not code units) in the longest lookbehind Return the largest number of characters (not code units) before the current
assertion in the pattern. The third argument should point to a uint32_t matching point that could be inspected while processing a lookbehind assertion
integer. This information is useful when doing multi-segment matching using the in the pattern. Before release 10.34 this request used to give the largest
partial matching facilities. Note that the simple assertions \eb and \eB value for any individual assertion. Now it takes into account nested
require a one-character lookbehind. \eA also registers a one-character lookbehinds, which can mean that the overall value is greater. For example, the
lookbehind, though it does not actually inspect the previous character. This is pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
to ensure that at least one character from the old segment is retained when a largest individual lookbehind. Now it returns 3, because matching actually
new segment is processed. Otherwise, if there are no lookbehinds in the looks back 3 characters.
pattern, \eA might match incorrectly at the start of a second or subsequent .P
segment. The third argument should point to a uint32_t integer. This information is
useful when doing multi-segment matching using the partial matching facilities.
Note that the simple assertions \eb and \eB require a one-character lookbehind.
\eA also registers a one-character lookbehind, though it does not actually
inspect the previous character. This is to ensure that at least one character
from the old segment is retained when a new segment is processed. Otherwise, if
there are no lookbehinds in the pattern, \eA might match incorrectly at the
start of a second or subsequent segment. There are more details in the
.\" HREF
\fBpcre2partial\fP
.\"
documentation.
.sp .sp
PCRE2_INFO_MINLENGTH PCRE2_INFO_MINLENGTH
.sp .sp
@ -3848,6 +3859,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 11 June 2019 Last updated: 25 June 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -132,8 +132,8 @@ static int
compile_block *); compile_block *);
static BOOL static BOOL
set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *, set_lookbehind_lengths(uint32_t **, int *, int *, int *,
compile_block *); parsed_recurse_check *, compile_block *);
@ -8902,7 +8902,8 @@ return -1;
/* Return a fixed length for a branch in a lookbehind, giving an error if the /* Return a fixed length for a branch in a lookbehind, giving an error if the
length is not fixed. If any lookbehinds are encountered on the way, they get length is not fixed. If any lookbehinds are encountered on the way, they get
their length set. On entry, *pptrptr points to the first element inside the their length set, and there is a check for them looking further back than the
current lookbehind. On entry, *pptrptr points to the first element inside the
branch. On exit it is set to point to the ALT or KET. branch. On exit it is set to point to the ALT or KET.
Arguments: Arguments:
@ -8921,6 +8922,8 @@ get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
{ {
int branchlength = 0; int branchlength = 0;
int grouplength; int grouplength;
int max;
int extra = 0; /* Additional lookbehind from nesting */
uint32_t lastitemlength = 0; uint32_t lastitemlength = 0;
uint32_t *pptr = *pptrptr; uint32_t *pptr = *pptrptr;
PCRE2_SIZE offset; PCRE2_SIZE offset;
@ -9067,12 +9070,17 @@ for (;; pptr++)
} }
break; break;
/* Lookbehinds can be ignored, but must themselves be checked. */ /* A lookbehind does not contribute any length to this lookbehind, but must
itself be checked and have its lengths set. If the maximum lookebhind of
any branch is greater than the length so far computed for this branch, we
must set an extra value for use when setting the maximum overall
lookbehind. */
case META_LOOKBEHIND: case META_LOOKBEHIND:
case META_LOOKBEHINDNOT: case META_LOOKBEHINDNOT:
if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb)) if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
return -1; return -1;
if (max - branchlength > extra) extra = max - branchlength;
break; break;
/* Back references and recursions are handled by very similar code. At this /* Back references and recursions are handled by very similar code. At this
@ -9264,7 +9272,15 @@ for (;; pptr++)
EXIT: EXIT:
*pptrptr = pptr; *pptrptr = pptr;
if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
/* The overall maximum lookbehind for any branch in the pattern takes note of
any extra value that is generated from a nested lookbehind. For example, for
/(?<=a(?<=ba)c)/ each individual lookbehind has length 2, but the
max_lookbehind setting is 3 because matching inspects 3 characters before the
match starting point. */
if (branchlength + extra > cb->max_lookbehind)
cb->max_lookbehind = branchlength + extra;
return branchlength; return branchlength;
PARSED_SKIP_FAILED: PARSED_SKIP_FAILED:
@ -9285,6 +9301,7 @@ ket.
Arguments: Arguments:
pptrptr pointer to pointer in the parsed pattern pptrptr pointer to pointer in the parsed pattern
maxptr where to return maximum length for the whole group
errcodeptr pointer to error code errcodeptr pointer to error code
lcptr pointer to loop counter lcptr pointer to loop counter
recurses chain of recurse_check to catch mutual recursion recurses chain of recurse_check to catch mutual recursion
@ -9295,11 +9312,12 @@ Returns: TRUE if all is well
*/ */
static BOOL static BOOL
set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr, set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
parsed_recurse_check *recurses, compile_block *cb) int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
{ {
PCRE2_SIZE offset; PCRE2_SIZE offset;
int branchlength; int branchlength;
int max = 0;
uint32_t *bptr = *pptrptr; uint32_t *bptr = *pptrptr;
READPLUSOFFSET(offset, bptr); /* Offset for error messages */ READPLUSOFFSET(offset, bptr); /* Offset for error messages */
@ -9316,11 +9334,13 @@ do
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset; if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
return FALSE; return FALSE;
} }
if (branchlength > max) max = branchlength;
*bptr |= branchlength; /* branchlength never more than 65535 */ *bptr |= branchlength; /* branchlength never more than 65535 */
bptr = *pptrptr; bptr = *pptrptr;
} }
while (*bptr == META_ALT); while (*bptr == META_ALT);
*maxptr = max;
return TRUE; return TRUE;
} }
@ -9344,6 +9364,7 @@ static int
check_lookbehinds(compile_block *cb) check_lookbehinds(compile_block *cb)
{ {
uint32_t *pptr; uint32_t *pptr;
int max;
int errorcode = 0; int errorcode = 0;
int loopcount = 0; int loopcount = 0;
@ -9446,7 +9467,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
case META_LOOKBEHIND: case META_LOOKBEHIND:
case META_LOOKBEHINDNOT: case META_LOOKBEHINDNOT:
if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, NULL, cb)) if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, NULL, cb))
return errorcode; return errorcode;
break; break;
} }

18
testdata/testinput2 vendored
View File

@ -5629,4 +5629,22 @@ a)"xI
ABC ABC
AXY AXY
/(?<=(?<=a)b)c.*/I
abc\=ph
\= Expect no match
xbc\=ph
/(?<=ab)c.*/I
abc\=ph
\= Expect no match
xbc\=ph
/(?<=a(?<=a|a)c)/I
/(?<=a(?<=a|ba)c)/I
/(?<=(?<=a)b)(?<!abcd)/I
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
# End of testinput2 # End of testinput2

48
testdata/testoutput2 vendored
View File

@ -17039,6 +17039,54 @@ Subject length lower bound = 1
0: A 0: A
1: A 1: A
/(?<=(?<=a)b)c.*/I
Capture group count = 0
Max lookbehind = 2
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
Partial match: abc
<<
\= Expect no match
xbc\=ph
No match
/(?<=ab)c.*/I
Capture group count = 0
Max lookbehind = 2
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
Partial match: abc
<<
\= Expect no match
xbc\=ph
No match
/(?<=a(?<=a|a)c)/I
Capture group count = 0
Max lookbehind = 2
May match empty string
Subject length lower bound = 0
/(?<=a(?<=a|ba)c)/I
Capture group count = 0
Max lookbehind = 3
May match empty string
Subject length lower bound = 0
/(?<=(?<=a)b)(?<!abcd)/I
Capture group count = 0
Max lookbehind = 4
May match empty string
Subject length lower bound = 0
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
Capture group count = 0
Max lookbehind = 5
May match empty string
Subject length lower bound = 0
# End of testinput2 # End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data Error -62: bad serialized data