Improve maximum lookbehind calculation for nested lookbehinds.
This commit is contained in:
parent
7f24a98cfb
commit
d21f7daf9b
|
@ -66,6 +66,11 @@ is made possessive and applied to an item in parentheses, because a
|
|||
parenthesized item may contain multiple branches or other backtracking points,
|
||||
for example /(a|ab){1}+c/ or /(a+){1}+a/.
|
||||
|
||||
13. Nested lookbehinds are now taken into account when computing the maximum
|
||||
lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
|
||||
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
|
||||
it to 3, because matching looks back 3 characters.
|
||||
|
||||
|
||||
Version 10.33 16-April-2019
|
||||
---------------------------
|
||||
|
|
|
@ -2252,16 +2252,26 @@ defaulted by the caller of the match function.
|
|||
<pre>
|
||||
PCRE2_INFO_MAXLOOKBEHIND
|
||||
</pre>
|
||||
Return the number of characters (not code units) in the longest lookbehind
|
||||
assertion in the pattern. The third argument should point to a uint32_t
|
||||
integer. This information is useful when doing multi-segment matching using the
|
||||
partial matching facilities. Note that the simple assertions \b and \B
|
||||
require a one-character lookbehind. \A also registers a one-character
|
||||
lookbehind, though it does not actually inspect the previous character. This is
|
||||
to ensure that at least one character from the old segment is retained when a
|
||||
new segment is processed. Otherwise, if there are no lookbehinds in the
|
||||
pattern, \A might match incorrectly at the start of a second or subsequent
|
||||
segment.
|
||||
Return the largest number of characters (not code units) before the current
|
||||
matching point that could be inspected while processing a lookbehind assertion
|
||||
in the pattern. Before release 10.34 this request used to give the largest
|
||||
value for any individual assertion. Now it takes into account nested
|
||||
lookbehinds, which can mean that the overall value is greater. For example, the
|
||||
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
|
||||
largest individual lookbehind. Now it returns 3, because matching actually
|
||||
looks back 3 characters.
|
||||
</P>
|
||||
<P>
|
||||
The third argument should point to a uint32_t integer. This information is
|
||||
useful when doing multi-segment matching using the partial matching facilities.
|
||||
Note that the simple assertions \b and \B require a one-character lookbehind.
|
||||
\A also registers a one-character lookbehind, though it does not actually
|
||||
inspect the previous character. This is to ensure that at least one character
|
||||
from the old segment is retained when a new segment is processed. Otherwise, if
|
||||
there are no lookbehinds in the pattern, \A might match incorrectly at the
|
||||
start of a second or subsequent segment. There are more details in the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation.
|
||||
<pre>
|
||||
PCRE2_INFO_MINLENGTH
|
||||
</pre>
|
||||
|
@ -3836,7 +3846,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 11 June 2019
|
||||
Last updated: 25 June 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
1171
doc/pcre2.txt
1171
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "11 June 2019" "PCRE2 10.34"
|
||||
.TH PCRE2API 3 "25 June 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -2215,16 +2215,27 @@ defaulted by the caller of the match function.
|
|||
.sp
|
||||
PCRE2_INFO_MAXLOOKBEHIND
|
||||
.sp
|
||||
Return the number of characters (not code units) in the longest lookbehind
|
||||
assertion in the pattern. The third argument should point to a uint32_t
|
||||
integer. This information is useful when doing multi-segment matching using the
|
||||
partial matching facilities. Note that the simple assertions \eb and \eB
|
||||
require a one-character lookbehind. \eA also registers a one-character
|
||||
lookbehind, though it does not actually inspect the previous character. This is
|
||||
to ensure that at least one character from the old segment is retained when a
|
||||
new segment is processed. Otherwise, if there are no lookbehinds in the
|
||||
pattern, \eA might match incorrectly at the start of a second or subsequent
|
||||
segment.
|
||||
Return the largest number of characters (not code units) before the current
|
||||
matching point that could be inspected while processing a lookbehind assertion
|
||||
in the pattern. Before release 10.34 this request used to give the largest
|
||||
value for any individual assertion. Now it takes into account nested
|
||||
lookbehinds, which can mean that the overall value is greater. For example, the
|
||||
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
|
||||
largest individual lookbehind. Now it returns 3, because matching actually
|
||||
looks back 3 characters.
|
||||
.P
|
||||
The third argument should point to a uint32_t integer. This information is
|
||||
useful when doing multi-segment matching using the partial matching facilities.
|
||||
Note that the simple assertions \eb and \eB require a one-character lookbehind.
|
||||
\eA also registers a one-character lookbehind, though it does not actually
|
||||
inspect the previous character. This is to ensure that at least one character
|
||||
from the old segment is retained when a new segment is processed. Otherwise, if
|
||||
there are no lookbehinds in the pattern, \eA might match incorrectly at the
|
||||
start of a second or subsequent segment. There are more details in the
|
||||
.\" HREF
|
||||
\fBpcre2partial\fP
|
||||
.\"
|
||||
documentation.
|
||||
.sp
|
||||
PCRE2_INFO_MINLENGTH
|
||||
.sp
|
||||
|
@ -3848,6 +3859,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 11 June 2019
|
||||
Last updated: 25 June 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -132,8 +132,8 @@ static int
|
|||
compile_block *);
|
||||
|
||||
static BOOL
|
||||
set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *,
|
||||
compile_block *);
|
||||
set_lookbehind_lengths(uint32_t **, int *, int *, int *,
|
||||
parsed_recurse_check *, compile_block *);
|
||||
|
||||
|
||||
|
||||
|
@ -8902,7 +8902,8 @@ return -1;
|
|||
|
||||
/* Return a fixed length for a branch in a lookbehind, giving an error if the
|
||||
length is not fixed. If any lookbehinds are encountered on the way, they get
|
||||
their length set. On entry, *pptrptr points to the first element inside the
|
||||
their length set, and there is a check for them looking further back than the
|
||||
current lookbehind. On entry, *pptrptr points to the first element inside the
|
||||
branch. On exit it is set to point to the ALT or KET.
|
||||
|
||||
Arguments:
|
||||
|
@ -8921,6 +8922,8 @@ get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
|
|||
{
|
||||
int branchlength = 0;
|
||||
int grouplength;
|
||||
int max;
|
||||
int extra = 0; /* Additional lookbehind from nesting */
|
||||
uint32_t lastitemlength = 0;
|
||||
uint32_t *pptr = *pptrptr;
|
||||
PCRE2_SIZE offset;
|
||||
|
@ -9067,12 +9070,17 @@ for (;; pptr++)
|
|||
}
|
||||
break;
|
||||
|
||||
/* Lookbehinds can be ignored, but must themselves be checked. */
|
||||
/* A lookbehind does not contribute any length to this lookbehind, but must
|
||||
itself be checked and have its lengths set. If the maximum lookebhind of
|
||||
any branch is greater than the length so far computed for this branch, we
|
||||
must set an extra value for use when setting the maximum overall
|
||||
lookbehind. */
|
||||
|
||||
case META_LOOKBEHIND:
|
||||
case META_LOOKBEHINDNOT:
|
||||
if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
|
||||
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
|
||||
return -1;
|
||||
if (max - branchlength > extra) extra = max - branchlength;
|
||||
break;
|
||||
|
||||
/* Back references and recursions are handled by very similar code. At this
|
||||
|
@ -9264,7 +9272,15 @@ for (;; pptr++)
|
|||
|
||||
EXIT:
|
||||
*pptrptr = pptr;
|
||||
if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
|
||||
|
||||
/* The overall maximum lookbehind for any branch in the pattern takes note of
|
||||
any extra value that is generated from a nested lookbehind. For example, for
|
||||
/(?<=a(?<=ba)c)/ each individual lookbehind has length 2, but the
|
||||
max_lookbehind setting is 3 because matching inspects 3 characters before the
|
||||
match starting point. */
|
||||
|
||||
if (branchlength + extra > cb->max_lookbehind)
|
||||
cb->max_lookbehind = branchlength + extra;
|
||||
return branchlength;
|
||||
|
||||
PARSED_SKIP_FAILED:
|
||||
|
@ -9285,6 +9301,7 @@ ket.
|
|||
|
||||
Arguments:
|
||||
pptrptr pointer to pointer in the parsed pattern
|
||||
maxptr where to return maximum length for the whole group
|
||||
errcodeptr pointer to error code
|
||||
lcptr pointer to loop counter
|
||||
recurses chain of recurse_check to catch mutual recursion
|
||||
|
@ -9295,11 +9312,12 @@ Returns: TRUE if all is well
|
|||
*/
|
||||
|
||||
static BOOL
|
||||
set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
|
||||
parsed_recurse_check *recurses, compile_block *cb)
|
||||
set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
|
||||
int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
|
||||
{
|
||||
PCRE2_SIZE offset;
|
||||
int branchlength;
|
||||
int max = 0;
|
||||
uint32_t *bptr = *pptrptr;
|
||||
|
||||
READPLUSOFFSET(offset, bptr); /* Offset for error messages */
|
||||
|
@ -9316,11 +9334,13 @@ do
|
|||
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
|
||||
return FALSE;
|
||||
}
|
||||
if (branchlength > max) max = branchlength;
|
||||
*bptr |= branchlength; /* branchlength never more than 65535 */
|
||||
bptr = *pptrptr;
|
||||
}
|
||||
while (*bptr == META_ALT);
|
||||
|
||||
*maxptr = max;
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
|
@ -9344,6 +9364,7 @@ static int
|
|||
check_lookbehinds(compile_block *cb)
|
||||
{
|
||||
uint32_t *pptr;
|
||||
int max;
|
||||
int errorcode = 0;
|
||||
int loopcount = 0;
|
||||
|
||||
|
@ -9446,7 +9467,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
|
|||
|
||||
case META_LOOKBEHIND:
|
||||
case META_LOOKBEHINDNOT:
|
||||
if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, NULL, cb))
|
||||
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, NULL, cb))
|
||||
return errorcode;
|
||||
break;
|
||||
}
|
||||
|
|
|
@ -5629,4 +5629,22 @@ a)"xI
|
|||
ABC
|
||||
AXY
|
||||
|
||||
/(?<=(?<=a)b)c.*/I
|
||||
abc\=ph
|
||||
\= Expect no match
|
||||
xbc\=ph
|
||||
|
||||
/(?<=ab)c.*/I
|
||||
abc\=ph
|
||||
\= Expect no match
|
||||
xbc\=ph
|
||||
|
||||
/(?<=a(?<=a|a)c)/I
|
||||
|
||||
/(?<=a(?<=a|ba)c)/I
|
||||
|
||||
/(?<=(?<=a)b)(?<!abcd)/I
|
||||
|
||||
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
|
||||
|
||||
# End of testinput2
|
||||
|
|
|
@ -17039,6 +17039,54 @@ Subject length lower bound = 1
|
|||
0: A
|
||||
1: A
|
||||
|
||||
/(?<=(?<=a)b)c.*/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 2
|
||||
First code unit = 'c'
|
||||
Subject length lower bound = 1
|
||||
abc\=ph
|
||||
Partial match: abc
|
||||
<<
|
||||
\= Expect no match
|
||||
xbc\=ph
|
||||
No match
|
||||
|
||||
/(?<=ab)c.*/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 2
|
||||
First code unit = 'c'
|
||||
Subject length lower bound = 1
|
||||
abc\=ph
|
||||
Partial match: abc
|
||||
<<
|
||||
\= Expect no match
|
||||
xbc\=ph
|
||||
No match
|
||||
|
||||
/(?<=a(?<=a|a)c)/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 2
|
||||
May match empty string
|
||||
Subject length lower bound = 0
|
||||
|
||||
/(?<=a(?<=a|ba)c)/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 3
|
||||
May match empty string
|
||||
Subject length lower bound = 0
|
||||
|
||||
/(?<=(?<=a)b)(?<!abcd)/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 4
|
||||
May match empty string
|
||||
Subject length lower bound = 0
|
||||
|
||||
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 5
|
||||
May match empty string
|
||||
Subject length lower bound = 0
|
||||
|
||||
# End of testinput2
|
||||
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
|
||||
Error -62: bad serialized data
|
||||
|
|
Loading…
Reference in New Issue