Improve maximum lookbehind calculation for nested lookbehinds.
This commit is contained in:
parent
7f24a98cfb
commit
d21f7daf9b
|
@ -66,6 +66,11 @@ is made possessive and applied to an item in parentheses, because a
|
||||||
parenthesized item may contain multiple branches or other backtracking points,
|
parenthesized item may contain multiple branches or other backtracking points,
|
||||||
for example /(a|ab){1}+c/ or /(a+){1}+a/.
|
for example /(a|ab){1}+c/ or /(a+){1}+a/.
|
||||||
|
|
||||||
|
13. Nested lookbehinds are now taken into account when computing the maximum
|
||||||
|
lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
|
||||||
|
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
|
||||||
|
it to 3, because matching looks back 3 characters.
|
||||||
|
|
||||||
|
|
||||||
Version 10.33 16-April-2019
|
Version 10.33 16-April-2019
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
|
@ -2252,16 +2252,26 @@ defaulted by the caller of the match function.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_MAXLOOKBEHIND
|
PCRE2_INFO_MAXLOOKBEHIND
|
||||||
</pre>
|
</pre>
|
||||||
Return the number of characters (not code units) in the longest lookbehind
|
Return the largest number of characters (not code units) before the current
|
||||||
assertion in the pattern. The third argument should point to a uint32_t
|
matching point that could be inspected while processing a lookbehind assertion
|
||||||
integer. This information is useful when doing multi-segment matching using the
|
in the pattern. Before release 10.34 this request used to give the largest
|
||||||
partial matching facilities. Note that the simple assertions \b and \B
|
value for any individual assertion. Now it takes into account nested
|
||||||
require a one-character lookbehind. \A also registers a one-character
|
lookbehinds, which can mean that the overall value is greater. For example, the
|
||||||
lookbehind, though it does not actually inspect the previous character. This is
|
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
|
||||||
to ensure that at least one character from the old segment is retained when a
|
largest individual lookbehind. Now it returns 3, because matching actually
|
||||||
new segment is processed. Otherwise, if there are no lookbehinds in the
|
looks back 3 characters.
|
||||||
pattern, \A might match incorrectly at the start of a second or subsequent
|
</P>
|
||||||
segment.
|
<P>
|
||||||
|
The third argument should point to a uint32_t integer. This information is
|
||||||
|
useful when doing multi-segment matching using the partial matching facilities.
|
||||||
|
Note that the simple assertions \b and \B require a one-character lookbehind.
|
||||||
|
\A also registers a one-character lookbehind, though it does not actually
|
||||||
|
inspect the previous character. This is to ensure that at least one character
|
||||||
|
from the old segment is retained when a new segment is processed. Otherwise, if
|
||||||
|
there are no lookbehinds in the pattern, \A might match incorrectly at the
|
||||||
|
start of a second or subsequent segment. There are more details in the
|
||||||
|
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||||
|
documentation.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_MINLENGTH
|
PCRE2_INFO_MINLENGTH
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -3836,7 +3846,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 11 June 2019
|
Last updated: 25 June 2019
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2019 University of Cambridge.
|
Copyright © 1997-2019 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -2218,16 +2218,25 @@ INFORMATION ABOUT A COMPILED PATTERN
|
||||||
|
|
||||||
PCRE2_INFO_MAXLOOKBEHIND
|
PCRE2_INFO_MAXLOOKBEHIND
|
||||||
|
|
||||||
Return the number of characters (not code units) in the longest lookbe-
|
Return the largest number of characters (not code units) before the
|
||||||
hind assertion in the pattern. The third argument should point to a
|
current matching point that could be inspected while processing a look-
|
||||||
uint32_t integer. This information is useful when doing multi-segment
|
behind assertion in the pattern. Before release 10.34 this request used
|
||||||
matching using the partial matching facilities. Note that the simple
|
to give the largest value for any individual assertion. Now it takes
|
||||||
assertions \b and \B require a one-character lookbehind. \A also regis-
|
into account nested lookbehinds, which can mean that the overall value
|
||||||
ters a one-character lookbehind, though it does not actually inspect
|
is greater. For example, the pattern (?<=a(?<=ba)c) previously returned
|
||||||
the previous character. This is to ensure that at least one character
|
2, because that is the length of the largest individual lookbehind. Now
|
||||||
from the old segment is retained when a new segment is processed. Oth-
|
it returns 3, because matching actually looks back 3 characters.
|
||||||
erwise, if there are no lookbehinds in the pattern, \A might match in-
|
|
||||||
correctly at the start of a second or subsequent segment.
|
The third argument should point to a uint32_t integer. This information
|
||||||
|
is useful when doing multi-segment matching using the partial matching
|
||||||
|
facilities. Note that the simple assertions \b and \B require a one-
|
||||||
|
character lookbehind. \A also registers a one-character lookbehind,
|
||||||
|
though it does not actually inspect the previous character. This is to
|
||||||
|
ensure that at least one character from the old segment is retained
|
||||||
|
when a new segment is processed. Otherwise, if there are no lookbehinds
|
||||||
|
in the pattern, \A might match incorrectly at the start of a second or
|
||||||
|
subsequent segment. There are more details in the pcre2partial documen-
|
||||||
|
tation.
|
||||||
|
|
||||||
PCRE2_INFO_MINLENGTH
|
PCRE2_INFO_MINLENGTH
|
||||||
|
|
||||||
|
@ -3693,7 +3702,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 11 June 2019
|
Last updated: 25 June 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "11 June 2019" "PCRE2 10.34"
|
.TH PCRE2API 3 "25 June 2019" "PCRE2 10.34"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -2215,16 +2215,27 @@ defaulted by the caller of the match function.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_MAXLOOKBEHIND
|
PCRE2_INFO_MAXLOOKBEHIND
|
||||||
.sp
|
.sp
|
||||||
Return the number of characters (not code units) in the longest lookbehind
|
Return the largest number of characters (not code units) before the current
|
||||||
assertion in the pattern. The third argument should point to a uint32_t
|
matching point that could be inspected while processing a lookbehind assertion
|
||||||
integer. This information is useful when doing multi-segment matching using the
|
in the pattern. Before release 10.34 this request used to give the largest
|
||||||
partial matching facilities. Note that the simple assertions \eb and \eB
|
value for any individual assertion. Now it takes into account nested
|
||||||
require a one-character lookbehind. \eA also registers a one-character
|
lookbehinds, which can mean that the overall value is greater. For example, the
|
||||||
lookbehind, though it does not actually inspect the previous character. This is
|
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
|
||||||
to ensure that at least one character from the old segment is retained when a
|
largest individual lookbehind. Now it returns 3, because matching actually
|
||||||
new segment is processed. Otherwise, if there are no lookbehinds in the
|
looks back 3 characters.
|
||||||
pattern, \eA might match incorrectly at the start of a second or subsequent
|
.P
|
||||||
segment.
|
The third argument should point to a uint32_t integer. This information is
|
||||||
|
useful when doing multi-segment matching using the partial matching facilities.
|
||||||
|
Note that the simple assertions \eb and \eB require a one-character lookbehind.
|
||||||
|
\eA also registers a one-character lookbehind, though it does not actually
|
||||||
|
inspect the previous character. This is to ensure that at least one character
|
||||||
|
from the old segment is retained when a new segment is processed. Otherwise, if
|
||||||
|
there are no lookbehinds in the pattern, \eA might match incorrectly at the
|
||||||
|
start of a second or subsequent segment. There are more details in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2partial\fP
|
||||||
|
.\"
|
||||||
|
documentation.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_MINLENGTH
|
PCRE2_INFO_MINLENGTH
|
||||||
.sp
|
.sp
|
||||||
|
@ -3848,6 +3859,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 11 June 2019
|
Last updated: 25 June 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -132,8 +132,8 @@ static int
|
||||||
compile_block *);
|
compile_block *);
|
||||||
|
|
||||||
static BOOL
|
static BOOL
|
||||||
set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *,
|
set_lookbehind_lengths(uint32_t **, int *, int *, int *,
|
||||||
compile_block *);
|
parsed_recurse_check *, compile_block *);
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -8902,7 +8902,8 @@ return -1;
|
||||||
|
|
||||||
/* Return a fixed length for a branch in a lookbehind, giving an error if the
|
/* Return a fixed length for a branch in a lookbehind, giving an error if the
|
||||||
length is not fixed. If any lookbehinds are encountered on the way, they get
|
length is not fixed. If any lookbehinds are encountered on the way, they get
|
||||||
their length set. On entry, *pptrptr points to the first element inside the
|
their length set, and there is a check for them looking further back than the
|
||||||
|
current lookbehind. On entry, *pptrptr points to the first element inside the
|
||||||
branch. On exit it is set to point to the ALT or KET.
|
branch. On exit it is set to point to the ALT or KET.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
|
@ -8921,6 +8922,8 @@ get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
|
||||||
{
|
{
|
||||||
int branchlength = 0;
|
int branchlength = 0;
|
||||||
int grouplength;
|
int grouplength;
|
||||||
|
int max;
|
||||||
|
int extra = 0; /* Additional lookbehind from nesting */
|
||||||
uint32_t lastitemlength = 0;
|
uint32_t lastitemlength = 0;
|
||||||
uint32_t *pptr = *pptrptr;
|
uint32_t *pptr = *pptrptr;
|
||||||
PCRE2_SIZE offset;
|
PCRE2_SIZE offset;
|
||||||
|
@ -9067,12 +9070,17 @@ for (;; pptr++)
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
|
|
||||||
/* Lookbehinds can be ignored, but must themselves be checked. */
|
/* A lookbehind does not contribute any length to this lookbehind, but must
|
||||||
|
itself be checked and have its lengths set. If the maximum lookebhind of
|
||||||
|
any branch is greater than the length so far computed for this branch, we
|
||||||
|
must set an extra value for use when setting the maximum overall
|
||||||
|
lookbehind. */
|
||||||
|
|
||||||
case META_LOOKBEHIND:
|
case META_LOOKBEHIND:
|
||||||
case META_LOOKBEHINDNOT:
|
case META_LOOKBEHINDNOT:
|
||||||
if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
|
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
|
||||||
return -1;
|
return -1;
|
||||||
|
if (max - branchlength > extra) extra = max - branchlength;
|
||||||
break;
|
break;
|
||||||
|
|
||||||
/* Back references and recursions are handled by very similar code. At this
|
/* Back references and recursions are handled by very similar code. At this
|
||||||
|
@ -9264,7 +9272,15 @@ for (;; pptr++)
|
||||||
|
|
||||||
EXIT:
|
EXIT:
|
||||||
*pptrptr = pptr;
|
*pptrptr = pptr;
|
||||||
if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
|
|
||||||
|
/* The overall maximum lookbehind for any branch in the pattern takes note of
|
||||||
|
any extra value that is generated from a nested lookbehind. For example, for
|
||||||
|
/(?<=a(?<=ba)c)/ each individual lookbehind has length 2, but the
|
||||||
|
max_lookbehind setting is 3 because matching inspects 3 characters before the
|
||||||
|
match starting point. */
|
||||||
|
|
||||||
|
if (branchlength + extra > cb->max_lookbehind)
|
||||||
|
cb->max_lookbehind = branchlength + extra;
|
||||||
return branchlength;
|
return branchlength;
|
||||||
|
|
||||||
PARSED_SKIP_FAILED:
|
PARSED_SKIP_FAILED:
|
||||||
|
@ -9285,6 +9301,7 @@ ket.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
pptrptr pointer to pointer in the parsed pattern
|
pptrptr pointer to pointer in the parsed pattern
|
||||||
|
maxptr where to return maximum length for the whole group
|
||||||
errcodeptr pointer to error code
|
errcodeptr pointer to error code
|
||||||
lcptr pointer to loop counter
|
lcptr pointer to loop counter
|
||||||
recurses chain of recurse_check to catch mutual recursion
|
recurses chain of recurse_check to catch mutual recursion
|
||||||
|
@ -9295,11 +9312,12 @@ Returns: TRUE if all is well
|
||||||
*/
|
*/
|
||||||
|
|
||||||
static BOOL
|
static BOOL
|
||||||
set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
|
set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
|
||||||
parsed_recurse_check *recurses, compile_block *cb)
|
int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
|
||||||
{
|
{
|
||||||
PCRE2_SIZE offset;
|
PCRE2_SIZE offset;
|
||||||
int branchlength;
|
int branchlength;
|
||||||
|
int max = 0;
|
||||||
uint32_t *bptr = *pptrptr;
|
uint32_t *bptr = *pptrptr;
|
||||||
|
|
||||||
READPLUSOFFSET(offset, bptr); /* Offset for error messages */
|
READPLUSOFFSET(offset, bptr); /* Offset for error messages */
|
||||||
|
@ -9316,11 +9334,13 @@ do
|
||||||
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
|
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
|
||||||
return FALSE;
|
return FALSE;
|
||||||
}
|
}
|
||||||
|
if (branchlength > max) max = branchlength;
|
||||||
*bptr |= branchlength; /* branchlength never more than 65535 */
|
*bptr |= branchlength; /* branchlength never more than 65535 */
|
||||||
bptr = *pptrptr;
|
bptr = *pptrptr;
|
||||||
}
|
}
|
||||||
while (*bptr == META_ALT);
|
while (*bptr == META_ALT);
|
||||||
|
|
||||||
|
*maxptr = max;
|
||||||
return TRUE;
|
return TRUE;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -9344,6 +9364,7 @@ static int
|
||||||
check_lookbehinds(compile_block *cb)
|
check_lookbehinds(compile_block *cb)
|
||||||
{
|
{
|
||||||
uint32_t *pptr;
|
uint32_t *pptr;
|
||||||
|
int max;
|
||||||
int errorcode = 0;
|
int errorcode = 0;
|
||||||
int loopcount = 0;
|
int loopcount = 0;
|
||||||
|
|
||||||
|
@ -9446,7 +9467,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
|
||||||
|
|
||||||
case META_LOOKBEHIND:
|
case META_LOOKBEHIND:
|
||||||
case META_LOOKBEHINDNOT:
|
case META_LOOKBEHINDNOT:
|
||||||
if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, NULL, cb))
|
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, NULL, cb))
|
||||||
return errorcode;
|
return errorcode;
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
|
@ -5629,4 +5629,22 @@ a)"xI
|
||||||
ABC
|
ABC
|
||||||
AXY
|
AXY
|
||||||
|
|
||||||
|
/(?<=(?<=a)b)c.*/I
|
||||||
|
abc\=ph
|
||||||
|
\= Expect no match
|
||||||
|
xbc\=ph
|
||||||
|
|
||||||
|
/(?<=ab)c.*/I
|
||||||
|
abc\=ph
|
||||||
|
\= Expect no match
|
||||||
|
xbc\=ph
|
||||||
|
|
||||||
|
/(?<=a(?<=a|a)c)/I
|
||||||
|
|
||||||
|
/(?<=a(?<=a|ba)c)/I
|
||||||
|
|
||||||
|
/(?<=(?<=a)b)(?<!abcd)/I
|
||||||
|
|
||||||
|
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
|
||||||
|
|
||||||
# End of testinput2
|
# End of testinput2
|
||||||
|
|
|
@ -17039,6 +17039,54 @@ Subject length lower bound = 1
|
||||||
0: A
|
0: A
|
||||||
1: A
|
1: A
|
||||||
|
|
||||||
|
/(?<=(?<=a)b)c.*/I
|
||||||
|
Capture group count = 0
|
||||||
|
Max lookbehind = 2
|
||||||
|
First code unit = 'c'
|
||||||
|
Subject length lower bound = 1
|
||||||
|
abc\=ph
|
||||||
|
Partial match: abc
|
||||||
|
<<
|
||||||
|
\= Expect no match
|
||||||
|
xbc\=ph
|
||||||
|
No match
|
||||||
|
|
||||||
|
/(?<=ab)c.*/I
|
||||||
|
Capture group count = 0
|
||||||
|
Max lookbehind = 2
|
||||||
|
First code unit = 'c'
|
||||||
|
Subject length lower bound = 1
|
||||||
|
abc\=ph
|
||||||
|
Partial match: abc
|
||||||
|
<<
|
||||||
|
\= Expect no match
|
||||||
|
xbc\=ph
|
||||||
|
No match
|
||||||
|
|
||||||
|
/(?<=a(?<=a|a)c)/I
|
||||||
|
Capture group count = 0
|
||||||
|
Max lookbehind = 2
|
||||||
|
May match empty string
|
||||||
|
Subject length lower bound = 0
|
||||||
|
|
||||||
|
/(?<=a(?<=a|ba)c)/I
|
||||||
|
Capture group count = 0
|
||||||
|
Max lookbehind = 3
|
||||||
|
May match empty string
|
||||||
|
Subject length lower bound = 0
|
||||||
|
|
||||||
|
/(?<=(?<=a)b)(?<!abcd)/I
|
||||||
|
Capture group count = 0
|
||||||
|
Max lookbehind = 4
|
||||||
|
May match empty string
|
||||||
|
Subject length lower bound = 0
|
||||||
|
|
||||||
|
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
|
||||||
|
Capture group count = 0
|
||||||
|
Max lookbehind = 5
|
||||||
|
May match empty string
|
||||||
|
Subject length lower bound = 0
|
||||||
|
|
||||||
# End of testinput2
|
# End of testinput2
|
||||||
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
|
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
|
||||||
Error -62: bad serialized data
|
Error -62: bad serialized data
|
||||||
|
|
Loading…
Reference in New Issue