Back off failed attempt to handle nested lookbehinds for estimating how much of
a partial match to retain for multi-segment matching. Document the current difficulty if the whole first segment cannot be retained.
This commit is contained in:
parent
87bc092222
commit
963b570fd0
37
ChangeLog
37
ChangeLog
|
@ -66,12 +66,7 @@ is made possessive and applied to an item in parentheses, because a
|
||||||
parenthesized item may contain multiple branches or other backtracking points,
|
parenthesized item may contain multiple branches or other backtracking points,
|
||||||
for example /(a|ab){1}+c/ or /(a+){1}+a/.
|
for example /(a|ab){1}+c/ or /(a+){1}+a/.
|
||||||
|
|
||||||
13. Nested lookbehinds are now taken into account when computing the maximum
|
13. For partial matches, pcre2test was always showing the maximum lookbehind
|
||||||
lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
|
|
||||||
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
|
|
||||||
it to 3, because matching looks back 3 characters.
|
|
||||||
|
|
||||||
14. For partial matches, pcre2test was always showing the maximum lookbehind
|
|
||||||
characters, flagged with "<", which is misleading when the lookbehind didn't
|
characters, flagged with "<", which is misleading when the lookbehind didn't
|
||||||
actually look behind the start (because it was later in the pattern). Showing
|
actually look behind the start (because it was later in the pattern). Showing
|
||||||
all consulted preceding characters for partial matches is now controlled by the
|
all consulted preceding characters for partial matches is now controlled by the
|
||||||
|
@ -79,25 +74,25 @@ existing "allusedtext" modifier and, as for complete matches, this facility is
|
||||||
available only for non-JIT matching, because JIT does not maintain the first
|
available only for non-JIT matching, because JIT does not maintain the first
|
||||||
and last consulted characters.
|
and last consulted characters.
|
||||||
|
|
||||||
15. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
|
14. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
|
||||||
if the end of the subject was encountered in a lookahead (conditional or
|
if the end of the subject was encountered in a lookahead (conditional or
|
||||||
otherwise), an atomic group, or a recursion.
|
otherwise), an atomic group, or a recursion.
|
||||||
|
|
||||||
16. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
|
15. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
|
||||||
|
|
||||||
17. Check for integer overflow when computing lookbehind lengths. Fixes
|
16. Check for integer overflow when computing lookbehind lengths. Fixes
|
||||||
Clusterfuzz issue 15636.
|
Clusterfuzz issue 15636.
|
||||||
|
|
||||||
18. Implemented non-atomic positive lookaround assertions.
|
17. Implemented non-atomic positive lookaround assertions.
|
||||||
|
|
||||||
19. If a lookbehind contained a lookahead that contained another lookbehind
|
18. If a lookbehind contained a lookahead that contained another lookbehind
|
||||||
within it, the nested lookbehind was not correctly processed. For example, if
|
within it, the nested lookbehind was not correctly processed. For example, if
|
||||||
/(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching
|
/(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching
|
||||||
"b".
|
"b".
|
||||||
|
|
||||||
20. Implemented pcre2_get_match_data_size().
|
19. Implemented pcre2_get_match_data_size().
|
||||||
|
|
||||||
21. Two alterations to partial matching (not yet done by JIT):
|
20. Two alterations to partial matching (not yet done by JIT):
|
||||||
|
|
||||||
(a) The definition of a partial match is slightly changed: if a pattern
|
(a) The definition of a partial match is slightly changed: if a pattern
|
||||||
contains any lookbehinds, an empty partial match may be given, because this
|
contains any lookbehinds, an empty partial match may be given, because this
|
||||||
|
@ -111,29 +106,29 @@ within it, the nested lookbehind was not correctly processed. For example, if
|
||||||
(c) An empty string partial hard match can be returned for \z and \Z as it
|
(c) An empty string partial hard match can be returned for \z and \Z as it
|
||||||
is documented that they shouldn't match.
|
is documented that they shouldn't match.
|
||||||
|
|
||||||
22. A branch that started with (*ACCEPT) was not being recognized as one that
|
21. A branch that started with (*ACCEPT) was not being recognized as one that
|
||||||
could match an empty string.
|
could match an empty string.
|
||||||
|
|
||||||
23. Corrected pcre2_set_character_tables() tables data type: was const unsigned
|
22. Corrected pcre2_set_character_tables() tables data type: was const unsigned
|
||||||
char * instead of const uint8_t *, as generated by pcre2_maketables().
|
char * instead of const uint8_t *, as generated by pcre2_maketables().
|
||||||
|
|
||||||
24. Upgraded to Unicode 12.1.0.
|
23. Upgraded to Unicode 12.1.0.
|
||||||
|
|
||||||
25. Add -jitfast command line option to pcre2test (to make all the jit options
|
24. Add -jitfast command line option to pcre2test (to make all the jit options
|
||||||
available directly).
|
available directly).
|
||||||
|
|
||||||
26. Make pcre2test -C show if libreadline or libedit is supported.
|
25. Make pcre2test -C show if libreadline or libedit is supported.
|
||||||
|
|
||||||
28. If the length of one branch of a group exceeded 65535 (the maximum value
|
26. If the length of one branch of a group exceeded 65535 (the maximum value
|
||||||
that is remembered as a minimum length), the whole group's length was
|
that is remembered as a minimum length), the whole group's length was
|
||||||
incorrectly recorded as 65535, leading to incorrect "no match" when start-up
|
incorrectly recorded as 65535, leading to incorrect "no match" when start-up
|
||||||
optimizations were in force.
|
optimizations were in force.
|
||||||
|
|
||||||
29. The "rightmost consulted character" value was not always correct; in
|
27. The "rightmost consulted character" value was not always correct; in
|
||||||
particular, if a pattern ended with a negative lookahead, characters that were
|
particular, if a pattern ended with a negative lookahead, characters that were
|
||||||
inspected in that lookahead were not included.
|
inspected in that lookahead were not included.
|
||||||
|
|
||||||
30. Add the pcre2_maketables_free() function.
|
28. Add the pcre2_maketables_free() function.
|
||||||
|
|
||||||
|
|
||||||
Version 10.33 16-April-2019
|
Version 10.33 16-April-2019
|
||||||
|
|
|
@ -2272,26 +2272,24 @@ defaulted by the caller of the match function.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_MAXLOOKBEHIND
|
PCRE2_INFO_MAXLOOKBEHIND
|
||||||
</pre>
|
</pre>
|
||||||
Return the largest number of characters (not code units) before the current
|
A lookbehind assertion moves back a certain number of characters (not code
|
||||||
matching point that could be inspected while processing a lookbehind assertion
|
units) when it starts to process each of its branches. This request returns the
|
||||||
in the pattern. Before release 10.34 this request used to give the largest
|
largest of these backward moves. The third argument should point to a uint32_t
|
||||||
value for any individual assertion. Now it takes into account nested
|
integer. The simple assertions \b and \B require a one-character lookbehind
|
||||||
lookbehinds, which can mean that the overall value is greater. For example, the
|
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
|
||||||
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
|
longer. \A also registers a one-character lookbehind, though it does not
|
||||||
largest individual lookbehind. Now it returns 3, because matching actually
|
actually inspect the previous character.
|
||||||
looks back 3 characters.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The third argument should point to a uint32_t integer. This information is
|
Note that this information is useful for multi-segment matching only
|
||||||
useful when doing multi-segment matching using the partial matching facilities.
|
if the pattern contains no nested lookbehinds. For example, the pattern
|
||||||
Note that the simple assertions \b and \B require a one-character lookbehind.
|
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
|
||||||
\A also registers a one-character lookbehind, though it does not actually
|
first lookbehind moves back by two characters, matches one character, then the
|
||||||
inspect the previous character. This is to ensure that at least one character
|
nested lookbehind also moves back by two characters. This puts the matching
|
||||||
from the old segment is retained when a new segment is processed. Otherwise, if
|
point three characters earlier than it was at the start.
|
||||||
there are no lookbehinds in the pattern, \A might match incorrectly at the
|
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
|
||||||
start of a second or subsequent segment. There are more details in the
|
|
||||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||||
documentation.
|
documentation for a discussion of multi-segment matching.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_MINLENGTH
|
PCRE2_INFO_MINLENGTH
|
||||||
</pre>
|
</pre>
|
||||||
|
|
|
@ -49,7 +49,7 @@ complete match, though the details differ between the two types of matching
|
||||||
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If you want to use partial matching with just-in-time optimized code, as well
|
If you want to use partial matching with just-in-time optimized code, as well
|
||||||
as setting a partial match option for the matching function, you must also call
|
as setting a partial match option for the matching function, you must also call
|
||||||
<b>pcre2_jit_compile()</b> with one or both of these options:
|
<b>pcre2_jit_compile()</b> with one or both of these options:
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -101,7 +101,7 @@ matched string.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
(2) The pattern contains one or more lookbehind assertions. This condition
|
(2) The pattern contains one or more lookbehind assertions. This condition
|
||||||
exists in case there is a lookbehind that inspects characters before the start
|
exists in case there is a lookbehind that inspects characters before the start
|
||||||
of the match.
|
of the match.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -171,7 +171,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
|
||||||
partial match is found, without continuing to search for possible complete
|
partial match is found, without continuing to search for possible complete
|
||||||
matches. This option is "hard" because it prefers an earlier partial match over
|
matches. This option is "hard" because it prefers an earlier partial match over
|
||||||
a later complete match. For this reason, the assumption is made that the end of
|
a later complete match. For this reason, the assumption is made that the end of
|
||||||
the supplied subject string is not the true end of the available data, which is
|
the supplied subject string is not the true end of the available data, which is
|
||||||
why \z, \Z, \b, \B, and $ always give a partial match.
|
why \z, \Z, \b, \B, and $ always give a partial match.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -226,7 +226,7 @@ date:
|
||||||
data> 3juj\=ph
|
data> 3juj\=ph
|
||||||
No match
|
No match
|
||||||
</pre>
|
</pre>
|
||||||
This example gives the same results for both hard and soft partial matching
|
This example gives the same results for both hard and soft partial matching
|
||||||
options. Here is an example where there is a difference:
|
options. Here is an example where there is a difference:
|
||||||
<pre>
|
<pre>
|
||||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||||
|
@ -234,7 +234,7 @@ options. Here is an example where there is a difference:
|
||||||
0: 25jun04
|
0: 25jun04
|
||||||
1: jun
|
1: jun
|
||||||
data> 25jun04\=ph
|
data> 25jun04\=ph
|
||||||
Partial match: 25jun04
|
Partial match: 25jun04
|
||||||
</pre>
|
</pre>
|
||||||
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||||
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
||||||
|
@ -244,9 +244,12 @@ there is only a partial match.
|
||||||
<P>
|
<P>
|
||||||
PCRE was not originally designed with multi-segment matching in mind. However,
|
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||||
over time, features (including partial matching) that make multi-segment
|
over time, features (including partial matching) that make multi-segment
|
||||||
matching possible have been added. The string is searched segment by segment by
|
matching possible have been added. A very long string can be searched segment
|
||||||
calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same
|
by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
|
||||||
results that would happen if the entire string was available for searching.
|
the same results that would happen if the entire string was available for
|
||||||
|
searching all the time. Normally, the strings that are being sought are much
|
||||||
|
shorter than each individual segment, and are in the middle of very long
|
||||||
|
strings, so the pattern is normally not anchored.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Special logic must be implemented to handle a matched substring that spans a
|
Special logic must be implemented to handle a matched substring that spans a
|
||||||
|
@ -256,11 +259,10 @@ changing the match by adding more characters. The PCRE2_NOTBOL option should
|
||||||
also be set for all but the first segment.
|
also be set for all but the first segment.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When a partial match occurs, the next segment must be added to the current
|
When a partial match occurs, the next segment must be added to the current
|
||||||
subject and the match re-run, using the <i>startoffset</i> argument of
|
subject and the match re-run, using the <i>startoffset</i> argument of
|
||||||
<b>pcre2_match()</b> to begin at the point where the partial match started.
|
<b>pcre2_match()</b> to begin at the point where the partial match started.
|
||||||
Multi-segment matching is usually used to search for substrings in the middle
|
For example:
|
||||||
of very long sequences, so the patterns are normally not anchored. For example:
|
|
||||||
<pre>
|
<pre>
|
||||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||||
data> ...the date is 23ja\=ph
|
data> ...the date is 23ja\=ph
|
||||||
|
@ -269,51 +271,52 @@ of very long sequences, so the patterns are normally not anchored. For example:
|
||||||
0: 23jan19
|
0: 23jan19
|
||||||
1: jan
|
1: jan
|
||||||
</pre>
|
</pre>
|
||||||
Note the use of the <b>offset</b> modifier to start the new match where the
|
Note the use of the <b>offset</b> modifier to start the new match where the
|
||||||
partial match was found.
|
partial match was found. In this example, the next segment was added to the one
|
||||||
|
in which the partial match was found. This is the most straightforward
|
||||||
|
approach, typically using a memory buffer that is twice the size of each
|
||||||
|
segment. After a partial match, the first half of the buffer is discarded, the
|
||||||
|
second half is moved to the start of the buffer, and a new segment is added
|
||||||
|
before repeating the match as in the example above. After a no match, the
|
||||||
|
entire buffer can be discarded.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In this simple example, the next segment was just added to the one in which the
|
If there are memory constraints, you may want to discard text that precedes a
|
||||||
partial match was found. However, if there are memory constraints, it may be
|
partial match before adding the next segment. Unfortunately, this is not at
|
||||||
necessary to discard text that precedes the partial match before adding the
|
present straightforward. In cases such as the above, where the pattern does not
|
||||||
next segment. In cases such as the above, where the pattern does not contain
|
contain any lookbehinds, it is sufficient to retain only the partially matched
|
||||||
any lookbehinds, it is sufficient to retain only the partially matched
|
substring. However, if the pattern contains a lookbehind assertion, characters
|
||||||
substring. However, if a pattern contains a lookbehind assertion, characters
|
|
||||||
that precede the start of the partial match may have been inspected during the
|
that precede the start of the partial match may have been inspected during the
|
||||||
matching process.
|
matching process. When <b>pcre2test</b> displays a partial match, it indicates
|
||||||
</P>
|
these characters with '<' if the <b>allusedtext</b> modifier is set:
|
||||||
<P>
|
|
||||||
The only lookbehind information that is available is the length of the longest
|
|
||||||
lookbehind in a pattern. This may not, of course, be at the start of the
|
|
||||||
pattern, but retaining that many characters before the partial match is
|
|
||||||
sufficient, if not always strictly necessary. The way to do this is as follows:
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
Before doing any matching, find the length of the longest lookbehind in the
|
|
||||||
pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
|
|
||||||
option. Note that the resulting count is in characters, not code units. After a
|
|
||||||
partial match, moving back from the ovector[0] offset in the subject by the
|
|
||||||
number of characters given for the maximum lookbehind gets you to the earliest
|
|
||||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
|
||||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
|
||||||
while moving back through the code units. Characters before the point you have
|
|
||||||
now reached can be discarded.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
|
||||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
|
||||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
|
||||||
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
|
|
||||||
displays a partial match, it indicates the lookbehind characters with '<'
|
|
||||||
characters if the <b>allusedtext</b> modifier is set:
|
|
||||||
<pre>
|
<pre>
|
||||||
re> "(?<=123)abc"
|
re> "(?<=123)abc"
|
||||||
data> xx123ab\=ph,allusedtext
|
data> xx123ab\=ph,allusedtext
|
||||||
Partial match: 123ab
|
Partial match: 123ab
|
||||||
<<<
|
<<<
|
||||||
</pre>
|
</pre>
|
||||||
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
|
However, the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||||
because JIT matching does not maintain the first and last consulted characters.
|
because JIT matching does not record the first (or last) consulted characters.
|
||||||
|
For this reason, this information is not available via the API. It is therefore
|
||||||
|
not possible in general to obtain the exact number of characters that must be
|
||||||
|
retained in order to get the right match result. If you cannot retain the
|
||||||
|
entire segment, you must find some heuristic way of choosing.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If you know the approximate length of the matching substrings, you can use that
|
||||||
|
to decide how much text to retain. The only lookbehind information that is
|
||||||
|
currently available via the API is the length of the longest individual
|
||||||
|
lookbehind in a pattern, but this can be misleading if there are nested
|
||||||
|
lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
|
||||||
|
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
|
||||||
|
units) that any individual lookbehind moves back when it is processed. A
|
||||||
|
pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
|
||||||
|
inspects two characters before its starting point.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
|
||||||
|
UTF-8 or UTF-16 you have to count characters while moving back through the code
|
||||||
|
units.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
|
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -379,11 +382,11 @@ want.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If you do want to allow for starting again at the next character, one way of
|
If you do want to allow for starting again at the next character, one way of
|
||||||
doing it is to retain the matched part of the segment and try a new complete
|
doing it is to retain some or all of the segment and try a new complete match,
|
||||||
match, as described for <b>pcre2_match()</b> above. Another possibility is to
|
as described for <b>pcre2_match()</b> above. Another possibility is to work with
|
||||||
work with two buffers. If a partial match at offset <i>n</i> in the first buffer
|
two buffers. If a partial match at offset <i>n</i> in the first buffer is
|
||||||
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
|
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
|
||||||
you can then try a new match starting at offset <i>n+1</i> in the first buffer.
|
can then try a new match starting at offset <i>n+1</i> in the first buffer.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -396,7 +399,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 07 August 2019
|
Last updated: 04 September 2019
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2019 University of Cambridge.
|
Copyright © 1997-2019 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
1346
doc/pcre2.txt
1346
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -2232,27 +2232,25 @@ defaulted by the caller of the match function.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_MAXLOOKBEHIND
|
PCRE2_INFO_MAXLOOKBEHIND
|
||||||
.sp
|
.sp
|
||||||
Return the largest number of characters (not code units) before the current
|
A lookbehind assertion moves back a certain number of characters (not code
|
||||||
matching point that could be inspected while processing a lookbehind assertion
|
units) when it starts to process each of its branches. This request returns the
|
||||||
in the pattern. Before release 10.34 this request used to give the largest
|
largest of these backward moves. The third argument should point to a uint32_t
|
||||||
value for any individual assertion. Now it takes into account nested
|
integer. The simple assertions \eb and \eB require a one-character lookbehind
|
||||||
lookbehinds, which can mean that the overall value is greater. For example, the
|
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
|
||||||
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
|
longer. \eA also registers a one-character lookbehind, though it does not
|
||||||
largest individual lookbehind. Now it returns 3, because matching actually
|
actually inspect the previous character.
|
||||||
looks back 3 characters.
|
|
||||||
.P
|
.P
|
||||||
The third argument should point to a uint32_t integer. This information is
|
Note that this information is useful for multi-segment matching only
|
||||||
useful when doing multi-segment matching using the partial matching facilities.
|
if the pattern contains no nested lookbehinds. For example, the pattern
|
||||||
Note that the simple assertions \eb and \eB require a one-character lookbehind.
|
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
|
||||||
\eA also registers a one-character lookbehind, though it does not actually
|
first lookbehind moves back by two characters, matches one character, then the
|
||||||
inspect the previous character. This is to ensure that at least one character
|
nested lookbehind also moves back by two characters. This puts the matching
|
||||||
from the old segment is retained when a new segment is processed. Otherwise, if
|
point three characters earlier than it was at the start.
|
||||||
there are no lookbehinds in the pattern, \eA might match incorrectly at the
|
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
|
||||||
start of a second or subsequent segment. There are more details in the
|
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2partial\fP
|
\fBpcre2partial\fP
|
||||||
.\"
|
.\"
|
||||||
documentation.
|
documentation for a discussion of multi-segment matching.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_MINLENGTH
|
PCRE2_INFO_MINLENGTH
|
||||||
.sp
|
.sp
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
|
.TH PCRE2PARTIAL 3 "04 September 2019" "PCRE2 10.34"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions
|
PCRE2 - Perl-compatible regular expressions
|
||||||
.SH "PARTIAL MATCHING IN PCRE2"
|
.SH "PARTIAL MATCHING IN PCRE2"
|
||||||
|
@ -25,7 +25,7 @@ options is whether or not a partial match is preferred to an alternative
|
||||||
complete match, though the details differ between the two types of matching
|
complete match, though the details differ between the two types of matching
|
||||||
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
||||||
.P
|
.P
|
||||||
If you want to use partial matching with just-in-time optimized code, as well
|
If you want to use partial matching with just-in-time optimized code, as well
|
||||||
as setting a partial match option for the matching function, you must also call
|
as setting a partial match option for the matching function, you must also call
|
||||||
\fBpcre2_jit_compile()\fP with one or both of these options:
|
\fBpcre2_jit_compile()\fP with one or both of these options:
|
||||||
.sp
|
.sp
|
||||||
|
@ -73,7 +73,7 @@ need not form part of the final matched string; lookbehind assertions and the
|
||||||
matched string.
|
matched string.
|
||||||
.P
|
.P
|
||||||
(2) The pattern contains one or more lookbehind assertions. This condition
|
(2) The pattern contains one or more lookbehind assertions. This condition
|
||||||
exists in case there is a lookbehind that inspects characters before the start
|
exists in case there is a lookbehind that inspects characters before the start
|
||||||
of the match.
|
of the match.
|
||||||
.P
|
.P
|
||||||
(3) There is a special case when the whole pattern can match an empty string.
|
(3) There is a special case when the whole pattern can match an empty string.
|
||||||
|
@ -139,7 +139,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
|
||||||
partial match is found, without continuing to search for possible complete
|
partial match is found, without continuing to search for possible complete
|
||||||
matches. This option is "hard" because it prefers an earlier partial match over
|
matches. This option is "hard" because it prefers an earlier partial match over
|
||||||
a later complete match. For this reason, the assumption is made that the end of
|
a later complete match. For this reason, the assumption is made that the end of
|
||||||
the supplied subject string is not the true end of the available data, which is
|
the supplied subject string is not the true end of the available data, which is
|
||||||
why \ez, \eZ, \eb, \eB, and $ always give a partial match.
|
why \ez, \eZ, \eb, \eB, and $ always give a partial match.
|
||||||
.P
|
.P
|
||||||
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
|
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
|
||||||
|
@ -192,7 +192,7 @@ date:
|
||||||
data> 3juj\e=ph
|
data> 3juj\e=ph
|
||||||
No match
|
No match
|
||||||
.sp
|
.sp
|
||||||
This example gives the same results for both hard and soft partial matching
|
This example gives the same results for both hard and soft partial matching
|
||||||
options. Here is an example where there is a difference:
|
options. Here is an example where there is a difference:
|
||||||
.sp
|
.sp
|
||||||
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
||||||
|
@ -200,8 +200,8 @@ options. Here is an example where there is a difference:
|
||||||
0: 25jun04
|
0: 25jun04
|
||||||
1: jun
|
1: jun
|
||||||
data> 25jun04\e=ph
|
data> 25jun04\e=ph
|
||||||
Partial match: 25jun04
|
Partial match: 25jun04
|
||||||
.sp
|
.sp
|
||||||
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||||
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
||||||
there is only a partial match.
|
there is only a partial match.
|
||||||
|
@ -213,9 +213,12 @@ there is only a partial match.
|
||||||
.sp
|
.sp
|
||||||
PCRE was not originally designed with multi-segment matching in mind. However,
|
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||||
over time, features (including partial matching) that make multi-segment
|
over time, features (including partial matching) that make multi-segment
|
||||||
matching possible have been added. The string is searched segment by segment by
|
matching possible have been added. A very long string can be searched segment
|
||||||
calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same
|
by segment by calling \fBpcre2_match()\fP repeatedly, with the aim of achieving
|
||||||
results that would happen if the entire string was available for searching.
|
the same results that would happen if the entire string was available for
|
||||||
|
searching all the time. Normally, the strings that are being sought are much
|
||||||
|
shorter than each individual segment, and are in the middle of very long
|
||||||
|
strings, so the pattern is normally not anchored.
|
||||||
.P
|
.P
|
||||||
Special logic must be implemented to handle a matched substring that spans a
|
Special logic must be implemented to handle a matched substring that spans a
|
||||||
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
|
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
|
||||||
|
@ -223,11 +226,10 @@ partial match at the end of a segment whenever there is the possibility of
|
||||||
changing the match by adding more characters. The PCRE2_NOTBOL option should
|
changing the match by adding more characters. The PCRE2_NOTBOL option should
|
||||||
also be set for all but the first segment.
|
also be set for all but the first segment.
|
||||||
.P
|
.P
|
||||||
When a partial match occurs, the next segment must be added to the current
|
When a partial match occurs, the next segment must be added to the current
|
||||||
subject and the match re-run, using the \fIstartoffset\fP argument of
|
subject and the match re-run, using the \fIstartoffset\fP argument of
|
||||||
\fBpcre2_match()\fP to begin at the point where the partial match started.
|
\fBpcre2_match()\fP to begin at the point where the partial match started.
|
||||||
Multi-segment matching is usually used to search for substrings in the middle
|
For example:
|
||||||
of very long sequences, so the patterns are normally not anchored. For example:
|
|
||||||
.sp
|
.sp
|
||||||
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
|
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
|
||||||
data> ...the date is 23ja\e=ph
|
data> ...the date is 23ja\e=ph
|
||||||
|
@ -236,48 +238,49 @@ of very long sequences, so the patterns are normally not anchored. For example:
|
||||||
0: 23jan19
|
0: 23jan19
|
||||||
1: jan
|
1: jan
|
||||||
.sp
|
.sp
|
||||||
Note the use of the \fBoffset\fP modifier to start the new match where the
|
Note the use of the \fBoffset\fP modifier to start the new match where the
|
||||||
partial match was found.
|
partial match was found. In this example, the next segment was added to the one
|
||||||
|
in which the partial match was found. This is the most straightforward
|
||||||
|
approach, typically using a memory buffer that is twice the size of each
|
||||||
|
segment. After a partial match, the first half of the buffer is discarded, the
|
||||||
|
second half is moved to the start of the buffer, and a new segment is added
|
||||||
|
before repeating the match as in the example above. After a no match, the
|
||||||
|
entire buffer can be discarded.
|
||||||
.P
|
.P
|
||||||
In this simple example, the next segment was just added to the one in which the
|
If there are memory constraints, you may want to discard text that precedes a
|
||||||
partial match was found. However, if there are memory constraints, it may be
|
partial match before adding the next segment. Unfortunately, this is not at
|
||||||
necessary to discard text that precedes the partial match before adding the
|
present straightforward. In cases such as the above, where the pattern does not
|
||||||
next segment. In cases such as the above, where the pattern does not contain
|
contain any lookbehinds, it is sufficient to retain only the partially matched
|
||||||
any lookbehinds, it is sufficient to retain only the partially matched
|
substring. However, if the pattern contains a lookbehind assertion, characters
|
||||||
substring. However, if a pattern contains a lookbehind assertion, characters
|
|
||||||
that precede the start of the partial match may have been inspected during the
|
that precede the start of the partial match may have been inspected during the
|
||||||
matching process.
|
matching process. When \fBpcre2test\fP displays a partial match, it indicates
|
||||||
.P
|
these characters with '<' if the \fBallusedtext\fP modifier is set:
|
||||||
The only lookbehind information that is available is the length of the longest
|
|
||||||
lookbehind in a pattern. This may not, of course, be at the start of the
|
|
||||||
pattern, but retaining that many characters before the partial match is
|
|
||||||
sufficient, if not always strictly necessary. The way to do this is as follows:
|
|
||||||
.P
|
|
||||||
Before doing any matching, find the length of the longest lookbehind in the
|
|
||||||
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
|
|
||||||
option. Note that the resulting count is in characters, not code units. After a
|
|
||||||
partial match, moving back from the ovector[0] offset in the subject by the
|
|
||||||
number of characters given for the maximum lookbehind gets you to the earliest
|
|
||||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
|
||||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
|
||||||
while moving back through the code units. Characters before the point you have
|
|
||||||
now reached can be discarded.
|
|
||||||
.P
|
|
||||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
|
||||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
|
||||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
|
||||||
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
|
|
||||||
displays a partial match, it indicates the lookbehind characters with '<'
|
|
||||||
characters if the \fBallusedtext\fP modifier is set:
|
|
||||||
.sp
|
.sp
|
||||||
re> "(?<=123)abc"
|
re> "(?<=123)abc"
|
||||||
data> xx123ab\e=ph,allusedtext
|
data> xx123ab\e=ph,allusedtext
|
||||||
Partial match: 123ab
|
Partial match: 123ab
|
||||||
<<<
|
<<<
|
||||||
.sp
|
.sp
|
||||||
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
|
However, the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||||
because JIT matching does not maintain the first and last consulted characters.
|
because JIT matching does not record the first (or last) consulted characters.
|
||||||
.
|
For this reason, this information is not available via the API. It is therefore
|
||||||
|
not possible in general to obtain the exact number of characters that must be
|
||||||
|
retained in order to get the right match result. If you cannot retain the
|
||||||
|
entire segment, you must find some heuristic way of choosing.
|
||||||
|
.P
|
||||||
|
If you know the approximate length of the matching substrings, you can use that
|
||||||
|
to decide how much text to retain. The only lookbehind information that is
|
||||||
|
currently available via the API is the length of the longest individual
|
||||||
|
lookbehind in a pattern, but this can be misleading if there are nested
|
||||||
|
lookbehinds. The value returned by calling \fBpcre2_pattern_info()\fP with the
|
||||||
|
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
|
||||||
|
units) that any individual lookbehind moves back when it is processed. A
|
||||||
|
pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
|
||||||
|
inspects two characters before its starting point.
|
||||||
|
.P
|
||||||
|
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
|
||||||
|
UTF-8 or UTF-16 you have to count characters while moving back through the code
|
||||||
|
units.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
|
.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
|
||||||
|
@ -344,11 +347,11 @@ are remembered. Depending on the application, this may or may not be what you
|
||||||
want.
|
want.
|
||||||
.P
|
.P
|
||||||
If you do want to allow for starting again at the next character, one way of
|
If you do want to allow for starting again at the next character, one way of
|
||||||
doing it is to retain the matched part of the segment and try a new complete
|
doing it is to retain some or all of the segment and try a new complete match,
|
||||||
match, as described for \fBpcre2_match()\fP above. Another possibility is to
|
as described for \fBpcre2_match()\fP above. Another possibility is to work with
|
||||||
work with two buffers. If a partial match at offset \fIn\fP in the first buffer
|
two buffers. If a partial match at offset \fIn\fP in the first buffer is
|
||||||
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
|
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
|
||||||
you can then try a new match starting at offset \fIn+1\fP in the first buffer.
|
can then try a new match starting at offset \fIn+1\fP in the first buffer.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH AUTHOR
|
.SH AUTHOR
|
||||||
|
@ -365,6 +368,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 07 August 2019
|
Last updated: 04 September 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -128,12 +128,12 @@ static int
|
||||||
compile_block *, PCRE2_SIZE *);
|
compile_block *, PCRE2_SIZE *);
|
||||||
|
|
||||||
static int
|
static int
|
||||||
get_branchlength(uint32_t **, int *, int *, int *, parsed_recurse_check *,
|
get_branchlength(uint32_t **, int *, int *, parsed_recurse_check *,
|
||||||
compile_block *);
|
compile_block *);
|
||||||
|
|
||||||
static BOOL
|
static BOOL
|
||||||
set_lookbehind_lengths(uint32_t **, int *, int *, int *,
|
set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *,
|
||||||
parsed_recurse_check *, compile_block *);
|
compile_block *);
|
||||||
|
|
||||||
static int
|
static int
|
||||||
check_lookbehinds(uint32_t *, uint32_t **, parsed_recurse_check *,
|
check_lookbehinds(uint32_t *, uint32_t **, parsed_recurse_check *,
|
||||||
|
@ -398,9 +398,6 @@ compiler is clever with identical subexpressions. */
|
||||||
#define GI_SET_FIXED_LENGTH 0x80000000u
|
#define GI_SET_FIXED_LENGTH 0x80000000u
|
||||||
#define GI_NOT_FIXED_LENGTH 0x40000000u
|
#define GI_NOT_FIXED_LENGTH 0x40000000u
|
||||||
#define GI_FIXED_LENGTH_MASK 0x0000ffffu
|
#define GI_FIXED_LENGTH_MASK 0x0000ffffu
|
||||||
#define GI_EXTRA_MASK 0x0fff0000u
|
|
||||||
#define GI_EXTRA_MAX 0xfff /* NB not unsigned */
|
|
||||||
#define GI_EXTRA_SHIFT 16
|
|
||||||
|
|
||||||
/* This simple test for a decimal digit works for both ASCII/Unicode and EBCDIC
|
/* This simple test for a decimal digit works for both ASCII/Unicode and EBCDIC
|
||||||
and is fast (a good compiler can turn it into a subtraction and unsigned
|
and is fast (a good compiler can turn it into a subtraction and unsigned
|
||||||
|
@ -8897,7 +8894,6 @@ improve processing speed when the same capturing group occurs many times.
|
||||||
Arguments:
|
Arguments:
|
||||||
pptrptr pointer to pointer in the parsed pattern
|
pptrptr pointer to pointer in the parsed pattern
|
||||||
isinline FALSE if a reference or recursion; TRUE for inline group
|
isinline FALSE if a reference or recursion; TRUE for inline group
|
||||||
extraptr pointer to where to return extra lookbehind length
|
|
||||||
errcodeptr pointer to the errorcode
|
errcodeptr pointer to the errorcode
|
||||||
lcptr pointer to the loop counter
|
lcptr pointer to the loop counter
|
||||||
group number of captured group or -1 for a non-capturing group
|
group number of captured group or -1 for a non-capturing group
|
||||||
|
@ -8908,13 +8904,11 @@ Returns: the group length or a negative number
|
||||||
*/
|
*/
|
||||||
|
|
||||||
static int
|
static int
|
||||||
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *extraptr,
|
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *errcodeptr, int *lcptr,
|
||||||
int *errcodeptr, int *lcptr, int group, parsed_recurse_check *recurses,
|
int group, parsed_recurse_check *recurses, compile_block *cb)
|
||||||
compile_block *cb)
|
|
||||||
{
|
{
|
||||||
int branchlength;
|
int branchlength;
|
||||||
int grouplength = -1;
|
int grouplength = -1;
|
||||||
int extra = 0;
|
|
||||||
|
|
||||||
/* The cache can be used only if there is no possibility of there being two
|
/* The cache can be used only if there is no possibility of there being two
|
||||||
groups with the same number. We do not need to set the end pointer for a group
|
groups with the same number. We do not need to set the end pointer for a group
|
||||||
|
@ -8928,7 +8922,6 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
|
||||||
if ((groupinfo & GI_SET_FIXED_LENGTH) != 0)
|
if ((groupinfo & GI_SET_FIXED_LENGTH) != 0)
|
||||||
{
|
{
|
||||||
if (isinline) *pptrptr = parsed_skip(*pptrptr, PSKIP_KET);
|
if (isinline) *pptrptr = parsed_skip(*pptrptr, PSKIP_KET);
|
||||||
*extraptr = (groupinfo & GI_EXTRA_MASK) >> GI_EXTRA_SHIFT;
|
|
||||||
return groupinfo & GI_FIXED_LENGTH_MASK;
|
return groupinfo & GI_FIXED_LENGTH_MASK;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -8937,28 +8930,16 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
|
||||||
|
|
||||||
for(;;)
|
for(;;)
|
||||||
{
|
{
|
||||||
int branchextra;
|
branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
|
||||||
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
|
|
||||||
recurses, cb);
|
|
||||||
if (branchlength < 0) goto ISNOTFIXED;
|
if (branchlength < 0) goto ISNOTFIXED;
|
||||||
if (grouplength == -1)
|
if (grouplength == -1) grouplength = branchlength;
|
||||||
{
|
else if (grouplength != branchlength) goto ISNOTFIXED;
|
||||||
grouplength = branchlength;
|
|
||||||
extra = branchextra;
|
|
||||||
}
|
|
||||||
else if (grouplength != branchlength || extra != branchextra) goto ISNOTFIXED;
|
|
||||||
if (**pptrptr == META_KET) break;
|
if (**pptrptr == META_KET) break;
|
||||||
*pptrptr += 1; /* Skip META_ALT */
|
*pptrptr += 1; /* Skip META_ALT */
|
||||||
}
|
}
|
||||||
|
|
||||||
/* There are only 12 bits for caching the extra value, but a pattern that
|
if (group > 0)
|
||||||
needs more than that is weird indeed. */
|
cb->groupinfo[group] |= (uint32_t)(GI_SET_FIXED_LENGTH | grouplength);
|
||||||
|
|
||||||
if (group > 0 && extra <= GI_EXTRA_MAX)
|
|
||||||
cb->groupinfo[group] |= (uint32_t)
|
|
||||||
(GI_SET_FIXED_LENGTH | (extra << GI_EXTRA_SHIFT) | grouplength);
|
|
||||||
|
|
||||||
*extraptr = extra;
|
|
||||||
return grouplength;
|
return grouplength;
|
||||||
|
|
||||||
ISNOTFIXED:
|
ISNOTFIXED:
|
||||||
|
@ -8973,17 +8954,11 @@ return -1;
|
||||||
*************************************************/
|
*************************************************/
|
||||||
|
|
||||||
/* Return a fixed length for a branch in a lookbehind, giving an error if the
|
/* Return a fixed length for a branch in a lookbehind, giving an error if the
|
||||||
length is not fixed. We also take note of any extra value that is generated
|
length is not fixed. On entry, *pptrptr points to the first element inside the
|
||||||
from a nested lookbehind. For example, for /(?<=a(?<=ba)c)/ each individual
|
branch. On exit it is set to point to the ALT or KET.
|
||||||
lookbehind has length 2, but the max_lookbehind setting must be 3 because
|
|
||||||
matching inspects 3 characters before the match starting point.
|
|
||||||
|
|
||||||
On entry, *pptrptr points to the first element inside the branch. On exit it is
|
|
||||||
set to point to the ALT or KET.
|
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
pptrptr pointer to pointer in the parsed pattern
|
pptrptr pointer to pointer in the parsed pattern
|
||||||
extraptr pointer to where to return extra lookbehind length
|
|
||||||
errcodeptr pointer to error code
|
errcodeptr pointer to error code
|
||||||
lcptr pointer to loop counter
|
lcptr pointer to loop counter
|
||||||
recurses chain of recurse_check to catch mutual recursion
|
recurses chain of recurse_check to catch mutual recursion
|
||||||
|
@ -8993,14 +8968,11 @@ Returns: the length, or a negative value on error
|
||||||
*/
|
*/
|
||||||
|
|
||||||
static int
|
static int
|
||||||
get_branchlength(uint32_t **pptrptr, int *extraptr, int *errcodeptr, int *lcptr,
|
get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
|
||||||
parsed_recurse_check *recurses, compile_block *cb)
|
parsed_recurse_check *recurses, compile_block *cb)
|
||||||
{
|
{
|
||||||
int branchlength = 0;
|
int branchlength = 0;
|
||||||
int grouplength;
|
int grouplength;
|
||||||
int groupextra;
|
|
||||||
int max;
|
|
||||||
int extra = 0; /* Additional lookbehind from nesting */
|
|
||||||
uint32_t lastitemlength = 0;
|
uint32_t lastitemlength = 0;
|
||||||
uint32_t *pptr = *pptrptr;
|
uint32_t *pptr = *pptrptr;
|
||||||
PCRE2_SIZE offset;
|
PCRE2_SIZE offset;
|
||||||
|
@ -9149,17 +9121,13 @@ for (;; pptr++)
|
||||||
break;
|
break;
|
||||||
|
|
||||||
/* A nested lookbehind does not contribute any length to this lookbehind,
|
/* A nested lookbehind does not contribute any length to this lookbehind,
|
||||||
but must itself be checked and have its lengths set. If the maximum
|
but must itself be checked and have its lengths set. */
|
||||||
lookbehind for the nested lookbehind is greater than the length so far
|
|
||||||
computed for this branch, we must compute an extra value and keep the
|
|
||||||
largest encountered for use when setting the maximum overall lookbehind. */
|
|
||||||
|
|
||||||
case META_LOOKBEHIND:
|
case META_LOOKBEHIND:
|
||||||
case META_LOOKBEHINDNOT:
|
case META_LOOKBEHINDNOT:
|
||||||
case META_LOOKBEHIND_NA:
|
case META_LOOKBEHIND_NA:
|
||||||
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
|
if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
|
||||||
return -1;
|
return -1;
|
||||||
if (max - branchlength > extra) extra = max - branchlength;
|
|
||||||
break;
|
break;
|
||||||
|
|
||||||
/* Back references and recursions are handled by very similar code. At this
|
/* Back references and recursions are handled by very similar code. At this
|
||||||
|
@ -9267,14 +9235,15 @@ for (;; pptr++)
|
||||||
in the cache. */
|
in the cache. */
|
||||||
|
|
||||||
gptr++;
|
gptr++;
|
||||||
grouplength = get_grouplength(&gptr, FALSE, &groupextra, errcodeptr, lcptr,
|
grouplength = get_grouplength(&gptr, FALSE, errcodeptr, lcptr, group,
|
||||||
group, &this_recurse, cb);
|
&this_recurse, cb);
|
||||||
if (grouplength < 0)
|
if (grouplength < 0)
|
||||||
{
|
{
|
||||||
if (*errcodeptr == 0) goto ISNOTFIXED;
|
if (*errcodeptr == 0) goto ISNOTFIXED;
|
||||||
return -1; /* Error already set */
|
return -1; /* Error already set */
|
||||||
}
|
}
|
||||||
goto OK_GROUP;
|
itemlength = grouplength;
|
||||||
|
break;
|
||||||
|
|
||||||
/* Check nested groups - advance past the initial data for each type and
|
/* Check nested groups - advance past the initial data for each type and
|
||||||
then seek a fixed length with get_grouplength(). */
|
then seek a fixed length with get_grouplength(). */
|
||||||
|
@ -9304,16 +9273,10 @@ for (;; pptr++)
|
||||||
case META_SCRIPT_RUN:
|
case META_SCRIPT_RUN:
|
||||||
pptr++;
|
pptr++;
|
||||||
CHECK_GROUP:
|
CHECK_GROUP:
|
||||||
grouplength = get_grouplength(&pptr, TRUE, &groupextra, errcodeptr, lcptr,
|
grouplength = get_grouplength(&pptr, TRUE, errcodeptr, lcptr, group,
|
||||||
group, recurses, cb);
|
recurses, cb);
|
||||||
if (grouplength < 0) return -1;
|
if (grouplength < 0) return -1;
|
||||||
|
|
||||||
/* A nested lookbehind within the group may require looking back further
|
|
||||||
than the length of the group. */
|
|
||||||
|
|
||||||
OK_GROUP:
|
|
||||||
itemlength = grouplength;
|
itemlength = grouplength;
|
||||||
if (groupextra - branchlength > extra) extra = groupextra - branchlength;
|
|
||||||
break;
|
break;
|
||||||
|
|
||||||
/* Exact repetition is OK; variable repetition is not. A repetition of zero
|
/* Exact repetition is OK; variable repetition is not. A repetition of zero
|
||||||
|
@ -9374,7 +9337,6 @@ for (;; pptr++)
|
||||||
|
|
||||||
EXIT:
|
EXIT:
|
||||||
*pptrptr = pptr;
|
*pptrptr = pptr;
|
||||||
*extraptr = extra;
|
|
||||||
return branchlength;
|
return branchlength;
|
||||||
|
|
||||||
PARSED_SKIP_FAILED:
|
PARSED_SKIP_FAILED:
|
||||||
|
@ -9400,7 +9362,6 @@ get_branchlength() as an "extra" value.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
pptrptr pointer to pointer in the parsed pattern
|
pptrptr pointer to pointer in the parsed pattern
|
||||||
maxptr where to return maximum lookbehind for the whole group
|
|
||||||
errcodeptr pointer to error code
|
errcodeptr pointer to error code
|
||||||
lcptr pointer to loop counter
|
lcptr pointer to loop counter
|
||||||
recurses chain of recurse_check to catch mutual recursion
|
recurses chain of recurse_check to catch mutual recursion
|
||||||
|
@ -9411,13 +9372,11 @@ Returns: TRUE if all is well
|
||||||
*/
|
*/
|
||||||
|
|
||||||
static BOOL
|
static BOOL
|
||||||
set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
|
set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
|
||||||
int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
|
parsed_recurse_check *recurses, compile_block *cb)
|
||||||
{
|
{
|
||||||
PCRE2_SIZE offset;
|
PCRE2_SIZE offset;
|
||||||
int branchlength;
|
int branchlength;
|
||||||
int branchextra;
|
|
||||||
int max = 0;
|
|
||||||
uint32_t *bptr = *pptrptr;
|
uint32_t *bptr = *pptrptr;
|
||||||
|
|
||||||
READPLUSOFFSET(offset, bptr); /* Offset for error messages */
|
READPLUSOFFSET(offset, bptr); /* Offset for error messages */
|
||||||
|
@ -9426,8 +9385,7 @@ READPLUSOFFSET(offset, bptr); /* Offset for error messages */
|
||||||
do
|
do
|
||||||
{
|
{
|
||||||
*pptrptr += 1;
|
*pptrptr += 1;
|
||||||
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
|
branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
|
||||||
recurses, cb);
|
|
||||||
if (branchlength < 0)
|
if (branchlength < 0)
|
||||||
{
|
{
|
||||||
/* The errorcode and offset may already be set from a nested lookbehind. */
|
/* The errorcode and offset may already be set from a nested lookbehind. */
|
||||||
|
@ -9435,14 +9393,12 @@ do
|
||||||
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
|
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
|
||||||
return FALSE;
|
return FALSE;
|
||||||
}
|
}
|
||||||
if (branchlength + branchextra > max) max = branchlength + branchextra;
|
if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
|
||||||
*bptr |= branchlength; /* branchlength never more than 65535 */
|
*bptr |= branchlength; /* branchlength never more than 65535 */
|
||||||
bptr = *pptrptr;
|
bptr = *pptrptr;
|
||||||
}
|
}
|
||||||
while (*bptr == META_ALT);
|
while (*bptr == META_ALT);
|
||||||
|
|
||||||
if (max > cb->max_lookbehind) cb->max_lookbehind = max;
|
|
||||||
*maxptr = max;
|
|
||||||
return TRUE;
|
return TRUE;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -9475,7 +9431,6 @@ static int
|
||||||
check_lookbehinds(uint32_t *pptr, uint32_t **retptr,
|
check_lookbehinds(uint32_t *pptr, uint32_t **retptr,
|
||||||
parsed_recurse_check *recurses, compile_block *cb)
|
parsed_recurse_check *recurses, compile_block *cb)
|
||||||
{
|
{
|
||||||
int max;
|
|
||||||
int errorcode = 0;
|
int errorcode = 0;
|
||||||
int loopcount = 0;
|
int loopcount = 0;
|
||||||
int nestlevel = 0;
|
int nestlevel = 0;
|
||||||
|
@ -9599,8 +9554,7 @@ for (; *pptr != META_END; pptr++)
|
||||||
case META_LOOKBEHIND:
|
case META_LOOKBEHIND:
|
||||||
case META_LOOKBEHINDNOT:
|
case META_LOOKBEHINDNOT:
|
||||||
case META_LOOKBEHIND_NA:
|
case META_LOOKBEHIND_NA:
|
||||||
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount,
|
if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, recurses, cb))
|
||||||
recurses, cb))
|
|
||||||
return errorcode;
|
return errorcode;
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
|
@ -304,7 +304,7 @@ Partial match, mark=xx: 123a
|
||||||
|
|
||||||
/(?<=(?<=a)b)c.*/I
|
/(?<=(?<=a)b)c.*/I
|
||||||
Capture group count = 0
|
Capture group count = 0
|
||||||
Max lookbehind = 2
|
Max lookbehind = 1
|
||||||
First code unit = 'c'
|
First code unit = 'c'
|
||||||
Subject length lower bound = 1
|
Subject length lower bound = 1
|
||||||
abc\=ph
|
abc\=ph
|
||||||
|
@ -337,7 +337,7 @@ Partial match: abcd
|
||||||
|
|
||||||
/(?<=(?<=(?<=a)b)c)./I
|
/(?<=(?<=(?<=a)b)c)./I
|
||||||
Capture group count = 0
|
Capture group count = 0
|
||||||
Max lookbehind = 3
|
Max lookbehind = 1
|
||||||
Subject length lower bound = 1
|
Subject length lower bound = 1
|
||||||
123abcXYZ
|
123abcXYZ
|
||||||
0: abcX
|
0: abcX
|
||||||
|
@ -354,7 +354,7 @@ Subject length lower bound = 1
|
||||||
|
|
||||||
/(?<=ab((?<=...)cd))./I
|
/(?<=ab((?<=...)cd))./I
|
||||||
Capture group count = 1
|
Capture group count = 1
|
||||||
Max lookbehind = 5
|
Max lookbehind = 4
|
||||||
Subject length lower bound = 1
|
Subject length lower bound = 1
|
||||||
ZabcdX
|
ZabcdX
|
||||||
0: ZabcdX
|
0: ZabcdX
|
||||||
|
@ -363,7 +363,7 @@ Subject length lower bound = 1
|
||||||
|
|
||||||
/(?<=((?<=(?<=ab).))(?1)(?1))./I
|
/(?<=((?<=(?<=ab).))(?1)(?1))./I
|
||||||
Capture group count = 1
|
Capture group count = 1
|
||||||
Max lookbehind = 3
|
Max lookbehind = 2
|
||||||
Subject length lower bound = 1
|
Subject length lower bound = 1
|
||||||
abxZ
|
abxZ
|
||||||
0: abxZ
|
0: abxZ
|
||||||
|
|
|
@ -17036,7 +17036,7 @@ Subject length lower bound = 1
|
||||||
|
|
||||||
/(?<=(?<=a)b)c.*/I
|
/(?<=(?<=a)b)c.*/I
|
||||||
Capture group count = 0
|
Capture group count = 0
|
||||||
Max lookbehind = 2
|
Max lookbehind = 1
|
||||||
First code unit = 'c'
|
First code unit = 'c'
|
||||||
Subject length lower bound = 1
|
Subject length lower bound = 1
|
||||||
abc\=ph
|
abc\=ph
|
||||||
|
@ -17064,7 +17064,7 @@ Subject length lower bound = 0
|
||||||
|
|
||||||
/(?<=a(?<=a|ba)c)/I
|
/(?<=a(?<=a|ba)c)/I
|
||||||
Capture group count = 0
|
Capture group count = 0
|
||||||
Max lookbehind = 3
|
Max lookbehind = 2
|
||||||
May match empty string
|
May match empty string
|
||||||
Subject length lower bound = 0
|
Subject length lower bound = 0
|
||||||
|
|
||||||
|
@ -17076,7 +17076,7 @@ Subject length lower bound = 0
|
||||||
|
|
||||||
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
|
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
|
||||||
Capture group count = 0
|
Capture group count = 0
|
||||||
Max lookbehind = 5
|
Max lookbehind = 4
|
||||||
May match empty string
|
May match empty string
|
||||||
Subject length lower bound = 0
|
Subject length lower bound = 0
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue