Back off failed attempt to handle nested lookbehinds for estimating how much of
a partial match to retain for multi-segment matching. Document the current difficulty if the whole first segment cannot be retained.
This commit is contained in:
parent
87bc092222
commit
963b570fd0
37
ChangeLog
37
ChangeLog
|
@ -66,12 +66,7 @@ is made possessive and applied to an item in parentheses, because a
|
|||
parenthesized item may contain multiple branches or other backtracking points,
|
||||
for example /(a|ab){1}+c/ or /(a+){1}+a/.
|
||||
|
||||
13. Nested lookbehinds are now taken into account when computing the maximum
|
||||
lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
|
||||
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
|
||||
it to 3, because matching looks back 3 characters.
|
||||
|
||||
14. For partial matches, pcre2test was always showing the maximum lookbehind
|
||||
13. For partial matches, pcre2test was always showing the maximum lookbehind
|
||||
characters, flagged with "<", which is misleading when the lookbehind didn't
|
||||
actually look behind the start (because it was later in the pattern). Showing
|
||||
all consulted preceding characters for partial matches is now controlled by the
|
||||
|
@ -79,25 +74,25 @@ existing "allusedtext" modifier and, as for complete matches, this facility is
|
|||
available only for non-JIT matching, because JIT does not maintain the first
|
||||
and last consulted characters.
|
||||
|
||||
15. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
|
||||
14. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
|
||||
if the end of the subject was encountered in a lookahead (conditional or
|
||||
otherwise), an atomic group, or a recursion.
|
||||
|
||||
16. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
|
||||
15. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
|
||||
|
||||
17. Check for integer overflow when computing lookbehind lengths. Fixes
|
||||
16. Check for integer overflow when computing lookbehind lengths. Fixes
|
||||
Clusterfuzz issue 15636.
|
||||
|
||||
18. Implemented non-atomic positive lookaround assertions.
|
||||
17. Implemented non-atomic positive lookaround assertions.
|
||||
|
||||
19. If a lookbehind contained a lookahead that contained another lookbehind
|
||||
18. If a lookbehind contained a lookahead that contained another lookbehind
|
||||
within it, the nested lookbehind was not correctly processed. For example, if
|
||||
/(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching
|
||||
"b".
|
||||
|
||||
20. Implemented pcre2_get_match_data_size().
|
||||
19. Implemented pcre2_get_match_data_size().
|
||||
|
||||
21. Two alterations to partial matching (not yet done by JIT):
|
||||
20. Two alterations to partial matching (not yet done by JIT):
|
||||
|
||||
(a) The definition of a partial match is slightly changed: if a pattern
|
||||
contains any lookbehinds, an empty partial match may be given, because this
|
||||
|
@ -111,29 +106,29 @@ within it, the nested lookbehind was not correctly processed. For example, if
|
|||
(c) An empty string partial hard match can be returned for \z and \Z as it
|
||||
is documented that they shouldn't match.
|
||||
|
||||
22. A branch that started with (*ACCEPT) was not being recognized as one that
|
||||
21. A branch that started with (*ACCEPT) was not being recognized as one that
|
||||
could match an empty string.
|
||||
|
||||
23. Corrected pcre2_set_character_tables() tables data type: was const unsigned
|
||||
22. Corrected pcre2_set_character_tables() tables data type: was const unsigned
|
||||
char * instead of const uint8_t *, as generated by pcre2_maketables().
|
||||
|
||||
24. Upgraded to Unicode 12.1.0.
|
||||
23. Upgraded to Unicode 12.1.0.
|
||||
|
||||
25. Add -jitfast command line option to pcre2test (to make all the jit options
|
||||
24. Add -jitfast command line option to pcre2test (to make all the jit options
|
||||
available directly).
|
||||
|
||||
26. Make pcre2test -C show if libreadline or libedit is supported.
|
||||
25. Make pcre2test -C show if libreadline or libedit is supported.
|
||||
|
||||
28. If the length of one branch of a group exceeded 65535 (the maximum value
|
||||
26. If the length of one branch of a group exceeded 65535 (the maximum value
|
||||
that is remembered as a minimum length), the whole group's length was
|
||||
incorrectly recorded as 65535, leading to incorrect "no match" when start-up
|
||||
optimizations were in force.
|
||||
|
||||
29. The "rightmost consulted character" value was not always correct; in
|
||||
27. The "rightmost consulted character" value was not always correct; in
|
||||
particular, if a pattern ended with a negative lookahead, characters that were
|
||||
inspected in that lookahead were not included.
|
||||
|
||||
30. Add the pcre2_maketables_free() function.
|
||||
28. Add the pcre2_maketables_free() function.
|
||||
|
||||
|
||||
Version 10.33 16-April-2019
|
||||
|
|
|
@ -2272,26 +2272,24 @@ defaulted by the caller of the match function.
|
|||
<pre>
|
||||
PCRE2_INFO_MAXLOOKBEHIND
|
||||
</pre>
|
||||
Return the largest number of characters (not code units) before the current
|
||||
matching point that could be inspected while processing a lookbehind assertion
|
||||
in the pattern. Before release 10.34 this request used to give the largest
|
||||
value for any individual assertion. Now it takes into account nested
|
||||
lookbehinds, which can mean that the overall value is greater. For example, the
|
||||
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
|
||||
largest individual lookbehind. Now it returns 3, because matching actually
|
||||
looks back 3 characters.
|
||||
A lookbehind assertion moves back a certain number of characters (not code
|
||||
units) when it starts to process each of its branches. This request returns the
|
||||
largest of these backward moves. The third argument should point to a uint32_t
|
||||
integer. The simple assertions \b and \B require a one-character lookbehind
|
||||
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
|
||||
longer. \A also registers a one-character lookbehind, though it does not
|
||||
actually inspect the previous character.
|
||||
</P>
|
||||
<P>
|
||||
The third argument should point to a uint32_t integer. This information is
|
||||
useful when doing multi-segment matching using the partial matching facilities.
|
||||
Note that the simple assertions \b and \B require a one-character lookbehind.
|
||||
\A also registers a one-character lookbehind, though it does not actually
|
||||
inspect the previous character. This is to ensure that at least one character
|
||||
from the old segment is retained when a new segment is processed. Otherwise, if
|
||||
there are no lookbehinds in the pattern, \A might match incorrectly at the
|
||||
start of a second or subsequent segment. There are more details in the
|
||||
Note that this information is useful for multi-segment matching only
|
||||
if the pattern contains no nested lookbehinds. For example, the pattern
|
||||
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
|
||||
first lookbehind moves back by two characters, matches one character, then the
|
||||
nested lookbehind also moves back by two characters. This puts the matching
|
||||
point three characters earlier than it was at the start.
|
||||
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation.
|
||||
documentation for a discussion of multi-segment matching.
|
||||
<pre>
|
||||
PCRE2_INFO_MINLENGTH
|
||||
</pre>
|
||||
|
|
|
@ -244,9 +244,12 @@ there is only a partial match.
|
|||
<P>
|
||||
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||
over time, features (including partial matching) that make multi-segment
|
||||
matching possible have been added. The string is searched segment by segment by
|
||||
calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same
|
||||
results that would happen if the entire string was available for searching.
|
||||
matching possible have been added. A very long string can be searched segment
|
||||
by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
|
||||
the same results that would happen if the entire string was available for
|
||||
searching all the time. Normally, the strings that are being sought are much
|
||||
shorter than each individual segment, and are in the middle of very long
|
||||
strings, so the pattern is normally not anchored.
|
||||
</P>
|
||||
<P>
|
||||
Special logic must be implemented to handle a matched substring that spans a
|
||||
|
@ -259,8 +262,7 @@ also be set for all but the first segment.
|
|||
When a partial match occurs, the next segment must be added to the current
|
||||
subject and the match re-run, using the <i>startoffset</i> argument of
|
||||
<b>pcre2_match()</b> to begin at the point where the partial match started.
|
||||
Multi-segment matching is usually used to search for substrings in the middle
|
||||
of very long sequences, so the patterns are normally not anchored. For example:
|
||||
For example:
|
||||
<pre>
|
||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||
data> ...the date is 23ja\=ph
|
||||
|
@ -270,50 +272,51 @@ of very long sequences, so the patterns are normally not anchored. For example:
|
|||
1: jan
|
||||
</pre>
|
||||
Note the use of the <b>offset</b> modifier to start the new match where the
|
||||
partial match was found.
|
||||
partial match was found. In this example, the next segment was added to the one
|
||||
in which the partial match was found. This is the most straightforward
|
||||
approach, typically using a memory buffer that is twice the size of each
|
||||
segment. After a partial match, the first half of the buffer is discarded, the
|
||||
second half is moved to the start of the buffer, and a new segment is added
|
||||
before repeating the match as in the example above. After a no match, the
|
||||
entire buffer can be discarded.
|
||||
</P>
|
||||
<P>
|
||||
In this simple example, the next segment was just added to the one in which the
|
||||
partial match was found. However, if there are memory constraints, it may be
|
||||
necessary to discard text that precedes the partial match before adding the
|
||||
next segment. In cases such as the above, where the pattern does not contain
|
||||
any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if a pattern contains a lookbehind assertion, characters
|
||||
If there are memory constraints, you may want to discard text that precedes a
|
||||
partial match before adding the next segment. Unfortunately, this is not at
|
||||
present straightforward. In cases such as the above, where the pattern does not
|
||||
contain any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if the pattern contains a lookbehind assertion, characters
|
||||
that precede the start of the partial match may have been inspected during the
|
||||
matching process.
|
||||
</P>
|
||||
<P>
|
||||
The only lookbehind information that is available is the length of the longest
|
||||
lookbehind in a pattern. This may not, of course, be at the start of the
|
||||
pattern, but retaining that many characters before the partial match is
|
||||
sufficient, if not always strictly necessary. The way to do this is as follows:
|
||||
</P>
|
||||
<P>
|
||||
Before doing any matching, find the length of the longest lookbehind in the
|
||||
pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
|
||||
option. Note that the resulting count is in characters, not code units. After a
|
||||
partial match, moving back from the ovector[0] offset in the subject by the
|
||||
number of characters given for the maximum lookbehind gets you to the earliest
|
||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
||||
while moving back through the code units. Characters before the point you have
|
||||
now reached can be discarded.
|
||||
</P>
|
||||
<P>
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
||||
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
|
||||
displays a partial match, it indicates the lookbehind characters with '<'
|
||||
characters if the <b>allusedtext</b> modifier is set:
|
||||
matching process. When <b>pcre2test</b> displays a partial match, it indicates
|
||||
these characters with '<' if the <b>allusedtext</b> modifier is set:
|
||||
<pre>
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
</pre>
|
||||
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||
because JIT matching does not maintain the first and last consulted characters.
|
||||
However, the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||
because JIT matching does not record the first (or last) consulted characters.
|
||||
For this reason, this information is not available via the API. It is therefore
|
||||
not possible in general to obtain the exact number of characters that must be
|
||||
retained in order to get the right match result. If you cannot retain the
|
||||
entire segment, you must find some heuristic way of choosing.
|
||||
</P>
|
||||
<P>
|
||||
If you know the approximate length of the matching substrings, you can use that
|
||||
to decide how much text to retain. The only lookbehind information that is
|
||||
currently available via the API is the length of the longest individual
|
||||
lookbehind in a pattern, but this can be misleading if there are nested
|
||||
lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
|
||||
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
|
||||
units) that any individual lookbehind moves back when it is processed. A
|
||||
pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
|
||||
inspects two characters before its starting point.
|
||||
</P>
|
||||
<P>
|
||||
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
|
||||
UTF-8 or UTF-16 you have to count characters while moving back through the code
|
||||
units.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
|
||||
<P>
|
||||
|
@ -379,11 +382,11 @@ want.
|
|||
</P>
|
||||
<P>
|
||||
If you do want to allow for starting again at the next character, one way of
|
||||
doing it is to retain the matched part of the segment and try a new complete
|
||||
match, as described for <b>pcre2_match()</b> above. Another possibility is to
|
||||
work with two buffers. If a partial match at offset <i>n</i> in the first buffer
|
||||
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
|
||||
you can then try a new match starting at offset <i>n+1</i> in the first buffer.
|
||||
doing it is to retain some or all of the segment and try a new complete match,
|
||||
as described for <b>pcre2_match()</b> above. Another possibility is to work with
|
||||
two buffers. If a partial match at offset <i>n</i> in the first buffer is
|
||||
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
|
||||
can then try a new match starting at offset <i>n+1</i> in the first buffer.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
|
@ -396,7 +399,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 07 August 2019
|
||||
Last updated: 04 September 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
138
doc/pcre2.txt
138
doc/pcre2.txt
|
@ -2234,25 +2234,24 @@ INFORMATION ABOUT A COMPILED PATTERN
|
|||
|
||||
PCRE2_INFO_MAXLOOKBEHIND
|
||||
|
||||
Return the largest number of characters (not code units) before the
|
||||
current matching point that could be inspected while processing a look-
|
||||
behind assertion in the pattern. Before release 10.34 this request used
|
||||
to give the largest value for any individual assertion. Now it takes
|
||||
into account nested lookbehinds, which can mean that the overall value
|
||||
is greater. For example, the pattern (?<=a(?<=ba)c) previously returned
|
||||
2, because that is the length of the largest individual lookbehind. Now
|
||||
it returns 3, because matching actually looks back 3 characters.
|
||||
A lookbehind assertion moves back a certain number of characters (not
|
||||
code units) when it starts to process each of its branches. This re-
|
||||
quest returns the largest of these backward moves. The third argument
|
||||
should point to a uint32_t integer. The simple assertions \b and \B re-
|
||||
quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to
|
||||
return 1 in the absence of anything longer. \A also registers a one-
|
||||
character lookbehind, though it does not actually inspect the previous
|
||||
character.
|
||||
|
||||
The third argument should point to a uint32_t integer. This information
|
||||
is useful when doing multi-segment matching using the partial matching
|
||||
facilities. Note that the simple assertions \b and \B require a one-
|
||||
character lookbehind. \A also registers a one-character lookbehind,
|
||||
though it does not actually inspect the previous character. This is to
|
||||
ensure that at least one character from the old segment is retained
|
||||
when a new segment is processed. Otherwise, if there are no lookbehinds
|
||||
in the pattern, \A might match incorrectly at the start of a second or
|
||||
subsequent segment. There are more details in the pcre2partial documen-
|
||||
tation.
|
||||
Note that this information is useful for multi-segment matching only if
|
||||
the pattern contains no nested lookbehinds. For example, the pattern
|
||||
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is pro-
|
||||
cessed, the first lookbehind moves back by two characters, matches one
|
||||
character, then the nested lookbehind also moves back by two charac-
|
||||
ters. This puts the matching point three characters earlier than it was
|
||||
at the start. PCRE2_INFO_MAXLOOKBEHIND is really only useful as a de-
|
||||
bugging tool. See the pcre2partial documentation for a discussion of
|
||||
multi-segment matching.
|
||||
|
||||
PCRE2_INFO_MINLENGTH
|
||||
|
||||
|
@ -5877,10 +5876,13 @@ MULTI-SEGMENT MATCHING WITH pcre2_match()
|
|||
|
||||
PCRE was not originally designed with multi-segment matching in mind.
|
||||
However, over time, features (including partial matching) that make
|
||||
multi-segment matching possible have been added. The string is searched
|
||||
segment by segment by calling pcre2_match() repeatedly, with the aim of
|
||||
achieving the same results that would happen if the entire string was
|
||||
available for searching.
|
||||
multi-segment matching possible have been added. A very long string can
|
||||
be searched segment by segment by calling pcre2_match() repeatedly,
|
||||
with the aim of achieving the same results that would happen if the en-
|
||||
tire string was available for searching all the time. Normally, the
|
||||
strings that are being sought are much shorter than each individual
|
||||
segment, and are in the middle of very long strings, so the pattern is
|
||||
normally not anchored.
|
||||
|
||||
Special logic must be implemented to handle a matched substring that
|
||||
spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
|
||||
|
@ -5891,9 +5893,7 @@ MULTI-SEGMENT MATCHING WITH pcre2_match()
|
|||
When a partial match occurs, the next segment must be added to the cur-
|
||||
rent subject and the match re-run, using the startoffset argument of
|
||||
pcre2_match() to begin at the point where the partial match started.
|
||||
Multi-segment matching is usually used to search for substrings in the
|
||||
middle of very long sequences, so the patterns are normally not an-
|
||||
chored. For example:
|
||||
For example:
|
||||
|
||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||
data> ...the date is 23ja\=ph
|
||||
|
@ -5903,49 +5903,51 @@ MULTI-SEGMENT MATCHING WITH pcre2_match()
|
|||
1: jan
|
||||
|
||||
Note the use of the offset modifier to start the new match where the
|
||||
partial match was found.
|
||||
partial match was found. In this example, the next segment was added to
|
||||
the one in which the partial match was found. This is the most
|
||||
straightforward approach, typically using a memory buffer that is twice
|
||||
the size of each segment. After a partial match, the first half of the
|
||||
buffer is discarded, the second half is moved to the start of the buf-
|
||||
fer, and a new segment is added before repeating the match as in the
|
||||
example above. After a no match, the entire buffer can be discarded.
|
||||
|
||||
In this simple example, the next segment was just added to the one in
|
||||
which the partial match was found. However, if there are memory con-
|
||||
straints, it may be necessary to discard text that precedes the partial
|
||||
match before adding the next segment. In cases such as the above, where
|
||||
the pattern does not contain any lookbehinds, it is sufficient to re-
|
||||
tain only the partially matched substring. However, if a pattern con-
|
||||
tains a lookbehind assertion, characters that precede the start of the
|
||||
partial match may have been inspected during the matching process.
|
||||
|
||||
The only lookbehind information that is available is the length of the
|
||||
longest lookbehind in a pattern. This may not, of course, be at the
|
||||
start of the pattern, but retaining that many characters before the
|
||||
partial match is sufficient, if not always strictly necessary. The way
|
||||
to do this is as follows:
|
||||
|
||||
Before doing any matching, find the length of the longest lookbehind in
|
||||
the pattern by calling pcre2_pattern_info() with the
|
||||
PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
|
||||
characters, not code units. After a partial match, moving back from the
|
||||
ovector[0] offset in the subject by the number of characters given for
|
||||
the maximum lookbehind gets you to the earliest character that must be
|
||||
retained. In a non-UTF or a 32-bit situation, moving back is just a
|
||||
subtraction, but in UTF-8 or UTF-16 you have to count characters while
|
||||
moving back through the code units. Characters before the point you
|
||||
have now reached can be discarded.
|
||||
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against
|
||||
the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
|
||||
mum lookbehind count is 3, so all characters before offset 2 can be
|
||||
discarded. The value of startoffset for the next match should be 3.
|
||||
When pcre2test displays a partial match, it indicates the lookbehind
|
||||
characters with '<' characters if the allusedtext modifier is set:
|
||||
If there are memory constraints, you may want to discard text that pre-
|
||||
cedes a partial match before adding the next segment. Unfortunately,
|
||||
this is not at present straightforward. In cases such as the above,
|
||||
where the pattern does not contain any lookbehinds, it is sufficient to
|
||||
retain only the partially matched substring. However, if the pattern
|
||||
contains a lookbehind assertion, characters that precede the start of
|
||||
the partial match may have been inspected during the matching process.
|
||||
When pcre2test displays a partial match, it indicates these characters
|
||||
with '<' if the allusedtext modifier is set:
|
||||
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
|
||||
Note that the allusedtext modifier is not available for JIT matching,
|
||||
because JIT matching does not maintain the first and last consulted
|
||||
characters.
|
||||
However, the allusedtext modifier is not available for JIT matching,
|
||||
because JIT matching does not record the first (or last) consulted
|
||||
characters. For this reason, this information is not available via the
|
||||
API. It is therefore not possible in general to obtain the exact number
|
||||
of characters that must be retained in order to get the right match re-
|
||||
sult. If you cannot retain the entire segment, you must find some
|
||||
heuristic way of choosing.
|
||||
|
||||
If you know the approximate length of the matching substrings, you can
|
||||
use that to decide how much text to retain. The only lookbehind infor-
|
||||
mation that is currently available via the API is the length of the
|
||||
longest individual lookbehind in a pattern, but this can be misleading
|
||||
if there are nested lookbehinds. The value returned by calling
|
||||
pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND option is the
|
||||
maximum number of characters (not code units) that any individual look-
|
||||
behind moves back when it is processed. A pattern such as
|
||||
"(?<=(?<!b)a)" has a maximum lookbehind value of one, but inspects two
|
||||
characters before its starting point.
|
||||
|
||||
In a non-UTF or a 32-bit case, moving back is just a subtraction, but
|
||||
in UTF-8 or UTF-16 you have to count characters while moving back
|
||||
through the code units.
|
||||
|
||||
|
||||
PARTIAL MATCHING USING pcre2_dfa_match()
|
||||
|
@ -6012,12 +6014,12 @@ MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
|
|||
plication, this may or may not be what you want.
|
||||
|
||||
If you do want to allow for starting again at the next character, one
|
||||
way of doing it is to retain the matched part of the segment and try a
|
||||
new complete match, as described for pcre2_match() above. Another pos-
|
||||
sibility is to work with two buffers. If a partial match at offset n in
|
||||
the first buffer is followed by "no match" when PCRE2_DFA_RESTART is
|
||||
used on the second buffer, you can then try a new match starting at
|
||||
offset n+1 in the first buffer.
|
||||
way of doing it is to retain some or all of the segment and try a new
|
||||
complete match, as described for pcre2_match() above. Another possibil-
|
||||
ity is to work with two buffers. If a partial match at offset n in the
|
||||
first buffer is followed by "no match" when PCRE2_DFA_RESTART is used
|
||||
on the second buffer, you can then try a new match starting at offset
|
||||
n+1 in the first buffer.
|
||||
|
||||
|
||||
AUTHOR
|
||||
|
@ -6029,7 +6031,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 07 August 2019
|
||||
Last updated: 04 September 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -2232,27 +2232,25 @@ defaulted by the caller of the match function.
|
|||
.sp
|
||||
PCRE2_INFO_MAXLOOKBEHIND
|
||||
.sp
|
||||
Return the largest number of characters (not code units) before the current
|
||||
matching point that could be inspected while processing a lookbehind assertion
|
||||
in the pattern. Before release 10.34 this request used to give the largest
|
||||
value for any individual assertion. Now it takes into account nested
|
||||
lookbehinds, which can mean that the overall value is greater. For example, the
|
||||
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
|
||||
largest individual lookbehind. Now it returns 3, because matching actually
|
||||
looks back 3 characters.
|
||||
A lookbehind assertion moves back a certain number of characters (not code
|
||||
units) when it starts to process each of its branches. This request returns the
|
||||
largest of these backward moves. The third argument should point to a uint32_t
|
||||
integer. The simple assertions \eb and \eB require a one-character lookbehind
|
||||
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
|
||||
longer. \eA also registers a one-character lookbehind, though it does not
|
||||
actually inspect the previous character.
|
||||
.P
|
||||
The third argument should point to a uint32_t integer. This information is
|
||||
useful when doing multi-segment matching using the partial matching facilities.
|
||||
Note that the simple assertions \eb and \eB require a one-character lookbehind.
|
||||
\eA also registers a one-character lookbehind, though it does not actually
|
||||
inspect the previous character. This is to ensure that at least one character
|
||||
from the old segment is retained when a new segment is processed. Otherwise, if
|
||||
there are no lookbehinds in the pattern, \eA might match incorrectly at the
|
||||
start of a second or subsequent segment. There are more details in the
|
||||
Note that this information is useful for multi-segment matching only
|
||||
if the pattern contains no nested lookbehinds. For example, the pattern
|
||||
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
|
||||
first lookbehind moves back by two characters, matches one character, then the
|
||||
nested lookbehind also moves back by two characters. This puts the matching
|
||||
point three characters earlier than it was at the start.
|
||||
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
|
||||
.\" HREF
|
||||
\fBpcre2partial\fP
|
||||
.\"
|
||||
documentation.
|
||||
documentation for a discussion of multi-segment matching.
|
||||
.sp
|
||||
PCRE2_INFO_MINLENGTH
|
||||
.sp
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
|
||||
.TH PCRE2PARTIAL 3 "04 September 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions
|
||||
.SH "PARTIAL MATCHING IN PCRE2"
|
||||
|
@ -213,9 +213,12 @@ there is only a partial match.
|
|||
.sp
|
||||
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||
over time, features (including partial matching) that make multi-segment
|
||||
matching possible have been added. The string is searched segment by segment by
|
||||
calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same
|
||||
results that would happen if the entire string was available for searching.
|
||||
matching possible have been added. A very long string can be searched segment
|
||||
by segment by calling \fBpcre2_match()\fP repeatedly, with the aim of achieving
|
||||
the same results that would happen if the entire string was available for
|
||||
searching all the time. Normally, the strings that are being sought are much
|
||||
shorter than each individual segment, and are in the middle of very long
|
||||
strings, so the pattern is normally not anchored.
|
||||
.P
|
||||
Special logic must be implemented to handle a matched substring that spans a
|
||||
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
|
||||
|
@ -226,8 +229,7 @@ also be set for all but the first segment.
|
|||
When a partial match occurs, the next segment must be added to the current
|
||||
subject and the match re-run, using the \fIstartoffset\fP argument of
|
||||
\fBpcre2_match()\fP to begin at the point where the partial match started.
|
||||
Multi-segment matching is usually used to search for substrings in the middle
|
||||
of very long sequences, so the patterns are normally not anchored. For example:
|
||||
For example:
|
||||
.sp
|
||||
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
|
||||
data> ...the date is 23ja\e=ph
|
||||
|
@ -237,47 +239,48 @@ of very long sequences, so the patterns are normally not anchored. For example:
|
|||
1: jan
|
||||
.sp
|
||||
Note the use of the \fBoffset\fP modifier to start the new match where the
|
||||
partial match was found.
|
||||
partial match was found. In this example, the next segment was added to the one
|
||||
in which the partial match was found. This is the most straightforward
|
||||
approach, typically using a memory buffer that is twice the size of each
|
||||
segment. After a partial match, the first half of the buffer is discarded, the
|
||||
second half is moved to the start of the buffer, and a new segment is added
|
||||
before repeating the match as in the example above. After a no match, the
|
||||
entire buffer can be discarded.
|
||||
.P
|
||||
In this simple example, the next segment was just added to the one in which the
|
||||
partial match was found. However, if there are memory constraints, it may be
|
||||
necessary to discard text that precedes the partial match before adding the
|
||||
next segment. In cases such as the above, where the pattern does not contain
|
||||
any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if a pattern contains a lookbehind assertion, characters
|
||||
If there are memory constraints, you may want to discard text that precedes a
|
||||
partial match before adding the next segment. Unfortunately, this is not at
|
||||
present straightforward. In cases such as the above, where the pattern does not
|
||||
contain any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if the pattern contains a lookbehind assertion, characters
|
||||
that precede the start of the partial match may have been inspected during the
|
||||
matching process.
|
||||
.P
|
||||
The only lookbehind information that is available is the length of the longest
|
||||
lookbehind in a pattern. This may not, of course, be at the start of the
|
||||
pattern, but retaining that many characters before the partial match is
|
||||
sufficient, if not always strictly necessary. The way to do this is as follows:
|
||||
.P
|
||||
Before doing any matching, find the length of the longest lookbehind in the
|
||||
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
|
||||
option. Note that the resulting count is in characters, not code units. After a
|
||||
partial match, moving back from the ovector[0] offset in the subject by the
|
||||
number of characters given for the maximum lookbehind gets you to the earliest
|
||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
||||
while moving back through the code units. Characters before the point you have
|
||||
now reached can be discarded.
|
||||
.P
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
||||
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
|
||||
displays a partial match, it indicates the lookbehind characters with '<'
|
||||
characters if the \fBallusedtext\fP modifier is set:
|
||||
matching process. When \fBpcre2test\fP displays a partial match, it indicates
|
||||
these characters with '<' if the \fBallusedtext\fP modifier is set:
|
||||
.sp
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\e=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
.sp
|
||||
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||
because JIT matching does not maintain the first and last consulted characters.
|
||||
.
|
||||
However, the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||
because JIT matching does not record the first (or last) consulted characters.
|
||||
For this reason, this information is not available via the API. It is therefore
|
||||
not possible in general to obtain the exact number of characters that must be
|
||||
retained in order to get the right match result. If you cannot retain the
|
||||
entire segment, you must find some heuristic way of choosing.
|
||||
.P
|
||||
If you know the approximate length of the matching substrings, you can use that
|
||||
to decide how much text to retain. The only lookbehind information that is
|
||||
currently available via the API is the length of the longest individual
|
||||
lookbehind in a pattern, but this can be misleading if there are nested
|
||||
lookbehinds. The value returned by calling \fBpcre2_pattern_info()\fP with the
|
||||
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
|
||||
units) that any individual lookbehind moves back when it is processed. A
|
||||
pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
|
||||
inspects two characters before its starting point.
|
||||
.P
|
||||
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
|
||||
UTF-8 or UTF-16 you have to count characters while moving back through the code
|
||||
units.
|
||||
.
|
||||
.
|
||||
.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
|
||||
|
@ -344,11 +347,11 @@ are remembered. Depending on the application, this may or may not be what you
|
|||
want.
|
||||
.P
|
||||
If you do want to allow for starting again at the next character, one way of
|
||||
doing it is to retain the matched part of the segment and try a new complete
|
||||
match, as described for \fBpcre2_match()\fP above. Another possibility is to
|
||||
work with two buffers. If a partial match at offset \fIn\fP in the first buffer
|
||||
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
|
||||
you can then try a new match starting at offset \fIn+1\fP in the first buffer.
|
||||
doing it is to retain some or all of the segment and try a new complete match,
|
||||
as described for \fBpcre2_match()\fP above. Another possibility is to work with
|
||||
two buffers. If a partial match at offset \fIn\fP in the first buffer is
|
||||
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
|
||||
can then try a new match starting at offset \fIn+1\fP in the first buffer.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
|
@ -365,6 +368,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 07 August 2019
|
||||
Last updated: 04 September 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -128,12 +128,12 @@ static int
|
|||
compile_block *, PCRE2_SIZE *);
|
||||
|
||||
static int
|
||||
get_branchlength(uint32_t **, int *, int *, int *, parsed_recurse_check *,
|
||||
get_branchlength(uint32_t **, int *, int *, parsed_recurse_check *,
|
||||
compile_block *);
|
||||
|
||||
static BOOL
|
||||
set_lookbehind_lengths(uint32_t **, int *, int *, int *,
|
||||
parsed_recurse_check *, compile_block *);
|
||||
set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *,
|
||||
compile_block *);
|
||||
|
||||
static int
|
||||
check_lookbehinds(uint32_t *, uint32_t **, parsed_recurse_check *,
|
||||
|
@ -398,9 +398,6 @@ compiler is clever with identical subexpressions. */
|
|||
#define GI_SET_FIXED_LENGTH 0x80000000u
|
||||
#define GI_NOT_FIXED_LENGTH 0x40000000u
|
||||
#define GI_FIXED_LENGTH_MASK 0x0000ffffu
|
||||
#define GI_EXTRA_MASK 0x0fff0000u
|
||||
#define GI_EXTRA_MAX 0xfff /* NB not unsigned */
|
||||
#define GI_EXTRA_SHIFT 16
|
||||
|
||||
/* This simple test for a decimal digit works for both ASCII/Unicode and EBCDIC
|
||||
and is fast (a good compiler can turn it into a subtraction and unsigned
|
||||
|
@ -8897,7 +8894,6 @@ improve processing speed when the same capturing group occurs many times.
|
|||
Arguments:
|
||||
pptrptr pointer to pointer in the parsed pattern
|
||||
isinline FALSE if a reference or recursion; TRUE for inline group
|
||||
extraptr pointer to where to return extra lookbehind length
|
||||
errcodeptr pointer to the errorcode
|
||||
lcptr pointer to the loop counter
|
||||
group number of captured group or -1 for a non-capturing group
|
||||
|
@ -8908,13 +8904,11 @@ Returns: the group length or a negative number
|
|||
*/
|
||||
|
||||
static int
|
||||
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *extraptr,
|
||||
int *errcodeptr, int *lcptr, int group, parsed_recurse_check *recurses,
|
||||
compile_block *cb)
|
||||
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *errcodeptr, int *lcptr,
|
||||
int group, parsed_recurse_check *recurses, compile_block *cb)
|
||||
{
|
||||
int branchlength;
|
||||
int grouplength = -1;
|
||||
int extra = 0;
|
||||
|
||||
/* The cache can be used only if there is no possibility of there being two
|
||||
groups with the same number. We do not need to set the end pointer for a group
|
||||
|
@ -8928,7 +8922,6 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
|
|||
if ((groupinfo & GI_SET_FIXED_LENGTH) != 0)
|
||||
{
|
||||
if (isinline) *pptrptr = parsed_skip(*pptrptr, PSKIP_KET);
|
||||
*extraptr = (groupinfo & GI_EXTRA_MASK) >> GI_EXTRA_SHIFT;
|
||||
return groupinfo & GI_FIXED_LENGTH_MASK;
|
||||
}
|
||||
}
|
||||
|
@ -8937,28 +8930,16 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
|
|||
|
||||
for(;;)
|
||||
{
|
||||
int branchextra;
|
||||
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
|
||||
recurses, cb);
|
||||
branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
|
||||
if (branchlength < 0) goto ISNOTFIXED;
|
||||
if (grouplength == -1)
|
||||
{
|
||||
grouplength = branchlength;
|
||||
extra = branchextra;
|
||||
}
|
||||
else if (grouplength != branchlength || extra != branchextra) goto ISNOTFIXED;
|
||||
if (grouplength == -1) grouplength = branchlength;
|
||||
else if (grouplength != branchlength) goto ISNOTFIXED;
|
||||
if (**pptrptr == META_KET) break;
|
||||
*pptrptr += 1; /* Skip META_ALT */
|
||||
}
|
||||
|
||||
/* There are only 12 bits for caching the extra value, but a pattern that
|
||||
needs more than that is weird indeed. */
|
||||
|
||||
if (group > 0 && extra <= GI_EXTRA_MAX)
|
||||
cb->groupinfo[group] |= (uint32_t)
|
||||
(GI_SET_FIXED_LENGTH | (extra << GI_EXTRA_SHIFT) | grouplength);
|
||||
|
||||
*extraptr = extra;
|
||||
if (group > 0)
|
||||
cb->groupinfo[group] |= (uint32_t)(GI_SET_FIXED_LENGTH | grouplength);
|
||||
return grouplength;
|
||||
|
||||
ISNOTFIXED:
|
||||
|
@ -8973,17 +8954,11 @@ return -1;
|
|||
*************************************************/
|
||||
|
||||
/* Return a fixed length for a branch in a lookbehind, giving an error if the
|
||||
length is not fixed. We also take note of any extra value that is generated
|
||||
from a nested lookbehind. For example, for /(?<=a(?<=ba)c)/ each individual
|
||||
lookbehind has length 2, but the max_lookbehind setting must be 3 because
|
||||
matching inspects 3 characters before the match starting point.
|
||||
|
||||
On entry, *pptrptr points to the first element inside the branch. On exit it is
|
||||
set to point to the ALT or KET.
|
||||
length is not fixed. On entry, *pptrptr points to the first element inside the
|
||||
branch. On exit it is set to point to the ALT or KET.
|
||||
|
||||
Arguments:
|
||||
pptrptr pointer to pointer in the parsed pattern
|
||||
extraptr pointer to where to return extra lookbehind length
|
||||
errcodeptr pointer to error code
|
||||
lcptr pointer to loop counter
|
||||
recurses chain of recurse_check to catch mutual recursion
|
||||
|
@ -8993,14 +8968,11 @@ Returns: the length, or a negative value on error
|
|||
*/
|
||||
|
||||
static int
|
||||
get_branchlength(uint32_t **pptrptr, int *extraptr, int *errcodeptr, int *lcptr,
|
||||
get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
|
||||
parsed_recurse_check *recurses, compile_block *cb)
|
||||
{
|
||||
int branchlength = 0;
|
||||
int grouplength;
|
||||
int groupextra;
|
||||
int max;
|
||||
int extra = 0; /* Additional lookbehind from nesting */
|
||||
uint32_t lastitemlength = 0;
|
||||
uint32_t *pptr = *pptrptr;
|
||||
PCRE2_SIZE offset;
|
||||
|
@ -9149,17 +9121,13 @@ for (;; pptr++)
|
|||
break;
|
||||
|
||||
/* A nested lookbehind does not contribute any length to this lookbehind,
|
||||
but must itself be checked and have its lengths set. If the maximum
|
||||
lookbehind for the nested lookbehind is greater than the length so far
|
||||
computed for this branch, we must compute an extra value and keep the
|
||||
largest encountered for use when setting the maximum overall lookbehind. */
|
||||
but must itself be checked and have its lengths set. */
|
||||
|
||||
case META_LOOKBEHIND:
|
||||
case META_LOOKBEHINDNOT:
|
||||
case META_LOOKBEHIND_NA:
|
||||
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
|
||||
if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
|
||||
return -1;
|
||||
if (max - branchlength > extra) extra = max - branchlength;
|
||||
break;
|
||||
|
||||
/* Back references and recursions are handled by very similar code. At this
|
||||
|
@ -9267,14 +9235,15 @@ for (;; pptr++)
|
|||
in the cache. */
|
||||
|
||||
gptr++;
|
||||
grouplength = get_grouplength(&gptr, FALSE, &groupextra, errcodeptr, lcptr,
|
||||
group, &this_recurse, cb);
|
||||
grouplength = get_grouplength(&gptr, FALSE, errcodeptr, lcptr, group,
|
||||
&this_recurse, cb);
|
||||
if (grouplength < 0)
|
||||
{
|
||||
if (*errcodeptr == 0) goto ISNOTFIXED;
|
||||
return -1; /* Error already set */
|
||||
}
|
||||
goto OK_GROUP;
|
||||
itemlength = grouplength;
|
||||
break;
|
||||
|
||||
/* Check nested groups - advance past the initial data for each type and
|
||||
then seek a fixed length with get_grouplength(). */
|
||||
|
@ -9304,16 +9273,10 @@ for (;; pptr++)
|
|||
case META_SCRIPT_RUN:
|
||||
pptr++;
|
||||
CHECK_GROUP:
|
||||
grouplength = get_grouplength(&pptr, TRUE, &groupextra, errcodeptr, lcptr,
|
||||
group, recurses, cb);
|
||||
grouplength = get_grouplength(&pptr, TRUE, errcodeptr, lcptr, group,
|
||||
recurses, cb);
|
||||
if (grouplength < 0) return -1;
|
||||
|
||||
/* A nested lookbehind within the group may require looking back further
|
||||
than the length of the group. */
|
||||
|
||||
OK_GROUP:
|
||||
itemlength = grouplength;
|
||||
if (groupextra - branchlength > extra) extra = groupextra - branchlength;
|
||||
break;
|
||||
|
||||
/* Exact repetition is OK; variable repetition is not. A repetition of zero
|
||||
|
@ -9374,7 +9337,6 @@ for (;; pptr++)
|
|||
|
||||
EXIT:
|
||||
*pptrptr = pptr;
|
||||
*extraptr = extra;
|
||||
return branchlength;
|
||||
|
||||
PARSED_SKIP_FAILED:
|
||||
|
@ -9400,7 +9362,6 @@ get_branchlength() as an "extra" value.
|
|||
|
||||
Arguments:
|
||||
pptrptr pointer to pointer in the parsed pattern
|
||||
maxptr where to return maximum lookbehind for the whole group
|
||||
errcodeptr pointer to error code
|
||||
lcptr pointer to loop counter
|
||||
recurses chain of recurse_check to catch mutual recursion
|
||||
|
@ -9411,13 +9372,11 @@ Returns: TRUE if all is well
|
|||
*/
|
||||
|
||||
static BOOL
|
||||
set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
|
||||
int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
|
||||
set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
|
||||
parsed_recurse_check *recurses, compile_block *cb)
|
||||
{
|
||||
PCRE2_SIZE offset;
|
||||
int branchlength;
|
||||
int branchextra;
|
||||
int max = 0;
|
||||
uint32_t *bptr = *pptrptr;
|
||||
|
||||
READPLUSOFFSET(offset, bptr); /* Offset for error messages */
|
||||
|
@ -9426,8 +9385,7 @@ READPLUSOFFSET(offset, bptr); /* Offset for error messages */
|
|||
do
|
||||
{
|
||||
*pptrptr += 1;
|
||||
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
|
||||
recurses, cb);
|
||||
branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
|
||||
if (branchlength < 0)
|
||||
{
|
||||
/* The errorcode and offset may already be set from a nested lookbehind. */
|
||||
|
@ -9435,14 +9393,12 @@ do
|
|||
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
|
||||
return FALSE;
|
||||
}
|
||||
if (branchlength + branchextra > max) max = branchlength + branchextra;
|
||||
if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
|
||||
*bptr |= branchlength; /* branchlength never more than 65535 */
|
||||
bptr = *pptrptr;
|
||||
}
|
||||
while (*bptr == META_ALT);
|
||||
|
||||
if (max > cb->max_lookbehind) cb->max_lookbehind = max;
|
||||
*maxptr = max;
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
|
@ -9475,7 +9431,6 @@ static int
|
|||
check_lookbehinds(uint32_t *pptr, uint32_t **retptr,
|
||||
parsed_recurse_check *recurses, compile_block *cb)
|
||||
{
|
||||
int max;
|
||||
int errorcode = 0;
|
||||
int loopcount = 0;
|
||||
int nestlevel = 0;
|
||||
|
@ -9599,8 +9554,7 @@ for (; *pptr != META_END; pptr++)
|
|||
case META_LOOKBEHIND:
|
||||
case META_LOOKBEHINDNOT:
|
||||
case META_LOOKBEHIND_NA:
|
||||
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount,
|
||||
recurses, cb))
|
||||
if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, recurses, cb))
|
||||
return errorcode;
|
||||
break;
|
||||
}
|
||||
|
|
|
@ -304,7 +304,7 @@ Partial match, mark=xx: 123a
|
|||
|
||||
/(?<=(?<=a)b)c.*/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 2
|
||||
Max lookbehind = 1
|
||||
First code unit = 'c'
|
||||
Subject length lower bound = 1
|
||||
abc\=ph
|
||||
|
@ -337,7 +337,7 @@ Partial match: abcd
|
|||
|
||||
/(?<=(?<=(?<=a)b)c)./I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 3
|
||||
Max lookbehind = 1
|
||||
Subject length lower bound = 1
|
||||
123abcXYZ
|
||||
0: abcX
|
||||
|
@ -354,7 +354,7 @@ Subject length lower bound = 1
|
|||
|
||||
/(?<=ab((?<=...)cd))./I
|
||||
Capture group count = 1
|
||||
Max lookbehind = 5
|
||||
Max lookbehind = 4
|
||||
Subject length lower bound = 1
|
||||
ZabcdX
|
||||
0: ZabcdX
|
||||
|
@ -363,7 +363,7 @@ Subject length lower bound = 1
|
|||
|
||||
/(?<=((?<=(?<=ab).))(?1)(?1))./I
|
||||
Capture group count = 1
|
||||
Max lookbehind = 3
|
||||
Max lookbehind = 2
|
||||
Subject length lower bound = 1
|
||||
abxZ
|
||||
0: abxZ
|
||||
|
|
|
@ -17036,7 +17036,7 @@ Subject length lower bound = 1
|
|||
|
||||
/(?<=(?<=a)b)c.*/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 2
|
||||
Max lookbehind = 1
|
||||
First code unit = 'c'
|
||||
Subject length lower bound = 1
|
||||
abc\=ph
|
||||
|
@ -17064,7 +17064,7 @@ Subject length lower bound = 0
|
|||
|
||||
/(?<=a(?<=a|ba)c)/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 3
|
||||
Max lookbehind = 2
|
||||
May match empty string
|
||||
Subject length lower bound = 0
|
||||
|
||||
|
@ -17076,7 +17076,7 @@ Subject length lower bound = 0
|
|||
|
||||
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 5
|
||||
Max lookbehind = 4
|
||||
May match empty string
|
||||
Subject length lower bound = 0
|
||||
|
||||
|
|
Loading…
Reference in New Issue