Back off failed attempt to handle nested lookbehinds for estimating how much of

a partial match to retain for multi-segment matching. Document the current 
difficulty if the whole first segment cannot be retained.
This commit is contained in:
Philip.Hazel 2019-09-04 18:14:54 +00:00
parent 87bc092222
commit 963b570fd0
9 changed files with 868 additions and 915 deletions

View File

@ -66,12 +66,7 @@ is made possessive and applied to an item in parentheses, because a
parenthesized item may contain multiple branches or other backtracking points,
for example /(a|ab){1}+c/ or /(a+){1}+a/.
13. Nested lookbehinds are now taken into account when computing the maximum
lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
it to 3, because matching looks back 3 characters.
14. For partial matches, pcre2test was always showing the maximum lookbehind
13. For partial matches, pcre2test was always showing the maximum lookbehind
characters, flagged with "<", which is misleading when the lookbehind didn't
actually look behind the start (because it was later in the pattern). Showing
all consulted preceding characters for partial matches is now controlled by the
@ -79,25 +74,25 @@ existing "allusedtext" modifier and, as for complete matches, this facility is
available only for non-JIT matching, because JIT does not maintain the first
and last consulted characters.
15. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
14. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
if the end of the subject was encountered in a lookahead (conditional or
otherwise), an atomic group, or a recursion.
16. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
15. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
17. Check for integer overflow when computing lookbehind lengths. Fixes
16. Check for integer overflow when computing lookbehind lengths. Fixes
Clusterfuzz issue 15636.
18. Implemented non-atomic positive lookaround assertions.
17. Implemented non-atomic positive lookaround assertions.
19. If a lookbehind contained a lookahead that contained another lookbehind
18. If a lookbehind contained a lookahead that contained another lookbehind
within it, the nested lookbehind was not correctly processed. For example, if
/(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching
"b".
20. Implemented pcre2_get_match_data_size().
19. Implemented pcre2_get_match_data_size().
21. Two alterations to partial matching (not yet done by JIT):
20. Two alterations to partial matching (not yet done by JIT):
(a) The definition of a partial match is slightly changed: if a pattern
contains any lookbehinds, an empty partial match may be given, because this
@ -111,29 +106,29 @@ within it, the nested lookbehind was not correctly processed. For example, if
(c) An empty string partial hard match can be returned for \z and \Z as it
is documented that they shouldn't match.
22. A branch that started with (*ACCEPT) was not being recognized as one that
21. A branch that started with (*ACCEPT) was not being recognized as one that
could match an empty string.
23. Corrected pcre2_set_character_tables() tables data type: was const unsigned
22. Corrected pcre2_set_character_tables() tables data type: was const unsigned
char * instead of const uint8_t *, as generated by pcre2_maketables().
24. Upgraded to Unicode 12.1.0.
23. Upgraded to Unicode 12.1.0.
25. Add -jitfast command line option to pcre2test (to make all the jit options
24. Add -jitfast command line option to pcre2test (to make all the jit options
available directly).
26. Make pcre2test -C show if libreadline or libedit is supported.
25. Make pcre2test -C show if libreadline or libedit is supported.
28. If the length of one branch of a group exceeded 65535 (the maximum value
26. If the length of one branch of a group exceeded 65535 (the maximum value
that is remembered as a minimum length), the whole group's length was
incorrectly recorded as 65535, leading to incorrect "no match" when start-up
optimizations were in force.
29. The "rightmost consulted character" value was not always correct; in
27. The "rightmost consulted character" value was not always correct; in
particular, if a pattern ended with a negative lookahead, characters that were
inspected in that lookahead were not included.
30. Add the pcre2_maketables_free() function.
28. Add the pcre2_maketables_free() function.
Version 10.33 16-April-2019

View File

@ -2272,26 +2272,24 @@ defaulted by the caller of the match function.
<pre>
PCRE2_INFO_MAXLOOKBEHIND
</pre>
Return the largest number of characters (not code units) before the current
matching point that could be inspected while processing a lookbehind assertion
in the pattern. Before release 10.34 this request used to give the largest
value for any individual assertion. Now it takes into account nested
lookbehinds, which can mean that the overall value is greater. For example, the
pattern (?&#60;=a(?&#60;=ba)c) previously returned 2, because that is the length of the
largest individual lookbehind. Now it returns 3, because matching actually
looks back 3 characters.
A lookbehind assertion moves back a certain number of characters (not code
units) when it starts to process each of its branches. This request returns the
largest of these backward moves. The third argument should point to a uint32_t
integer. The simple assertions \b and \B require a one-character lookbehind
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
longer. \A also registers a one-character lookbehind, though it does not
actually inspect the previous character.
</P>
<P>
The third argument should point to a uint32_t integer. This information is
useful when doing multi-segment matching using the partial matching facilities.
Note that the simple assertions \b and \B require a one-character lookbehind.
\A also registers a one-character lookbehind, though it does not actually
inspect the previous character. This is to ensure that at least one character
from the old segment is retained when a new segment is processed. Otherwise, if
there are no lookbehinds in the pattern, \A might match incorrectly at the
start of a second or subsequent segment. There are more details in the
Note that this information is useful for multi-segment matching only
if the pattern contains no nested lookbehinds. For example, the pattern
(?&#60;=a(?&#60;=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
first lookbehind moves back by two characters, matches one character, then the
nested lookbehind also moves back by two characters. This puts the matching
point three characters earlier than it was at the start.
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation.
documentation for a discussion of multi-segment matching.
<pre>
PCRE2_INFO_MINLENGTH
</pre>

View File

@ -244,9 +244,12 @@ there is only a partial match.
<P>
PCRE was not originally designed with multi-segment matching in mind. However,
over time, features (including partial matching) that make multi-segment
matching possible have been added. The string is searched segment by segment by
calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same
results that would happen if the entire string was available for searching.
matching possible have been added. A very long string can be searched segment
by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
the same results that would happen if the entire string was available for
searching all the time. Normally, the strings that are being sought are much
shorter than each individual segment, and are in the middle of very long
strings, so the pattern is normally not anchored.
</P>
<P>
Special logic must be implemented to handle a matched substring that spans a
@ -259,8 +262,7 @@ also be set for all but the first segment.
When a partial match occurs, the next segment must be added to the current
subject and the match re-run, using the <i>startoffset</i> argument of
<b>pcre2_match()</b> to begin at the point where the partial match started.
Multi-segment matching is usually used to search for substrings in the middle
of very long sequences, so the patterns are normally not anchored. For example:
For example:
<pre>
re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
data&#62; ...the date is 23ja\=ph
@ -270,50 +272,51 @@ of very long sequences, so the patterns are normally not anchored. For example:
1: jan
</pre>
Note the use of the <b>offset</b> modifier to start the new match where the
partial match was found.
partial match was found. In this example, the next segment was added to the one
in which the partial match was found. This is the most straightforward
approach, typically using a memory buffer that is twice the size of each
segment. After a partial match, the first half of the buffer is discarded, the
second half is moved to the start of the buffer, and a new segment is added
before repeating the match as in the example above. After a no match, the
entire buffer can be discarded.
</P>
<P>
In this simple example, the next segment was just added to the one in which the
partial match was found. However, if there are memory constraints, it may be
necessary to discard text that precedes the partial match before adding the
next segment. In cases such as the above, where the pattern does not contain
any lookbehinds, it is sufficient to retain only the partially matched
substring. However, if a pattern contains a lookbehind assertion, characters
If there are memory constraints, you may want to discard text that precedes a
partial match before adding the next segment. Unfortunately, this is not at
present straightforward. In cases such as the above, where the pattern does not
contain any lookbehinds, it is sufficient to retain only the partially matched
substring. However, if the pattern contains a lookbehind assertion, characters
that precede the start of the partial match may have been inspected during the
matching process.
</P>
<P>
The only lookbehind information that is available is the length of the longest
lookbehind in a pattern. This may not, of course, be at the start of the
pattern, but retaining that many characters before the partial match is
sufficient, if not always strictly necessary. The way to do this is as follows:
</P>
<P>
Before doing any matching, find the length of the longest lookbehind in the
pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
option. Note that the resulting count is in characters, not code units. After a
partial match, moving back from the ovector[0] offset in the subject by the
number of characters given for the maximum lookbehind gets you to the earliest
character that must be retained. In a non-UTF or a 32-bit situation, moving
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
while moving back through the code units. Characters before the point you have
now reached can be discarded.
</P>
<P>
For example, if the pattern "(?&#60;=123)abc" is partially matched against the
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
displays a partial match, it indicates the lookbehind characters with '&#60;'
characters if the <b>allusedtext</b> modifier is set:
matching process. When <b>pcre2test</b> displays a partial match, it indicates
these characters with '&#60;' if the <b>allusedtext</b> modifier is set:
<pre>
re&#62; "(?&#60;=123)abc"
data&#62; xx123ab\=ph,allusedtext
Partial match: 123ab
&#60;&#60;&#60;
</pre>
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
because JIT matching does not maintain the first and last consulted characters.
However, the \fPallusedtext\fP modifier is not available for JIT matching,
because JIT matching does not record the first (or last) consulted characters.
For this reason, this information is not available via the API. It is therefore
not possible in general to obtain the exact number of characters that must be
retained in order to get the right match result. If you cannot retain the
entire segment, you must find some heuristic way of choosing.
</P>
<P>
If you know the approximate length of the matching substrings, you can use that
to decide how much text to retain. The only lookbehind information that is
currently available via the API is the length of the longest individual
lookbehind in a pattern, but this can be misleading if there are nested
lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
units) that any individual lookbehind moves back when it is processed. A
pattern such as "(?&#60;=(?&#60;!b)a)" has a maximum lookbehind value of one, but
inspects two characters before its starting point.
</P>
<P>
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
UTF-8 or UTF-16 you have to count characters while moving back through the code
units.
</P>
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
<P>
@ -379,11 +382,11 @@ want.
</P>
<P>
If you do want to allow for starting again at the next character, one way of
doing it is to retain the matched part of the segment and try a new complete
match, as described for <b>pcre2_match()</b> above. Another possibility is to
work with two buffers. If a partial match at offset <i>n</i> in the first buffer
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
you can then try a new match starting at offset <i>n+1</i> in the first buffer.
doing it is to retain some or all of the segment and try a new complete match,
as described for <b>pcre2_match()</b> above. Another possibility is to work with
two buffers. If a partial match at offset <i>n</i> in the first buffer is
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
can then try a new match starting at offset <i>n+1</i> in the first buffer.
</P>
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
<P>
@ -396,7 +399,7 @@ Cambridge, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
Last updated: 07 August 2019
Last updated: 04 September 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -2234,25 +2234,24 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_MAXLOOKBEHIND
Return the largest number of characters (not code units) before the
current matching point that could be inspected while processing a look-
behind assertion in the pattern. Before release 10.34 this request used
to give the largest value for any individual assertion. Now it takes
into account nested lookbehinds, which can mean that the overall value
is greater. For example, the pattern (?<=a(?<=ba)c) previously returned
2, because that is the length of the largest individual lookbehind. Now
it returns 3, because matching actually looks back 3 characters.
A lookbehind assertion moves back a certain number of characters (not
code units) when it starts to process each of its branches. This re-
quest returns the largest of these backward moves. The third argument
should point to a uint32_t integer. The simple assertions \b and \B re-
quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to
return 1 in the absence of anything longer. \A also registers a one-
character lookbehind, though it does not actually inspect the previous
character.
The third argument should point to a uint32_t integer. This information
is useful when doing multi-segment matching using the partial matching
facilities. Note that the simple assertions \b and \B require a one-
character lookbehind. \A also registers a one-character lookbehind,
though it does not actually inspect the previous character. This is to
ensure that at least one character from the old segment is retained
when a new segment is processed. Otherwise, if there are no lookbehinds
in the pattern, \A might match incorrectly at the start of a second or
subsequent segment. There are more details in the pcre2partial documen-
tation.
Note that this information is useful for multi-segment matching only if
the pattern contains no nested lookbehinds. For example, the pattern
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is pro-
cessed, the first lookbehind moves back by two characters, matches one
character, then the nested lookbehind also moves back by two charac-
ters. This puts the matching point three characters earlier than it was
at the start. PCRE2_INFO_MAXLOOKBEHIND is really only useful as a de-
bugging tool. See the pcre2partial documentation for a discussion of
multi-segment matching.
PCRE2_INFO_MINLENGTH
@ -5877,10 +5876,13 @@ MULTI-SEGMENT MATCHING WITH pcre2_match()
PCRE was not originally designed with multi-segment matching in mind.
However, over time, features (including partial matching) that make
multi-segment matching possible have been added. The string is searched
segment by segment by calling pcre2_match() repeatedly, with the aim of
achieving the same results that would happen if the entire string was
available for searching.
multi-segment matching possible have been added. A very long string can
be searched segment by segment by calling pcre2_match() repeatedly,
with the aim of achieving the same results that would happen if the en-
tire string was available for searching all the time. Normally, the
strings that are being sought are much shorter than each individual
segment, and are in the middle of very long strings, so the pattern is
normally not anchored.
Special logic must be implemented to handle a matched substring that
spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
@ -5891,9 +5893,7 @@ MULTI-SEGMENT MATCHING WITH pcre2_match()
When a partial match occurs, the next segment must be added to the cur-
rent subject and the match re-run, using the startoffset argument of
pcre2_match() to begin at the point where the partial match started.
Multi-segment matching is usually used to search for substrings in the
middle of very long sequences, so the patterns are normally not an-
chored. For example:
For example:
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
data> ...the date is 23ja\=ph
@ -5903,49 +5903,51 @@ MULTI-SEGMENT MATCHING WITH pcre2_match()
1: jan
Note the use of the offset modifier to start the new match where the
partial match was found.
partial match was found. In this example, the next segment was added to
the one in which the partial match was found. This is the most
straightforward approach, typically using a memory buffer that is twice
the size of each segment. After a partial match, the first half of the
buffer is discarded, the second half is moved to the start of the buf-
fer, and a new segment is added before repeating the match as in the
example above. After a no match, the entire buffer can be discarded.
In this simple example, the next segment was just added to the one in
which the partial match was found. However, if there are memory con-
straints, it may be necessary to discard text that precedes the partial
match before adding the next segment. In cases such as the above, where
the pattern does not contain any lookbehinds, it is sufficient to re-
tain only the partially matched substring. However, if a pattern con-
tains a lookbehind assertion, characters that precede the start of the
partial match may have been inspected during the matching process.
The only lookbehind information that is available is the length of the
longest lookbehind in a pattern. This may not, of course, be at the
start of the pattern, but retaining that many characters before the
partial match is sufficient, if not always strictly necessary. The way
to do this is as follows:
Before doing any matching, find the length of the longest lookbehind in
the pattern by calling pcre2_pattern_info() with the
PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
characters, not code units. After a partial match, moving back from the
ovector[0] offset in the subject by the number of characters given for
the maximum lookbehind gets you to the earliest character that must be
retained. In a non-UTF or a 32-bit situation, moving back is just a
subtraction, but in UTF-8 or UTF-16 you have to count characters while
moving back through the code units. Characters before the point you
have now reached can be discarded.
For example, if the pattern "(?<=123)abc" is partially matched against
the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
mum lookbehind count is 3, so all characters before offset 2 can be
discarded. The value of startoffset for the next match should be 3.
When pcre2test displays a partial match, it indicates the lookbehind
characters with '<' characters if the allusedtext modifier is set:
If there are memory constraints, you may want to discard text that pre-
cedes a partial match before adding the next segment. Unfortunately,
this is not at present straightforward. In cases such as the above,
where the pattern does not contain any lookbehinds, it is sufficient to
retain only the partially matched substring. However, if the pattern
contains a lookbehind assertion, characters that precede the start of
the partial match may have been inspected during the matching process.
When pcre2test displays a partial match, it indicates these characters
with '<' if the allusedtext modifier is set:
re> "(?<=123)abc"
data> xx123ab\=ph,allusedtext
Partial match: 123ab
<<<
Note that the allusedtext modifier is not available for JIT matching,
because JIT matching does not maintain the first and last consulted
characters.
However, the allusedtext modifier is not available for JIT matching,
because JIT matching does not record the first (or last) consulted
characters. For this reason, this information is not available via the
API. It is therefore not possible in general to obtain the exact number
of characters that must be retained in order to get the right match re-
sult. If you cannot retain the entire segment, you must find some
heuristic way of choosing.
If you know the approximate length of the matching substrings, you can
use that to decide how much text to retain. The only lookbehind infor-
mation that is currently available via the API is the length of the
longest individual lookbehind in a pattern, but this can be misleading
if there are nested lookbehinds. The value returned by calling
pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND option is the
maximum number of characters (not code units) that any individual look-
behind moves back when it is processed. A pattern such as
"(?<=(?<!b)a)" has a maximum lookbehind value of one, but inspects two
characters before its starting point.
In a non-UTF or a 32-bit case, moving back is just a subtraction, but
in UTF-8 or UTF-16 you have to count characters while moving back
through the code units.
PARTIAL MATCHING USING pcre2_dfa_match()
@ -6012,12 +6014,12 @@ MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
plication, this may or may not be what you want.
If you do want to allow for starting again at the next character, one
way of doing it is to retain the matched part of the segment and try a
new complete match, as described for pcre2_match() above. Another pos-
sibility is to work with two buffers. If a partial match at offset n in
the first buffer is followed by "no match" when PCRE2_DFA_RESTART is
used on the second buffer, you can then try a new match starting at
offset n+1 in the first buffer.
way of doing it is to retain some or all of the segment and try a new
complete match, as described for pcre2_match() above. Another possibil-
ity is to work with two buffers. If a partial match at offset n in the
first buffer is followed by "no match" when PCRE2_DFA_RESTART is used
on the second buffer, you can then try a new match starting at offset
n+1 in the first buffer.
AUTHOR
@ -6029,7 +6031,7 @@ AUTHOR
REVISION
Last updated: 07 August 2019
Last updated: 04 September 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -2232,27 +2232,25 @@ defaulted by the caller of the match function.
.sp
PCRE2_INFO_MAXLOOKBEHIND
.sp
Return the largest number of characters (not code units) before the current
matching point that could be inspected while processing a lookbehind assertion
in the pattern. Before release 10.34 this request used to give the largest
value for any individual assertion. Now it takes into account nested
lookbehinds, which can mean that the overall value is greater. For example, the
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
largest individual lookbehind. Now it returns 3, because matching actually
looks back 3 characters.
A lookbehind assertion moves back a certain number of characters (not code
units) when it starts to process each of its branches. This request returns the
largest of these backward moves. The third argument should point to a uint32_t
integer. The simple assertions \eb and \eB require a one-character lookbehind
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
longer. \eA also registers a one-character lookbehind, though it does not
actually inspect the previous character.
.P
The third argument should point to a uint32_t integer. This information is
useful when doing multi-segment matching using the partial matching facilities.
Note that the simple assertions \eb and \eB require a one-character lookbehind.
\eA also registers a one-character lookbehind, though it does not actually
inspect the previous character. This is to ensure that at least one character
from the old segment is retained when a new segment is processed. Otherwise, if
there are no lookbehinds in the pattern, \eA might match incorrectly at the
start of a second or subsequent segment. There are more details in the
Note that this information is useful for multi-segment matching only
if the pattern contains no nested lookbehinds. For example, the pattern
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
first lookbehind moves back by two characters, matches one character, then the
nested lookbehind also moves back by two characters. This puts the matching
point three characters earlier than it was at the start.
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
.\" HREF
\fBpcre2partial\fP
.\"
documentation.
documentation for a discussion of multi-segment matching.
.sp
PCRE2_INFO_MINLENGTH
.sp

View File

@ -1,4 +1,4 @@
.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
.TH PCRE2PARTIAL 3 "04 September 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2"
@ -213,9 +213,12 @@ there is only a partial match.
.sp
PCRE was not originally designed with multi-segment matching in mind. However,
over time, features (including partial matching) that make multi-segment
matching possible have been added. The string is searched segment by segment by
calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same
results that would happen if the entire string was available for searching.
matching possible have been added. A very long string can be searched segment
by segment by calling \fBpcre2_match()\fP repeatedly, with the aim of achieving
the same results that would happen if the entire string was available for
searching all the time. Normally, the strings that are being sought are much
shorter than each individual segment, and are in the middle of very long
strings, so the pattern is normally not anchored.
.P
Special logic must be implemented to handle a matched substring that spans a
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
@ -226,8 +229,7 @@ also be set for all but the first segment.
When a partial match occurs, the next segment must be added to the current
subject and the match re-run, using the \fIstartoffset\fP argument of
\fBpcre2_match()\fP to begin at the point where the partial match started.
Multi-segment matching is usually used to search for substrings in the middle
of very long sequences, so the patterns are normally not anchored. For example:
For example:
.sp
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
data> ...the date is 23ja\e=ph
@ -237,47 +239,48 @@ of very long sequences, so the patterns are normally not anchored. For example:
1: jan
.sp
Note the use of the \fBoffset\fP modifier to start the new match where the
partial match was found.
partial match was found. In this example, the next segment was added to the one
in which the partial match was found. This is the most straightforward
approach, typically using a memory buffer that is twice the size of each
segment. After a partial match, the first half of the buffer is discarded, the
second half is moved to the start of the buffer, and a new segment is added
before repeating the match as in the example above. After a no match, the
entire buffer can be discarded.
.P
In this simple example, the next segment was just added to the one in which the
partial match was found. However, if there are memory constraints, it may be
necessary to discard text that precedes the partial match before adding the
next segment. In cases such as the above, where the pattern does not contain
any lookbehinds, it is sufficient to retain only the partially matched
substring. However, if a pattern contains a lookbehind assertion, characters
If there are memory constraints, you may want to discard text that precedes a
partial match before adding the next segment. Unfortunately, this is not at
present straightforward. In cases such as the above, where the pattern does not
contain any lookbehinds, it is sufficient to retain only the partially matched
substring. However, if the pattern contains a lookbehind assertion, characters
that precede the start of the partial match may have been inspected during the
matching process.
.P
The only lookbehind information that is available is the length of the longest
lookbehind in a pattern. This may not, of course, be at the start of the
pattern, but retaining that many characters before the partial match is
sufficient, if not always strictly necessary. The way to do this is as follows:
.P
Before doing any matching, find the length of the longest lookbehind in the
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
option. Note that the resulting count is in characters, not code units. After a
partial match, moving back from the ovector[0] offset in the subject by the
number of characters given for the maximum lookbehind gets you to the earliest
character that must be retained. In a non-UTF or a 32-bit situation, moving
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
while moving back through the code units. Characters before the point you have
now reached can be discarded.
.P
For example, if the pattern "(?<=123)abc" is partially matched against the
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
displays a partial match, it indicates the lookbehind characters with '<'
characters if the \fBallusedtext\fP modifier is set:
matching process. When \fBpcre2test\fP displays a partial match, it indicates
these characters with '<' if the \fBallusedtext\fP modifier is set:
.sp
re> "(?<=123)abc"
data> xx123ab\e=ph,allusedtext
Partial match: 123ab
<<<
.sp
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
because JIT matching does not maintain the first and last consulted characters.
.
However, the \fPallusedtext\fP modifier is not available for JIT matching,
because JIT matching does not record the first (or last) consulted characters.
For this reason, this information is not available via the API. It is therefore
not possible in general to obtain the exact number of characters that must be
retained in order to get the right match result. If you cannot retain the
entire segment, you must find some heuristic way of choosing.
.P
If you know the approximate length of the matching substrings, you can use that
to decide how much text to retain. The only lookbehind information that is
currently available via the API is the length of the longest individual
lookbehind in a pattern, but this can be misleading if there are nested
lookbehinds. The value returned by calling \fBpcre2_pattern_info()\fP with the
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
units) that any individual lookbehind moves back when it is processed. A
pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
inspects two characters before its starting point.
.P
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
UTF-8 or UTF-16 you have to count characters while moving back through the code
units.
.
.
.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
@ -344,11 +347,11 @@ are remembered. Depending on the application, this may or may not be what you
want.
.P
If you do want to allow for starting again at the next character, one way of
doing it is to retain the matched part of the segment and try a new complete
match, as described for \fBpcre2_match()\fP above. Another possibility is to
work with two buffers. If a partial match at offset \fIn\fP in the first buffer
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
you can then try a new match starting at offset \fIn+1\fP in the first buffer.
doing it is to retain some or all of the segment and try a new complete match,
as described for \fBpcre2_match()\fP above. Another possibility is to work with
two buffers. If a partial match at offset \fIn\fP in the first buffer is
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
can then try a new match starting at offset \fIn+1\fP in the first buffer.
.
.
.SH AUTHOR
@ -365,6 +368,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 07 August 2019
Last updated: 04 September 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -128,12 +128,12 @@ static int
compile_block *, PCRE2_SIZE *);
static int
get_branchlength(uint32_t **, int *, int *, int *, parsed_recurse_check *,
get_branchlength(uint32_t **, int *, int *, parsed_recurse_check *,
compile_block *);
static BOOL
set_lookbehind_lengths(uint32_t **, int *, int *, int *,
parsed_recurse_check *, compile_block *);
set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *,
compile_block *);
static int
check_lookbehinds(uint32_t *, uint32_t **, parsed_recurse_check *,
@ -398,9 +398,6 @@ compiler is clever with identical subexpressions. */
#define GI_SET_FIXED_LENGTH 0x80000000u
#define GI_NOT_FIXED_LENGTH 0x40000000u
#define GI_FIXED_LENGTH_MASK 0x0000ffffu
#define GI_EXTRA_MASK 0x0fff0000u
#define GI_EXTRA_MAX 0xfff /* NB not unsigned */
#define GI_EXTRA_SHIFT 16
/* This simple test for a decimal digit works for both ASCII/Unicode and EBCDIC
and is fast (a good compiler can turn it into a subtraction and unsigned
@ -8897,7 +8894,6 @@ improve processing speed when the same capturing group occurs many times.
Arguments:
pptrptr pointer to pointer in the parsed pattern
isinline FALSE if a reference or recursion; TRUE for inline group
extraptr pointer to where to return extra lookbehind length
errcodeptr pointer to the errorcode
lcptr pointer to the loop counter
group number of captured group or -1 for a non-capturing group
@ -8908,13 +8904,11 @@ Returns: the group length or a negative number
*/
static int
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *extraptr,
int *errcodeptr, int *lcptr, int group, parsed_recurse_check *recurses,
compile_block *cb)
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *errcodeptr, int *lcptr,
int group, parsed_recurse_check *recurses, compile_block *cb)
{
int branchlength;
int grouplength = -1;
int extra = 0;
/* The cache can be used only if there is no possibility of there being two
groups with the same number. We do not need to set the end pointer for a group
@ -8928,7 +8922,6 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
if ((groupinfo & GI_SET_FIXED_LENGTH) != 0)
{
if (isinline) *pptrptr = parsed_skip(*pptrptr, PSKIP_KET);
*extraptr = (groupinfo & GI_EXTRA_MASK) >> GI_EXTRA_SHIFT;
return groupinfo & GI_FIXED_LENGTH_MASK;
}
}
@ -8937,28 +8930,16 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
for(;;)
{
int branchextra;
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
recurses, cb);
branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
if (branchlength < 0) goto ISNOTFIXED;
if (grouplength == -1)
{
grouplength = branchlength;
extra = branchextra;
}
else if (grouplength != branchlength || extra != branchextra) goto ISNOTFIXED;
if (grouplength == -1) grouplength = branchlength;
else if (grouplength != branchlength) goto ISNOTFIXED;
if (**pptrptr == META_KET) break;
*pptrptr += 1; /* Skip META_ALT */
}
/* There are only 12 bits for caching the extra value, but a pattern that
needs more than that is weird indeed. */
if (group > 0 && extra <= GI_EXTRA_MAX)
cb->groupinfo[group] |= (uint32_t)
(GI_SET_FIXED_LENGTH | (extra << GI_EXTRA_SHIFT) | grouplength);
*extraptr = extra;
if (group > 0)
cb->groupinfo[group] |= (uint32_t)(GI_SET_FIXED_LENGTH | grouplength);
return grouplength;
ISNOTFIXED:
@ -8973,17 +8954,11 @@ return -1;
*************************************************/
/* Return a fixed length for a branch in a lookbehind, giving an error if the
length is not fixed. We also take note of any extra value that is generated
from a nested lookbehind. For example, for /(?<=a(?<=ba)c)/ each individual
lookbehind has length 2, but the max_lookbehind setting must be 3 because
matching inspects 3 characters before the match starting point.
On entry, *pptrptr points to the first element inside the branch. On exit it is
set to point to the ALT or KET.
length is not fixed. On entry, *pptrptr points to the first element inside the
branch. On exit it is set to point to the ALT or KET.
Arguments:
pptrptr pointer to pointer in the parsed pattern
extraptr pointer to where to return extra lookbehind length
errcodeptr pointer to error code
lcptr pointer to loop counter
recurses chain of recurse_check to catch mutual recursion
@ -8993,14 +8968,11 @@ Returns: the length, or a negative value on error
*/
static int
get_branchlength(uint32_t **pptrptr, int *extraptr, int *errcodeptr, int *lcptr,
get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
parsed_recurse_check *recurses, compile_block *cb)
{
int branchlength = 0;
int grouplength;
int groupextra;
int max;
int extra = 0; /* Additional lookbehind from nesting */
uint32_t lastitemlength = 0;
uint32_t *pptr = *pptrptr;
PCRE2_SIZE offset;
@ -9149,17 +9121,13 @@ for (;; pptr++)
break;
/* A nested lookbehind does not contribute any length to this lookbehind,
but must itself be checked and have its lengths set. If the maximum
lookbehind for the nested lookbehind is greater than the length so far
computed for this branch, we must compute an extra value and keep the
largest encountered for use when setting the maximum overall lookbehind. */
but must itself be checked and have its lengths set. */
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
return -1;
if (max - branchlength > extra) extra = max - branchlength;
break;
/* Back references and recursions are handled by very similar code. At this
@ -9267,14 +9235,15 @@ for (;; pptr++)
in the cache. */
gptr++;
grouplength = get_grouplength(&gptr, FALSE, &groupextra, errcodeptr, lcptr,
group, &this_recurse, cb);
grouplength = get_grouplength(&gptr, FALSE, errcodeptr, lcptr, group,
&this_recurse, cb);
if (grouplength < 0)
{
if (*errcodeptr == 0) goto ISNOTFIXED;
return -1; /* Error already set */
}
goto OK_GROUP;
itemlength = grouplength;
break;
/* Check nested groups - advance past the initial data for each type and
then seek a fixed length with get_grouplength(). */
@ -9304,16 +9273,10 @@ for (;; pptr++)
case META_SCRIPT_RUN:
pptr++;
CHECK_GROUP:
grouplength = get_grouplength(&pptr, TRUE, &groupextra, errcodeptr, lcptr,
group, recurses, cb);
grouplength = get_grouplength(&pptr, TRUE, errcodeptr, lcptr, group,
recurses, cb);
if (grouplength < 0) return -1;
/* A nested lookbehind within the group may require looking back further
than the length of the group. */
OK_GROUP:
itemlength = grouplength;
if (groupextra - branchlength > extra) extra = groupextra - branchlength;
break;
/* Exact repetition is OK; variable repetition is not. A repetition of zero
@ -9374,7 +9337,6 @@ for (;; pptr++)
EXIT:
*pptrptr = pptr;
*extraptr = extra;
return branchlength;
PARSED_SKIP_FAILED:
@ -9400,7 +9362,6 @@ get_branchlength() as an "extra" value.
Arguments:
pptrptr pointer to pointer in the parsed pattern
maxptr where to return maximum lookbehind for the whole group
errcodeptr pointer to error code
lcptr pointer to loop counter
recurses chain of recurse_check to catch mutual recursion
@ -9411,13 +9372,11 @@ Returns: TRUE if all is well
*/
static BOOL
set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
parsed_recurse_check *recurses, compile_block *cb)
{
PCRE2_SIZE offset;
int branchlength;
int branchextra;
int max = 0;
uint32_t *bptr = *pptrptr;
READPLUSOFFSET(offset, bptr); /* Offset for error messages */
@ -9426,8 +9385,7 @@ READPLUSOFFSET(offset, bptr); /* Offset for error messages */
do
{
*pptrptr += 1;
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
recurses, cb);
branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
if (branchlength < 0)
{
/* The errorcode and offset may already be set from a nested lookbehind. */
@ -9435,14 +9393,12 @@ do
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
return FALSE;
}
if (branchlength + branchextra > max) max = branchlength + branchextra;
if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
*bptr |= branchlength; /* branchlength never more than 65535 */
bptr = *pptrptr;
}
while (*bptr == META_ALT);
if (max > cb->max_lookbehind) cb->max_lookbehind = max;
*maxptr = max;
return TRUE;
}
@ -9475,7 +9431,6 @@ static int
check_lookbehinds(uint32_t *pptr, uint32_t **retptr,
parsed_recurse_check *recurses, compile_block *cb)
{
int max;
int errorcode = 0;
int loopcount = 0;
int nestlevel = 0;
@ -9599,8 +9554,7 @@ for (; *pptr != META_END; pptr++)
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount,
recurses, cb))
if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, recurses, cb))
return errorcode;
break;
}

View File

@ -304,7 +304,7 @@ Partial match, mark=xx: 123a
/(?<=(?<=a)b)c.*/I
Capture group count = 0
Max lookbehind = 2
Max lookbehind = 1
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
@ -337,7 +337,7 @@ Partial match: abcd
/(?<=(?<=(?<=a)b)c)./I
Capture group count = 0
Max lookbehind = 3
Max lookbehind = 1
Subject length lower bound = 1
123abcXYZ
0: abcX
@ -354,7 +354,7 @@ Subject length lower bound = 1
/(?<=ab((?<=...)cd))./I
Capture group count = 1
Max lookbehind = 5
Max lookbehind = 4
Subject length lower bound = 1
ZabcdX
0: ZabcdX
@ -363,7 +363,7 @@ Subject length lower bound = 1
/(?<=((?<=(?<=ab).))(?1)(?1))./I
Capture group count = 1
Max lookbehind = 3
Max lookbehind = 2
Subject length lower bound = 1
abxZ
0: abxZ

View File

@ -17036,7 +17036,7 @@ Subject length lower bound = 1
/(?<=(?<=a)b)c.*/I
Capture group count = 0
Max lookbehind = 2
Max lookbehind = 1
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
@ -17064,7 +17064,7 @@ Subject length lower bound = 0
/(?<=a(?<=a|ba)c)/I
Capture group count = 0
Max lookbehind = 3
Max lookbehind = 2
May match empty string
Subject length lower bound = 0
@ -17076,7 +17076,7 @@ Subject length lower bound = 0
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
Capture group count = 0
Max lookbehind = 5
Max lookbehind = 4
May match empty string
Subject length lower bound = 0