Back off failed attempt to handle nested lookbehinds for estimating how much of

a partial match to retain for multi-segment matching. Document the current 
difficulty if the whole first segment cannot be retained.
This commit is contained in:
Philip.Hazel 2019-09-04 18:14:54 +00:00
parent 87bc092222
commit 963b570fd0
9 changed files with 868 additions and 915 deletions

View File

@ -66,12 +66,7 @@ is made possessive and applied to an item in parentheses, because a
parenthesized item may contain multiple branches or other backtracking points, parenthesized item may contain multiple branches or other backtracking points,
for example /(a|ab){1}+c/ or /(a+){1}+a/. for example /(a|ab){1}+c/ or /(a+){1}+a/.
13. Nested lookbehinds are now taken into account when computing the maximum 13. For partial matches, pcre2test was always showing the maximum lookbehind
lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
it to 3, because matching looks back 3 characters.
14. For partial matches, pcre2test was always showing the maximum lookbehind
characters, flagged with "<", which is misleading when the lookbehind didn't characters, flagged with "<", which is misleading when the lookbehind didn't
actually look behind the start (because it was later in the pattern). Showing actually look behind the start (because it was later in the pattern). Showing
all consulted preceding characters for partial matches is now controlled by the all consulted preceding characters for partial matches is now controlled by the
@ -79,25 +74,25 @@ existing "allusedtext" modifier and, as for complete matches, this facility is
available only for non-JIT matching, because JIT does not maintain the first available only for non-JIT matching, because JIT does not maintain the first
and last consulted characters. and last consulted characters.
15. DFA matching (using pcre2_dfa_match()) was not recognising a partial match 14. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
if the end of the subject was encountered in a lookahead (conditional or if the end of the subject was encountered in a lookahead (conditional or
otherwise), an atomic group, or a recursion. otherwise), an atomic group, or a recursion.
16. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero. 15. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
17. Check for integer overflow when computing lookbehind lengths. Fixes 16. Check for integer overflow when computing lookbehind lengths. Fixes
Clusterfuzz issue 15636. Clusterfuzz issue 15636.
18. Implemented non-atomic positive lookaround assertions. 17. Implemented non-atomic positive lookaround assertions.
19. If a lookbehind contained a lookahead that contained another lookbehind 18. If a lookbehind contained a lookahead that contained another lookbehind
within it, the nested lookbehind was not correctly processed. For example, if within it, the nested lookbehind was not correctly processed. For example, if
/(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching /(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching
"b". "b".
20. Implemented pcre2_get_match_data_size(). 19. Implemented pcre2_get_match_data_size().
21. Two alterations to partial matching (not yet done by JIT): 20. Two alterations to partial matching (not yet done by JIT):
(a) The definition of a partial match is slightly changed: if a pattern (a) The definition of a partial match is slightly changed: if a pattern
contains any lookbehinds, an empty partial match may be given, because this contains any lookbehinds, an empty partial match may be given, because this
@ -111,29 +106,29 @@ within it, the nested lookbehind was not correctly processed. For example, if
(c) An empty string partial hard match can be returned for \z and \Z as it (c) An empty string partial hard match can be returned for \z and \Z as it
is documented that they shouldn't match. is documented that they shouldn't match.
22. A branch that started with (*ACCEPT) was not being recognized as one that 21. A branch that started with (*ACCEPT) was not being recognized as one that
could match an empty string. could match an empty string.
23. Corrected pcre2_set_character_tables() tables data type: was const unsigned 22. Corrected pcre2_set_character_tables() tables data type: was const unsigned
char * instead of const uint8_t *, as generated by pcre2_maketables(). char * instead of const uint8_t *, as generated by pcre2_maketables().
24. Upgraded to Unicode 12.1.0. 23. Upgraded to Unicode 12.1.0.
25. Add -jitfast command line option to pcre2test (to make all the jit options 24. Add -jitfast command line option to pcre2test (to make all the jit options
available directly). available directly).
26. Make pcre2test -C show if libreadline or libedit is supported. 25. Make pcre2test -C show if libreadline or libedit is supported.
28. If the length of one branch of a group exceeded 65535 (the maximum value 26. If the length of one branch of a group exceeded 65535 (the maximum value
that is remembered as a minimum length), the whole group's length was that is remembered as a minimum length), the whole group's length was
incorrectly recorded as 65535, leading to incorrect "no match" when start-up incorrectly recorded as 65535, leading to incorrect "no match" when start-up
optimizations were in force. optimizations were in force.
29. The "rightmost consulted character" value was not always correct; in 27. The "rightmost consulted character" value was not always correct; in
particular, if a pattern ended with a negative lookahead, characters that were particular, if a pattern ended with a negative lookahead, characters that were
inspected in that lookahead were not included. inspected in that lookahead were not included.
30. Add the pcre2_maketables_free() function. 28. Add the pcre2_maketables_free() function.
Version 10.33 16-April-2019 Version 10.33 16-April-2019

View File

@ -2272,26 +2272,24 @@ defaulted by the caller of the match function.
<pre> <pre>
PCRE2_INFO_MAXLOOKBEHIND PCRE2_INFO_MAXLOOKBEHIND
</pre> </pre>
Return the largest number of characters (not code units) before the current A lookbehind assertion moves back a certain number of characters (not code
matching point that could be inspected while processing a lookbehind assertion units) when it starts to process each of its branches. This request returns the
in the pattern. Before release 10.34 this request used to give the largest largest of these backward moves. The third argument should point to a uint32_t
value for any individual assertion. Now it takes into account nested integer. The simple assertions \b and \B require a one-character lookbehind
lookbehinds, which can mean that the overall value is greater. For example, the and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
pattern (?&#60;=a(?&#60;=ba)c) previously returned 2, because that is the length of the longer. \A also registers a one-character lookbehind, though it does not
largest individual lookbehind. Now it returns 3, because matching actually actually inspect the previous character.
looks back 3 characters.
</P> </P>
<P> <P>
The third argument should point to a uint32_t integer. This information is Note that this information is useful for multi-segment matching only
useful when doing multi-segment matching using the partial matching facilities. if the pattern contains no nested lookbehinds. For example, the pattern
Note that the simple assertions \b and \B require a one-character lookbehind. (?&#60;=a(?&#60;=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
\A also registers a one-character lookbehind, though it does not actually first lookbehind moves back by two characters, matches one character, then the
inspect the previous character. This is to ensure that at least one character nested lookbehind also moves back by two characters. This puts the matching
from the old segment is retained when a new segment is processed. Otherwise, if point three characters earlier than it was at the start.
there are no lookbehinds in the pattern, \A might match incorrectly at the PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
start of a second or subsequent segment. There are more details in the
<a href="pcre2partial.html"><b>pcre2partial</b></a> <a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation. documentation for a discussion of multi-segment matching.
<pre> <pre>
PCRE2_INFO_MINLENGTH PCRE2_INFO_MINLENGTH
</pre> </pre>

View File

@ -49,7 +49,7 @@ complete match, though the details differ between the two types of matching
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence. function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
</P> </P>
<P> <P>
If you want to use partial matching with just-in-time optimized code, as well If you want to use partial matching with just-in-time optimized code, as well
as setting a partial match option for the matching function, you must also call as setting a partial match option for the matching function, you must also call
<b>pcre2_jit_compile()</b> with one or both of these options: <b>pcre2_jit_compile()</b> with one or both of these options:
<pre> <pre>
@ -101,7 +101,7 @@ matched string.
</P> </P>
<P> <P>
(2) The pattern contains one or more lookbehind assertions. This condition (2) The pattern contains one or more lookbehind assertions. This condition
exists in case there is a lookbehind that inspects characters before the start exists in case there is a lookbehind that inspects characters before the start
of the match. of the match.
</P> </P>
<P> <P>
@ -171,7 +171,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
partial match is found, without continuing to search for possible complete partial match is found, without continuing to search for possible complete
matches. This option is "hard" because it prefers an earlier partial match over matches. This option is "hard" because it prefers an earlier partial match over
a later complete match. For this reason, the assumption is made that the end of a later complete match. For this reason, the assumption is made that the end of
the supplied subject string is not the true end of the available data, which is the supplied subject string is not the true end of the available data, which is
why \z, \Z, \b, \B, and $ always give a partial match. why \z, \Z, \b, \B, and $ always give a partial match.
</P> </P>
<P> <P>
@ -226,7 +226,7 @@ date:
data&#62; 3juj\=ph data&#62; 3juj\=ph
No match No match
</pre> </pre>
This example gives the same results for both hard and soft partial matching This example gives the same results for both hard and soft partial matching
options. Here is an example where there is a difference: options. Here is an example where there is a difference:
<pre> <pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@ -234,7 +234,7 @@ options. Here is an example where there is a difference:
0: 25jun04 0: 25jun04
1: jun 1: jun
data&#62; 25jun04\=ph data&#62; 25jun04\=ph
Partial match: 25jun04 Partial match: 25jun04
</pre> </pre>
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
@ -244,9 +244,12 @@ there is only a partial match.
<P> <P>
PCRE was not originally designed with multi-segment matching in mind. However, PCRE was not originally designed with multi-segment matching in mind. However,
over time, features (including partial matching) that make multi-segment over time, features (including partial matching) that make multi-segment
matching possible have been added. The string is searched segment by segment by matching possible have been added. A very long string can be searched segment
calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
results that would happen if the entire string was available for searching. the same results that would happen if the entire string was available for
searching all the time. Normally, the strings that are being sought are much
shorter than each individual segment, and are in the middle of very long
strings, so the pattern is normally not anchored.
</P> </P>
<P> <P>
Special logic must be implemented to handle a matched substring that spans a Special logic must be implemented to handle a matched substring that spans a
@ -256,11 +259,10 @@ changing the match by adding more characters. The PCRE2_NOTBOL option should
also be set for all but the first segment. also be set for all but the first segment.
</P> </P>
<P> <P>
When a partial match occurs, the next segment must be added to the current When a partial match occurs, the next segment must be added to the current
subject and the match re-run, using the <i>startoffset</i> argument of subject and the match re-run, using the <i>startoffset</i> argument of
<b>pcre2_match()</b> to begin at the point where the partial match started. <b>pcre2_match()</b> to begin at the point where the partial match started.
Multi-segment matching is usually used to search for substrings in the middle For example:
of very long sequences, so the patterns are normally not anchored. For example:
<pre> <pre>
re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
data&#62; ...the date is 23ja\=ph data&#62; ...the date is 23ja\=ph
@ -269,51 +271,52 @@ of very long sequences, so the patterns are normally not anchored. For example:
0: 23jan19 0: 23jan19
1: jan 1: jan
</pre> </pre>
Note the use of the <b>offset</b> modifier to start the new match where the Note the use of the <b>offset</b> modifier to start the new match where the
partial match was found. partial match was found. In this example, the next segment was added to the one
in which the partial match was found. This is the most straightforward
approach, typically using a memory buffer that is twice the size of each
segment. After a partial match, the first half of the buffer is discarded, the
second half is moved to the start of the buffer, and a new segment is added
before repeating the match as in the example above. After a no match, the
entire buffer can be discarded.
</P> </P>
<P> <P>
In this simple example, the next segment was just added to the one in which the If there are memory constraints, you may want to discard text that precedes a
partial match was found. However, if there are memory constraints, it may be partial match before adding the next segment. Unfortunately, this is not at
necessary to discard text that precedes the partial match before adding the present straightforward. In cases such as the above, where the pattern does not
next segment. In cases such as the above, where the pattern does not contain contain any lookbehinds, it is sufficient to retain only the partially matched
any lookbehinds, it is sufficient to retain only the partially matched substring. However, if the pattern contains a lookbehind assertion, characters
substring. However, if a pattern contains a lookbehind assertion, characters
that precede the start of the partial match may have been inspected during the that precede the start of the partial match may have been inspected during the
matching process. matching process. When <b>pcre2test</b> displays a partial match, it indicates
</P> these characters with '&#60;' if the <b>allusedtext</b> modifier is set:
<P>
The only lookbehind information that is available is the length of the longest
lookbehind in a pattern. This may not, of course, be at the start of the
pattern, but retaining that many characters before the partial match is
sufficient, if not always strictly necessary. The way to do this is as follows:
</P>
<P>
Before doing any matching, find the length of the longest lookbehind in the
pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
option. Note that the resulting count is in characters, not code units. After a
partial match, moving back from the ovector[0] offset in the subject by the
number of characters given for the maximum lookbehind gets you to the earliest
character that must be retained. In a non-UTF or a 32-bit situation, moving
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
while moving back through the code units. Characters before the point you have
now reached can be discarded.
</P>
<P>
For example, if the pattern "(?&#60;=123)abc" is partially matched against the
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
displays a partial match, it indicates the lookbehind characters with '&#60;'
characters if the <b>allusedtext</b> modifier is set:
<pre> <pre>
re&#62; "(?&#60;=123)abc" re&#62; "(?&#60;=123)abc"
data&#62; xx123ab\=ph,allusedtext data&#62; xx123ab\=ph,allusedtext
Partial match: 123ab Partial match: 123ab
&#60;&#60;&#60; &#60;&#60;&#60;
</pre> </pre>
Note that the \fPallusedtext\fP modifier is not available for JIT matching, However, the \fPallusedtext\fP modifier is not available for JIT matching,
because JIT matching does not maintain the first and last consulted characters. because JIT matching does not record the first (or last) consulted characters.
For this reason, this information is not available via the API. It is therefore
not possible in general to obtain the exact number of characters that must be
retained in order to get the right match result. If you cannot retain the
entire segment, you must find some heuristic way of choosing.
</P>
<P>
If you know the approximate length of the matching substrings, you can use that
to decide how much text to retain. The only lookbehind information that is
currently available via the API is the length of the longest individual
lookbehind in a pattern, but this can be misleading if there are nested
lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
units) that any individual lookbehind moves back when it is processed. A
pattern such as "(?&#60;=(?&#60;!b)a)" has a maximum lookbehind value of one, but
inspects two characters before its starting point.
</P>
<P>
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
UTF-8 or UTF-16 you have to count characters while moving back through the code
units.
</P> </P>
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br> <br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
<P> <P>
@ -379,11 +382,11 @@ want.
</P> </P>
<P> <P>
If you do want to allow for starting again at the next character, one way of If you do want to allow for starting again at the next character, one way of
doing it is to retain the matched part of the segment and try a new complete doing it is to retain some or all of the segment and try a new complete match,
match, as described for <b>pcre2_match()</b> above. Another possibility is to as described for <b>pcre2_match()</b> above. Another possibility is to work with
work with two buffers. If a partial match at offset <i>n</i> in the first buffer two buffers. If a partial match at offset <i>n</i> in the first buffer is
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
you can then try a new match starting at offset <i>n+1</i> in the first buffer. can then try a new match starting at offset <i>n+1</i> in the first buffer.
</P> </P>
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br> <br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
<P> <P>
@ -396,7 +399,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br> <br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 07 August 2019 Last updated: 04 September 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

File diff suppressed because it is too large Load Diff

View File

@ -2232,27 +2232,25 @@ defaulted by the caller of the match function.
.sp .sp
PCRE2_INFO_MAXLOOKBEHIND PCRE2_INFO_MAXLOOKBEHIND
.sp .sp
Return the largest number of characters (not code units) before the current A lookbehind assertion moves back a certain number of characters (not code
matching point that could be inspected while processing a lookbehind assertion units) when it starts to process each of its branches. This request returns the
in the pattern. Before release 10.34 this request used to give the largest largest of these backward moves. The third argument should point to a uint32_t
value for any individual assertion. Now it takes into account nested integer. The simple assertions \eb and \eB require a one-character lookbehind
lookbehinds, which can mean that the overall value is greater. For example, the and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the longer. \eA also registers a one-character lookbehind, though it does not
largest individual lookbehind. Now it returns 3, because matching actually actually inspect the previous character.
looks back 3 characters.
.P .P
The third argument should point to a uint32_t integer. This information is Note that this information is useful for multi-segment matching only
useful when doing multi-segment matching using the partial matching facilities. if the pattern contains no nested lookbehinds. For example, the pattern
Note that the simple assertions \eb and \eB require a one-character lookbehind. (?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
\eA also registers a one-character lookbehind, though it does not actually first lookbehind moves back by two characters, matches one character, then the
inspect the previous character. This is to ensure that at least one character nested lookbehind also moves back by two characters. This puts the matching
from the old segment is retained when a new segment is processed. Otherwise, if point three characters earlier than it was at the start.
there are no lookbehinds in the pattern, \eA might match incorrectly at the PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
start of a second or subsequent segment. There are more details in the
.\" HREF .\" HREF
\fBpcre2partial\fP \fBpcre2partial\fP
.\" .\"
documentation. documentation for a discussion of multi-segment matching.
.sp .sp
PCRE2_INFO_MINLENGTH PCRE2_INFO_MINLENGTH
.sp .sp

View File

@ -1,4 +1,4 @@
.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34" .TH PCRE2PARTIAL 3 "04 September 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2" .SH "PARTIAL MATCHING IN PCRE2"
@ -25,7 +25,7 @@ options is whether or not a partial match is preferred to an alternative
complete match, though the details differ between the two types of matching complete match, though the details differ between the two types of matching
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence. function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
.P .P
If you want to use partial matching with just-in-time optimized code, as well If you want to use partial matching with just-in-time optimized code, as well
as setting a partial match option for the matching function, you must also call as setting a partial match option for the matching function, you must also call
\fBpcre2_jit_compile()\fP with one or both of these options: \fBpcre2_jit_compile()\fP with one or both of these options:
.sp .sp
@ -73,7 +73,7 @@ need not form part of the final matched string; lookbehind assertions and the
matched string. matched string.
.P .P
(2) The pattern contains one or more lookbehind assertions. This condition (2) The pattern contains one or more lookbehind assertions. This condition
exists in case there is a lookbehind that inspects characters before the start exists in case there is a lookbehind that inspects characters before the start
of the match. of the match.
.P .P
(3) There is a special case when the whole pattern can match an empty string. (3) There is a special case when the whole pattern can match an empty string.
@ -139,7 +139,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
partial match is found, without continuing to search for possible complete partial match is found, without continuing to search for possible complete
matches. This option is "hard" because it prefers an earlier partial match over matches. This option is "hard" because it prefers an earlier partial match over
a later complete match. For this reason, the assumption is made that the end of a later complete match. For this reason, the assumption is made that the end of
the supplied subject string is not the true end of the available data, which is the supplied subject string is not the true end of the available data, which is
why \ez, \eZ, \eb, \eB, and $ always give a partial match. why \ez, \eZ, \eb, \eB, and $ always give a partial match.
.P .P
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
@ -192,7 +192,7 @@ date:
data> 3juj\e=ph data> 3juj\e=ph
No match No match
.sp .sp
This example gives the same results for both hard and soft partial matching This example gives the same results for both hard and soft partial matching
options. Here is an example where there is a difference: options. Here is an example where there is a difference:
.sp .sp
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/ re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
@ -200,8 +200,8 @@ options. Here is an example where there is a difference:
0: 25jun04 0: 25jun04
1: jun 1: jun
data> 25jun04\e=ph data> 25jun04\e=ph
Partial match: 25jun04 Partial match: 25jun04
.sp .sp
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
there is only a partial match. there is only a partial match.
@ -213,9 +213,12 @@ there is only a partial match.
.sp .sp
PCRE was not originally designed with multi-segment matching in mind. However, PCRE was not originally designed with multi-segment matching in mind. However,
over time, features (including partial matching) that make multi-segment over time, features (including partial matching) that make multi-segment
matching possible have been added. The string is searched segment by segment by matching possible have been added. A very long string can be searched segment
calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same by segment by calling \fBpcre2_match()\fP repeatedly, with the aim of achieving
results that would happen if the entire string was available for searching. the same results that would happen if the entire string was available for
searching all the time. Normally, the strings that are being sought are much
shorter than each individual segment, and are in the middle of very long
strings, so the pattern is normally not anchored.
.P .P
Special logic must be implemented to handle a matched substring that spans a Special logic must be implemented to handle a matched substring that spans a
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
@ -223,11 +226,10 @@ partial match at the end of a segment whenever there is the possibility of
changing the match by adding more characters. The PCRE2_NOTBOL option should changing the match by adding more characters. The PCRE2_NOTBOL option should
also be set for all but the first segment. also be set for all but the first segment.
.P .P
When a partial match occurs, the next segment must be added to the current When a partial match occurs, the next segment must be added to the current
subject and the match re-run, using the \fIstartoffset\fP argument of subject and the match re-run, using the \fIstartoffset\fP argument of
\fBpcre2_match()\fP to begin at the point where the partial match started. \fBpcre2_match()\fP to begin at the point where the partial match started.
Multi-segment matching is usually used to search for substrings in the middle For example:
of very long sequences, so the patterns are normally not anchored. For example:
.sp .sp
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/ re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
data> ...the date is 23ja\e=ph data> ...the date is 23ja\e=ph
@ -236,48 +238,49 @@ of very long sequences, so the patterns are normally not anchored. For example:
0: 23jan19 0: 23jan19
1: jan 1: jan
.sp .sp
Note the use of the \fBoffset\fP modifier to start the new match where the Note the use of the \fBoffset\fP modifier to start the new match where the
partial match was found. partial match was found. In this example, the next segment was added to the one
in which the partial match was found. This is the most straightforward
approach, typically using a memory buffer that is twice the size of each
segment. After a partial match, the first half of the buffer is discarded, the
second half is moved to the start of the buffer, and a new segment is added
before repeating the match as in the example above. After a no match, the
entire buffer can be discarded.
.P .P
In this simple example, the next segment was just added to the one in which the If there are memory constraints, you may want to discard text that precedes a
partial match was found. However, if there are memory constraints, it may be partial match before adding the next segment. Unfortunately, this is not at
necessary to discard text that precedes the partial match before adding the present straightforward. In cases such as the above, where the pattern does not
next segment. In cases such as the above, where the pattern does not contain contain any lookbehinds, it is sufficient to retain only the partially matched
any lookbehinds, it is sufficient to retain only the partially matched substring. However, if the pattern contains a lookbehind assertion, characters
substring. However, if a pattern contains a lookbehind assertion, characters
that precede the start of the partial match may have been inspected during the that precede the start of the partial match may have been inspected during the
matching process. matching process. When \fBpcre2test\fP displays a partial match, it indicates
.P these characters with '<' if the \fBallusedtext\fP modifier is set:
The only lookbehind information that is available is the length of the longest
lookbehind in a pattern. This may not, of course, be at the start of the
pattern, but retaining that many characters before the partial match is
sufficient, if not always strictly necessary. The way to do this is as follows:
.P
Before doing any matching, find the length of the longest lookbehind in the
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
option. Note that the resulting count is in characters, not code units. After a
partial match, moving back from the ovector[0] offset in the subject by the
number of characters given for the maximum lookbehind gets you to the earliest
character that must be retained. In a non-UTF or a 32-bit situation, moving
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
while moving back through the code units. Characters before the point you have
now reached can be discarded.
.P
For example, if the pattern "(?<=123)abc" is partially matched against the
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
displays a partial match, it indicates the lookbehind characters with '<'
characters if the \fBallusedtext\fP modifier is set:
.sp .sp
re> "(?<=123)abc" re> "(?<=123)abc"
data> xx123ab\e=ph,allusedtext data> xx123ab\e=ph,allusedtext
Partial match: 123ab Partial match: 123ab
<<< <<<
.sp .sp
Note that the \fPallusedtext\fP modifier is not available for JIT matching, However, the \fPallusedtext\fP modifier is not available for JIT matching,
because JIT matching does not maintain the first and last consulted characters. because JIT matching does not record the first (or last) consulted characters.
. For this reason, this information is not available via the API. It is therefore
not possible in general to obtain the exact number of characters that must be
retained in order to get the right match result. If you cannot retain the
entire segment, you must find some heuristic way of choosing.
.P
If you know the approximate length of the matching substrings, you can use that
to decide how much text to retain. The only lookbehind information that is
currently available via the API is the length of the longest individual
lookbehind in a pattern, but this can be misleading if there are nested
lookbehinds. The value returned by calling \fBpcre2_pattern_info()\fP with the
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
units) that any individual lookbehind moves back when it is processed. A
pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
inspects two characters before its starting point.
.P
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
UTF-8 or UTF-16 you have to count characters while moving back through the code
units.
. .
. .
.SH "PARTIAL MATCHING USING pcre2_dfa_match()" .SH "PARTIAL MATCHING USING pcre2_dfa_match()"
@ -344,11 +347,11 @@ are remembered. Depending on the application, this may or may not be what you
want. want.
.P .P
If you do want to allow for starting again at the next character, one way of If you do want to allow for starting again at the next character, one way of
doing it is to retain the matched part of the segment and try a new complete doing it is to retain some or all of the segment and try a new complete match,
match, as described for \fBpcre2_match()\fP above. Another possibility is to as described for \fBpcre2_match()\fP above. Another possibility is to work with
work with two buffers. If a partial match at offset \fIn\fP in the first buffer two buffers. If a partial match at offset \fIn\fP in the first buffer is
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
you can then try a new match starting at offset \fIn+1\fP in the first buffer. can then try a new match starting at offset \fIn+1\fP in the first buffer.
. .
. .
.SH AUTHOR .SH AUTHOR
@ -365,6 +368,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 07 August 2019 Last updated: 04 September 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -128,12 +128,12 @@ static int
compile_block *, PCRE2_SIZE *); compile_block *, PCRE2_SIZE *);
static int static int
get_branchlength(uint32_t **, int *, int *, int *, parsed_recurse_check *, get_branchlength(uint32_t **, int *, int *, parsed_recurse_check *,
compile_block *); compile_block *);
static BOOL static BOOL
set_lookbehind_lengths(uint32_t **, int *, int *, int *, set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *,
parsed_recurse_check *, compile_block *); compile_block *);
static int static int
check_lookbehinds(uint32_t *, uint32_t **, parsed_recurse_check *, check_lookbehinds(uint32_t *, uint32_t **, parsed_recurse_check *,
@ -398,9 +398,6 @@ compiler is clever with identical subexpressions. */
#define GI_SET_FIXED_LENGTH 0x80000000u #define GI_SET_FIXED_LENGTH 0x80000000u
#define GI_NOT_FIXED_LENGTH 0x40000000u #define GI_NOT_FIXED_LENGTH 0x40000000u
#define GI_FIXED_LENGTH_MASK 0x0000ffffu #define GI_FIXED_LENGTH_MASK 0x0000ffffu
#define GI_EXTRA_MASK 0x0fff0000u
#define GI_EXTRA_MAX 0xfff /* NB not unsigned */
#define GI_EXTRA_SHIFT 16
/* This simple test for a decimal digit works for both ASCII/Unicode and EBCDIC /* This simple test for a decimal digit works for both ASCII/Unicode and EBCDIC
and is fast (a good compiler can turn it into a subtraction and unsigned and is fast (a good compiler can turn it into a subtraction and unsigned
@ -8897,7 +8894,6 @@ improve processing speed when the same capturing group occurs many times.
Arguments: Arguments:
pptrptr pointer to pointer in the parsed pattern pptrptr pointer to pointer in the parsed pattern
isinline FALSE if a reference or recursion; TRUE for inline group isinline FALSE if a reference or recursion; TRUE for inline group
extraptr pointer to where to return extra lookbehind length
errcodeptr pointer to the errorcode errcodeptr pointer to the errorcode
lcptr pointer to the loop counter lcptr pointer to the loop counter
group number of captured group or -1 for a non-capturing group group number of captured group or -1 for a non-capturing group
@ -8908,13 +8904,11 @@ Returns: the group length or a negative number
*/ */
static int static int
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *extraptr, get_grouplength(uint32_t **pptrptr, BOOL isinline, int *errcodeptr, int *lcptr,
int *errcodeptr, int *lcptr, int group, parsed_recurse_check *recurses, int group, parsed_recurse_check *recurses, compile_block *cb)
compile_block *cb)
{ {
int branchlength; int branchlength;
int grouplength = -1; int grouplength = -1;
int extra = 0;
/* The cache can be used only if there is no possibility of there being two /* The cache can be used only if there is no possibility of there being two
groups with the same number. We do not need to set the end pointer for a group groups with the same number. We do not need to set the end pointer for a group
@ -8928,7 +8922,6 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
if ((groupinfo & GI_SET_FIXED_LENGTH) != 0) if ((groupinfo & GI_SET_FIXED_LENGTH) != 0)
{ {
if (isinline) *pptrptr = parsed_skip(*pptrptr, PSKIP_KET); if (isinline) *pptrptr = parsed_skip(*pptrptr, PSKIP_KET);
*extraptr = (groupinfo & GI_EXTRA_MASK) >> GI_EXTRA_SHIFT;
return groupinfo & GI_FIXED_LENGTH_MASK; return groupinfo & GI_FIXED_LENGTH_MASK;
} }
} }
@ -8937,28 +8930,16 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
for(;;) for(;;)
{ {
int branchextra; branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
recurses, cb);
if (branchlength < 0) goto ISNOTFIXED; if (branchlength < 0) goto ISNOTFIXED;
if (grouplength == -1) if (grouplength == -1) grouplength = branchlength;
{ else if (grouplength != branchlength) goto ISNOTFIXED;
grouplength = branchlength;
extra = branchextra;
}
else if (grouplength != branchlength || extra != branchextra) goto ISNOTFIXED;
if (**pptrptr == META_KET) break; if (**pptrptr == META_KET) break;
*pptrptr += 1; /* Skip META_ALT */ *pptrptr += 1; /* Skip META_ALT */
} }
/* There are only 12 bits for caching the extra value, but a pattern that if (group > 0)
needs more than that is weird indeed. */ cb->groupinfo[group] |= (uint32_t)(GI_SET_FIXED_LENGTH | grouplength);
if (group > 0 && extra <= GI_EXTRA_MAX)
cb->groupinfo[group] |= (uint32_t)
(GI_SET_FIXED_LENGTH | (extra << GI_EXTRA_SHIFT) | grouplength);
*extraptr = extra;
return grouplength; return grouplength;
ISNOTFIXED: ISNOTFIXED:
@ -8973,17 +8954,11 @@ return -1;
*************************************************/ *************************************************/
/* Return a fixed length for a branch in a lookbehind, giving an error if the /* Return a fixed length for a branch in a lookbehind, giving an error if the
length is not fixed. We also take note of any extra value that is generated length is not fixed. On entry, *pptrptr points to the first element inside the
from a nested lookbehind. For example, for /(?<=a(?<=ba)c)/ each individual branch. On exit it is set to point to the ALT or KET.
lookbehind has length 2, but the max_lookbehind setting must be 3 because
matching inspects 3 characters before the match starting point.
On entry, *pptrptr points to the first element inside the branch. On exit it is
set to point to the ALT or KET.
Arguments: Arguments:
pptrptr pointer to pointer in the parsed pattern pptrptr pointer to pointer in the parsed pattern
extraptr pointer to where to return extra lookbehind length
errcodeptr pointer to error code errcodeptr pointer to error code
lcptr pointer to loop counter lcptr pointer to loop counter
recurses chain of recurse_check to catch mutual recursion recurses chain of recurse_check to catch mutual recursion
@ -8993,14 +8968,11 @@ Returns: the length, or a negative value on error
*/ */
static int static int
get_branchlength(uint32_t **pptrptr, int *extraptr, int *errcodeptr, int *lcptr, get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
parsed_recurse_check *recurses, compile_block *cb) parsed_recurse_check *recurses, compile_block *cb)
{ {
int branchlength = 0; int branchlength = 0;
int grouplength; int grouplength;
int groupextra;
int max;
int extra = 0; /* Additional lookbehind from nesting */
uint32_t lastitemlength = 0; uint32_t lastitemlength = 0;
uint32_t *pptr = *pptrptr; uint32_t *pptr = *pptrptr;
PCRE2_SIZE offset; PCRE2_SIZE offset;
@ -9149,17 +9121,13 @@ for (;; pptr++)
break; break;
/* A nested lookbehind does not contribute any length to this lookbehind, /* A nested lookbehind does not contribute any length to this lookbehind,
but must itself be checked and have its lengths set. If the maximum but must itself be checked and have its lengths set. */
lookbehind for the nested lookbehind is greater than the length so far
computed for this branch, we must compute an extra value and keep the
largest encountered for use when setting the maximum overall lookbehind. */
case META_LOOKBEHIND: case META_LOOKBEHIND:
case META_LOOKBEHINDNOT: case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA: case META_LOOKBEHIND_NA:
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb)) if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
return -1; return -1;
if (max - branchlength > extra) extra = max - branchlength;
break; break;
/* Back references and recursions are handled by very similar code. At this /* Back references and recursions are handled by very similar code. At this
@ -9267,14 +9235,15 @@ for (;; pptr++)
in the cache. */ in the cache. */
gptr++; gptr++;
grouplength = get_grouplength(&gptr, FALSE, &groupextra, errcodeptr, lcptr, grouplength = get_grouplength(&gptr, FALSE, errcodeptr, lcptr, group,
group, &this_recurse, cb); &this_recurse, cb);
if (grouplength < 0) if (grouplength < 0)
{ {
if (*errcodeptr == 0) goto ISNOTFIXED; if (*errcodeptr == 0) goto ISNOTFIXED;
return -1; /* Error already set */ return -1; /* Error already set */
} }
goto OK_GROUP; itemlength = grouplength;
break;
/* Check nested groups - advance past the initial data for each type and /* Check nested groups - advance past the initial data for each type and
then seek a fixed length with get_grouplength(). */ then seek a fixed length with get_grouplength(). */
@ -9304,16 +9273,10 @@ for (;; pptr++)
case META_SCRIPT_RUN: case META_SCRIPT_RUN:
pptr++; pptr++;
CHECK_GROUP: CHECK_GROUP:
grouplength = get_grouplength(&pptr, TRUE, &groupextra, errcodeptr, lcptr, grouplength = get_grouplength(&pptr, TRUE, errcodeptr, lcptr, group,
group, recurses, cb); recurses, cb);
if (grouplength < 0) return -1; if (grouplength < 0) return -1;
/* A nested lookbehind within the group may require looking back further
than the length of the group. */
OK_GROUP:
itemlength = grouplength; itemlength = grouplength;
if (groupextra - branchlength > extra) extra = groupextra - branchlength;
break; break;
/* Exact repetition is OK; variable repetition is not. A repetition of zero /* Exact repetition is OK; variable repetition is not. A repetition of zero
@ -9374,7 +9337,6 @@ for (;; pptr++)
EXIT: EXIT:
*pptrptr = pptr; *pptrptr = pptr;
*extraptr = extra;
return branchlength; return branchlength;
PARSED_SKIP_FAILED: PARSED_SKIP_FAILED:
@ -9400,7 +9362,6 @@ get_branchlength() as an "extra" value.
Arguments: Arguments:
pptrptr pointer to pointer in the parsed pattern pptrptr pointer to pointer in the parsed pattern
maxptr where to return maximum lookbehind for the whole group
errcodeptr pointer to error code errcodeptr pointer to error code
lcptr pointer to loop counter lcptr pointer to loop counter
recurses chain of recurse_check to catch mutual recursion recurses chain of recurse_check to catch mutual recursion
@ -9411,13 +9372,11 @@ Returns: TRUE if all is well
*/ */
static BOOL static BOOL
set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr, set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
int *lcptr, parsed_recurse_check *recurses, compile_block *cb) parsed_recurse_check *recurses, compile_block *cb)
{ {
PCRE2_SIZE offset; PCRE2_SIZE offset;
int branchlength; int branchlength;
int branchextra;
int max = 0;
uint32_t *bptr = *pptrptr; uint32_t *bptr = *pptrptr;
READPLUSOFFSET(offset, bptr); /* Offset for error messages */ READPLUSOFFSET(offset, bptr); /* Offset for error messages */
@ -9426,8 +9385,7 @@ READPLUSOFFSET(offset, bptr); /* Offset for error messages */
do do
{ {
*pptrptr += 1; *pptrptr += 1;
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr, branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
recurses, cb);
if (branchlength < 0) if (branchlength < 0)
{ {
/* The errorcode and offset may already be set from a nested lookbehind. */ /* The errorcode and offset may already be set from a nested lookbehind. */
@ -9435,14 +9393,12 @@ do
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset; if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
return FALSE; return FALSE;
} }
if (branchlength + branchextra > max) max = branchlength + branchextra; if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
*bptr |= branchlength; /* branchlength never more than 65535 */ *bptr |= branchlength; /* branchlength never more than 65535 */
bptr = *pptrptr; bptr = *pptrptr;
} }
while (*bptr == META_ALT); while (*bptr == META_ALT);
if (max > cb->max_lookbehind) cb->max_lookbehind = max;
*maxptr = max;
return TRUE; return TRUE;
} }
@ -9475,7 +9431,6 @@ static int
check_lookbehinds(uint32_t *pptr, uint32_t **retptr, check_lookbehinds(uint32_t *pptr, uint32_t **retptr,
parsed_recurse_check *recurses, compile_block *cb) parsed_recurse_check *recurses, compile_block *cb)
{ {
int max;
int errorcode = 0; int errorcode = 0;
int loopcount = 0; int loopcount = 0;
int nestlevel = 0; int nestlevel = 0;
@ -9599,8 +9554,7 @@ for (; *pptr != META_END; pptr++)
case META_LOOKBEHIND: case META_LOOKBEHIND:
case META_LOOKBEHINDNOT: case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA: case META_LOOKBEHIND_NA:
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, recurses, cb))
recurses, cb))
return errorcode; return errorcode;
break; break;
} }

View File

@ -304,7 +304,7 @@ Partial match, mark=xx: 123a
/(?<=(?<=a)b)c.*/I /(?<=(?<=a)b)c.*/I
Capture group count = 0 Capture group count = 0
Max lookbehind = 2 Max lookbehind = 1
First code unit = 'c' First code unit = 'c'
Subject length lower bound = 1 Subject length lower bound = 1
abc\=ph abc\=ph
@ -337,7 +337,7 @@ Partial match: abcd
/(?<=(?<=(?<=a)b)c)./I /(?<=(?<=(?<=a)b)c)./I
Capture group count = 0 Capture group count = 0
Max lookbehind = 3 Max lookbehind = 1
Subject length lower bound = 1 Subject length lower bound = 1
123abcXYZ 123abcXYZ
0: abcX 0: abcX
@ -354,7 +354,7 @@ Subject length lower bound = 1
/(?<=ab((?<=...)cd))./I /(?<=ab((?<=...)cd))./I
Capture group count = 1 Capture group count = 1
Max lookbehind = 5 Max lookbehind = 4
Subject length lower bound = 1 Subject length lower bound = 1
ZabcdX ZabcdX
0: ZabcdX 0: ZabcdX
@ -363,7 +363,7 @@ Subject length lower bound = 1
/(?<=((?<=(?<=ab).))(?1)(?1))./I /(?<=((?<=(?<=ab).))(?1)(?1))./I
Capture group count = 1 Capture group count = 1
Max lookbehind = 3 Max lookbehind = 2
Subject length lower bound = 1 Subject length lower bound = 1
abxZ abxZ
0: abxZ 0: abxZ

View File

@ -17036,7 +17036,7 @@ Subject length lower bound = 1
/(?<=(?<=a)b)c.*/I /(?<=(?<=a)b)c.*/I
Capture group count = 0 Capture group count = 0
Max lookbehind = 2 Max lookbehind = 1
First code unit = 'c' First code unit = 'c'
Subject length lower bound = 1 Subject length lower bound = 1
abc\=ph abc\=ph
@ -17064,7 +17064,7 @@ Subject length lower bound = 0
/(?<=a(?<=a|ba)c)/I /(?<=a(?<=a|ba)c)/I
Capture group count = 0 Capture group count = 0
Max lookbehind = 3 Max lookbehind = 2
May match empty string May match empty string
Subject length lower bound = 0 Subject length lower bound = 0
@ -17076,7 +17076,7 @@ Subject length lower bound = 0
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I /(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
Capture group count = 0 Capture group count = 0
Max lookbehind = 5 Max lookbehind = 4
May match empty string May match empty string
Subject length lower bound = 0 Subject length lower bound = 0