Back off failed attempt to handle nested lookbehinds for estimating how much of

a partial match to retain for multi-segment matching. Document the current 
difficulty if the whole first segment cannot be retained.
This commit is contained in:
Philip.Hazel 2019-09-04 18:14:54 +00:00
parent 87bc092222
commit 963b570fd0
9 changed files with 868 additions and 915 deletions

View File

@ -66,12 +66,7 @@ is made possessive and applied to an item in parentheses, because a
parenthesized item may contain multiple branches or other backtracking points,
for example /(a|ab){1}+c/ or /(a+){1}+a/.
13. Nested lookbehinds are now taken into account when computing the maximum
lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
it to 3, because matching looks back 3 characters.
14. For partial matches, pcre2test was always showing the maximum lookbehind
13. For partial matches, pcre2test was always showing the maximum lookbehind
characters, flagged with "<", which is misleading when the lookbehind didn't
actually look behind the start (because it was later in the pattern). Showing
all consulted preceding characters for partial matches is now controlled by the
@ -79,25 +74,25 @@ existing "allusedtext" modifier and, as for complete matches, this facility is
available only for non-JIT matching, because JIT does not maintain the first
and last consulted characters.
15. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
14. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
if the end of the subject was encountered in a lookahead (conditional or
otherwise), an atomic group, or a recursion.
16. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
15. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
17. Check for integer overflow when computing lookbehind lengths. Fixes
16. Check for integer overflow when computing lookbehind lengths. Fixes
Clusterfuzz issue 15636.
18. Implemented non-atomic positive lookaround assertions.
17. Implemented non-atomic positive lookaround assertions.
19. If a lookbehind contained a lookahead that contained another lookbehind
18. If a lookbehind contained a lookahead that contained another lookbehind
within it, the nested lookbehind was not correctly processed. For example, if
/(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching
"b".
20. Implemented pcre2_get_match_data_size().
19. Implemented pcre2_get_match_data_size().
21. Two alterations to partial matching (not yet done by JIT):
20. Two alterations to partial matching (not yet done by JIT):
(a) The definition of a partial match is slightly changed: if a pattern
contains any lookbehinds, an empty partial match may be given, because this
@ -111,29 +106,29 @@ within it, the nested lookbehind was not correctly processed. For example, if
(c) An empty string partial hard match can be returned for \z and \Z as it
is documented that they shouldn't match.
22. A branch that started with (*ACCEPT) was not being recognized as one that
21. A branch that started with (*ACCEPT) was not being recognized as one that
could match an empty string.
23. Corrected pcre2_set_character_tables() tables data type: was const unsigned
22. Corrected pcre2_set_character_tables() tables data type: was const unsigned
char * instead of const uint8_t *, as generated by pcre2_maketables().
24. Upgraded to Unicode 12.1.0.
23. Upgraded to Unicode 12.1.0.
25. Add -jitfast command line option to pcre2test (to make all the jit options
24. Add -jitfast command line option to pcre2test (to make all the jit options
available directly).
26. Make pcre2test -C show if libreadline or libedit is supported.
25. Make pcre2test -C show if libreadline or libedit is supported.
28. If the length of one branch of a group exceeded 65535 (the maximum value
26. If the length of one branch of a group exceeded 65535 (the maximum value
that is remembered as a minimum length), the whole group's length was
incorrectly recorded as 65535, leading to incorrect "no match" when start-up
optimizations were in force.
29. The "rightmost consulted character" value was not always correct; in
27. The "rightmost consulted character" value was not always correct; in
particular, if a pattern ended with a negative lookahead, characters that were
inspected in that lookahead were not included.
30. Add the pcre2_maketables_free() function.
28. Add the pcre2_maketables_free() function.
Version 10.33 16-April-2019

View File

@ -2272,26 +2272,24 @@ defaulted by the caller of the match function.
<pre>
PCRE2_INFO_MAXLOOKBEHIND
</pre>
Return the largest number of characters (not code units) before the current
matching point that could be inspected while processing a lookbehind assertion
in the pattern. Before release 10.34 this request used to give the largest
value for any individual assertion. Now it takes into account nested
lookbehinds, which can mean that the overall value is greater. For example, the
pattern (?&#60;=a(?&#60;=ba)c) previously returned 2, because that is the length of the
largest individual lookbehind. Now it returns 3, because matching actually
looks back 3 characters.
A lookbehind assertion moves back a certain number of characters (not code
units) when it starts to process each of its branches. This request returns the
largest of these backward moves. The third argument should point to a uint32_t
integer. The simple assertions \b and \B require a one-character lookbehind
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
longer. \A also registers a one-character lookbehind, though it does not
actually inspect the previous character.
</P>
<P>
The third argument should point to a uint32_t integer. This information is
useful when doing multi-segment matching using the partial matching facilities.
Note that the simple assertions \b and \B require a one-character lookbehind.
\A also registers a one-character lookbehind, though it does not actually
inspect the previous character. This is to ensure that at least one character
from the old segment is retained when a new segment is processed. Otherwise, if
there are no lookbehinds in the pattern, \A might match incorrectly at the
start of a second or subsequent segment. There are more details in the
Note that this information is useful for multi-segment matching only
if the pattern contains no nested lookbehinds. For example, the pattern
(?&#60;=a(?&#60;=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
first lookbehind moves back by two characters, matches one character, then the
nested lookbehind also moves back by two characters. This puts the matching
point three characters earlier than it was at the start.
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation.
documentation for a discussion of multi-segment matching.
<pre>
PCRE2_INFO_MINLENGTH
</pre>

View File

@ -49,7 +49,7 @@ complete match, though the details differ between the two types of matching
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
</P>
<P>
If you want to use partial matching with just-in-time optimized code, as well
If you want to use partial matching with just-in-time optimized code, as well
as setting a partial match option for the matching function, you must also call
<b>pcre2_jit_compile()</b> with one or both of these options:
<pre>
@ -101,7 +101,7 @@ matched string.
</P>
<P>
(2) The pattern contains one or more lookbehind assertions. This condition
exists in case there is a lookbehind that inspects characters before the start
exists in case there is a lookbehind that inspects characters before the start
of the match.
</P>
<P>
@ -171,7 +171,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
partial match is found, without continuing to search for possible complete
matches. This option is "hard" because it prefers an earlier partial match over
a later complete match. For this reason, the assumption is made that the end of
the supplied subject string is not the true end of the available data, which is
the supplied subject string is not the true end of the available data, which is
why \z, \Z, \b, \B, and $ always give a partial match.
</P>
<P>
@ -226,7 +226,7 @@ date:
data&#62; 3juj\=ph
No match
</pre>
This example gives the same results for both hard and soft partial matching
This example gives the same results for both hard and soft partial matching
options. Here is an example where there is a difference:
<pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@ -234,7 +234,7 @@ options. Here is an example where there is a difference:
0: 25jun04
1: jun
data&#62; 25jun04\=ph
Partial match: 25jun04
Partial match: 25jun04
</pre>
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
@ -244,9 +244,12 @@ there is only a partial match.
<P>
PCRE was not originally designed with multi-segment matching in mind. However,
over time, features (including partial matching) that make multi-segment
matching possible have been added. The string is searched segment by segment by
calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same
results that would happen if the entire string was available for searching.
matching possible have been added. A very long string can be searched segment
by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
the same results that would happen if the entire string was available for
searching all the time. Normally, the strings that are being sought are much
shorter than each individual segment, and are in the middle of very long
strings, so the pattern is normally not anchored.
</P>
<P>
Special logic must be implemented to handle a matched substring that spans a
@ -256,11 +259,10 @@ changing the match by adding more characters. The PCRE2_NOTBOL option should
also be set for all but the first segment.
</P>
<P>
When a partial match occurs, the next segment must be added to the current
subject and the match re-run, using the <i>startoffset</i> argument of
When a partial match occurs, the next segment must be added to the current
subject and the match re-run, using the <i>startoffset</i> argument of
<b>pcre2_match()</b> to begin at the point where the partial match started.
Multi-segment matching is usually used to search for substrings in the middle
of very long sequences, so the patterns are normally not anchored. For example:
For example:
<pre>
re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
data&#62; ...the date is 23ja\=ph
@ -269,51 +271,52 @@ of very long sequences, so the patterns are normally not anchored. For example:
0: 23jan19
1: jan
</pre>
Note the use of the <b>offset</b> modifier to start the new match where the
partial match was found.
Note the use of the <b>offset</b> modifier to start the new match where the
partial match was found. In this example, the next segment was added to the one
in which the partial match was found. This is the most straightforward
approach, typically using a memory buffer that is twice the size of each
segment. After a partial match, the first half of the buffer is discarded, the
second half is moved to the start of the buffer, and a new segment is added
before repeating the match as in the example above. After a no match, the
entire buffer can be discarded.
</P>
<P>
In this simple example, the next segment was just added to the one in which the
partial match was found. However, if there are memory constraints, it may be
necessary to discard text that precedes the partial match before adding the
next segment. In cases such as the above, where the pattern does not contain
any lookbehinds, it is sufficient to retain only the partially matched
substring. However, if a pattern contains a lookbehind assertion, characters
If there are memory constraints, you may want to discard text that precedes a
partial match before adding the next segment. Unfortunately, this is not at
present straightforward. In cases such as the above, where the pattern does not
contain any lookbehinds, it is sufficient to retain only the partially matched
substring. However, if the pattern contains a lookbehind assertion, characters
that precede the start of the partial match may have been inspected during the
matching process.
</P>
<P>
The only lookbehind information that is available is the length of the longest
lookbehind in a pattern. This may not, of course, be at the start of the
pattern, but retaining that many characters before the partial match is
sufficient, if not always strictly necessary. The way to do this is as follows:
</P>
<P>
Before doing any matching, find the length of the longest lookbehind in the
pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
option. Note that the resulting count is in characters, not code units. After a
partial match, moving back from the ovector[0] offset in the subject by the
number of characters given for the maximum lookbehind gets you to the earliest
character that must be retained. In a non-UTF or a 32-bit situation, moving
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
while moving back through the code units. Characters before the point you have
now reached can be discarded.
</P>
<P>
For example, if the pattern "(?&#60;=123)abc" is partially matched against the
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
displays a partial match, it indicates the lookbehind characters with '&#60;'
characters if the <b>allusedtext</b> modifier is set:
matching process. When <b>pcre2test</b> displays a partial match, it indicates
these characters with '&#60;' if the <b>allusedtext</b> modifier is set:
<pre>
re&#62; "(?&#60;=123)abc"
data&#62; xx123ab\=ph,allusedtext
Partial match: 123ab
&#60;&#60;&#60;
</pre>
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
because JIT matching does not maintain the first and last consulted characters.
However, the \fPallusedtext\fP modifier is not available for JIT matching,
because JIT matching does not record the first (or last) consulted characters.
For this reason, this information is not available via the API. It is therefore
not possible in general to obtain the exact number of characters that must be
retained in order to get the right match result. If you cannot retain the
entire segment, you must find some heuristic way of choosing.
</P>
<P>
If you know the approximate length of the matching substrings, you can use that
to decide how much text to retain. The only lookbehind information that is
currently available via the API is the length of the longest individual
lookbehind in a pattern, but this can be misleading if there are nested
lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
units) that any individual lookbehind moves back when it is processed. A
pattern such as "(?&#60;=(?&#60;!b)a)" has a maximum lookbehind value of one, but
inspects two characters before its starting point.
</P>
<P>
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
UTF-8 or UTF-16 you have to count characters while moving back through the code
units.
</P>
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
<P>
@ -379,11 +382,11 @@ want.
</P>
<P>
If you do want to allow for starting again at the next character, one way of
doing it is to retain the matched part of the segment and try a new complete
match, as described for <b>pcre2_match()</b> above. Another possibility is to
work with two buffers. If a partial match at offset <i>n</i> in the first buffer
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
you can then try a new match starting at offset <i>n+1</i> in the first buffer.
doing it is to retain some or all of the segment and try a new complete match,
as described for <b>pcre2_match()</b> above. Another possibility is to work with
two buffers. If a partial match at offset <i>n</i> in the first buffer is
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
can then try a new match starting at offset <i>n+1</i> in the first buffer.
</P>
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
<P>
@ -396,7 +399,7 @@ Cambridge, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
Last updated: 07 August 2019
Last updated: 04 September 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -2232,27 +2232,25 @@ defaulted by the caller of the match function.
.sp
PCRE2_INFO_MAXLOOKBEHIND
.sp
Return the largest number of characters (not code units) before the current
matching point that could be inspected while processing a lookbehind assertion
in the pattern. Before release 10.34 this request used to give the largest
value for any individual assertion. Now it takes into account nested
lookbehinds, which can mean that the overall value is greater. For example, the
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
largest individual lookbehind. Now it returns 3, because matching actually
looks back 3 characters.
A lookbehind assertion moves back a certain number of characters (not code
units) when it starts to process each of its branches. This request returns the
largest of these backward moves. The third argument should point to a uint32_t
integer. The simple assertions \eb and \eB require a one-character lookbehind
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
longer. \eA also registers a one-character lookbehind, though it does not
actually inspect the previous character.
.P
The third argument should point to a uint32_t integer. This information is
useful when doing multi-segment matching using the partial matching facilities.
Note that the simple assertions \eb and \eB require a one-character lookbehind.
\eA also registers a one-character lookbehind, though it does not actually
inspect the previous character. This is to ensure that at least one character
from the old segment is retained when a new segment is processed. Otherwise, if
there are no lookbehinds in the pattern, \eA might match incorrectly at the
start of a second or subsequent segment. There are more details in the
Note that this information is useful for multi-segment matching only
if the pattern contains no nested lookbehinds. For example, the pattern
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
first lookbehind moves back by two characters, matches one character, then the
nested lookbehind also moves back by two characters. This puts the matching
point three characters earlier than it was at the start.
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
.\" HREF
\fBpcre2partial\fP
.\"
documentation.
documentation for a discussion of multi-segment matching.
.sp
PCRE2_INFO_MINLENGTH
.sp

View File

@ -1,4 +1,4 @@
.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
.TH PCRE2PARTIAL 3 "04 September 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2"
@ -25,7 +25,7 @@ options is whether or not a partial match is preferred to an alternative
complete match, though the details differ between the two types of matching
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
.P
If you want to use partial matching with just-in-time optimized code, as well
If you want to use partial matching with just-in-time optimized code, as well
as setting a partial match option for the matching function, you must also call
\fBpcre2_jit_compile()\fP with one or both of these options:
.sp
@ -73,7 +73,7 @@ need not form part of the final matched string; lookbehind assertions and the
matched string.
.P
(2) The pattern contains one or more lookbehind assertions. This condition
exists in case there is a lookbehind that inspects characters before the start
exists in case there is a lookbehind that inspects characters before the start
of the match.
.P
(3) There is a special case when the whole pattern can match an empty string.
@ -139,7 +139,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
partial match is found, without continuing to search for possible complete
matches. This option is "hard" because it prefers an earlier partial match over
a later complete match. For this reason, the assumption is made that the end of
the supplied subject string is not the true end of the available data, which is
the supplied subject string is not the true end of the available data, which is
why \ez, \eZ, \eb, \eB, and $ always give a partial match.
.P
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
@ -192,7 +192,7 @@ date:
data> 3juj\e=ph
No match
.sp
This example gives the same results for both hard and soft partial matching
This example gives the same results for both hard and soft partial matching
options. Here is an example where there is a difference:
.sp
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
@ -200,8 +200,8 @@ options. Here is an example where there is a difference:
0: 25jun04
1: jun
data> 25jun04\e=ph
Partial match: 25jun04
.sp
Partial match: 25jun04
.sp
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
there is only a partial match.
@ -213,9 +213,12 @@ there is only a partial match.
.sp
PCRE was not originally designed with multi-segment matching in mind. However,
over time, features (including partial matching) that make multi-segment
matching possible have been added. The string is searched segment by segment by
calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same
results that would happen if the entire string was available for searching.
matching possible have been added. A very long string can be searched segment
by segment by calling \fBpcre2_match()\fP repeatedly, with the aim of achieving
the same results that would happen if the entire string was available for
searching all the time. Normally, the strings that are being sought are much
shorter than each individual segment, and are in the middle of very long
strings, so the pattern is normally not anchored.
.P
Special logic must be implemented to handle a matched substring that spans a
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
@ -223,11 +226,10 @@ partial match at the end of a segment whenever there is the possibility of
changing the match by adding more characters. The PCRE2_NOTBOL option should
also be set for all but the first segment.
.P
When a partial match occurs, the next segment must be added to the current
subject and the match re-run, using the \fIstartoffset\fP argument of
When a partial match occurs, the next segment must be added to the current
subject and the match re-run, using the \fIstartoffset\fP argument of
\fBpcre2_match()\fP to begin at the point where the partial match started.
Multi-segment matching is usually used to search for substrings in the middle
of very long sequences, so the patterns are normally not anchored. For example:
For example:
.sp
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
data> ...the date is 23ja\e=ph
@ -236,48 +238,49 @@ of very long sequences, so the patterns are normally not anchored. For example:
0: 23jan19
1: jan
.sp
Note the use of the \fBoffset\fP modifier to start the new match where the
partial match was found.
Note the use of the \fBoffset\fP modifier to start the new match where the
partial match was found. In this example, the next segment was added to the one
in which the partial match was found. This is the most straightforward
approach, typically using a memory buffer that is twice the size of each
segment. After a partial match, the first half of the buffer is discarded, the
second half is moved to the start of the buffer, and a new segment is added
before repeating the match as in the example above. After a no match, the
entire buffer can be discarded.
.P
In this simple example, the next segment was just added to the one in which the
partial match was found. However, if there are memory constraints, it may be
necessary to discard text that precedes the partial match before adding the
next segment. In cases such as the above, where the pattern does not contain
any lookbehinds, it is sufficient to retain only the partially matched
substring. However, if a pattern contains a lookbehind assertion, characters
If there are memory constraints, you may want to discard text that precedes a
partial match before adding the next segment. Unfortunately, this is not at
present straightforward. In cases such as the above, where the pattern does not
contain any lookbehinds, it is sufficient to retain only the partially matched
substring. However, if the pattern contains a lookbehind assertion, characters
that precede the start of the partial match may have been inspected during the
matching process.
.P
The only lookbehind information that is available is the length of the longest
lookbehind in a pattern. This may not, of course, be at the start of the
pattern, but retaining that many characters before the partial match is
sufficient, if not always strictly necessary. The way to do this is as follows:
.P
Before doing any matching, find the length of the longest lookbehind in the
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
option. Note that the resulting count is in characters, not code units. After a
partial match, moving back from the ovector[0] offset in the subject by the
number of characters given for the maximum lookbehind gets you to the earliest
character that must be retained. In a non-UTF or a 32-bit situation, moving
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
while moving back through the code units. Characters before the point you have
now reached can be discarded.
.P
For example, if the pattern "(?<=123)abc" is partially matched against the
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
displays a partial match, it indicates the lookbehind characters with '<'
characters if the \fBallusedtext\fP modifier is set:
matching process. When \fBpcre2test\fP displays a partial match, it indicates
these characters with '<' if the \fBallusedtext\fP modifier is set:
.sp
re> "(?<=123)abc"
data> xx123ab\e=ph,allusedtext
Partial match: 123ab
<<<
.sp
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
because JIT matching does not maintain the first and last consulted characters.
.
.sp
However, the \fPallusedtext\fP modifier is not available for JIT matching,
because JIT matching does not record the first (or last) consulted characters.
For this reason, this information is not available via the API. It is therefore
not possible in general to obtain the exact number of characters that must be
retained in order to get the right match result. If you cannot retain the
entire segment, you must find some heuristic way of choosing.
.P
If you know the approximate length of the matching substrings, you can use that
to decide how much text to retain. The only lookbehind information that is
currently available via the API is the length of the longest individual
lookbehind in a pattern, but this can be misleading if there are nested
lookbehinds. The value returned by calling \fBpcre2_pattern_info()\fP with the
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
units) that any individual lookbehind moves back when it is processed. A
pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
inspects two characters before its starting point.
.P
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
UTF-8 or UTF-16 you have to count characters while moving back through the code
units.
.
.
.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
@ -344,11 +347,11 @@ are remembered. Depending on the application, this may or may not be what you
want.
.P
If you do want to allow for starting again at the next character, one way of
doing it is to retain the matched part of the segment and try a new complete
match, as described for \fBpcre2_match()\fP above. Another possibility is to
work with two buffers. If a partial match at offset \fIn\fP in the first buffer
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
you can then try a new match starting at offset \fIn+1\fP in the first buffer.
doing it is to retain some or all of the segment and try a new complete match,
as described for \fBpcre2_match()\fP above. Another possibility is to work with
two buffers. If a partial match at offset \fIn\fP in the first buffer is
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
can then try a new match starting at offset \fIn+1\fP in the first buffer.
.
.
.SH AUTHOR
@ -365,6 +368,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 07 August 2019
Last updated: 04 September 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -128,12 +128,12 @@ static int
compile_block *, PCRE2_SIZE *);
static int
get_branchlength(uint32_t **, int *, int *, int *, parsed_recurse_check *,
get_branchlength(uint32_t **, int *, int *, parsed_recurse_check *,
compile_block *);
static BOOL
set_lookbehind_lengths(uint32_t **, int *, int *, int *,
parsed_recurse_check *, compile_block *);
set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *,
compile_block *);
static int
check_lookbehinds(uint32_t *, uint32_t **, parsed_recurse_check *,
@ -398,9 +398,6 @@ compiler is clever with identical subexpressions. */
#define GI_SET_FIXED_LENGTH 0x80000000u
#define GI_NOT_FIXED_LENGTH 0x40000000u
#define GI_FIXED_LENGTH_MASK 0x0000ffffu
#define GI_EXTRA_MASK 0x0fff0000u
#define GI_EXTRA_MAX 0xfff /* NB not unsigned */
#define GI_EXTRA_SHIFT 16
/* This simple test for a decimal digit works for both ASCII/Unicode and EBCDIC
and is fast (a good compiler can turn it into a subtraction and unsigned
@ -8897,7 +8894,6 @@ improve processing speed when the same capturing group occurs many times.
Arguments:
pptrptr pointer to pointer in the parsed pattern
isinline FALSE if a reference or recursion; TRUE for inline group
extraptr pointer to where to return extra lookbehind length
errcodeptr pointer to the errorcode
lcptr pointer to the loop counter
group number of captured group or -1 for a non-capturing group
@ -8908,13 +8904,11 @@ Returns: the group length or a negative number
*/
static int
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *extraptr,
int *errcodeptr, int *lcptr, int group, parsed_recurse_check *recurses,
compile_block *cb)
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *errcodeptr, int *lcptr,
int group, parsed_recurse_check *recurses, compile_block *cb)
{
int branchlength;
int grouplength = -1;
int extra = 0;
/* The cache can be used only if there is no possibility of there being two
groups with the same number. We do not need to set the end pointer for a group
@ -8928,7 +8922,6 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
if ((groupinfo & GI_SET_FIXED_LENGTH) != 0)
{
if (isinline) *pptrptr = parsed_skip(*pptrptr, PSKIP_KET);
*extraptr = (groupinfo & GI_EXTRA_MASK) >> GI_EXTRA_SHIFT;
return groupinfo & GI_FIXED_LENGTH_MASK;
}
}
@ -8937,28 +8930,16 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
for(;;)
{
int branchextra;
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
recurses, cb);
branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
if (branchlength < 0) goto ISNOTFIXED;
if (grouplength == -1)
{
grouplength = branchlength;
extra = branchextra;
}
else if (grouplength != branchlength || extra != branchextra) goto ISNOTFIXED;
if (grouplength == -1) grouplength = branchlength;
else if (grouplength != branchlength) goto ISNOTFIXED;
if (**pptrptr == META_KET) break;
*pptrptr += 1; /* Skip META_ALT */
}
/* There are only 12 bits for caching the extra value, but a pattern that
needs more than that is weird indeed. */
if (group > 0 && extra <= GI_EXTRA_MAX)
cb->groupinfo[group] |= (uint32_t)
(GI_SET_FIXED_LENGTH | (extra << GI_EXTRA_SHIFT) | grouplength);
*extraptr = extra;
if (group > 0)
cb->groupinfo[group] |= (uint32_t)(GI_SET_FIXED_LENGTH | grouplength);
return grouplength;
ISNOTFIXED:
@ -8973,17 +8954,11 @@ return -1;
*************************************************/
/* Return a fixed length for a branch in a lookbehind, giving an error if the
length is not fixed. We also take note of any extra value that is generated
from a nested lookbehind. For example, for /(?<=a(?<=ba)c)/ each individual
lookbehind has length 2, but the max_lookbehind setting must be 3 because
matching inspects 3 characters before the match starting point.
On entry, *pptrptr points to the first element inside the branch. On exit it is
set to point to the ALT or KET.
length is not fixed. On entry, *pptrptr points to the first element inside the
branch. On exit it is set to point to the ALT or KET.
Arguments:
pptrptr pointer to pointer in the parsed pattern
extraptr pointer to where to return extra lookbehind length
errcodeptr pointer to error code
lcptr pointer to loop counter
recurses chain of recurse_check to catch mutual recursion
@ -8993,14 +8968,11 @@ Returns: the length, or a negative value on error
*/
static int
get_branchlength(uint32_t **pptrptr, int *extraptr, int *errcodeptr, int *lcptr,
get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
parsed_recurse_check *recurses, compile_block *cb)
{
int branchlength = 0;
int grouplength;
int groupextra;
int max;
int extra = 0; /* Additional lookbehind from nesting */
uint32_t lastitemlength = 0;
uint32_t *pptr = *pptrptr;
PCRE2_SIZE offset;
@ -9149,17 +9121,13 @@ for (;; pptr++)
break;
/* A nested lookbehind does not contribute any length to this lookbehind,
but must itself be checked and have its lengths set. If the maximum
lookbehind for the nested lookbehind is greater than the length so far
computed for this branch, we must compute an extra value and keep the
largest encountered for use when setting the maximum overall lookbehind. */
but must itself be checked and have its lengths set. */
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
return -1;
if (max - branchlength > extra) extra = max - branchlength;
break;
/* Back references and recursions are handled by very similar code. At this
@ -9267,14 +9235,15 @@ for (;; pptr++)
in the cache. */
gptr++;
grouplength = get_grouplength(&gptr, FALSE, &groupextra, errcodeptr, lcptr,
group, &this_recurse, cb);
grouplength = get_grouplength(&gptr, FALSE, errcodeptr, lcptr, group,
&this_recurse, cb);
if (grouplength < 0)
{
if (*errcodeptr == 0) goto ISNOTFIXED;
return -1; /* Error already set */
}
goto OK_GROUP;
itemlength = grouplength;
break;
/* Check nested groups - advance past the initial data for each type and
then seek a fixed length with get_grouplength(). */
@ -9304,16 +9273,10 @@ for (;; pptr++)
case META_SCRIPT_RUN:
pptr++;
CHECK_GROUP:
grouplength = get_grouplength(&pptr, TRUE, &groupextra, errcodeptr, lcptr,
group, recurses, cb);
grouplength = get_grouplength(&pptr, TRUE, errcodeptr, lcptr, group,
recurses, cb);
if (grouplength < 0) return -1;
/* A nested lookbehind within the group may require looking back further
than the length of the group. */
OK_GROUP:
itemlength = grouplength;
if (groupextra - branchlength > extra) extra = groupextra - branchlength;
break;
/* Exact repetition is OK; variable repetition is not. A repetition of zero
@ -9374,7 +9337,6 @@ for (;; pptr++)
EXIT:
*pptrptr = pptr;
*extraptr = extra;
return branchlength;
PARSED_SKIP_FAILED:
@ -9400,7 +9362,6 @@ get_branchlength() as an "extra" value.
Arguments:
pptrptr pointer to pointer in the parsed pattern
maxptr where to return maximum lookbehind for the whole group
errcodeptr pointer to error code
lcptr pointer to loop counter
recurses chain of recurse_check to catch mutual recursion
@ -9411,13 +9372,11 @@ Returns: TRUE if all is well
*/
static BOOL
set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
parsed_recurse_check *recurses, compile_block *cb)
{
PCRE2_SIZE offset;
int branchlength;
int branchextra;
int max = 0;
uint32_t *bptr = *pptrptr;
READPLUSOFFSET(offset, bptr); /* Offset for error messages */
@ -9426,8 +9385,7 @@ READPLUSOFFSET(offset, bptr); /* Offset for error messages */
do
{
*pptrptr += 1;
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
recurses, cb);
branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
if (branchlength < 0)
{
/* The errorcode and offset may already be set from a nested lookbehind. */
@ -9435,14 +9393,12 @@ do
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
return FALSE;
}
if (branchlength + branchextra > max) max = branchlength + branchextra;
if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
*bptr |= branchlength; /* branchlength never more than 65535 */
bptr = *pptrptr;
}
while (*bptr == META_ALT);
if (max > cb->max_lookbehind) cb->max_lookbehind = max;
*maxptr = max;
return TRUE;
}
@ -9475,7 +9431,6 @@ static int
check_lookbehinds(uint32_t *pptr, uint32_t **retptr,
parsed_recurse_check *recurses, compile_block *cb)
{
int max;
int errorcode = 0;
int loopcount = 0;
int nestlevel = 0;
@ -9599,8 +9554,7 @@ for (; *pptr != META_END; pptr++)
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount,
recurses, cb))
if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, recurses, cb))
return errorcode;
break;
}

View File

@ -304,7 +304,7 @@ Partial match, mark=xx: 123a
/(?<=(?<=a)b)c.*/I
Capture group count = 0
Max lookbehind = 2
Max lookbehind = 1
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
@ -337,7 +337,7 @@ Partial match: abcd
/(?<=(?<=(?<=a)b)c)./I
Capture group count = 0
Max lookbehind = 3
Max lookbehind = 1
Subject length lower bound = 1
123abcXYZ
0: abcX
@ -354,7 +354,7 @@ Subject length lower bound = 1
/(?<=ab((?<=...)cd))./I
Capture group count = 1
Max lookbehind = 5
Max lookbehind = 4
Subject length lower bound = 1
ZabcdX
0: ZabcdX
@ -363,7 +363,7 @@ Subject length lower bound = 1
/(?<=((?<=(?<=ab).))(?1)(?1))./I
Capture group count = 1
Max lookbehind = 3
Max lookbehind = 2
Subject length lower bound = 1
abxZ
0: abxZ

View File

@ -17036,7 +17036,7 @@ Subject length lower bound = 1
/(?<=(?<=a)b)c.*/I
Capture group count = 0
Max lookbehind = 2
Max lookbehind = 1
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
@ -17064,7 +17064,7 @@ Subject length lower bound = 0
/(?<=a(?<=a|ba)c)/I
Capture group count = 0
Max lookbehind = 3
Max lookbehind = 2
May match empty string
Subject length lower bound = 0
@ -17076,7 +17076,7 @@ Subject length lower bound = 0
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
Capture group count = 0
Max lookbehind = 5
Max lookbehind = 4
May match empty string
Subject length lower bound = 0