Back off failed attempt to handle nested lookbehinds for estimating how much of
a partial match to retain for multi-segment matching. Document the current difficulty if the whole first segment cannot be retained.
This commit is contained in:
parent
87bc092222
commit
963b570fd0
37
ChangeLog
37
ChangeLog
|
@ -66,12 +66,7 @@ is made possessive and applied to an item in parentheses, because a
|
|||
parenthesized item may contain multiple branches or other backtracking points,
|
||||
for example /(a|ab){1}+c/ or /(a+){1}+a/.
|
||||
|
||||
13. Nested lookbehinds are now taken into account when computing the maximum
|
||||
lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
|
||||
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
|
||||
it to 3, because matching looks back 3 characters.
|
||||
|
||||
14. For partial matches, pcre2test was always showing the maximum lookbehind
|
||||
13. For partial matches, pcre2test was always showing the maximum lookbehind
|
||||
characters, flagged with "<", which is misleading when the lookbehind didn't
|
||||
actually look behind the start (because it was later in the pattern). Showing
|
||||
all consulted preceding characters for partial matches is now controlled by the
|
||||
|
@ -79,25 +74,25 @@ existing "allusedtext" modifier and, as for complete matches, this facility is
|
|||
available only for non-JIT matching, because JIT does not maintain the first
|
||||
and last consulted characters.
|
||||
|
||||
15. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
|
||||
14. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
|
||||
if the end of the subject was encountered in a lookahead (conditional or
|
||||
otherwise), an atomic group, or a recursion.
|
||||
|
||||
16. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
|
||||
15. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
|
||||
|
||||
17. Check for integer overflow when computing lookbehind lengths. Fixes
|
||||
16. Check for integer overflow when computing lookbehind lengths. Fixes
|
||||
Clusterfuzz issue 15636.
|
||||
|
||||
18. Implemented non-atomic positive lookaround assertions.
|
||||
17. Implemented non-atomic positive lookaround assertions.
|
||||
|
||||
19. If a lookbehind contained a lookahead that contained another lookbehind
|
||||
18. If a lookbehind contained a lookahead that contained another lookbehind
|
||||
within it, the nested lookbehind was not correctly processed. For example, if
|
||||
/(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching
|
||||
"b".
|
||||
|
||||
20. Implemented pcre2_get_match_data_size().
|
||||
19. Implemented pcre2_get_match_data_size().
|
||||
|
||||
21. Two alterations to partial matching (not yet done by JIT):
|
||||
20. Two alterations to partial matching (not yet done by JIT):
|
||||
|
||||
(a) The definition of a partial match is slightly changed: if a pattern
|
||||
contains any lookbehinds, an empty partial match may be given, because this
|
||||
|
@ -111,29 +106,29 @@ within it, the nested lookbehind was not correctly processed. For example, if
|
|||
(c) An empty string partial hard match can be returned for \z and \Z as it
|
||||
is documented that they shouldn't match.
|
||||
|
||||
22. A branch that started with (*ACCEPT) was not being recognized as one that
|
||||
21. A branch that started with (*ACCEPT) was not being recognized as one that
|
||||
could match an empty string.
|
||||
|
||||
23. Corrected pcre2_set_character_tables() tables data type: was const unsigned
|
||||
22. Corrected pcre2_set_character_tables() tables data type: was const unsigned
|
||||
char * instead of const uint8_t *, as generated by pcre2_maketables().
|
||||
|
||||
24. Upgraded to Unicode 12.1.0.
|
||||
23. Upgraded to Unicode 12.1.0.
|
||||
|
||||
25. Add -jitfast command line option to pcre2test (to make all the jit options
|
||||
24. Add -jitfast command line option to pcre2test (to make all the jit options
|
||||
available directly).
|
||||
|
||||
26. Make pcre2test -C show if libreadline or libedit is supported.
|
||||
25. Make pcre2test -C show if libreadline or libedit is supported.
|
||||
|
||||
28. If the length of one branch of a group exceeded 65535 (the maximum value
|
||||
26. If the length of one branch of a group exceeded 65535 (the maximum value
|
||||
that is remembered as a minimum length), the whole group's length was
|
||||
incorrectly recorded as 65535, leading to incorrect "no match" when start-up
|
||||
optimizations were in force.
|
||||
|
||||
29. The "rightmost consulted character" value was not always correct; in
|
||||
27. The "rightmost consulted character" value was not always correct; in
|
||||
particular, if a pattern ended with a negative lookahead, characters that were
|
||||
inspected in that lookahead were not included.
|
||||
|
||||
30. Add the pcre2_maketables_free() function.
|
||||
28. Add the pcre2_maketables_free() function.
|
||||
|
||||
|
||||
Version 10.33 16-April-2019
|
||||
|
|
|
@ -2272,26 +2272,24 @@ defaulted by the caller of the match function.
|
|||
<pre>
|
||||
PCRE2_INFO_MAXLOOKBEHIND
|
||||
</pre>
|
||||
Return the largest number of characters (not code units) before the current
|
||||
matching point that could be inspected while processing a lookbehind assertion
|
||||
in the pattern. Before release 10.34 this request used to give the largest
|
||||
value for any individual assertion. Now it takes into account nested
|
||||
lookbehinds, which can mean that the overall value is greater. For example, the
|
||||
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
|
||||
largest individual lookbehind. Now it returns 3, because matching actually
|
||||
looks back 3 characters.
|
||||
A lookbehind assertion moves back a certain number of characters (not code
|
||||
units) when it starts to process each of its branches. This request returns the
|
||||
largest of these backward moves. The third argument should point to a uint32_t
|
||||
integer. The simple assertions \b and \B require a one-character lookbehind
|
||||
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
|
||||
longer. \A also registers a one-character lookbehind, though it does not
|
||||
actually inspect the previous character.
|
||||
</P>
|
||||
<P>
|
||||
The third argument should point to a uint32_t integer. This information is
|
||||
useful when doing multi-segment matching using the partial matching facilities.
|
||||
Note that the simple assertions \b and \B require a one-character lookbehind.
|
||||
\A also registers a one-character lookbehind, though it does not actually
|
||||
inspect the previous character. This is to ensure that at least one character
|
||||
from the old segment is retained when a new segment is processed. Otherwise, if
|
||||
there are no lookbehinds in the pattern, \A might match incorrectly at the
|
||||
start of a second or subsequent segment. There are more details in the
|
||||
Note that this information is useful for multi-segment matching only
|
||||
if the pattern contains no nested lookbehinds. For example, the pattern
|
||||
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
|
||||
first lookbehind moves back by two characters, matches one character, then the
|
||||
nested lookbehind also moves back by two characters. This puts the matching
|
||||
point three characters earlier than it was at the start.
|
||||
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation.
|
||||
documentation for a discussion of multi-segment matching.
|
||||
<pre>
|
||||
PCRE2_INFO_MINLENGTH
|
||||
</pre>
|
||||
|
|
|
@ -49,7 +49,7 @@ complete match, though the details differ between the two types of matching
|
|||
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
||||
</P>
|
||||
<P>
|
||||
If you want to use partial matching with just-in-time optimized code, as well
|
||||
If you want to use partial matching with just-in-time optimized code, as well
|
||||
as setting a partial match option for the matching function, you must also call
|
||||
<b>pcre2_jit_compile()</b> with one or both of these options:
|
||||
<pre>
|
||||
|
@ -101,7 +101,7 @@ matched string.
|
|||
</P>
|
||||
<P>
|
||||
(2) The pattern contains one or more lookbehind assertions. This condition
|
||||
exists in case there is a lookbehind that inspects characters before the start
|
||||
exists in case there is a lookbehind that inspects characters before the start
|
||||
of the match.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -171,7 +171,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
|
|||
partial match is found, without continuing to search for possible complete
|
||||
matches. This option is "hard" because it prefers an earlier partial match over
|
||||
a later complete match. For this reason, the assumption is made that the end of
|
||||
the supplied subject string is not the true end of the available data, which is
|
||||
the supplied subject string is not the true end of the available data, which is
|
||||
why \z, \Z, \b, \B, and $ always give a partial match.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -226,7 +226,7 @@ date:
|
|||
data> 3juj\=ph
|
||||
No match
|
||||
</pre>
|
||||
This example gives the same results for both hard and soft partial matching
|
||||
This example gives the same results for both hard and soft partial matching
|
||||
options. Here is an example where there is a difference:
|
||||
<pre>
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
|
@ -234,7 +234,7 @@ options. Here is an example where there is a difference:
|
|||
0: 25jun04
|
||||
1: jun
|
||||
data> 25jun04\=ph
|
||||
Partial match: 25jun04
|
||||
Partial match: 25jun04
|
||||
</pre>
|
||||
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
||||
|
@ -244,9 +244,12 @@ there is only a partial match.
|
|||
<P>
|
||||
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||
over time, features (including partial matching) that make multi-segment
|
||||
matching possible have been added. The string is searched segment by segment by
|
||||
calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same
|
||||
results that would happen if the entire string was available for searching.
|
||||
matching possible have been added. A very long string can be searched segment
|
||||
by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
|
||||
the same results that would happen if the entire string was available for
|
||||
searching all the time. Normally, the strings that are being sought are much
|
||||
shorter than each individual segment, and are in the middle of very long
|
||||
strings, so the pattern is normally not anchored.
|
||||
</P>
|
||||
<P>
|
||||
Special logic must be implemented to handle a matched substring that spans a
|
||||
|
@ -256,11 +259,10 @@ changing the match by adding more characters. The PCRE2_NOTBOL option should
|
|||
also be set for all but the first segment.
|
||||
</P>
|
||||
<P>
|
||||
When a partial match occurs, the next segment must be added to the current
|
||||
subject and the match re-run, using the <i>startoffset</i> argument of
|
||||
When a partial match occurs, the next segment must be added to the current
|
||||
subject and the match re-run, using the <i>startoffset</i> argument of
|
||||
<b>pcre2_match()</b> to begin at the point where the partial match started.
|
||||
Multi-segment matching is usually used to search for substrings in the middle
|
||||
of very long sequences, so the patterns are normally not anchored. For example:
|
||||
For example:
|
||||
<pre>
|
||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||
data> ...the date is 23ja\=ph
|
||||
|
@ -269,51 +271,52 @@ of very long sequences, so the patterns are normally not anchored. For example:
|
|||
0: 23jan19
|
||||
1: jan
|
||||
</pre>
|
||||
Note the use of the <b>offset</b> modifier to start the new match where the
|
||||
partial match was found.
|
||||
Note the use of the <b>offset</b> modifier to start the new match where the
|
||||
partial match was found. In this example, the next segment was added to the one
|
||||
in which the partial match was found. This is the most straightforward
|
||||
approach, typically using a memory buffer that is twice the size of each
|
||||
segment. After a partial match, the first half of the buffer is discarded, the
|
||||
second half is moved to the start of the buffer, and a new segment is added
|
||||
before repeating the match as in the example above. After a no match, the
|
||||
entire buffer can be discarded.
|
||||
</P>
|
||||
<P>
|
||||
In this simple example, the next segment was just added to the one in which the
|
||||
partial match was found. However, if there are memory constraints, it may be
|
||||
necessary to discard text that precedes the partial match before adding the
|
||||
next segment. In cases such as the above, where the pattern does not contain
|
||||
any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if a pattern contains a lookbehind assertion, characters
|
||||
If there are memory constraints, you may want to discard text that precedes a
|
||||
partial match before adding the next segment. Unfortunately, this is not at
|
||||
present straightforward. In cases such as the above, where the pattern does not
|
||||
contain any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if the pattern contains a lookbehind assertion, characters
|
||||
that precede the start of the partial match may have been inspected during the
|
||||
matching process.
|
||||
</P>
|
||||
<P>
|
||||
The only lookbehind information that is available is the length of the longest
|
||||
lookbehind in a pattern. This may not, of course, be at the start of the
|
||||
pattern, but retaining that many characters before the partial match is
|
||||
sufficient, if not always strictly necessary. The way to do this is as follows:
|
||||
</P>
|
||||
<P>
|
||||
Before doing any matching, find the length of the longest lookbehind in the
|
||||
pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
|
||||
option. Note that the resulting count is in characters, not code units. After a
|
||||
partial match, moving back from the ovector[0] offset in the subject by the
|
||||
number of characters given for the maximum lookbehind gets you to the earliest
|
||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
||||
while moving back through the code units. Characters before the point you have
|
||||
now reached can be discarded.
|
||||
</P>
|
||||
<P>
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
||||
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
|
||||
displays a partial match, it indicates the lookbehind characters with '<'
|
||||
characters if the <b>allusedtext</b> modifier is set:
|
||||
matching process. When <b>pcre2test</b> displays a partial match, it indicates
|
||||
these characters with '<' if the <b>allusedtext</b> modifier is set:
|
||||
<pre>
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
</pre>
|
||||
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||
because JIT matching does not maintain the first and last consulted characters.
|
||||
However, the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||
because JIT matching does not record the first (or last) consulted characters.
|
||||
For this reason, this information is not available via the API. It is therefore
|
||||
not possible in general to obtain the exact number of characters that must be
|
||||
retained in order to get the right match result. If you cannot retain the
|
||||
entire segment, you must find some heuristic way of choosing.
|
||||
</P>
|
||||
<P>
|
||||
If you know the approximate length of the matching substrings, you can use that
|
||||
to decide how much text to retain. The only lookbehind information that is
|
||||
currently available via the API is the length of the longest individual
|
||||
lookbehind in a pattern, but this can be misleading if there are nested
|
||||
lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
|
||||
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
|
||||
units) that any individual lookbehind moves back when it is processed. A
|
||||
pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
|
||||
inspects two characters before its starting point.
|
||||
</P>
|
||||
<P>
|
||||
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
|
||||
UTF-8 or UTF-16 you have to count characters while moving back through the code
|
||||
units.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
|
||||
<P>
|
||||
|
@ -379,11 +382,11 @@ want.
|
|||
</P>
|
||||
<P>
|
||||
If you do want to allow for starting again at the next character, one way of
|
||||
doing it is to retain the matched part of the segment and try a new complete
|
||||
match, as described for <b>pcre2_match()</b> above. Another possibility is to
|
||||
work with two buffers. If a partial match at offset <i>n</i> in the first buffer
|
||||
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
|
||||
you can then try a new match starting at offset <i>n+1</i> in the first buffer.
|
||||
doing it is to retain some or all of the segment and try a new complete match,
|
||||
as described for <b>pcre2_match()</b> above. Another possibility is to work with
|
||||
two buffers. If a partial match at offset <i>n</i> in the first buffer is
|
||||
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
|
||||
can then try a new match starting at offset <i>n+1</i> in the first buffer.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
|
@ -396,7 +399,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 07 August 2019
|
||||
Last updated: 04 September 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
1346
doc/pcre2.txt
1346
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -2232,27 +2232,25 @@ defaulted by the caller of the match function.
|
|||
.sp
|
||||
PCRE2_INFO_MAXLOOKBEHIND
|
||||
.sp
|
||||
Return the largest number of characters (not code units) before the current
|
||||
matching point that could be inspected while processing a lookbehind assertion
|
||||
in the pattern. Before release 10.34 this request used to give the largest
|
||||
value for any individual assertion. Now it takes into account nested
|
||||
lookbehinds, which can mean that the overall value is greater. For example, the
|
||||
pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
|
||||
largest individual lookbehind. Now it returns 3, because matching actually
|
||||
looks back 3 characters.
|
||||
A lookbehind assertion moves back a certain number of characters (not code
|
||||
units) when it starts to process each of its branches. This request returns the
|
||||
largest of these backward moves. The third argument should point to a uint32_t
|
||||
integer. The simple assertions \eb and \eB require a one-character lookbehind
|
||||
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
|
||||
longer. \eA also registers a one-character lookbehind, though it does not
|
||||
actually inspect the previous character.
|
||||
.P
|
||||
The third argument should point to a uint32_t integer. This information is
|
||||
useful when doing multi-segment matching using the partial matching facilities.
|
||||
Note that the simple assertions \eb and \eB require a one-character lookbehind.
|
||||
\eA also registers a one-character lookbehind, though it does not actually
|
||||
inspect the previous character. This is to ensure that at least one character
|
||||
from the old segment is retained when a new segment is processed. Otherwise, if
|
||||
there are no lookbehinds in the pattern, \eA might match incorrectly at the
|
||||
start of a second or subsequent segment. There are more details in the
|
||||
Note that this information is useful for multi-segment matching only
|
||||
if the pattern contains no nested lookbehinds. For example, the pattern
|
||||
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
|
||||
first lookbehind moves back by two characters, matches one character, then the
|
||||
nested lookbehind also moves back by two characters. This puts the matching
|
||||
point three characters earlier than it was at the start.
|
||||
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
|
||||
.\" HREF
|
||||
\fBpcre2partial\fP
|
||||
.\"
|
||||
documentation.
|
||||
documentation for a discussion of multi-segment matching.
|
||||
.sp
|
||||
PCRE2_INFO_MINLENGTH
|
||||
.sp
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
|
||||
.TH PCRE2PARTIAL 3 "04 September 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions
|
||||
.SH "PARTIAL MATCHING IN PCRE2"
|
||||
|
@ -25,7 +25,7 @@ options is whether or not a partial match is preferred to an alternative
|
|||
complete match, though the details differ between the two types of matching
|
||||
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
||||
.P
|
||||
If you want to use partial matching with just-in-time optimized code, as well
|
||||
If you want to use partial matching with just-in-time optimized code, as well
|
||||
as setting a partial match option for the matching function, you must also call
|
||||
\fBpcre2_jit_compile()\fP with one or both of these options:
|
||||
.sp
|
||||
|
@ -73,7 +73,7 @@ need not form part of the final matched string; lookbehind assertions and the
|
|||
matched string.
|
||||
.P
|
||||
(2) The pattern contains one or more lookbehind assertions. This condition
|
||||
exists in case there is a lookbehind that inspects characters before the start
|
||||
exists in case there is a lookbehind that inspects characters before the start
|
||||
of the match.
|
||||
.P
|
||||
(3) There is a special case when the whole pattern can match an empty string.
|
||||
|
@ -139,7 +139,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
|
|||
partial match is found, without continuing to search for possible complete
|
||||
matches. This option is "hard" because it prefers an earlier partial match over
|
||||
a later complete match. For this reason, the assumption is made that the end of
|
||||
the supplied subject string is not the true end of the available data, which is
|
||||
the supplied subject string is not the true end of the available data, which is
|
||||
why \ez, \eZ, \eb, \eB, and $ always give a partial match.
|
||||
.P
|
||||
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
|
||||
|
@ -192,7 +192,7 @@ date:
|
|||
data> 3juj\e=ph
|
||||
No match
|
||||
.sp
|
||||
This example gives the same results for both hard and soft partial matching
|
||||
This example gives the same results for both hard and soft partial matching
|
||||
options. Here is an example where there is a difference:
|
||||
.sp
|
||||
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
||||
|
@ -200,8 +200,8 @@ options. Here is an example where there is a difference:
|
|||
0: 25jun04
|
||||
1: jun
|
||||
data> 25jun04\e=ph
|
||||
Partial match: 25jun04
|
||||
.sp
|
||||
Partial match: 25jun04
|
||||
.sp
|
||||
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
||||
there is only a partial match.
|
||||
|
@ -213,9 +213,12 @@ there is only a partial match.
|
|||
.sp
|
||||
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||
over time, features (including partial matching) that make multi-segment
|
||||
matching possible have been added. The string is searched segment by segment by
|
||||
calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same
|
||||
results that would happen if the entire string was available for searching.
|
||||
matching possible have been added. A very long string can be searched segment
|
||||
by segment by calling \fBpcre2_match()\fP repeatedly, with the aim of achieving
|
||||
the same results that would happen if the entire string was available for
|
||||
searching all the time. Normally, the strings that are being sought are much
|
||||
shorter than each individual segment, and are in the middle of very long
|
||||
strings, so the pattern is normally not anchored.
|
||||
.P
|
||||
Special logic must be implemented to handle a matched substring that spans a
|
||||
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
|
||||
|
@ -223,11 +226,10 @@ partial match at the end of a segment whenever there is the possibility of
|
|||
changing the match by adding more characters. The PCRE2_NOTBOL option should
|
||||
also be set for all but the first segment.
|
||||
.P
|
||||
When a partial match occurs, the next segment must be added to the current
|
||||
subject and the match re-run, using the \fIstartoffset\fP argument of
|
||||
When a partial match occurs, the next segment must be added to the current
|
||||
subject and the match re-run, using the \fIstartoffset\fP argument of
|
||||
\fBpcre2_match()\fP to begin at the point where the partial match started.
|
||||
Multi-segment matching is usually used to search for substrings in the middle
|
||||
of very long sequences, so the patterns are normally not anchored. For example:
|
||||
For example:
|
||||
.sp
|
||||
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
|
||||
data> ...the date is 23ja\e=ph
|
||||
|
@ -236,48 +238,49 @@ of very long sequences, so the patterns are normally not anchored. For example:
|
|||
0: 23jan19
|
||||
1: jan
|
||||
.sp
|
||||
Note the use of the \fBoffset\fP modifier to start the new match where the
|
||||
partial match was found.
|
||||
Note the use of the \fBoffset\fP modifier to start the new match where the
|
||||
partial match was found. In this example, the next segment was added to the one
|
||||
in which the partial match was found. This is the most straightforward
|
||||
approach, typically using a memory buffer that is twice the size of each
|
||||
segment. After a partial match, the first half of the buffer is discarded, the
|
||||
second half is moved to the start of the buffer, and a new segment is added
|
||||
before repeating the match as in the example above. After a no match, the
|
||||
entire buffer can be discarded.
|
||||
.P
|
||||
In this simple example, the next segment was just added to the one in which the
|
||||
partial match was found. However, if there are memory constraints, it may be
|
||||
necessary to discard text that precedes the partial match before adding the
|
||||
next segment. In cases such as the above, where the pattern does not contain
|
||||
any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if a pattern contains a lookbehind assertion, characters
|
||||
If there are memory constraints, you may want to discard text that precedes a
|
||||
partial match before adding the next segment. Unfortunately, this is not at
|
||||
present straightforward. In cases such as the above, where the pattern does not
|
||||
contain any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if the pattern contains a lookbehind assertion, characters
|
||||
that precede the start of the partial match may have been inspected during the
|
||||
matching process.
|
||||
.P
|
||||
The only lookbehind information that is available is the length of the longest
|
||||
lookbehind in a pattern. This may not, of course, be at the start of the
|
||||
pattern, but retaining that many characters before the partial match is
|
||||
sufficient, if not always strictly necessary. The way to do this is as follows:
|
||||
.P
|
||||
Before doing any matching, find the length of the longest lookbehind in the
|
||||
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
|
||||
option. Note that the resulting count is in characters, not code units. After a
|
||||
partial match, moving back from the ovector[0] offset in the subject by the
|
||||
number of characters given for the maximum lookbehind gets you to the earliest
|
||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
||||
while moving back through the code units. Characters before the point you have
|
||||
now reached can be discarded.
|
||||
.P
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
||||
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
|
||||
displays a partial match, it indicates the lookbehind characters with '<'
|
||||
characters if the \fBallusedtext\fP modifier is set:
|
||||
matching process. When \fBpcre2test\fP displays a partial match, it indicates
|
||||
these characters with '<' if the \fBallusedtext\fP modifier is set:
|
||||
.sp
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\e=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
.sp
|
||||
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||
because JIT matching does not maintain the first and last consulted characters.
|
||||
.
|
||||
.sp
|
||||
However, the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||
because JIT matching does not record the first (or last) consulted characters.
|
||||
For this reason, this information is not available via the API. It is therefore
|
||||
not possible in general to obtain the exact number of characters that must be
|
||||
retained in order to get the right match result. If you cannot retain the
|
||||
entire segment, you must find some heuristic way of choosing.
|
||||
.P
|
||||
If you know the approximate length of the matching substrings, you can use that
|
||||
to decide how much text to retain. The only lookbehind information that is
|
||||
currently available via the API is the length of the longest individual
|
||||
lookbehind in a pattern, but this can be misleading if there are nested
|
||||
lookbehinds. The value returned by calling \fBpcre2_pattern_info()\fP with the
|
||||
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
|
||||
units) that any individual lookbehind moves back when it is processed. A
|
||||
pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
|
||||
inspects two characters before its starting point.
|
||||
.P
|
||||
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
|
||||
UTF-8 or UTF-16 you have to count characters while moving back through the code
|
||||
units.
|
||||
.
|
||||
.
|
||||
.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
|
||||
|
@ -344,11 +347,11 @@ are remembered. Depending on the application, this may or may not be what you
|
|||
want.
|
||||
.P
|
||||
If you do want to allow for starting again at the next character, one way of
|
||||
doing it is to retain the matched part of the segment and try a new complete
|
||||
match, as described for \fBpcre2_match()\fP above. Another possibility is to
|
||||
work with two buffers. If a partial match at offset \fIn\fP in the first buffer
|
||||
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
|
||||
you can then try a new match starting at offset \fIn+1\fP in the first buffer.
|
||||
doing it is to retain some or all of the segment and try a new complete match,
|
||||
as described for \fBpcre2_match()\fP above. Another possibility is to work with
|
||||
two buffers. If a partial match at offset \fIn\fP in the first buffer is
|
||||
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
|
||||
can then try a new match starting at offset \fIn+1\fP in the first buffer.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
|
@ -365,6 +368,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 07 August 2019
|
||||
Last updated: 04 September 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -128,12 +128,12 @@ static int
|
|||
compile_block *, PCRE2_SIZE *);
|
||||
|
||||
static int
|
||||
get_branchlength(uint32_t **, int *, int *, int *, parsed_recurse_check *,
|
||||
get_branchlength(uint32_t **, int *, int *, parsed_recurse_check *,
|
||||
compile_block *);
|
||||
|
||||
static BOOL
|
||||
set_lookbehind_lengths(uint32_t **, int *, int *, int *,
|
||||
parsed_recurse_check *, compile_block *);
|
||||
set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *,
|
||||
compile_block *);
|
||||
|
||||
static int
|
||||
check_lookbehinds(uint32_t *, uint32_t **, parsed_recurse_check *,
|
||||
|
@ -398,9 +398,6 @@ compiler is clever with identical subexpressions. */
|
|||
#define GI_SET_FIXED_LENGTH 0x80000000u
|
||||
#define GI_NOT_FIXED_LENGTH 0x40000000u
|
||||
#define GI_FIXED_LENGTH_MASK 0x0000ffffu
|
||||
#define GI_EXTRA_MASK 0x0fff0000u
|
||||
#define GI_EXTRA_MAX 0xfff /* NB not unsigned */
|
||||
#define GI_EXTRA_SHIFT 16
|
||||
|
||||
/* This simple test for a decimal digit works for both ASCII/Unicode and EBCDIC
|
||||
and is fast (a good compiler can turn it into a subtraction and unsigned
|
||||
|
@ -8897,7 +8894,6 @@ improve processing speed when the same capturing group occurs many times.
|
|||
Arguments:
|
||||
pptrptr pointer to pointer in the parsed pattern
|
||||
isinline FALSE if a reference or recursion; TRUE for inline group
|
||||
extraptr pointer to where to return extra lookbehind length
|
||||
errcodeptr pointer to the errorcode
|
||||
lcptr pointer to the loop counter
|
||||
group number of captured group or -1 for a non-capturing group
|
||||
|
@ -8908,13 +8904,11 @@ Returns: the group length or a negative number
|
|||
*/
|
||||
|
||||
static int
|
||||
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *extraptr,
|
||||
int *errcodeptr, int *lcptr, int group, parsed_recurse_check *recurses,
|
||||
compile_block *cb)
|
||||
get_grouplength(uint32_t **pptrptr, BOOL isinline, int *errcodeptr, int *lcptr,
|
||||
int group, parsed_recurse_check *recurses, compile_block *cb)
|
||||
{
|
||||
int branchlength;
|
||||
int grouplength = -1;
|
||||
int extra = 0;
|
||||
|
||||
/* The cache can be used only if there is no possibility of there being two
|
||||
groups with the same number. We do not need to set the end pointer for a group
|
||||
|
@ -8928,7 +8922,6 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
|
|||
if ((groupinfo & GI_SET_FIXED_LENGTH) != 0)
|
||||
{
|
||||
if (isinline) *pptrptr = parsed_skip(*pptrptr, PSKIP_KET);
|
||||
*extraptr = (groupinfo & GI_EXTRA_MASK) >> GI_EXTRA_SHIFT;
|
||||
return groupinfo & GI_FIXED_LENGTH_MASK;
|
||||
}
|
||||
}
|
||||
|
@ -8937,28 +8930,16 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
|
|||
|
||||
for(;;)
|
||||
{
|
||||
int branchextra;
|
||||
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
|
||||
recurses, cb);
|
||||
branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
|
||||
if (branchlength < 0) goto ISNOTFIXED;
|
||||
if (grouplength == -1)
|
||||
{
|
||||
grouplength = branchlength;
|
||||
extra = branchextra;
|
||||
}
|
||||
else if (grouplength != branchlength || extra != branchextra) goto ISNOTFIXED;
|
||||
if (grouplength == -1) grouplength = branchlength;
|
||||
else if (grouplength != branchlength) goto ISNOTFIXED;
|
||||
if (**pptrptr == META_KET) break;
|
||||
*pptrptr += 1; /* Skip META_ALT */
|
||||
}
|
||||
|
||||
/* There are only 12 bits for caching the extra value, but a pattern that
|
||||
needs more than that is weird indeed. */
|
||||
|
||||
if (group > 0 && extra <= GI_EXTRA_MAX)
|
||||
cb->groupinfo[group] |= (uint32_t)
|
||||
(GI_SET_FIXED_LENGTH | (extra << GI_EXTRA_SHIFT) | grouplength);
|
||||
|
||||
*extraptr = extra;
|
||||
if (group > 0)
|
||||
cb->groupinfo[group] |= (uint32_t)(GI_SET_FIXED_LENGTH | grouplength);
|
||||
return grouplength;
|
||||
|
||||
ISNOTFIXED:
|
||||
|
@ -8973,17 +8954,11 @@ return -1;
|
|||
*************************************************/
|
||||
|
||||
/* Return a fixed length for a branch in a lookbehind, giving an error if the
|
||||
length is not fixed. We also take note of any extra value that is generated
|
||||
from a nested lookbehind. For example, for /(?<=a(?<=ba)c)/ each individual
|
||||
lookbehind has length 2, but the max_lookbehind setting must be 3 because
|
||||
matching inspects 3 characters before the match starting point.
|
||||
|
||||
On entry, *pptrptr points to the first element inside the branch. On exit it is
|
||||
set to point to the ALT or KET.
|
||||
length is not fixed. On entry, *pptrptr points to the first element inside the
|
||||
branch. On exit it is set to point to the ALT or KET.
|
||||
|
||||
Arguments:
|
||||
pptrptr pointer to pointer in the parsed pattern
|
||||
extraptr pointer to where to return extra lookbehind length
|
||||
errcodeptr pointer to error code
|
||||
lcptr pointer to loop counter
|
||||
recurses chain of recurse_check to catch mutual recursion
|
||||
|
@ -8993,14 +8968,11 @@ Returns: the length, or a negative value on error
|
|||
*/
|
||||
|
||||
static int
|
||||
get_branchlength(uint32_t **pptrptr, int *extraptr, int *errcodeptr, int *lcptr,
|
||||
get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
|
||||
parsed_recurse_check *recurses, compile_block *cb)
|
||||
{
|
||||
int branchlength = 0;
|
||||
int grouplength;
|
||||
int groupextra;
|
||||
int max;
|
||||
int extra = 0; /* Additional lookbehind from nesting */
|
||||
uint32_t lastitemlength = 0;
|
||||
uint32_t *pptr = *pptrptr;
|
||||
PCRE2_SIZE offset;
|
||||
|
@ -9149,17 +9121,13 @@ for (;; pptr++)
|
|||
break;
|
||||
|
||||
/* A nested lookbehind does not contribute any length to this lookbehind,
|
||||
but must itself be checked and have its lengths set. If the maximum
|
||||
lookbehind for the nested lookbehind is greater than the length so far
|
||||
computed for this branch, we must compute an extra value and keep the
|
||||
largest encountered for use when setting the maximum overall lookbehind. */
|
||||
but must itself be checked and have its lengths set. */
|
||||
|
||||
case META_LOOKBEHIND:
|
||||
case META_LOOKBEHINDNOT:
|
||||
case META_LOOKBEHIND_NA:
|
||||
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
|
||||
if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
|
||||
return -1;
|
||||
if (max - branchlength > extra) extra = max - branchlength;
|
||||
break;
|
||||
|
||||
/* Back references and recursions are handled by very similar code. At this
|
||||
|
@ -9267,14 +9235,15 @@ for (;; pptr++)
|
|||
in the cache. */
|
||||
|
||||
gptr++;
|
||||
grouplength = get_grouplength(&gptr, FALSE, &groupextra, errcodeptr, lcptr,
|
||||
group, &this_recurse, cb);
|
||||
grouplength = get_grouplength(&gptr, FALSE, errcodeptr, lcptr, group,
|
||||
&this_recurse, cb);
|
||||
if (grouplength < 0)
|
||||
{
|
||||
if (*errcodeptr == 0) goto ISNOTFIXED;
|
||||
return -1; /* Error already set */
|
||||
}
|
||||
goto OK_GROUP;
|
||||
itemlength = grouplength;
|
||||
break;
|
||||
|
||||
/* Check nested groups - advance past the initial data for each type and
|
||||
then seek a fixed length with get_grouplength(). */
|
||||
|
@ -9304,16 +9273,10 @@ for (;; pptr++)
|
|||
case META_SCRIPT_RUN:
|
||||
pptr++;
|
||||
CHECK_GROUP:
|
||||
grouplength = get_grouplength(&pptr, TRUE, &groupextra, errcodeptr, lcptr,
|
||||
group, recurses, cb);
|
||||
grouplength = get_grouplength(&pptr, TRUE, errcodeptr, lcptr, group,
|
||||
recurses, cb);
|
||||
if (grouplength < 0) return -1;
|
||||
|
||||
/* A nested lookbehind within the group may require looking back further
|
||||
than the length of the group. */
|
||||
|
||||
OK_GROUP:
|
||||
itemlength = grouplength;
|
||||
if (groupextra - branchlength > extra) extra = groupextra - branchlength;
|
||||
break;
|
||||
|
||||
/* Exact repetition is OK; variable repetition is not. A repetition of zero
|
||||
|
@ -9374,7 +9337,6 @@ for (;; pptr++)
|
|||
|
||||
EXIT:
|
||||
*pptrptr = pptr;
|
||||
*extraptr = extra;
|
||||
return branchlength;
|
||||
|
||||
PARSED_SKIP_FAILED:
|
||||
|
@ -9400,7 +9362,6 @@ get_branchlength() as an "extra" value.
|
|||
|
||||
Arguments:
|
||||
pptrptr pointer to pointer in the parsed pattern
|
||||
maxptr where to return maximum lookbehind for the whole group
|
||||
errcodeptr pointer to error code
|
||||
lcptr pointer to loop counter
|
||||
recurses chain of recurse_check to catch mutual recursion
|
||||
|
@ -9411,13 +9372,11 @@ Returns: TRUE if all is well
|
|||
*/
|
||||
|
||||
static BOOL
|
||||
set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
|
||||
int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
|
||||
set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
|
||||
parsed_recurse_check *recurses, compile_block *cb)
|
||||
{
|
||||
PCRE2_SIZE offset;
|
||||
int branchlength;
|
||||
int branchextra;
|
||||
int max = 0;
|
||||
uint32_t *bptr = *pptrptr;
|
||||
|
||||
READPLUSOFFSET(offset, bptr); /* Offset for error messages */
|
||||
|
@ -9426,8 +9385,7 @@ READPLUSOFFSET(offset, bptr); /* Offset for error messages */
|
|||
do
|
||||
{
|
||||
*pptrptr += 1;
|
||||
branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
|
||||
recurses, cb);
|
||||
branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
|
||||
if (branchlength < 0)
|
||||
{
|
||||
/* The errorcode and offset may already be set from a nested lookbehind. */
|
||||
|
@ -9435,14 +9393,12 @@ do
|
|||
if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
|
||||
return FALSE;
|
||||
}
|
||||
if (branchlength + branchextra > max) max = branchlength + branchextra;
|
||||
if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
|
||||
*bptr |= branchlength; /* branchlength never more than 65535 */
|
||||
bptr = *pptrptr;
|
||||
}
|
||||
while (*bptr == META_ALT);
|
||||
|
||||
if (max > cb->max_lookbehind) cb->max_lookbehind = max;
|
||||
*maxptr = max;
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
|
@ -9475,7 +9431,6 @@ static int
|
|||
check_lookbehinds(uint32_t *pptr, uint32_t **retptr,
|
||||
parsed_recurse_check *recurses, compile_block *cb)
|
||||
{
|
||||
int max;
|
||||
int errorcode = 0;
|
||||
int loopcount = 0;
|
||||
int nestlevel = 0;
|
||||
|
@ -9599,8 +9554,7 @@ for (; *pptr != META_END; pptr++)
|
|||
case META_LOOKBEHIND:
|
||||
case META_LOOKBEHINDNOT:
|
||||
case META_LOOKBEHIND_NA:
|
||||
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount,
|
||||
recurses, cb))
|
||||
if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, recurses, cb))
|
||||
return errorcode;
|
||||
break;
|
||||
}
|
||||
|
|
|
@ -304,7 +304,7 @@ Partial match, mark=xx: 123a
|
|||
|
||||
/(?<=(?<=a)b)c.*/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 2
|
||||
Max lookbehind = 1
|
||||
First code unit = 'c'
|
||||
Subject length lower bound = 1
|
||||
abc\=ph
|
||||
|
@ -337,7 +337,7 @@ Partial match: abcd
|
|||
|
||||
/(?<=(?<=(?<=a)b)c)./I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 3
|
||||
Max lookbehind = 1
|
||||
Subject length lower bound = 1
|
||||
123abcXYZ
|
||||
0: abcX
|
||||
|
@ -354,7 +354,7 @@ Subject length lower bound = 1
|
|||
|
||||
/(?<=ab((?<=...)cd))./I
|
||||
Capture group count = 1
|
||||
Max lookbehind = 5
|
||||
Max lookbehind = 4
|
||||
Subject length lower bound = 1
|
||||
ZabcdX
|
||||
0: ZabcdX
|
||||
|
@ -363,7 +363,7 @@ Subject length lower bound = 1
|
|||
|
||||
/(?<=((?<=(?<=ab).))(?1)(?1))./I
|
||||
Capture group count = 1
|
||||
Max lookbehind = 3
|
||||
Max lookbehind = 2
|
||||
Subject length lower bound = 1
|
||||
abxZ
|
||||
0: abxZ
|
||||
|
|
|
@ -17036,7 +17036,7 @@ Subject length lower bound = 1
|
|||
|
||||
/(?<=(?<=a)b)c.*/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 2
|
||||
Max lookbehind = 1
|
||||
First code unit = 'c'
|
||||
Subject length lower bound = 1
|
||||
abc\=ph
|
||||
|
@ -17064,7 +17064,7 @@ Subject length lower bound = 0
|
|||
|
||||
/(?<=a(?<=a|ba)c)/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 3
|
||||
Max lookbehind = 2
|
||||
May match empty string
|
||||
Subject length lower bound = 0
|
||||
|
||||
|
@ -17076,7 +17076,7 @@ Subject length lower bound = 0
|
|||
|
||||
/(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
|
||||
Capture group count = 0
|
||||
Max lookbehind = 5
|
||||
Max lookbehind = 4
|
||||
May match empty string
|
||||
Subject length lower bound = 0
|
||||
|
||||
|
|
Loading…
Reference in New Issue