Update definition of partial match and fix \z and \Z (as documented).

This commit is contained in:
Philip.Hazel 2019-07-21 16:48:13 +00:00
parent 344056baf8
commit c84a06c96e
13 changed files with 715 additions and 604 deletions

View File

@ -97,6 +97,16 @@ within it, the nested lookbehind was not correctly processed. For example, if
20. Implemented pcre2_get_match_data_size(). 20. Implemented pcre2_get_match_data_size().
21. Two alterations to partial matching (not yet done by JIT):
(a) The definition of a partial match is slightly changed: if a pattern
contains any lookbehinds, an empty partial match may be given, because this
is another situation where adding characters to the current subject can
lead to a full match. Example: /c*+(?<=[bc])/ with subject "ab".
(b) An empty string partial hard match can be returned for \z and \Z as it
is documented that they shouldn't match.
Version 10.33 16-April-2019 Version 10.33 16-April-2019
--------------------------- ---------------------------

View File

@ -2725,12 +2725,16 @@ Your program may crash or loop indefinitely or give wrong results.
</pre> </pre>
These options turn on the partial matching feature. A partial match occurs if These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. If this happens when subject characters to complete the match. In addition, either at least one
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by character must have been inspected or the pattern must contain a lookbehind.
testing any remaining alternatives. Only if no complete match can be found is </P>
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words, <P>
PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
match, but only if no complete match can be found. is set, matching continues by testing any remaining alternatives. Only if no
complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that the
caller is prepared to handle a partial match, but only if no complete match can
be found.
</P> </P>
<P> <P>
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
@ -3846,7 +3850,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 25 June 2019 Last updated: 20 July 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -45,7 +45,7 @@ as soon as a mistake is made, by beeping and not reflecting the character that
has been typed, for example. This immediate feedback is likely to be a better has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been user interface than a check that is delayed until the entire string has been
entered. Partial matching can also be useful when the subject string is very entered. Partial matching can also be useful when the subject string is very
long and is not all available at once. long and is not all available at once, as discussed below.
</P> </P>
<P> <P>
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
@ -79,13 +79,18 @@ is also disabled for partial matching.
<P> <P>
A partial match occurs during a call to <b>pcre2_match()</b> when the end of the A partial match occurs during a call to <b>pcre2_match()</b> when the end of the
subject string is reached successfully, but matching cannot continue because subject string is reached successfully, but matching cannot continue because
more characters are needed. However, at least one character in the subject must more characters are needed, and in addition, either at least one character in
have been inspected. This character need not form part of the final matched the subject has been inspected or the pattern contains a lookbehind. An
string; lookbehind assertions and the \K escape sequence provide ways of inspected character need not form part of the final matched string; lookbehind
inspecting characters before the start of a matched string. The requirement for assertions and the \K escape sequence provide ways of inspecting characters
inspecting at least one character exists because an empty string can always be before the start of a matched string.
matched; without such a restriction there would always be a partial match of an </P>
empty string at the end of the subject. <P>
The two additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there
would be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty).
</P> </P>
<P> <P>
When a partial match is returned, the first two elements in the ovector point When a partial match is returned, the first two elements in the ovector point
@ -104,7 +109,7 @@ characters.
</P> </P>
<P> <P>
What happens when a partial match is identified depends on which of the two What happens when a partial match is identified depends on which of the two
partial matching options are set. partial matching options is set.
</P> </P>
<br><b> <br><b>
PCRE2_PARTIAL_SOFT WITH pcre2_match() PCRE2_PARTIAL_SOFT WITH pcre2_match()
@ -128,12 +133,12 @@ the data that is returned. Consider this pattern:
<pre> <pre>
/123\w+X|dogY/ /123\w+X|dogY/
</pre> </pre>
If this is matched against the subject string "abc123dog", both If this is matched against the subject string "abc123dog", both alternatives
alternatives fail to match, but the end of the subject is reached during fail to match, but the end of the subject is reached during matching, so
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
identifying "123dog" as the first partial match that was found. (In this "123dog" as the first partial match that was found. (In this example, there are
example, there are two partial matches, because "dog" on its own partially two partial matches, because "dog" on its own partially matches the second
matches the second alternative.) alternative.)
</P> </P>
<br><b> <br><b>
PCRE2_PARTIAL_HARD WITH pcre2_match() PCRE2_PARTIAL_HARD WITH pcre2_match()
@ -145,8 +150,8 @@ possible complete matches. This option is "hard" because it prefers an earlier
partial match over a later complete match. For this reason, the assumption is partial match over a later complete match. For this reason, the assumption is
made that the end of the supplied subject string may not be the true end of the made that the end of the supplied subject string may not be the true end of the
available data, and so, if \z, \Z, \b, \B, or $ are encountered at the end available data, and so, if \z, \Z, \b, \B, or $ are encountered at the end
of the subject, the result is PCRE2_ERROR_PARTIAL, provided that at least one of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
character in the subject has been inspected. characters have been inspected.
</P> </P>
<br><b> <br><b>
Comparing hard and soft partial matching Comparing hard and soft partial matching
@ -346,44 +351,25 @@ string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b> value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
displays a partial match, it indicates the lookbehind characters with '&#60;' displays a partial match, it indicates the lookbehind characters with '&#60;'
characters: characters if the "allusedtext" modifier is set:
<pre> <pre>
re&#62; "(?&#60;=123)abc" re&#62; "(?&#60;=123)abc"
data&#62; xx123ab\=ph data&#62; xx123ab\=ph,allusedtext
Partial match: 123ab Partial match: 123ab
&#60;&#60;&#60; &#60;&#60;&#60;
</PRE>
</P>
<P>
3. The maximum lookbehind count is also important when the result of a partial
match attempt is "no match". In this case, the maximum lookbehind characters
from the end of the current segment must be retained at the start of the next
segment, in case the lookbehind is at the start of the pattern. Matching the
next segment must then start at the appropriate offset.
</P>
<P>
4. Because a partial match must always contain at least one character, what
might be considered a partial match of an empty string actually gives a "no
match" result. For example:
<pre>
re&#62; /c(?&#60;=abc)x/
data&#62; ab\=ps
No match
</pre> </pre>
If the next segment begins "cx", a match should be found, but this will only However, the "allusedtext" modifier is not available for JIT matching, because
happen if characters from the previous segment are retained. For this reason, a JIT matching does not maintain the first and last consulted characters.
"no match" result should be interpreted as "partial match of an empty string"
when the pattern contains lookbehinds.
</P> </P>
<P> <P>
5. Matching a subject string that is split into multiple segments may not 3. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string, always produce exactly the same result as matching over one single long string
especially when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
Word Boundaries" above describes an issue that arises if the pattern ends with Boundaries" above describes an issue that arises if the pattern ends with \b
\b or \B. Another kind of difference may occur when there are multiple or \B. Another kind of difference may occur when there are multiple matching
matching possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
is given only when there are no completed matches. This means that as soon as only when there are no completed matches. This means that as soon as the
the shortest match has been found, continuation to a new subject segment is no shortest match has been found, continuation to a new subject segment is no
longer possible. Consider this <b>pcre2test</b> example: longer possible. Consider this <b>pcre2test</b> example:
<pre> <pre>
re&#62; /dog(sbody)?/ re&#62; /dog(sbody)?/
@ -418,7 +404,7 @@ multi-segment data. The example above then behaves differently:
data&#62; gsb\=ph,dfa,dfa_restart data&#62; gsb\=ph,dfa,dfa_restart
Partial match: gsb Partial match: gsb
</pre> </pre>
6. Patterns that contain alternatives at the top level which do not all start 4. Patterns that contain alternatives at the top level which do not all start
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
used. For example, consider this pattern: used. For example, consider this pattern:
<pre> <pre>
@ -463,7 +449,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC10" href="#TOC1">REVISION</a><br> <br><a name="SEC10" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 21 June 2019 Last updated: 21 July 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -2661,13 +2661,16 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
These options turn on the partial matching feature. A partial match oc- These options turn on the partial matching feature. A partial match oc-
curs if the end of the subject string is reached successfully, but curs if the end of the subject string is reached successfully, but
there are not enough subject characters to complete the match. If this there are not enough subject characters to complete the match. In addi-
happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, tion, either at least one character must have been inspected or the
matching continues by testing any remaining alternatives. Only if no pattern must contain a lookbehind.
complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR-
the caller is prepared to handle a partial match, but only if no com- TIAL_HARD) is set, matching continues by testing any remaining alterna-
plete match can be found. tives. Only if no complete match can be found is PCRE2_ERROR_PARTIAL
returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_PAR-
TIAL_SOFT specifies that the caller is prepared to handle a partial
match, but only if no complete match can be found.
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
case, if a partial match is found, pcre2_match() immediately returns case, if a partial match is found, pcre2_match() immediately returns
@ -3702,7 +3705,7 @@ AUTHOR
REVISION REVISION
Last updated: 25 June 2019 Last updated: 20 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -5665,7 +5668,7 @@ PARTIAL MATCHING IN PCRE2
feedback is likely to be a better user interface than a check that is feedback is likely to be a better user interface than a check that is
delayed until the entire string has been entered. Partial matching can delayed until the entire string has been entered. Partial matching can
also be useful when the subject string is very long and is not all also be useful when the subject string is very long and is not all
available at once. available at once, as discussed below.
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
PCRE2_PARTIAL_HARD options, which can be set when calling a matching PCRE2_PARTIAL_HARD options, which can be set when calling a matching
@ -5698,14 +5701,18 @@ PARTIAL MATCHING USING pcre2_match()
A partial match occurs during a call to pcre2_match() when the end of A partial match occurs during a call to pcre2_match() when the end of
the subject string is reached successfully, but matching cannot con- the subject string is reached successfully, but matching cannot con-
tinue because more characters are needed. However, at least one charac- tinue because more characters are needed, and in addition, either at
ter in the subject must have been inspected. This character need not least one character in the subject has been inspected or the pattern
form part of the final matched string; lookbehind assertions and the \K contains a lookbehind. An inspected character need not form part of the
escape sequence provide ways of inspecting characters before the start final matched string; lookbehind assertions and the \K escape sequence
of a matched string. The requirement for inspecting at least one char- provide ways of inspecting characters before the start of a matched
acter exists because an empty string can always be matched; without string.
such a restriction there would always be a partial match of an empty
string at the end of the subject. The two additional requirements define the cases where adding more
characters to the existing subject may complete the match. Without
these conditions there would be a partial match of an empty string at
the end of the subject for all unanchored patterns (and also for an-
chored patterns if the subject itself is empty).
When a partial match is returned, the first two elements in the ovector When a partial match is returned, the first two elements in the ovector
point to the portion of the subject that was matched, but the values in point to the portion of the subject that was matched, but the values in
@ -5722,7 +5729,7 @@ PARTIAL MATCHING USING pcre2_match()
quent re-match with additional characters. quent re-match with additional characters.
What happens when a partial match is identified depends on which of the What happens when a partial match is identified depends on which of the
two partial matching options are set. two partial matching options is set.
PCRE2_PARTIAL_SOFT WITH pcre2_match() PCRE2_PARTIAL_SOFT WITH pcre2_match()
@ -5759,8 +5766,8 @@ PARTIAL MATCHING USING pcre2_match()
reason, the assumption is made that the end of the supplied subject reason, the assumption is made that the end of the supplied subject
string may not be the true end of the available data, and so, if \z, string may not be the true end of the available data, and so, if \z,
\Z, \b, \B, or $ are encountered at the end of the subject, the result \Z, \b, \B, or $ are encountered at the end of the subject, the result
is PCRE2_ERROR_PARTIAL, provided that at least one character in the is PCRE2_ERROR_PARTIAL, whether or not any characters have been in-
subject has been inspected. spected.
Comparing hard and soft partial matching Comparing hard and soft partial matching
@ -5963,43 +5970,25 @@ ISSUES WITH MULTI-SEGMENT MATCHING
mum lookbehind count is 3, so all characters before offset 2 can be mum lookbehind count is 3, so all characters before offset 2 can be
discarded. The value of startoffset for the next match should be 3. discarded. The value of startoffset for the next match should be 3.
When pcre2test displays a partial match, it indicates the lookbehind When pcre2test displays a partial match, it indicates the lookbehind
characters with '<' characters: characters with '<' characters if the "allusedtext" modifier is set:
re> "(?<=123)abc" re> "(?<=123)abc"
data> xx123ab\=ph data> xx123ab\=ph,allusedtext
Partial match: 123ab Partial match: 123ab
<<< <<< However, the "allusedtext" modifier is not avail-
able for JIT matching, because JIT matching does not maintain the first
and last consulted characters.
3. The maximum lookbehind count is also important when the result of a 3. Matching a subject string that is split into multiple segments may
partial match attempt is "no match". In this case, the maximum lookbe-
hind characters from the end of the current segment must be retained at
the start of the next segment, in case the lookbehind is at the start
of the pattern. Matching the next segment must then start at the appro-
priate offset.
4. Because a partial match must always contain at least one character,
what might be considered a partial match of an empty string actually
gives a "no match" result. For example:
re> /c(?<=abc)x/
data> ab\=ps
No match
If the next segment begins "cx", a match should be found, but this will
only happen if characters from the previous segment are retained. For
this reason, a "no match" result should be interpreted as "partial
match of an empty string" when the pattern contains lookbehinds.
5. Matching a subject string that is split into multiple segments may
not always produce exactly the same result as matching over one single not always produce exactly the same result as matching over one single
long string, especially when PCRE2_PARTIAL_SOFT is used. The section long string when PCRE2_PARTIAL_SOFT is used. The section "Partial
"Partial Matching and Word Boundaries" above describes an issue that Matching and Word Boundaries" above describes an issue that arises if
arises if the pattern ends with \b or \B. Another kind of difference the pattern ends with \b or \B. Another kind of difference may occur
may occur when there are multiple matching possibilities, because (for when there are multiple matching possibilities, because (for PCRE2_PAR-
PCRE2_PARTIAL_SOFT) a partial match result is given only when there are TIAL_SOFT) a partial match result is given only when there are no com-
no completed matches. This means that as soon as the shortest match has pleted matches. This means that as soon as the shortest match has been
been found, continuation to a new subject segment is no longer possi- found, continuation to a new subject segment is no longer possible.
ble. Consider this pcre2test example: Consider this pcre2test example:
re> /dog(sbody)?/ re> /dog(sbody)?/
data> dogsb\=ps data> dogsb\=ps
@ -6034,7 +6023,7 @@ ISSUES WITH MULTI-SEGMENT MATCHING
data> gsb\=ph,dfa,dfa_restart data> gsb\=ph,dfa,dfa_restart
Partial match: gsb Partial match: gsb
6. Patterns that contain alternatives at the top level which do not all 4. Patterns that contain alternatives at the top level which do not all
start with the same pattern item may not work as expected when start with the same pattern item may not work as expected when
PCRE2_DFA_RESTART is used. For example, consider this pattern: PCRE2_DFA_RESTART is used. For example, consider this pattern:
@ -6079,7 +6068,7 @@ AUTHOR
REVISION REVISION
Last updated: 21 June 2019 Last updated: 21 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "25 June 2019" "PCRE2 10.34" .TH PCRE2API 3 "20 July 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -2719,12 +2719,15 @@ Your program may crash or loop indefinitely or give wrong results.
.sp .sp
These options turn on the partial matching feature. A partial match occurs if These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. If this happens when subject characters to complete the match. In addition, either at least one
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by character must have been inspected or the pattern must contain a lookbehind.
testing any remaining alternatives. Only if no complete match can be found is .P
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words, If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial is set, matching continues by testing any remaining alternatives. Only if no
match, but only if no complete match can be found. complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that the
caller is prepared to handle a partial match, but only if no complete match can
be found.
.P .P
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
a partial match is found, \fBpcre2_match()\fP immediately returns a partial match is found, \fBpcre2_match()\fP immediately returns
@ -3859,6 +3862,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 25 June 2019 Last updated: 20 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PARTIAL 3 "21 June 2019" "PCRE2 10.34" .TH PCRE2PARTIAL 3 "21 July 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2" .SH "PARTIAL MATCHING IN PCRE2"
@ -22,7 +22,7 @@ as soon as a mistake is made, by beeping and not reflecting the character that
has been typed, for example. This immediate feedback is likely to be a better has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been user interface than a check that is delayed until the entire string has been
entered. Partial matching can also be useful when the subject string is very entered. Partial matching can also be useful when the subject string is very
long and is not all available at once. long and is not all available at once, as discussed below.
.P .P
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
PCRE2_PARTIAL_HARD options, which can be set when calling a matching function. PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
@ -55,13 +55,17 @@ is also disabled for partial matching.
.sp .sp
A partial match occurs during a call to \fBpcre2_match()\fP when the end of the A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
subject string is reached successfully, but matching cannot continue because subject string is reached successfully, but matching cannot continue because
more characters are needed. However, at least one character in the subject must more characters are needed, and in addition, either at least one character in
have been inspected. This character need not form part of the final matched the subject has been inspected or the pattern contains a lookbehind. An
string; lookbehind assertions and the \eK escape sequence provide ways of inspected character need not form part of the final matched string; lookbehind
inspecting characters before the start of a matched string. The requirement for assertions and the \eK escape sequence provide ways of inspecting characters
inspecting at least one character exists because an empty string can always be before the start of a matched string.
matched; without such a restriction there would always be a partial match of an .P
empty string at the end of the subject. The two additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there
would be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty).
.P .P
When a partial match is returned, the first two elements in the ovector point When a partial match is returned, the first two elements in the ovector point
to the portion of the subject that was matched, but the values in the rest of to the portion of the subject that was matched, but the values in the rest of
@ -78,7 +82,7 @@ these characters are needed for a subsequent re-match with additional
characters. characters.
.P .P
What happens when a partial match is identified depends on which of the two What happens when a partial match is identified depends on which of the two
partial matching options are set. partial matching options is set.
. .
. .
.SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()" .SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()"
@ -100,12 +104,12 @@ the data that is returned. Consider this pattern:
.sp .sp
/123\ew+X|dogY/ /123\ew+X|dogY/
.sp .sp
If this is matched against the subject string "abc123dog", both If this is matched against the subject string "abc123dog", both alternatives
alternatives fail to match, but the end of the subject is reached during fail to match, but the end of the subject is reached during matching, so
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
identifying "123dog" as the first partial match that was found. (In this "123dog" as the first partial match that was found. (In this example, there are
example, there are two partial matches, because "dog" on its own partially two partial matches, because "dog" on its own partially matches the second
matches the second alternative.) alternative.)
. .
. .
.SS "PCRE2_PARTIAL_HARD WITH pcre2_match()" .SS "PCRE2_PARTIAL_HARD WITH pcre2_match()"
@ -117,8 +121,8 @@ possible complete matches. This option is "hard" because it prefers an earlier
partial match over a later complete match. For this reason, the assumption is partial match over a later complete match. For this reason, the assumption is
made that the end of the supplied subject string may not be the true end of the made that the end of the supplied subject string may not be the true end of the
available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end
of the subject, the result is PCRE2_ERROR_PARTIAL, provided that at least one of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
character in the subject has been inspected. characters have been inspected.
. .
. .
.SS "Comparing hard and soft partial matching" .SS "Comparing hard and soft partial matching"
@ -319,40 +323,23 @@ string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
displays a partial match, it indicates the lookbehind characters with '<' displays a partial match, it indicates the lookbehind characters with '<'
characters: characters if the "allusedtext" modifier is set:
.sp .sp
re> "(?<=123)abc" re> "(?<=123)abc"
data> xx123ab\e=ph data> xx123ab\e=ph,allusedtext
Partial match: 123ab Partial match: 123ab
<<< <<<
However, the "allusedtext" modifier is not available for JIT matching, because
JIT matching does not maintain the first and last consulted characters.
.P .P
3. The maximum lookbehind count is also important when the result of a partial 3. Matching a subject string that is split into multiple segments may not
match attempt is "no match". In this case, the maximum lookbehind characters always produce exactly the same result as matching over one single long string
from the end of the current segment must be retained at the start of the next when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
segment, in case the lookbehind is at the start of the pattern. Matching the Boundaries" above describes an issue that arises if the pattern ends with \eb
next segment must then start at the appropriate offset. or \eB. Another kind of difference may occur when there are multiple matching
.P possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
4. Because a partial match must always contain at least one character, what only when there are no completed matches. This means that as soon as the
might be considered a partial match of an empty string actually gives a "no shortest match has been found, continuation to a new subject segment is no
match" result. For example:
.sp
re> /c(?<=abc)x/
data> ab\e=ps
No match
.sp
If the next segment begins "cx", a match should be found, but this will only
happen if characters from the previous segment are retained. For this reason, a
"no match" result should be interpreted as "partial match of an empty string"
when the pattern contains lookbehinds.
.P
5. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string,
especially when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and
Word Boundaries" above describes an issue that arises if the pattern ends with
\eb or \eB. Another kind of difference may occur when there are multiple
matching possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result
is given only when there are no completed matches. This means that as soon as
the shortest match has been found, continuation to a new subject segment is no
longer possible. Consider this \fBpcre2test\fP example: longer possible. Consider this \fBpcre2test\fP example:
.sp .sp
re> /dog(sbody)?/ re> /dog(sbody)?/
@ -386,7 +373,7 @@ multi-segment data. The example above then behaves differently:
data> gsb\e=ph,dfa,dfa_restart data> gsb\e=ph,dfa,dfa_restart
Partial match: gsb Partial match: gsb
.sp .sp
6. Patterns that contain alternatives at the top level which do not all start 4. Patterns that contain alternatives at the top level which do not all start
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
used. For example, consider this pattern: used. For example, consider this pattern:
.sp .sp
@ -435,6 +422,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 21 June 2019 Last updated: 21 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -966,7 +966,7 @@ for (;;)
if (ptr >= end_subject) if (ptr >= end_subject)
{ {
if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0) if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
could_continue = TRUE; return PCRE2_ERROR_PARTIAL;
else { ADD_ACTIVE(state_offset + 1, 0); } else { ADD_ACTIVE(state_offset + 1, 0); }
} }
break; break;
@ -1015,10 +1015,12 @@ for (;;)
/*-----------------------------------------------------------------*/ /*-----------------------------------------------------------------*/
case OP_EODN: case OP_EODN:
if (clen == 0 && (mb->moptions & PCRE2_PARTIAL_HARD) != 0) if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - mb->nllen))
could_continue = TRUE; {
else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - mb->nllen)) if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
{ ADD_ACTIVE(state_offset + 1, 0); } return PCRE2_ERROR_PARTIAL;
ADD_ACTIVE(state_offset + 1, 0);
}
break; break;
/*-----------------------------------------------------------------*/ /*-----------------------------------------------------------------*/
@ -3181,9 +3183,12 @@ for (;;)
partial_newline || /* Either partial NL */ partial_newline || /* Either partial NL */
( /* or ... */ ( /* or ... */
ptr >= end_subject && /* End of subject and */ ptr >= end_subject && /* End of subject and */
ptr > mb->start_used_ptr) /* Inspected non-empty string */ ( /* either */
ptr > mb->start_used_ptr || /* Inspected non-empty string */
mb->haslookbehind /* or pattern has lookbehind */
) )
) )
))
match_count = PCRE2_ERROR_PARTIAL; match_count = PCRE2_ERROR_PARTIAL;
break; /* Exit from loop along the subject string */ break; /* Exit from loop along the subject string */
} }
@ -3412,6 +3417,7 @@ mb->tables = re->tables;
mb->start_subject = subject; mb->start_subject = subject;
mb->end_subject = end_subject; mb->end_subject = end_subject;
mb->start_offset = start_offset; mb->start_offset = start_offset;
mb->haslookbehind = (re->max_lookbehind > 0);
mb->moptions = options; mb->moptions = options;
mb->poptions = re->overall_options; mb->poptions = re->overall_options;
mb->match_call_count = 0; mb->match_call_count = 0;

View File

@ -854,6 +854,7 @@ typedef struct match_block {
uint32_t match_call_count; /* Number of times a new frame is created */ uint32_t match_call_count; /* Number of times a new frame is created */
BOOL hitend; /* Hit the end of the subject at some point */ BOOL hitend; /* Hit the end of the subject at some point */
BOOL hasthen; /* Pattern contains (*THEN) */ BOOL hasthen; /* Pattern contains (*THEN) */
BOOL haslookbehind; /* Pattern contains sigificant lookbehind */
const uint8_t *lcc; /* Points to lower casing table */ const uint8_t *lcc; /* Points to lower casing table */
const uint8_t *fcc; /* Points to case-flipping table */ const uint8_t *fcc; /* Points to case-flipping table */
const uint8_t *ctypes; /* Points to table of type maps */ const uint8_t *ctypes; /* Points to table of type maps */
@ -909,6 +910,7 @@ typedef struct dfa_match_block {
uint32_t poptions; /* Pattern options */ uint32_t poptions; /* Pattern options */
uint32_t nltype; /* Newline type */ uint32_t nltype; /* Newline type */
uint32_t nllen; /* Newline string length */ uint32_t nllen; /* Newline string length */
BOOL haslookbehind; /* Pattern contains significant lookbehind */
PCRE2_UCHAR nl[4]; /* Newline string when fixed */ PCRE2_UCHAR nl[4]; /* Newline string when fixed */
uint16_t bsr_convention; /* \R interpretation */ uint16_t bsr_convention; /* \R interpretation */
pcre2_callout_block *cb; /* Points to a callout block */ pcre2_callout_block *cb; /* Points to a callout block */

View File

@ -416,7 +416,6 @@ if (caseless)
#endif #endif
/* Not in UTF mode */ /* Not in UTF mode */
{ {
for (; length > 0; length--) for (; length > 0; length--)
{ {
@ -491,11 +490,16 @@ heap is used for a larger vector.
*************************************************/ *************************************************/
/* These macros pack up tests that are used for partial matching several times /* These macros pack up tests that are used for partial matching several times
in the code. We set the "hit end" flag if the pointer is at the end of the in the code. The second one is used when we already know we are past the end of
subject and also past the earliest inspected character (i.e. something has been the subject. We set the "hit end" flag if the pointer is at the end of the
matched, even if not part of the actual matched string). For hard partial subject and either (a) the pointer is past the earliest inspected character
matching, we then return immediately. The second one is used when we already (i.e. something has been matched, even if not part of the actual matched
know we are past the end of the subject. */ string), or (b) the pattern contains a lookbehind. These are the conditions for
which adding more characters may allow the current match to continue.
For hard partial matching, we immediately return a partial match. Otherwise,
carrying on means that a complete match on the current subject will be sought.
A partial match is returned only if no complete match can be found. */
#define CHECK_PARTIAL()\ #define CHECK_PARTIAL()\
if (Feptr >= mb->end_subject) \ if (Feptr >= mb->end_subject) \
@ -503,31 +507,13 @@ know we are past the end of the subject. */
SCHECK_PARTIAL(); \ SCHECK_PARTIAL(); \
} }
/* Original version that allows hard partial to continue if no inspected
characters. */
#define SCHECK_PARTIAL()\ #define SCHECK_PARTIAL()\
if (mb->partial != 0 && Feptr > mb->start_used_ptr) \ if (mb->partial != 0 && (Feptr > mb->start_used_ptr || mb->haslookbehind)) \
{ \ { \
mb->hitend = TRUE; \ mb->hitend = TRUE; \
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \ if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \
} }
/* Experimental version that makes hard partial give no match instead of
continuing if no characters have been inspected. */
#ifdef NEVERNEVER
#define SCHECK_PARTIAL()\
if (mb->partial != 0) \
{ \
if (Feptr > mb->start_used_ptr) \
{ \
mb->hitend = TRUE; \
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \
} \
else if (mb->partial > 1) RRETURN(MATCH_NOMATCH); \
}
#endif /* NEVERNEVER */
/* These macros are used to implement backtracking. They simulate a recursive /* These macros are used to implement backtracking. They simulate a recursive
call to the match() function by means of a local vector of frames which call to the match() function by means of a local vector of frames which
@ -5670,7 +5656,11 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
case OP_EOD: case OP_EOD:
if (Feptr < mb->end_subject) RRETURN(MATCH_NOMATCH); if (Feptr < mb->end_subject) RRETURN(MATCH_NOMATCH);
SCHECK_PARTIAL(); if (mb->partial != 0)
{
mb->hitend = TRUE;
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL;
}
Fecode++; Fecode++;
break; break;
@ -5695,7 +5685,11 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
/* Either at end of string or \n before end. */ /* Either at end of string or \n before end. */
SCHECK_PARTIAL(); if (mb->partial != 0)
{
mb->hitend = TRUE;
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL;
}
Fecode++; Fecode++;
break; break;
@ -6457,6 +6451,7 @@ mb->start_subject = subject;
mb->start_offset = start_offset; mb->start_offset = start_offset;
mb->end_subject = end_subject; mb->end_subject = end_subject;
mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0; mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
mb->haslookbehind = (re->max_lookbehind > 0);
mb->poptions = re->overall_options; /* Pattern options */ mb->poptions = re->overall_options; /* Pattern options */
mb->ignore_skip_arg = 0; mb->ignore_skip_arg = 0;
mb->mark = mb->nomatch_mark = NULL; /* In case never set */ mb->mark = mb->nomatch_mark = NULL; /* In case never set */

29
testdata/testinput2 vendored
View File

@ -5690,10 +5690,33 @@ a)"xI
# ---- # ----
/(?<=(?=.(?<=x)))/
ab\=ph
# Expect error (recursion => not fixed length) # Expect error (recursion => not fixed length)
/(\2)((?=(?<=\1)))/ /(\2)((?=(?<=\1)))/
/c*+(?<=[bc])/
abc\=ph,no_jit
ab\=ph,no_jit
abc\=ps,no_jit
ab\=ps,no_jit
/c++(?<=[bc])/
abc\=ph,no_jit
ab\=ph,no_jit
/(?<=(?=.(?<=x)))/
abx
ab\=ph,no_jit
bxyz
xyz
/\z/
abc\=ph,no_jit
abc\=ps
/\Z/
abc\=ph,no_jit
abc\=ps
abc\n\=ph,no_jit
abc\n\=ps
# End of testinput2 # End of testinput2

26
testdata/testinput6 vendored
View File

@ -4994,4 +4994,30 @@
ab\=ps ab\=ps
abcx abcx
/\z/
abc\=ph
abc\=ps
/\Z/
abc\=ph
abc\=ps
abc\n\=ph
abc\n\=ps
/c*+(?<=[bc])/
abc\=ph
ab\=ph
abc\=ps
ab\=ps
/c++(?<=[bc])/
abc\=ph
ab\=ph
/(?<=(?=.(?<=x)))/
abx
ab\=ph
bxyz
xyz
# End of testinput6 # End of testinput6

46
testdata/testoutput2 vendored
View File

@ -17185,14 +17185,52 @@ Subject length lower bound = 1
# ---- # ----
/(?<=(?=.(?<=x)))/
ab\=ph
No match
# Expect error (recursion => not fixed length) # Expect error (recursion => not fixed length)
/(\2)((?=(?<=\1)))/ /(\2)((?=(?<=\1)))/
Failed: error 125 at offset 8: lookbehind assertion is not fixed length Failed: error 125 at offset 8: lookbehind assertion is not fixed length
/c*+(?<=[bc])/
abc\=ph,no_jit
Partial match: c
ab\=ph,no_jit
Partial match:
abc\=ps,no_jit
0: c
ab\=ps,no_jit
0:
/c++(?<=[bc])/
abc\=ph,no_jit
Partial match: c
ab\=ph,no_jit
Partial match:
/(?<=(?=.(?<=x)))/
abx
0:
ab\=ph,no_jit
Partial match:
bxyz
0:
xyz
0:
/\z/
abc\=ph,no_jit
Partial match:
abc\=ps
0:
/\Z/
abc\=ph,no_jit
Partial match:
abc\=ps
0:
abc\n\=ph,no_jit
Partial match: \x0a
abc\n\=ps
0:
# End of testinput2 # End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data Error -62: bad serialized data

42
testdata/testoutput6 vendored
View File

@ -7845,4 +7845,46 @@ Partial match: ab
abcx abcx
0: abcx 0: abcx
/\z/
abc\=ph
Partial match:
abc\=ps
0:
/\Z/
abc\=ph
Partial match:
abc\=ps
0:
abc\n\=ph
Partial match: \x0a
abc\n\=ps
0:
/c*+(?<=[bc])/
abc\=ph
Partial match: c
ab\=ph
Partial match:
abc\=ps
0: c
ab\=ps
0:
/c++(?<=[bc])/
abc\=ph
Partial match: c
ab\=ph
Partial match:
/(?<=(?=.(?<=x)))/
abx
0:
ab\=ph
Partial match:
bxyz
0:
xyz
0:
# End of testinput6 # End of testinput6