Update definition of partial match and fix \z and \Z (as documented).

This commit is contained in:
Philip.Hazel 2019-07-21 16:48:13 +00:00
parent 344056baf8
commit c84a06c96e
13 changed files with 715 additions and 604 deletions

View File

@ -97,6 +97,16 @@ within it, the nested lookbehind was not correctly processed. For example, if
20. Implemented pcre2_get_match_data_size().
21. Two alterations to partial matching (not yet done by JIT):
(a) The definition of a partial match is slightly changed: if a pattern
contains any lookbehinds, an empty partial match may be given, because this
is another situation where adding characters to the current subject can
lead to a full match. Example: /c*+(?<=[bc])/ with subject "ab".
(b) An empty string partial hard match can be returned for \z and \Z as it
is documented that they shouldn't match.
Version 10.33 16-April-2019
---------------------------

View File

@ -2725,12 +2725,16 @@ Your program may crash or loop indefinitely or give wrong results.
</pre>
These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. If this happens when
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
testing any remaining alternatives. Only if no complete match can be found is
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
match, but only if no complete match can be found.
subject characters to complete the match. In addition, either at least one
character must have been inspected or the pattern must contain a lookbehind.
</P>
<P>
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
is set, matching continues by testing any remaining alternatives. Only if no
complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that the
caller is prepared to handle a partial match, but only if no complete match can
be found.
</P>
<P>
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
@ -3846,7 +3850,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 25 June 2019
Last updated: 20 July 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -45,7 +45,7 @@ as soon as a mistake is made, by beeping and not reflecting the character that
has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been
entered. Partial matching can also be useful when the subject string is very
long and is not all available at once.
long and is not all available at once, as discussed below.
</P>
<P>
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
@ -79,13 +79,18 @@ is also disabled for partial matching.
<P>
A partial match occurs during a call to <b>pcre2_match()</b> when the end of the
subject string is reached successfully, but matching cannot continue because
more characters are needed. However, at least one character in the subject must
have been inspected. This character need not form part of the final matched
string; lookbehind assertions and the \K escape sequence provide ways of
inspecting characters before the start of a matched string. The requirement for
inspecting at least one character exists because an empty string can always be
matched; without such a restriction there would always be a partial match of an
empty string at the end of the subject.
more characters are needed, and in addition, either at least one character in
the subject has been inspected or the pattern contains a lookbehind. An
inspected character need not form part of the final matched string; lookbehind
assertions and the \K escape sequence provide ways of inspecting characters
before the start of a matched string.
</P>
<P>
The two additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there
would be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty).
</P>
<P>
When a partial match is returned, the first two elements in the ovector point
@ -104,7 +109,7 @@ characters.
</P>
<P>
What happens when a partial match is identified depends on which of the two
partial matching options are set.
partial matching options is set.
</P>
<br><b>
PCRE2_PARTIAL_SOFT WITH pcre2_match()
@ -128,12 +133,12 @@ the data that is returned. Consider this pattern:
<pre>
/123\w+X|dogY/
</pre>
If this is matched against the subject string "abc123dog", both
alternatives fail to match, but the end of the subject is reached during
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
identifying "123dog" as the first partial match that was found. (In this
example, there are two partial matches, because "dog" on its own partially
matches the second alternative.)
If this is matched against the subject string "abc123dog", both alternatives
fail to match, but the end of the subject is reached during matching, so
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
"123dog" as the first partial match that was found. (In this example, there are
two partial matches, because "dog" on its own partially matches the second
alternative.)
</P>
<br><b>
PCRE2_PARTIAL_HARD WITH pcre2_match()
@ -145,8 +150,8 @@ possible complete matches. This option is "hard" because it prefers an earlier
partial match over a later complete match. For this reason, the assumption is
made that the end of the supplied subject string may not be the true end of the
available data, and so, if \z, \Z, \b, \B, or $ are encountered at the end
of the subject, the result is PCRE2_ERROR_PARTIAL, provided that at least one
character in the subject has been inspected.
of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
characters have been inspected.
</P>
<br><b>
Comparing hard and soft partial matching
@ -346,44 +351,25 @@ string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
displays a partial match, it indicates the lookbehind characters with '&#60;'
characters:
characters if the "allusedtext" modifier is set:
<pre>
re&#62; "(?&#60;=123)abc"
data&#62; xx123ab\=ph
data&#62; xx123ab\=ph,allusedtext
Partial match: 123ab
&#60;&#60;&#60;
</PRE>
</P>
<P>
3. The maximum lookbehind count is also important when the result of a partial
match attempt is "no match". In this case, the maximum lookbehind characters
from the end of the current segment must be retained at the start of the next
segment, in case the lookbehind is at the start of the pattern. Matching the
next segment must then start at the appropriate offset.
</P>
<P>
4. Because a partial match must always contain at least one character, what
might be considered a partial match of an empty string actually gives a "no
match" result. For example:
<pre>
re&#62; /c(?&#60;=abc)x/
data&#62; ab\=ps
No match
</pre>
If the next segment begins "cx", a match should be found, but this will only
happen if characters from the previous segment are retained. For this reason, a
"no match" result should be interpreted as "partial match of an empty string"
when the pattern contains lookbehinds.
However, the "allusedtext" modifier is not available for JIT matching, because
JIT matching does not maintain the first and last consulted characters.
</P>
<P>
5. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string,
especially when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and
Word Boundaries" above describes an issue that arises if the pattern ends with
\b or \B. Another kind of difference may occur when there are multiple
matching possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result
is given only when there are no completed matches. This means that as soon as
the shortest match has been found, continuation to a new subject segment is no
3. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string
when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
Boundaries" above describes an issue that arises if the pattern ends with \b
or \B. Another kind of difference may occur when there are multiple matching
possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
only when there are no completed matches. This means that as soon as the
shortest match has been found, continuation to a new subject segment is no
longer possible. Consider this <b>pcre2test</b> example:
<pre>
re&#62; /dog(sbody)?/
@ -418,7 +404,7 @@ multi-segment data. The example above then behaves differently:
data&#62; gsb\=ph,dfa,dfa_restart
Partial match: gsb
</pre>
6. Patterns that contain alternatives at the top level which do not all start
4. Patterns that contain alternatives at the top level which do not all start
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
used. For example, consider this pattern:
<pre>
@ -463,7 +449,7 @@ Cambridge, England.
</P>
<br><a name="SEC10" href="#TOC1">REVISION</a><br>
<P>
Last updated: 21 June 2019
Last updated: 21 July 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -2661,13 +2661,16 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
These options turn on the partial matching feature. A partial match oc-
curs if the end of the subject string is reached successfully, but
there are not enough subject characters to complete the match. If this
happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set,
matching continues by testing any remaining alternatives. Only if no
complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that
the caller is prepared to handle a partial match, but only if no com-
plete match can be found.
there are not enough subject characters to complete the match. In addi-
tion, either at least one character must have been inspected or the
pattern must contain a lookbehind.
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR-
TIAL_HARD) is set, matching continues by testing any remaining alterna-
tives. Only if no complete match can be found is PCRE2_ERROR_PARTIAL
returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_PAR-
TIAL_SOFT specifies that the caller is prepared to handle a partial
match, but only if no complete match can be found.
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
case, if a partial match is found, pcre2_match() immediately returns
@ -3702,7 +3705,7 @@ AUTHOR
REVISION
Last updated: 25 June 2019
Last updated: 20 July 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
@ -5665,7 +5668,7 @@ PARTIAL MATCHING IN PCRE2
feedback is likely to be a better user interface than a check that is
delayed until the entire string has been entered. Partial matching can
also be useful when the subject string is very long and is not all
available at once.
available at once, as discussed below.
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
PCRE2_PARTIAL_HARD options, which can be set when calling a matching
@ -5698,14 +5701,18 @@ PARTIAL MATCHING USING pcre2_match()
A partial match occurs during a call to pcre2_match() when the end of
the subject string is reached successfully, but matching cannot con-
tinue because more characters are needed. However, at least one charac-
ter in the subject must have been inspected. This character need not
form part of the final matched string; lookbehind assertions and the \K
escape sequence provide ways of inspecting characters before the start
of a matched string. The requirement for inspecting at least one char-
acter exists because an empty string can always be matched; without
such a restriction there would always be a partial match of an empty
string at the end of the subject.
tinue because more characters are needed, and in addition, either at
least one character in the subject has been inspected or the pattern
contains a lookbehind. An inspected character need not form part of the
final matched string; lookbehind assertions and the \K escape sequence
provide ways of inspecting characters before the start of a matched
string.
The two additional requirements define the cases where adding more
characters to the existing subject may complete the match. Without
these conditions there would be a partial match of an empty string at
the end of the subject for all unanchored patterns (and also for an-
chored patterns if the subject itself is empty).
When a partial match is returned, the first two elements in the ovector
point to the portion of the subject that was matched, but the values in
@ -5722,7 +5729,7 @@ PARTIAL MATCHING USING pcre2_match()
quent re-match with additional characters.
What happens when a partial match is identified depends on which of the
two partial matching options are set.
two partial matching options is set.
PCRE2_PARTIAL_SOFT WITH pcre2_match()
@ -5759,8 +5766,8 @@ PARTIAL MATCHING USING pcre2_match()
reason, the assumption is made that the end of the supplied subject
string may not be the true end of the available data, and so, if \z,
\Z, \b, \B, or $ are encountered at the end of the subject, the result
is PCRE2_ERROR_PARTIAL, provided that at least one character in the
subject has been inspected.
is PCRE2_ERROR_PARTIAL, whether or not any characters have been in-
spected.
Comparing hard and soft partial matching
@ -5963,43 +5970,25 @@ ISSUES WITH MULTI-SEGMENT MATCHING
mum lookbehind count is 3, so all characters before offset 2 can be
discarded. The value of startoffset for the next match should be 3.
When pcre2test displays a partial match, it indicates the lookbehind
characters with '<' characters:
characters with '<' characters if the "allusedtext" modifier is set:
re> "(?<=123)abc"
data> xx123ab\=ph
data> xx123ab\=ph,allusedtext
Partial match: 123ab
<<<
<<< However, the "allusedtext" modifier is not avail-
able for JIT matching, because JIT matching does not maintain the first
and last consulted characters.
3. The maximum lookbehind count is also important when the result of a
partial match attempt is "no match". In this case, the maximum lookbe-
hind characters from the end of the current segment must be retained at
the start of the next segment, in case the lookbehind is at the start
of the pattern. Matching the next segment must then start at the appro-
priate offset.
4. Because a partial match must always contain at least one character,
what might be considered a partial match of an empty string actually
gives a "no match" result. For example:
re> /c(?<=abc)x/
data> ab\=ps
No match
If the next segment begins "cx", a match should be found, but this will
only happen if characters from the previous segment are retained. For
this reason, a "no match" result should be interpreted as "partial
match of an empty string" when the pattern contains lookbehinds.
5. Matching a subject string that is split into multiple segments may
3. Matching a subject string that is split into multiple segments may
not always produce exactly the same result as matching over one single
long string, especially when PCRE2_PARTIAL_SOFT is used. The section
"Partial Matching and Word Boundaries" above describes an issue that
arises if the pattern ends with \b or \B. Another kind of difference
may occur when there are multiple matching possibilities, because (for
PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
no completed matches. This means that as soon as the shortest match has
been found, continuation to a new subject segment is no longer possi-
ble. Consider this pcre2test example:
long string when PCRE2_PARTIAL_SOFT is used. The section "Partial
Matching and Word Boundaries" above describes an issue that arises if
the pattern ends with \b or \B. Another kind of difference may occur
when there are multiple matching possibilities, because (for PCRE2_PAR-
TIAL_SOFT) a partial match result is given only when there are no com-
pleted matches. This means that as soon as the shortest match has been
found, continuation to a new subject segment is no longer possible.
Consider this pcre2test example:
re> /dog(sbody)?/
data> dogsb\=ps
@ -6034,7 +6023,7 @@ ISSUES WITH MULTI-SEGMENT MATCHING
data> gsb\=ph,dfa,dfa_restart
Partial match: gsb
6. Patterns that contain alternatives at the top level which do not all
4. Patterns that contain alternatives at the top level which do not all
start with the same pattern item may not work as expected when
PCRE2_DFA_RESTART is used. For example, consider this pattern:
@ -6079,7 +6068,7 @@ AUTHOR
REVISION
Last updated: 21 June 2019
Last updated: 21 July 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "25 June 2019" "PCRE2 10.34"
.TH PCRE2API 3 "20 July 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -2719,12 +2719,15 @@ Your program may crash or loop indefinitely or give wrong results.
.sp
These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. If this happens when
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
testing any remaining alternatives. Only if no complete match can be found is
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
match, but only if no complete match can be found.
subject characters to complete the match. In addition, either at least one
character must have been inspected or the pattern must contain a lookbehind.
.P
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
is set, matching continues by testing any remaining alternatives. Only if no
complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that the
caller is prepared to handle a partial match, but only if no complete match can
be found.
.P
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
a partial match is found, \fBpcre2_match()\fP immediately returns
@ -3859,6 +3862,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 25 June 2019
Last updated: 20 July 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PARTIAL 3 "21 June 2019" "PCRE2 10.34"
.TH PCRE2PARTIAL 3 "21 July 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2"
@ -22,7 +22,7 @@ as soon as a mistake is made, by beeping and not reflecting the character that
has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been
entered. Partial matching can also be useful when the subject string is very
long and is not all available at once.
long and is not all available at once, as discussed below.
.P
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
@ -55,13 +55,17 @@ is also disabled for partial matching.
.sp
A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
subject string is reached successfully, but matching cannot continue because
more characters are needed. However, at least one character in the subject must
have been inspected. This character need not form part of the final matched
string; lookbehind assertions and the \eK escape sequence provide ways of
inspecting characters before the start of a matched string. The requirement for
inspecting at least one character exists because an empty string can always be
matched; without such a restriction there would always be a partial match of an
empty string at the end of the subject.
more characters are needed, and in addition, either at least one character in
the subject has been inspected or the pattern contains a lookbehind. An
inspected character need not form part of the final matched string; lookbehind
assertions and the \eK escape sequence provide ways of inspecting characters
before the start of a matched string.
.P
The two additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there
would be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty).
.P
When a partial match is returned, the first two elements in the ovector point
to the portion of the subject that was matched, but the values in the rest of
@ -78,7 +82,7 @@ these characters are needed for a subsequent re-match with additional
characters.
.P
What happens when a partial match is identified depends on which of the two
partial matching options are set.
partial matching options is set.
.
.
.SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()"
@ -100,12 +104,12 @@ the data that is returned. Consider this pattern:
.sp
/123\ew+X|dogY/
.sp
If this is matched against the subject string "abc123dog", both
alternatives fail to match, but the end of the subject is reached during
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
identifying "123dog" as the first partial match that was found. (In this
example, there are two partial matches, because "dog" on its own partially
matches the second alternative.)
If this is matched against the subject string "abc123dog", both alternatives
fail to match, but the end of the subject is reached during matching, so
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
"123dog" as the first partial match that was found. (In this example, there are
two partial matches, because "dog" on its own partially matches the second
alternative.)
.
.
.SS "PCRE2_PARTIAL_HARD WITH pcre2_match()"
@ -117,8 +121,8 @@ possible complete matches. This option is "hard" because it prefers an earlier
partial match over a later complete match. For this reason, the assumption is
made that the end of the supplied subject string may not be the true end of the
available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end
of the subject, the result is PCRE2_ERROR_PARTIAL, provided that at least one
character in the subject has been inspected.
of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
characters have been inspected.
.
.
.SS "Comparing hard and soft partial matching"
@ -319,40 +323,23 @@ string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
displays a partial match, it indicates the lookbehind characters with '<'
characters:
characters if the "allusedtext" modifier is set:
.sp
re> "(?<=123)abc"
data> xx123ab\e=ph
data> xx123ab\e=ph,allusedtext
Partial match: 123ab
<<<
However, the "allusedtext" modifier is not available for JIT matching, because
JIT matching does not maintain the first and last consulted characters.
.P
3. The maximum lookbehind count is also important when the result of a partial
match attempt is "no match". In this case, the maximum lookbehind characters
from the end of the current segment must be retained at the start of the next
segment, in case the lookbehind is at the start of the pattern. Matching the
next segment must then start at the appropriate offset.
.P
4. Because a partial match must always contain at least one character, what
might be considered a partial match of an empty string actually gives a "no
match" result. For example:
.sp
re> /c(?<=abc)x/
data> ab\e=ps
No match
.sp
If the next segment begins "cx", a match should be found, but this will only
happen if characters from the previous segment are retained. For this reason, a
"no match" result should be interpreted as "partial match of an empty string"
when the pattern contains lookbehinds.
.P
5. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string,
especially when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and
Word Boundaries" above describes an issue that arises if the pattern ends with
\eb or \eB. Another kind of difference may occur when there are multiple
matching possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result
is given only when there are no completed matches. This means that as soon as
the shortest match has been found, continuation to a new subject segment is no
3. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string
when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
Boundaries" above describes an issue that arises if the pattern ends with \eb
or \eB. Another kind of difference may occur when there are multiple matching
possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
only when there are no completed matches. This means that as soon as the
shortest match has been found, continuation to a new subject segment is no
longer possible. Consider this \fBpcre2test\fP example:
.sp
re> /dog(sbody)?/
@ -386,7 +373,7 @@ multi-segment data. The example above then behaves differently:
data> gsb\e=ph,dfa,dfa_restart
Partial match: gsb
.sp
6. Patterns that contain alternatives at the top level which do not all start
4. Patterns that contain alternatives at the top level which do not all start
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
used. For example, consider this pattern:
.sp
@ -435,6 +422,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 21 June 2019
Last updated: 21 July 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -966,7 +966,7 @@ for (;;)
if (ptr >= end_subject)
{
if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
could_continue = TRUE;
return PCRE2_ERROR_PARTIAL;
else { ADD_ACTIVE(state_offset + 1, 0); }
}
break;
@ -1015,10 +1015,12 @@ for (;;)
/*-----------------------------------------------------------------*/
case OP_EODN:
if (clen == 0 && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
could_continue = TRUE;
else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - mb->nllen))
{ ADD_ACTIVE(state_offset + 1, 0); }
if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - mb->nllen))
{
if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
return PCRE2_ERROR_PARTIAL;
ADD_ACTIVE(state_offset + 1, 0);
}
break;
/*-----------------------------------------------------------------*/
@ -3181,9 +3183,12 @@ for (;;)
partial_newline || /* Either partial NL */
( /* or ... */
ptr >= end_subject && /* End of subject and */
ptr > mb->start_used_ptr) /* Inspected non-empty string */
( /* either */
ptr > mb->start_used_ptr || /* Inspected non-empty string */
mb->haslookbehind /* or pattern has lookbehind */
)
)
))
match_count = PCRE2_ERROR_PARTIAL;
break; /* Exit from loop along the subject string */
}
@ -3412,6 +3417,7 @@ mb->tables = re->tables;
mb->start_subject = subject;
mb->end_subject = end_subject;
mb->start_offset = start_offset;
mb->haslookbehind = (re->max_lookbehind > 0);
mb->moptions = options;
mb->poptions = re->overall_options;
mb->match_call_count = 0;

View File

@ -854,6 +854,7 @@ typedef struct match_block {
uint32_t match_call_count; /* Number of times a new frame is created */
BOOL hitend; /* Hit the end of the subject at some point */
BOOL hasthen; /* Pattern contains (*THEN) */
BOOL haslookbehind; /* Pattern contains sigificant lookbehind */
const uint8_t *lcc; /* Points to lower casing table */
const uint8_t *fcc; /* Points to case-flipping table */
const uint8_t *ctypes; /* Points to table of type maps */
@ -909,6 +910,7 @@ typedef struct dfa_match_block {
uint32_t poptions; /* Pattern options */
uint32_t nltype; /* Newline type */
uint32_t nllen; /* Newline string length */
BOOL haslookbehind; /* Pattern contains significant lookbehind */
PCRE2_UCHAR nl[4]; /* Newline string when fixed */
uint16_t bsr_convention; /* \R interpretation */
pcre2_callout_block *cb; /* Points to a callout block */

View File

@ -416,7 +416,6 @@ if (caseless)
#endif
/* Not in UTF mode */
{
for (; length > 0; length--)
{
@ -491,11 +490,16 @@ heap is used for a larger vector.
*************************************************/
/* These macros pack up tests that are used for partial matching several times
in the code. We set the "hit end" flag if the pointer is at the end of the
subject and also past the earliest inspected character (i.e. something has been
matched, even if not part of the actual matched string). For hard partial
matching, we then return immediately. The second one is used when we already
know we are past the end of the subject. */
in the code. The second one is used when we already know we are past the end of
the subject. We set the "hit end" flag if the pointer is at the end of the
subject and either (a) the pointer is past the earliest inspected character
(i.e. something has been matched, even if not part of the actual matched
string), or (b) the pattern contains a lookbehind. These are the conditions for
which adding more characters may allow the current match to continue.
For hard partial matching, we immediately return a partial match. Otherwise,
carrying on means that a complete match on the current subject will be sought.
A partial match is returned only if no complete match can be found. */
#define CHECK_PARTIAL()\
if (Feptr >= mb->end_subject) \
@ -503,31 +507,13 @@ know we are past the end of the subject. */
SCHECK_PARTIAL(); \
}
/* Original version that allows hard partial to continue if no inspected
characters. */
#define SCHECK_PARTIAL()\
if (mb->partial != 0 && Feptr > mb->start_used_ptr) \
if (mb->partial != 0 && (Feptr > mb->start_used_ptr || mb->haslookbehind)) \
{ \
mb->hitend = TRUE; \
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \
}
/* Experimental version that makes hard partial give no match instead of
continuing if no characters have been inspected. */
#ifdef NEVERNEVER
#define SCHECK_PARTIAL()\
if (mb->partial != 0) \
{ \
if (Feptr > mb->start_used_ptr) \
{ \
mb->hitend = TRUE; \
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \
} \
else if (mb->partial > 1) RRETURN(MATCH_NOMATCH); \
}
#endif /* NEVERNEVER */
/* These macros are used to implement backtracking. They simulate a recursive
call to the match() function by means of a local vector of frames which
@ -5670,7 +5656,11 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
case OP_EOD:
if (Feptr < mb->end_subject) RRETURN(MATCH_NOMATCH);
SCHECK_PARTIAL();
if (mb->partial != 0)
{
mb->hitend = TRUE;
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL;
}
Fecode++;
break;
@ -5695,7 +5685,11 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
/* Either at end of string or \n before end. */
SCHECK_PARTIAL();
if (mb->partial != 0)
{
mb->hitend = TRUE;
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL;
}
Fecode++;
break;
@ -6457,6 +6451,7 @@ mb->start_subject = subject;
mb->start_offset = start_offset;
mb->end_subject = end_subject;
mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
mb->haslookbehind = (re->max_lookbehind > 0);
mb->poptions = re->overall_options; /* Pattern options */
mb->ignore_skip_arg = 0;
mb->mark = mb->nomatch_mark = NULL; /* In case never set */

29
testdata/testinput2 vendored
View File

@ -5690,10 +5690,33 @@ a)"xI
# ----
/(?<=(?=.(?<=x)))/
ab\=ph
# Expect error (recursion => not fixed length)
/(\2)((?=(?<=\1)))/
/c*+(?<=[bc])/
abc\=ph,no_jit
ab\=ph,no_jit
abc\=ps,no_jit
ab\=ps,no_jit
/c++(?<=[bc])/
abc\=ph,no_jit
ab\=ph,no_jit
/(?<=(?=.(?<=x)))/
abx
ab\=ph,no_jit
bxyz
xyz
/\z/
abc\=ph,no_jit
abc\=ps
/\Z/
abc\=ph,no_jit
abc\=ps
abc\n\=ph,no_jit
abc\n\=ps
# End of testinput2

26
testdata/testinput6 vendored
View File

@ -4994,4 +4994,30 @@
ab\=ps
abcx
/\z/
abc\=ph
abc\=ps
/\Z/
abc\=ph
abc\=ps
abc\n\=ph
abc\n\=ps
/c*+(?<=[bc])/
abc\=ph
ab\=ph
abc\=ps
ab\=ps
/c++(?<=[bc])/
abc\=ph
ab\=ph
/(?<=(?=.(?<=x)))/
abx
ab\=ph
bxyz
xyz
# End of testinput6

46
testdata/testoutput2 vendored
View File

@ -17185,14 +17185,52 @@ Subject length lower bound = 1
# ----
/(?<=(?=.(?<=x)))/
ab\=ph
No match
# Expect error (recursion => not fixed length)
/(\2)((?=(?<=\1)))/
Failed: error 125 at offset 8: lookbehind assertion is not fixed length
/c*+(?<=[bc])/
abc\=ph,no_jit
Partial match: c
ab\=ph,no_jit
Partial match:
abc\=ps,no_jit
0: c
ab\=ps,no_jit
0:
/c++(?<=[bc])/
abc\=ph,no_jit
Partial match: c
ab\=ph,no_jit
Partial match:
/(?<=(?=.(?<=x)))/
abx
0:
ab\=ph,no_jit
Partial match:
bxyz
0:
xyz
0:
/\z/
abc\=ph,no_jit
Partial match:
abc\=ps
0:
/\Z/
abc\=ph,no_jit
Partial match:
abc\=ps
0:
abc\n\=ph,no_jit
Partial match: \x0a
abc\n\=ps
0:
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data

42
testdata/testoutput6 vendored
View File

@ -7845,4 +7845,46 @@ Partial match: ab
abcx
0: abcx
/\z/
abc\=ph
Partial match:
abc\=ps
0:
/\Z/
abc\=ph
Partial match:
abc\=ps
0:
abc\n\=ph
Partial match: \x0a
abc\n\=ps
0:
/c*+(?<=[bc])/
abc\=ph
Partial match: c
ab\=ph
Partial match:
abc\=ps
0: c
ab\=ps
0:
/c++(?<=[bc])/
abc\=ph
Partial match: c
ab\=ph
Partial match:
/(?<=(?=.(?<=x)))/
abx
0:
ab\=ph
Partial match:
bxyz
0:
xyz
0:
# End of testinput6