Update definition of partial match and fix \z and \Z (as documented).

This commit is contained in:
Philip.Hazel 2019-07-21 16:48:13 +00:00
parent 344056baf8
commit c84a06c96e
13 changed files with 715 additions and 604 deletions

View File

@ -97,6 +97,16 @@ within it, the nested lookbehind was not correctly processed. For example, if
20. Implemented pcre2_get_match_data_size(). 20. Implemented pcre2_get_match_data_size().
21. Two alterations to partial matching (not yet done by JIT):
(a) The definition of a partial match is slightly changed: if a pattern
contains any lookbehinds, an empty partial match may be given, because this
is another situation where adding characters to the current subject can
lead to a full match. Example: /c*+(?<=[bc])/ with subject "ab".
(b) An empty string partial hard match can be returned for \z and \Z as it
is documented that they shouldn't match.
Version 10.33 16-April-2019 Version 10.33 16-April-2019
--------------------------- ---------------------------

View File

@ -2725,12 +2725,16 @@ Your program may crash or loop indefinitely or give wrong results.
</pre> </pre>
These options turn on the partial matching feature. A partial match occurs if These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. If this happens when subject characters to complete the match. In addition, either at least one
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by character must have been inspected or the pattern must contain a lookbehind.
testing any remaining alternatives. Only if no complete match can be found is </P>
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words, <P>
PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
match, but only if no complete match can be found. is set, matching continues by testing any remaining alternatives. Only if no
complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that the
caller is prepared to handle a partial match, but only if no complete match can
be found.
</P> </P>
<P> <P>
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
@ -3846,7 +3850,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 25 June 2019 Last updated: 20 July 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -45,7 +45,7 @@ as soon as a mistake is made, by beeping and not reflecting the character that
has been typed, for example. This immediate feedback is likely to be a better has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been user interface than a check that is delayed until the entire string has been
entered. Partial matching can also be useful when the subject string is very entered. Partial matching can also be useful when the subject string is very
long and is not all available at once. long and is not all available at once, as discussed below.
</P> </P>
<P> <P>
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
@ -79,13 +79,18 @@ is also disabled for partial matching.
<P> <P>
A partial match occurs during a call to <b>pcre2_match()</b> when the end of the A partial match occurs during a call to <b>pcre2_match()</b> when the end of the
subject string is reached successfully, but matching cannot continue because subject string is reached successfully, but matching cannot continue because
more characters are needed. However, at least one character in the subject must more characters are needed, and in addition, either at least one character in
have been inspected. This character need not form part of the final matched the subject has been inspected or the pattern contains a lookbehind. An
string; lookbehind assertions and the \K escape sequence provide ways of inspected character need not form part of the final matched string; lookbehind
inspecting characters before the start of a matched string. The requirement for assertions and the \K escape sequence provide ways of inspecting characters
inspecting at least one character exists because an empty string can always be before the start of a matched string.
matched; without such a restriction there would always be a partial match of an </P>
empty string at the end of the subject. <P>
The two additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there
would be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty).
</P> </P>
<P> <P>
When a partial match is returned, the first two elements in the ovector point When a partial match is returned, the first two elements in the ovector point
@ -104,7 +109,7 @@ characters.
</P> </P>
<P> <P>
What happens when a partial match is identified depends on which of the two What happens when a partial match is identified depends on which of the two
partial matching options are set. partial matching options is set.
</P> </P>
<br><b> <br><b>
PCRE2_PARTIAL_SOFT WITH pcre2_match() PCRE2_PARTIAL_SOFT WITH pcre2_match()
@ -128,12 +133,12 @@ the data that is returned. Consider this pattern:
<pre> <pre>
/123\w+X|dogY/ /123\w+X|dogY/
</pre> </pre>
If this is matched against the subject string "abc123dog", both If this is matched against the subject string "abc123dog", both alternatives
alternatives fail to match, but the end of the subject is reached during fail to match, but the end of the subject is reached during matching, so
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
identifying "123dog" as the first partial match that was found. (In this "123dog" as the first partial match that was found. (In this example, there are
example, there are two partial matches, because "dog" on its own partially two partial matches, because "dog" on its own partially matches the second
matches the second alternative.) alternative.)
</P> </P>
<br><b> <br><b>
PCRE2_PARTIAL_HARD WITH pcre2_match() PCRE2_PARTIAL_HARD WITH pcre2_match()
@ -145,8 +150,8 @@ possible complete matches. This option is "hard" because it prefers an earlier
partial match over a later complete match. For this reason, the assumption is partial match over a later complete match. For this reason, the assumption is
made that the end of the supplied subject string may not be the true end of the made that the end of the supplied subject string may not be the true end of the
available data, and so, if \z, \Z, \b, \B, or $ are encountered at the end available data, and so, if \z, \Z, \b, \B, or $ are encountered at the end
of the subject, the result is PCRE2_ERROR_PARTIAL, provided that at least one of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
character in the subject has been inspected. characters have been inspected.
</P> </P>
<br><b> <br><b>
Comparing hard and soft partial matching Comparing hard and soft partial matching
@ -346,44 +351,25 @@ string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b> value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
displays a partial match, it indicates the lookbehind characters with '&#60;' displays a partial match, it indicates the lookbehind characters with '&#60;'
characters: characters if the "allusedtext" modifier is set:
<pre> <pre>
re&#62; "(?&#60;=123)abc" re&#62; "(?&#60;=123)abc"
data&#62; xx123ab\=ph data&#62; xx123ab\=ph,allusedtext
Partial match: 123ab Partial match: 123ab
&#60;&#60;&#60; &#60;&#60;&#60;
</PRE>
</P>
<P>
3. The maximum lookbehind count is also important when the result of a partial
match attempt is "no match". In this case, the maximum lookbehind characters
from the end of the current segment must be retained at the start of the next
segment, in case the lookbehind is at the start of the pattern. Matching the
next segment must then start at the appropriate offset.
</P>
<P>
4. Because a partial match must always contain at least one character, what
might be considered a partial match of an empty string actually gives a "no
match" result. For example:
<pre>
re&#62; /c(?&#60;=abc)x/
data&#62; ab\=ps
No match
</pre> </pre>
If the next segment begins "cx", a match should be found, but this will only However, the "allusedtext" modifier is not available for JIT matching, because
happen if characters from the previous segment are retained. For this reason, a JIT matching does not maintain the first and last consulted characters.
"no match" result should be interpreted as "partial match of an empty string"
when the pattern contains lookbehinds.
</P> </P>
<P> <P>
5. Matching a subject string that is split into multiple segments may not 3. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string, always produce exactly the same result as matching over one single long string
especially when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
Word Boundaries" above describes an issue that arises if the pattern ends with Boundaries" above describes an issue that arises if the pattern ends with \b
\b or \B. Another kind of difference may occur when there are multiple or \B. Another kind of difference may occur when there are multiple matching
matching possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
is given only when there are no completed matches. This means that as soon as only when there are no completed matches. This means that as soon as the
the shortest match has been found, continuation to a new subject segment is no shortest match has been found, continuation to a new subject segment is no
longer possible. Consider this <b>pcre2test</b> example: longer possible. Consider this <b>pcre2test</b> example:
<pre> <pre>
re&#62; /dog(sbody)?/ re&#62; /dog(sbody)?/
@ -418,7 +404,7 @@ multi-segment data. The example above then behaves differently:
data&#62; gsb\=ph,dfa,dfa_restart data&#62; gsb\=ph,dfa,dfa_restart
Partial match: gsb Partial match: gsb
</pre> </pre>
6. Patterns that contain alternatives at the top level which do not all start 4. Patterns that contain alternatives at the top level which do not all start
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
used. For example, consider this pattern: used. For example, consider this pattern:
<pre> <pre>
@ -463,7 +449,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC10" href="#TOC1">REVISION</a><br> <br><a name="SEC10" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 21 June 2019 Last updated: 21 July 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "25 June 2019" "PCRE2 10.34" .TH PCRE2API 3 "20 July 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -2719,12 +2719,15 @@ Your program may crash or loop indefinitely or give wrong results.
.sp .sp
These options turn on the partial matching feature. A partial match occurs if These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. If this happens when subject characters to complete the match. In addition, either at least one
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by character must have been inspected or the pattern must contain a lookbehind.
testing any remaining alternatives. Only if no complete match can be found is .P
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words, If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial is set, matching continues by testing any remaining alternatives. Only if no
match, but only if no complete match can be found. complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that the
caller is prepared to handle a partial match, but only if no complete match can
be found.
.P .P
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
a partial match is found, \fBpcre2_match()\fP immediately returns a partial match is found, \fBpcre2_match()\fP immediately returns
@ -3859,6 +3862,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 25 June 2019 Last updated: 20 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PARTIAL 3 "21 June 2019" "PCRE2 10.34" .TH PCRE2PARTIAL 3 "21 July 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2" .SH "PARTIAL MATCHING IN PCRE2"
@ -22,7 +22,7 @@ as soon as a mistake is made, by beeping and not reflecting the character that
has been typed, for example. This immediate feedback is likely to be a better has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been user interface than a check that is delayed until the entire string has been
entered. Partial matching can also be useful when the subject string is very entered. Partial matching can also be useful when the subject string is very
long and is not all available at once. long and is not all available at once, as discussed below.
.P .P
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
PCRE2_PARTIAL_HARD options, which can be set when calling a matching function. PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
@ -55,13 +55,17 @@ is also disabled for partial matching.
.sp .sp
A partial match occurs during a call to \fBpcre2_match()\fP when the end of the A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
subject string is reached successfully, but matching cannot continue because subject string is reached successfully, but matching cannot continue because
more characters are needed. However, at least one character in the subject must more characters are needed, and in addition, either at least one character in
have been inspected. This character need not form part of the final matched the subject has been inspected or the pattern contains a lookbehind. An
string; lookbehind assertions and the \eK escape sequence provide ways of inspected character need not form part of the final matched string; lookbehind
inspecting characters before the start of a matched string. The requirement for assertions and the \eK escape sequence provide ways of inspecting characters
inspecting at least one character exists because an empty string can always be before the start of a matched string.
matched; without such a restriction there would always be a partial match of an .P
empty string at the end of the subject. The two additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there
would be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty).
.P .P
When a partial match is returned, the first two elements in the ovector point When a partial match is returned, the first two elements in the ovector point
to the portion of the subject that was matched, but the values in the rest of to the portion of the subject that was matched, but the values in the rest of
@ -78,7 +82,7 @@ these characters are needed for a subsequent re-match with additional
characters. characters.
.P .P
What happens when a partial match is identified depends on which of the two What happens when a partial match is identified depends on which of the two
partial matching options are set. partial matching options is set.
. .
. .
.SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()" .SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()"
@ -100,12 +104,12 @@ the data that is returned. Consider this pattern:
.sp .sp
/123\ew+X|dogY/ /123\ew+X|dogY/
.sp .sp
If this is matched against the subject string "abc123dog", both If this is matched against the subject string "abc123dog", both alternatives
alternatives fail to match, but the end of the subject is reached during fail to match, but the end of the subject is reached during matching, so
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
identifying "123dog" as the first partial match that was found. (In this "123dog" as the first partial match that was found. (In this example, there are
example, there are two partial matches, because "dog" on its own partially two partial matches, because "dog" on its own partially matches the second
matches the second alternative.) alternative.)
. .
. .
.SS "PCRE2_PARTIAL_HARD WITH pcre2_match()" .SS "PCRE2_PARTIAL_HARD WITH pcre2_match()"
@ -117,8 +121,8 @@ possible complete matches. This option is "hard" because it prefers an earlier
partial match over a later complete match. For this reason, the assumption is partial match over a later complete match. For this reason, the assumption is
made that the end of the supplied subject string may not be the true end of the made that the end of the supplied subject string may not be the true end of the
available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end
of the subject, the result is PCRE2_ERROR_PARTIAL, provided that at least one of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
character in the subject has been inspected. characters have been inspected.
. .
. .
.SS "Comparing hard and soft partial matching" .SS "Comparing hard and soft partial matching"
@ -319,40 +323,23 @@ string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
displays a partial match, it indicates the lookbehind characters with '<' displays a partial match, it indicates the lookbehind characters with '<'
characters: characters if the "allusedtext" modifier is set:
.sp .sp
re> "(?<=123)abc" re> "(?<=123)abc"
data> xx123ab\e=ph data> xx123ab\e=ph,allusedtext
Partial match: 123ab Partial match: 123ab
<<< <<<
However, the "allusedtext" modifier is not available for JIT matching, because
JIT matching does not maintain the first and last consulted characters.
.P .P
3. The maximum lookbehind count is also important when the result of a partial 3. Matching a subject string that is split into multiple segments may not
match attempt is "no match". In this case, the maximum lookbehind characters always produce exactly the same result as matching over one single long string
from the end of the current segment must be retained at the start of the next when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
segment, in case the lookbehind is at the start of the pattern. Matching the Boundaries" above describes an issue that arises if the pattern ends with \eb
next segment must then start at the appropriate offset. or \eB. Another kind of difference may occur when there are multiple matching
.P possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
4. Because a partial match must always contain at least one character, what only when there are no completed matches. This means that as soon as the
might be considered a partial match of an empty string actually gives a "no shortest match has been found, continuation to a new subject segment is no
match" result. For example:
.sp
re> /c(?<=abc)x/
data> ab\e=ps
No match
.sp
If the next segment begins "cx", a match should be found, but this will only
happen if characters from the previous segment are retained. For this reason, a
"no match" result should be interpreted as "partial match of an empty string"
when the pattern contains lookbehinds.
.P
5. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string,
especially when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and
Word Boundaries" above describes an issue that arises if the pattern ends with
\eb or \eB. Another kind of difference may occur when there are multiple
matching possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result
is given only when there are no completed matches. This means that as soon as
the shortest match has been found, continuation to a new subject segment is no
longer possible. Consider this \fBpcre2test\fP example: longer possible. Consider this \fBpcre2test\fP example:
.sp .sp
re> /dog(sbody)?/ re> /dog(sbody)?/
@ -386,7 +373,7 @@ multi-segment data. The example above then behaves differently:
data> gsb\e=ph,dfa,dfa_restart data> gsb\e=ph,dfa,dfa_restart
Partial match: gsb Partial match: gsb
.sp .sp
6. Patterns that contain alternatives at the top level which do not all start 4. Patterns that contain alternatives at the top level which do not all start
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
used. For example, consider this pattern: used. For example, consider this pattern:
.sp .sp
@ -435,6 +422,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 21 June 2019 Last updated: 21 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -966,7 +966,7 @@ for (;;)
if (ptr >= end_subject) if (ptr >= end_subject)
{ {
if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0) if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
could_continue = TRUE; return PCRE2_ERROR_PARTIAL;
else { ADD_ACTIVE(state_offset + 1, 0); } else { ADD_ACTIVE(state_offset + 1, 0); }
} }
break; break;
@ -1015,10 +1015,12 @@ for (;;)
/*-----------------------------------------------------------------*/ /*-----------------------------------------------------------------*/
case OP_EODN: case OP_EODN:
if (clen == 0 && (mb->moptions & PCRE2_PARTIAL_HARD) != 0) if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - mb->nllen))
could_continue = TRUE; {
else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - mb->nllen)) if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
{ ADD_ACTIVE(state_offset + 1, 0); } return PCRE2_ERROR_PARTIAL;
ADD_ACTIVE(state_offset + 1, 0);
}
break; break;
/*-----------------------------------------------------------------*/ /*-----------------------------------------------------------------*/
@ -3175,15 +3177,18 @@ for (;;)
(mb->moptions & PCRE2_PARTIAL_HARD) != 0 /* Hard partial */ (mb->moptions & PCRE2_PARTIAL_HARD) != 0 /* Hard partial */
|| /* or... */ || /* or... */
((mb->moptions & PCRE2_PARTIAL_SOFT) != 0 && /* Soft partial and */ ((mb->moptions & PCRE2_PARTIAL_SOFT) != 0 && /* Soft partial and */
match_count < 0) /* no matches */ match_count < 0) /* no matches */
) && /* And... */ ) && /* And... */
( (
partial_newline || /* Either partial NL */ partial_newline || /* Either partial NL */
( /* or ... */ ( /* or ... */
ptr >= end_subject && /* End of subject and */ ptr >= end_subject && /* End of subject and */
ptr > mb->start_used_ptr) /* Inspected non-empty string */ ( /* either */
ptr > mb->start_used_ptr || /* Inspected non-empty string */
mb->haslookbehind /* or pattern has lookbehind */
)
) )
) ))
match_count = PCRE2_ERROR_PARTIAL; match_count = PCRE2_ERROR_PARTIAL;
break; /* Exit from loop along the subject string */ break; /* Exit from loop along the subject string */
} }
@ -3412,6 +3417,7 @@ mb->tables = re->tables;
mb->start_subject = subject; mb->start_subject = subject;
mb->end_subject = end_subject; mb->end_subject = end_subject;
mb->start_offset = start_offset; mb->start_offset = start_offset;
mb->haslookbehind = (re->max_lookbehind > 0);
mb->moptions = options; mb->moptions = options;
mb->poptions = re->overall_options; mb->poptions = re->overall_options;
mb->match_call_count = 0; mb->match_call_count = 0;

View File

@ -854,6 +854,7 @@ typedef struct match_block {
uint32_t match_call_count; /* Number of times a new frame is created */ uint32_t match_call_count; /* Number of times a new frame is created */
BOOL hitend; /* Hit the end of the subject at some point */ BOOL hitend; /* Hit the end of the subject at some point */
BOOL hasthen; /* Pattern contains (*THEN) */ BOOL hasthen; /* Pattern contains (*THEN) */
BOOL haslookbehind; /* Pattern contains sigificant lookbehind */
const uint8_t *lcc; /* Points to lower casing table */ const uint8_t *lcc; /* Points to lower casing table */
const uint8_t *fcc; /* Points to case-flipping table */ const uint8_t *fcc; /* Points to case-flipping table */
const uint8_t *ctypes; /* Points to table of type maps */ const uint8_t *ctypes; /* Points to table of type maps */
@ -909,6 +910,7 @@ typedef struct dfa_match_block {
uint32_t poptions; /* Pattern options */ uint32_t poptions; /* Pattern options */
uint32_t nltype; /* Newline type */ uint32_t nltype; /* Newline type */
uint32_t nllen; /* Newline string length */ uint32_t nllen; /* Newline string length */
BOOL haslookbehind; /* Pattern contains significant lookbehind */
PCRE2_UCHAR nl[4]; /* Newline string when fixed */ PCRE2_UCHAR nl[4]; /* Newline string when fixed */
uint16_t bsr_convention; /* \R interpretation */ uint16_t bsr_convention; /* \R interpretation */
pcre2_callout_block *cb; /* Points to a callout block */ pcre2_callout_block *cb; /* Points to a callout block */

View File

@ -415,8 +415,7 @@ if (caseless)
else else
#endif #endif
/* Not in UTF mode */ /* Not in UTF mode */
{ {
for (; length > 0; length--) for (; length > 0; length--)
{ {
@ -491,11 +490,16 @@ heap is used for a larger vector.
*************************************************/ *************************************************/
/* These macros pack up tests that are used for partial matching several times /* These macros pack up tests that are used for partial matching several times
in the code. We set the "hit end" flag if the pointer is at the end of the in the code. The second one is used when we already know we are past the end of
subject and also past the earliest inspected character (i.e. something has been the subject. We set the "hit end" flag if the pointer is at the end of the
matched, even if not part of the actual matched string). For hard partial subject and either (a) the pointer is past the earliest inspected character
matching, we then return immediately. The second one is used when we already (i.e. something has been matched, even if not part of the actual matched
know we are past the end of the subject. */ string), or (b) the pattern contains a lookbehind. These are the conditions for
which adding more characters may allow the current match to continue.
For hard partial matching, we immediately return a partial match. Otherwise,
carrying on means that a complete match on the current subject will be sought.
A partial match is returned only if no complete match can be found. */
#define CHECK_PARTIAL()\ #define CHECK_PARTIAL()\
if (Feptr >= mb->end_subject) \ if (Feptr >= mb->end_subject) \
@ -503,31 +507,13 @@ know we are past the end of the subject. */
SCHECK_PARTIAL(); \ SCHECK_PARTIAL(); \
} }
/* Original version that allows hard partial to continue if no inspected
characters. */
#define SCHECK_PARTIAL()\ #define SCHECK_PARTIAL()\
if (mb->partial != 0 && Feptr > mb->start_used_ptr) \ if (mb->partial != 0 && (Feptr > mb->start_used_ptr || mb->haslookbehind)) \
{ \ { \
mb->hitend = TRUE; \ mb->hitend = TRUE; \
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \ if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \
} }
/* Experimental version that makes hard partial give no match instead of
continuing if no characters have been inspected. */
#ifdef NEVERNEVER
#define SCHECK_PARTIAL()\
if (mb->partial != 0) \
{ \
if (Feptr > mb->start_used_ptr) \
{ \
mb->hitend = TRUE; \
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \
} \
else if (mb->partial > 1) RRETURN(MATCH_NOMATCH); \
}
#endif /* NEVERNEVER */
/* These macros are used to implement backtracking. They simulate a recursive /* These macros are used to implement backtracking. They simulate a recursive
call to the match() function by means of a local vector of frames which call to the match() function by means of a local vector of frames which
@ -5670,7 +5656,11 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
case OP_EOD: case OP_EOD:
if (Feptr < mb->end_subject) RRETURN(MATCH_NOMATCH); if (Feptr < mb->end_subject) RRETURN(MATCH_NOMATCH);
SCHECK_PARTIAL(); if (mb->partial != 0)
{
mb->hitend = TRUE;
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL;
}
Fecode++; Fecode++;
break; break;
@ -5695,7 +5685,11 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
/* Either at end of string or \n before end. */ /* Either at end of string or \n before end. */
SCHECK_PARTIAL(); if (mb->partial != 0)
{
mb->hitend = TRUE;
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL;
}
Fecode++; Fecode++;
break; break;
@ -6457,9 +6451,10 @@ mb->start_subject = subject;
mb->start_offset = start_offset; mb->start_offset = start_offset;
mb->end_subject = end_subject; mb->end_subject = end_subject;
mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0; mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
mb->poptions = re->overall_options; /* Pattern options */ mb->haslookbehind = (re->max_lookbehind > 0);
mb->poptions = re->overall_options; /* Pattern options */
mb->ignore_skip_arg = 0; mb->ignore_skip_arg = 0;
mb->mark = mb->nomatch_mark = NULL; /* In case never set */ mb->mark = mb->nomatch_mark = NULL; /* In case never set */
/* The name table is needed for finding all the numbers associated with a /* The name table is needed for finding all the numbers associated with a
given name, for condition testing. The code follows the name table. */ given name, for condition testing. The code follows the name table. */

29
testdata/testinput2 vendored
View File

@ -5690,10 +5690,33 @@ a)"xI
# ---- # ----
/(?<=(?=.(?<=x)))/
ab\=ph
# Expect error (recursion => not fixed length) # Expect error (recursion => not fixed length)
/(\2)((?=(?<=\1)))/ /(\2)((?=(?<=\1)))/
/c*+(?<=[bc])/
abc\=ph,no_jit
ab\=ph,no_jit
abc\=ps,no_jit
ab\=ps,no_jit
/c++(?<=[bc])/
abc\=ph,no_jit
ab\=ph,no_jit
/(?<=(?=.(?<=x)))/
abx
ab\=ph,no_jit
bxyz
xyz
/\z/
abc\=ph,no_jit
abc\=ps
/\Z/
abc\=ph,no_jit
abc\=ps
abc\n\=ph,no_jit
abc\n\=ps
# End of testinput2 # End of testinput2

26
testdata/testinput6 vendored
View File

@ -4994,4 +4994,30 @@
ab\=ps ab\=ps
abcx abcx
/\z/
abc\=ph
abc\=ps
/\Z/
abc\=ph
abc\=ps
abc\n\=ph
abc\n\=ps
/c*+(?<=[bc])/
abc\=ph
ab\=ph
abc\=ps
ab\=ps
/c++(?<=[bc])/
abc\=ph
ab\=ph
/(?<=(?=.(?<=x)))/
abx
ab\=ph
bxyz
xyz
# End of testinput6 # End of testinput6

46
testdata/testoutput2 vendored
View File

@ -17185,14 +17185,52 @@ Subject length lower bound = 1
# ---- # ----
/(?<=(?=.(?<=x)))/
ab\=ph
No match
# Expect error (recursion => not fixed length) # Expect error (recursion => not fixed length)
/(\2)((?=(?<=\1)))/ /(\2)((?=(?<=\1)))/
Failed: error 125 at offset 8: lookbehind assertion is not fixed length Failed: error 125 at offset 8: lookbehind assertion is not fixed length
/c*+(?<=[bc])/
abc\=ph,no_jit
Partial match: c
ab\=ph,no_jit
Partial match:
abc\=ps,no_jit
0: c
ab\=ps,no_jit
0:
/c++(?<=[bc])/
abc\=ph,no_jit
Partial match: c
ab\=ph,no_jit
Partial match:
/(?<=(?=.(?<=x)))/
abx
0:
ab\=ph,no_jit
Partial match:
bxyz
0:
xyz
0:
/\z/
abc\=ph,no_jit
Partial match:
abc\=ps
0:
/\Z/
abc\=ph,no_jit
Partial match:
abc\=ps
0:
abc\n\=ph,no_jit
Partial match: \x0a
abc\n\=ps
0:
# End of testinput2 # End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data Error -62: bad serialized data

42
testdata/testoutput6 vendored
View File

@ -7845,4 +7845,46 @@ Partial match: ab
abcx abcx
0: abcx 0: abcx
/\z/
abc\=ph
Partial match:
abc\=ps
0:
/\Z/
abc\=ph
Partial match:
abc\=ps
0:
abc\n\=ph
Partial match: \x0a
abc\n\=ps
0:
/c*+(?<=[bc])/
abc\=ph
Partial match: c
ab\=ph
Partial match:
abc\=ps
0: c
ab\=ps
0:
/c++(?<=[bc])/
abc\=ph
Partial match: c
ab\=ph
Partial match:
/(?<=(?=.(?<=x)))/
abx
0:
ab\=ph
Partial match:
bxyz
0:
xyz
0:
# End of testinput6 # End of testinput6