More partial match tweaks.

This commit is contained in:
Philip.Hazel 2019-07-22 16:30:44 +00:00
parent f7e21162fa
commit 3572634086
13 changed files with 501 additions and 457 deletions

View File

@ -104,7 +104,11 @@ within it, the nested lookbehind was not correctly processed. For example, if
is another situation where adding characters to the current subject can is another situation where adding characters to the current subject can
lead to a full match. Example: /c*+(?<=[bc])/ with subject "ab". lead to a full match. Example: /c*+(?<=[bc])/ with subject "ab".
(b) An empty string partial hard match can be returned for \z and \Z as it (b) Similarly, if a pattern could match an empty string, an empty partial
match may be given. Example: /(?![ab]).*/ with subject "ab". This case
applies only to PCRE2_PARTIAL_HARD.
(c) An empty string partial hard match can be returned for \z and \Z as it
is documented that they shouldn't match. is documented that they shouldn't match.

View File

@ -2057,7 +2057,7 @@ the following negative numbers:
PCRE2_ERROR_BADOPTION the value of <i>what</i> was invalid PCRE2_ERROR_BADOPTION the value of <i>what</i> was invalid
PCRE2_ERROR_UNSET the requested field is not set PCRE2_ERROR_UNSET the requested field is not set
</pre> </pre>
The "magic number" is placed at the start of each compiled pattern as an simple The "magic number" is placed at the start of each compiled pattern as a simple
check against passing an arbitrary memory pointer. Here is a typical call of check against passing an arbitrary memory pointer. Here is a typical call of
<b>pcre2_pattern_info()</b>, to obtain the length of the compiled pattern: <b>pcre2_pattern_info()</b>, to obtain the length of the compiled pattern:
<pre> <pre>
@ -2114,7 +2114,7 @@ options returned for PCRE2_INFO_ALLOPTIONS.
PCRE2_INFO_BACKREFMAX PCRE2_INFO_BACKREFMAX
</pre> </pre>
Return the number of the highest backreference in the pattern. The third Return the number of the highest backreference in the pattern. The third
argument should point to an <b>uint32_t</b> variable. Named capture groups argument should point to a <b>uint32_t</b> variable. Named capture groups
acquire numbers as well as names, and these count towards the highest acquire numbers as well as names, and these count towards the highest
backreference. Backreferences such as \4 or \g{12} match the captured backreference. Backreferences such as \4 or \g{12} match the captured
characters of the given group, but in addition, the check that a capture characters of the given group, but in addition, the check that a capture
@ -2132,7 +2132,7 @@ that \R matches only CR, LF, or CRLF.
</pre> </pre>
Return the highest capture group number in the pattern. In patterns where (?| Return the highest capture group number in the pattern. In patterns where (?|
is not used, this is also the total number of capture groups. The third is not used, this is also the total number of capture groups. The third
argument should point to an <b>uint32_t</b> variable. argument should point to a <b>uint32_t</b> variable.
<pre> <pre>
PCRE2_INFO_DEPTHLIMIT PCRE2_INFO_DEPTHLIMIT
</pre> </pre>
@ -2157,7 +2157,7 @@ returned. Otherwise NULL is returned. The third argument should point to a
PCRE2_INFO_FIRSTCODETYPE PCRE2_INFO_FIRSTCODETYPE
</pre> </pre>
Return information about the first code unit of any matched string, for a Return information about the first code unit of any matched string, for a
non-anchored pattern. The third argument should point to an <b>uint32_t</b> non-anchored pattern. The third argument should point to a <b>uint32_t</b>
variable. If there is a fixed first value, for example, the letter "c" from a variable. If there is a fixed first value, for example, the letter "c" from a
pattern such as (cat|cow|coyote), 1 is returned, and the value can be retrieved pattern such as (cat|cow|coyote), 1 is returned, and the value can be retrieved
using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is
@ -2169,7 +2169,7 @@ is returned.
</pre> </pre>
Return the value of the first code unit of any matched string for a pattern Return the value of the first code unit of any matched string for a pattern
where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third
argument should point to an <b>uint32_t</b> variable. In the 8-bit library, the argument should point to a <b>uint32_t</b> variable. In the 8-bit library, the
value is always less than 256. In the 16-bit library the value can be up to value is always less than 256. In the 16-bit library the value can be up to
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff, 0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
and up to 0xffffffff when not using UTF-32 mode. and up to 0xffffffff when not using UTF-32 mode.
@ -2185,12 +2185,12 @@ pattern. Each additional capture group adds two PCRE2_SIZE variables.
PCRE2_INFO_HASBACKSLASHC PCRE2_INFO_HASBACKSLASHC
</pre> </pre>
Return 1 if the pattern contains any instances of \C, otherwise 0. The third Return 1 if the pattern contains any instances of \C, otherwise 0. The third
argument should point to an <b>uint32_t</b> variable. argument should point to a <b>uint32_t</b> variable.
<pre> <pre>
PCRE2_INFO_HASCRORLF PCRE2_INFO_HASCRORLF
</pre> </pre>
Return 1 if the pattern contains any explicit matches for CR or LF characters, Return 1 if the pattern contains any explicit matches for CR or LF characters,
otherwise 0. The third argument should point to an <b>uint32_t</b> variable. An otherwise 0. The third argument should point to a <b>uint32_t</b> variable. An
explicit match is either a literal CR or LF character, or \r or \n or one of explicit match is either a literal CR or LF character, or \r or \n or one of
the equivalent hexadecimal or octal escape sequences. the equivalent hexadecimal or octal escape sequences.
<pre> <pre>
@ -2206,7 +2206,7 @@ defaulted by the caller of the match function.
PCRE2_INFO_JCHANGED PCRE2_INFO_JCHANGED
</pre> </pre>
Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
0. The third argument should point to an <b>uint32_t</b> variable. (?J) and 0. The third argument should point to a <b>uint32_t</b> variable. (?J) and
(?-J) set and unset the local PCRE2_DUPNAMES option, respectively. (?-J) set and unset the local PCRE2_DUPNAMES option, respectively.
<pre> <pre>
PCRE2_INFO_JITSIZE PCRE2_INFO_JITSIZE
@ -2218,7 +2218,7 @@ return zero. The third argument should point to a <b>size_t</b> variable.
PCRE2_INFO_LASTCODETYPE PCRE2_INFO_LASTCODETYPE
</pre> </pre>
Returns 1 if there is a rightmost literal code unit that must exist in any Returns 1 if there is a rightmost literal code unit that must exist in any
matched string, other than at its start. The third argument should point to an matched string, other than at its start. The third argument should point to a
<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is <b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is
returned, the code unit value itself can be retrieved using returned, the code unit value itself can be retrieved using
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
@ -2231,12 +2231,12 @@ PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
Return the value of the rightmost literal code unit that must exist in any Return the value of the rightmost literal code unit that must exist in any
matched string, other than at its start, for a pattern where matched string, other than at its start, for a pattern where
PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argument PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argument
should point to an <b>uint32_t</b> variable. should point to a <b>uint32_t</b> variable.
<pre> <pre>
PCRE2_INFO_MATCHEMPTY PCRE2_INFO_MATCHEMPTY
</pre> </pre>
Return 1 if the pattern might match an empty string, otherwise 0. The third Return 1 if the pattern might match an empty string, otherwise 0. The third
argument should point to an <b>uint32_t</b> variable. When a pattern contains argument should point to a <b>uint32_t</b> variable. When a pattern contains
recursive subroutine calls it is not always possible to determine whether or recursive subroutine calls it is not always possible to determine whether or
not it can match an empty string. PCRE2 takes a cautious approach and returns 1 not it can match an empty string. PCRE2 takes a cautious approach and returns 1
in such cases. in such cases.
@ -2279,7 +2279,7 @@ If a minimum length for matching subject strings was computed, its value is
returned. Otherwise the returned value is 0. This value is not computed when returned. Otherwise the returned value is 0. This value is not computed when
PCRE2_NO_START_OPTIMIZE is set. The value is a number of characters, which in PCRE2_NO_START_OPTIMIZE is set. The value is a number of characters, which in
UTF mode may be different from the number of code units. The third argument UTF mode may be different from the number of code units. The third argument
should point to an <b>uint32_t</b> variable. The value is a lower bound to the should point to a <b>uint32_t</b> variable. The value is a lower bound to the
length of any matching string. There may not be any strings of that length that length of any matching string. There may not be any strings of that length that
do actually match, but every string that does match is at least that long. do actually match, but every string that does match is at least that long.
<pre> <pre>
@ -2726,7 +2726,8 @@ Your program may crash or loop indefinitely or give wrong results.
These options turn on the partial matching feature. A partial match occurs if These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. In addition, either at least one subject characters to complete the match. In addition, either at least one
character must have been inspected or the pattern must contain a lookbehind. character must have been inspected or the pattern must contain a lookbehind, or
the pattern must be one that could match an empty string.
</P> </P>
<P> <P>
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
@ -3850,7 +3851,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 20 July 2019 Last updated: 22 July 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -80,16 +80,18 @@ is also disabled for partial matching.
A partial match occurs during a call to <b>pcre2_match()</b> when the end of the A partial match occurs during a call to <b>pcre2_match()</b> when the end of the
subject string is reached successfully, but matching cannot continue because subject string is reached successfully, but matching cannot continue because
more characters are needed, and in addition, either at least one character in more characters are needed, and in addition, either at least one character in
the subject has been inspected or the pattern contains a lookbehind. An the subject has been inspected or the pattern contains a lookbehind, or (when
PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
inspected character need not form part of the final matched string; lookbehind inspected character need not form part of the final matched string; lookbehind
assertions and the \K escape sequence provide ways of inspecting characters assertions and the \K escape sequence provide ways of inspecting characters
before the start of a matched string. before the start of a matched string.
</P> </P>
<P> <P>
The two additional requirements define the cases where adding more characters The three additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there to the existing subject may complete the same match that would occur if they
would be a partial match of an empty string at the end of the subject for all had all been present in the first place. Without these conditions there would
unanchored patterns (and also for anchored patterns if the subject itself is be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty). empty).
</P> </P>
<P> <P>
@ -449,7 +451,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC10" href="#TOC1">REVISION</a><br> <br><a name="SEC10" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 21 July 2019 Last updated: 22 July 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

File diff suppressed because it is too large Load Diff

View File

@ -2720,7 +2720,8 @@ Your program may crash or loop indefinitely or give wrong results.
These options turn on the partial matching feature. A partial match occurs if These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. In addition, either at least one subject characters to complete the match. In addition, either at least one
character must have been inspected or the pattern must contain a lookbehind. character must have been inspected or the pattern must contain a lookbehind, or
the pattern must be one that could match an empty string.
.P .P
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
is set, matching continues by testing any remaining alternatives. Only if no is set, matching continues by testing any remaining alternatives. Only if no

View File

@ -1,4 +1,4 @@
.TH PCRE2PARTIAL 3 "21 July 2019" "PCRE2 10.34" .TH PCRE2PARTIAL 3 "22 July 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2" .SH "PARTIAL MATCHING IN PCRE2"
@ -56,15 +56,17 @@ is also disabled for partial matching.
A partial match occurs during a call to \fBpcre2_match()\fP when the end of the A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
subject string is reached successfully, but matching cannot continue because subject string is reached successfully, but matching cannot continue because
more characters are needed, and in addition, either at least one character in more characters are needed, and in addition, either at least one character in
the subject has been inspected or the pattern contains a lookbehind. An the subject has been inspected or the pattern contains a lookbehind, or (when
PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
inspected character need not form part of the final matched string; lookbehind inspected character need not form part of the final matched string; lookbehind
assertions and the \eK escape sequence provide ways of inspecting characters assertions and the \eK escape sequence provide ways of inspecting characters
before the start of a matched string. before the start of a matched string.
.P .P
The two additional requirements define the cases where adding more characters The three additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there to the existing subject may complete the same match that would occur if they
would be a partial match of an empty string at the end of the subject for all had all been present in the first place. Without these conditions there would
unanchored patterns (and also for anchored patterns if the subject itself is be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty). empty).
.P .P
When a partial match is returned, the first two elements in the ovector point When a partial match is returned, the first two elements in the ovector point
@ -422,6 +424,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 21 July 2019 Last updated: 22 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -3185,8 +3185,8 @@ for (;;)
ptr >= end_subject && /* End of subject and */ ptr >= end_subject && /* End of subject and */
( /* either */ ( /* either */
ptr > mb->start_used_ptr || /* Inspected non-empty string */ ptr > mb->start_used_ptr || /* Inspected non-empty string */
mb->haslookbehind /* or pattern has lookbehind */ mb->allowemptypartial /* or pattern has lookbehind */
) ) /* or could match empty */
) )
)) ))
match_count = PCRE2_ERROR_PARTIAL; match_count = PCRE2_ERROR_PARTIAL;
@ -3417,7 +3417,8 @@ mb->tables = re->tables;
mb->start_subject = subject; mb->start_subject = subject;
mb->end_subject = end_subject; mb->end_subject = end_subject;
mb->start_offset = start_offset; mb->start_offset = start_offset;
mb->haslookbehind = (re->max_lookbehind > 0); mb->allowemptypartial = (re->max_lookbehind > 0) ||
(re->flags & PCRE2_MATCH_EMPTY) != 0;
mb->moptions = options; mb->moptions = options;
mb->poptions = re->overall_options; mb->poptions = re->overall_options;
mb->match_call_count = 0; mb->match_call_count = 0;

View File

@ -854,7 +854,7 @@ typedef struct match_block {
uint32_t match_call_count; /* Number of times a new frame is created */ uint32_t match_call_count; /* Number of times a new frame is created */
BOOL hitend; /* Hit the end of the subject at some point */ BOOL hitend; /* Hit the end of the subject at some point */
BOOL hasthen; /* Pattern contains (*THEN) */ BOOL hasthen; /* Pattern contains (*THEN) */
BOOL haslookbehind; /* Pattern contains sigificant lookbehind */ BOOL allowemptypartial; /* Allow empty hard partial */
const uint8_t *lcc; /* Points to lower casing table */ const uint8_t *lcc; /* Points to lower casing table */
const uint8_t *fcc; /* Points to case-flipping table */ const uint8_t *fcc; /* Points to case-flipping table */
const uint8_t *ctypes; /* Points to table of type maps */ const uint8_t *ctypes; /* Points to table of type maps */
@ -910,7 +910,7 @@ typedef struct dfa_match_block {
uint32_t poptions; /* Pattern options */ uint32_t poptions; /* Pattern options */
uint32_t nltype; /* Newline type */ uint32_t nltype; /* Newline type */
uint32_t nllen; /* Newline string length */ uint32_t nllen; /* Newline string length */
BOOL haslookbehind; /* Pattern contains significant lookbehind */ BOOL allowemptypartial; /* Allow empty hard partial */
PCRE2_UCHAR nl[4]; /* Newline string when fixed */ PCRE2_UCHAR nl[4]; /* Newline string when fixed */
uint16_t bsr_convention; /* \R interpretation */ uint16_t bsr_convention; /* \R interpretation */
pcre2_callout_block *cb; /* Points to a callout block */ pcre2_callout_block *cb; /* Points to a callout block */

View File

@ -508,7 +508,8 @@ A partial match is returned only if no complete match can be found. */
} }
#define SCHECK_PARTIAL()\ #define SCHECK_PARTIAL()\
if (mb->partial != 0 && (Feptr > mb->start_used_ptr || mb->haslookbehind)) \ if (mb->partial != 0 && \
(Feptr > mb->start_used_ptr || mb->allowemptypartial)) \
{ \ { \
mb->hitend = TRUE; \ mb->hitend = TRUE; \
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \ if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \
@ -6451,7 +6452,8 @@ mb->start_subject = subject;
mb->start_offset = start_offset; mb->start_offset = start_offset;
mb->end_subject = end_subject; mb->end_subject = end_subject;
mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0; mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
mb->haslookbehind = (re->max_lookbehind > 0); mb->allowemptypartial = (re->max_lookbehind > 0) ||
(re->flags & PCRE2_MATCH_EMPTY) != 0;
mb->poptions = re->overall_options; /* Pattern options */ mb->poptions = re->overall_options; /* Pattern options */
mb->ignore_skip_arg = 0; mb->ignore_skip_arg = 0;
mb->mark = mb->nomatch_mark = NULL; /* In case never set */ mb->mark = mb->nomatch_mark = NULL; /* In case never set */

6
testdata/testinput2 vendored
View File

@ -5719,4 +5719,10 @@ a)"xI
abc\n\=ph,no_jit abc\n\=ph,no_jit
abc\n\=ps abc\n\=ps
/(?![ab]).*/
ab\=ph,no_jit
/c*+/
ab\=ph,offset=2,no_jit
# End of testinput2 # End of testinput2

6
testdata/testinput6 vendored
View File

@ -5020,4 +5020,10 @@
bxyz bxyz
xyz xyz
/(?![ab]).*/
ab\=ph
/c*+/
ab\=ph,offset=2
# End of testinput6 # End of testinput6

View File

@ -17231,6 +17231,14 @@ Partial match: \x0a
abc\n\=ps abc\n\=ps
0: 0:
/(?![ab]).*/
ab\=ph,no_jit
Partial match:
/c*+/
ab\=ph,offset=2,no_jit
Partial match:
# End of testinput2 # End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data Error -62: bad serialized data

View File

@ -7887,4 +7887,12 @@ Partial match:
xyz xyz
0: 0:
/(?![ab]).*/
ab\=ph
Partial match:
/c*+/
ab\=ph,offset=2
Partial match:
# End of testinput6 # End of testinput6