More partial match tweaks.

This commit is contained in:
Philip.Hazel 2019-07-22 16:30:44 +00:00
parent f7e21162fa
commit 3572634086
13 changed files with 501 additions and 457 deletions

View File

@ -104,7 +104,11 @@ within it, the nested lookbehind was not correctly processed. For example, if
is another situation where adding characters to the current subject can
lead to a full match. Example: /c*+(?<=[bc])/ with subject "ab".
(b) An empty string partial hard match can be returned for \z and \Z as it
(b) Similarly, if a pattern could match an empty string, an empty partial
match may be given. Example: /(?![ab]).*/ with subject "ab". This case
applies only to PCRE2_PARTIAL_HARD.
(c) An empty string partial hard match can be returned for \z and \Z as it
is documented that they shouldn't match.

View File

@ -2057,7 +2057,7 @@ the following negative numbers:
PCRE2_ERROR_BADOPTION the value of <i>what</i> was invalid
PCRE2_ERROR_UNSET the requested field is not set
</pre>
The "magic number" is placed at the start of each compiled pattern as an simple
The "magic number" is placed at the start of each compiled pattern as a simple
check against passing an arbitrary memory pointer. Here is a typical call of
<b>pcre2_pattern_info()</b>, to obtain the length of the compiled pattern:
<pre>
@ -2114,7 +2114,7 @@ options returned for PCRE2_INFO_ALLOPTIONS.
PCRE2_INFO_BACKREFMAX
</pre>
Return the number of the highest backreference in the pattern. The third
argument should point to an <b>uint32_t</b> variable. Named capture groups
argument should point to a <b>uint32_t</b> variable. Named capture groups
acquire numbers as well as names, and these count towards the highest
backreference. Backreferences such as \4 or \g{12} match the captured
characters of the given group, but in addition, the check that a capture
@ -2132,7 +2132,7 @@ that \R matches only CR, LF, or CRLF.
</pre>
Return the highest capture group number in the pattern. In patterns where (?|
is not used, this is also the total number of capture groups. The third
argument should point to an <b>uint32_t</b> variable.
argument should point to a <b>uint32_t</b> variable.
<pre>
PCRE2_INFO_DEPTHLIMIT
</pre>
@ -2157,7 +2157,7 @@ returned. Otherwise NULL is returned. The third argument should point to a
PCRE2_INFO_FIRSTCODETYPE
</pre>
Return information about the first code unit of any matched string, for a
non-anchored pattern. The third argument should point to an <b>uint32_t</b>
non-anchored pattern. The third argument should point to a <b>uint32_t</b>
variable. If there is a fixed first value, for example, the letter "c" from a
pattern such as (cat|cow|coyote), 1 is returned, and the value can be retrieved
using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is
@ -2169,7 +2169,7 @@ is returned.
</pre>
Return the value of the first code unit of any matched string for a pattern
where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third
argument should point to an <b>uint32_t</b> variable. In the 8-bit library, the
argument should point to a <b>uint32_t</b> variable. In the 8-bit library, the
value is always less than 256. In the 16-bit library the value can be up to
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
and up to 0xffffffff when not using UTF-32 mode.
@ -2185,12 +2185,12 @@ pattern. Each additional capture group adds two PCRE2_SIZE variables.
PCRE2_INFO_HASBACKSLASHC
</pre>
Return 1 if the pattern contains any instances of \C, otherwise 0. The third
argument should point to an <b>uint32_t</b> variable.
argument should point to a <b>uint32_t</b> variable.
<pre>
PCRE2_INFO_HASCRORLF
</pre>
Return 1 if the pattern contains any explicit matches for CR or LF characters,
otherwise 0. The third argument should point to an <b>uint32_t</b> variable. An
otherwise 0. The third argument should point to a <b>uint32_t</b> variable. An
explicit match is either a literal CR or LF character, or \r or \n or one of
the equivalent hexadecimal or octal escape sequences.
<pre>
@ -2206,7 +2206,7 @@ defaulted by the caller of the match function.
PCRE2_INFO_JCHANGED
</pre>
Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
0. The third argument should point to an <b>uint32_t</b> variable. (?J) and
0. The third argument should point to a <b>uint32_t</b> variable. (?J) and
(?-J) set and unset the local PCRE2_DUPNAMES option, respectively.
<pre>
PCRE2_INFO_JITSIZE
@ -2218,7 +2218,7 @@ return zero. The third argument should point to a <b>size_t</b> variable.
PCRE2_INFO_LASTCODETYPE
</pre>
Returns 1 if there is a rightmost literal code unit that must exist in any
matched string, other than at its start. The third argument should point to an
matched string, other than at its start. The third argument should point to a
<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is
returned, the code unit value itself can be retrieved using
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
@ -2231,12 +2231,12 @@ PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
Return the value of the rightmost literal code unit that must exist in any
matched string, other than at its start, for a pattern where
PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argument
should point to an <b>uint32_t</b> variable.
should point to a <b>uint32_t</b> variable.
<pre>
PCRE2_INFO_MATCHEMPTY
</pre>
Return 1 if the pattern might match an empty string, otherwise 0. The third
argument should point to an <b>uint32_t</b> variable. When a pattern contains
argument should point to a <b>uint32_t</b> variable. When a pattern contains
recursive subroutine calls it is not always possible to determine whether or
not it can match an empty string. PCRE2 takes a cautious approach and returns 1
in such cases.
@ -2279,7 +2279,7 @@ If a minimum length for matching subject strings was computed, its value is
returned. Otherwise the returned value is 0. This value is not computed when
PCRE2_NO_START_OPTIMIZE is set. The value is a number of characters, which in
UTF mode may be different from the number of code units. The third argument
should point to an <b>uint32_t</b> variable. The value is a lower bound to the
should point to a <b>uint32_t</b> variable. The value is a lower bound to the
length of any matching string. There may not be any strings of that length that
do actually match, but every string that does match is at least that long.
<pre>
@ -2726,7 +2726,8 @@ Your program may crash or loop indefinitely or give wrong results.
These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. In addition, either at least one
character must have been inspected or the pattern must contain a lookbehind.
character must have been inspected or the pattern must contain a lookbehind, or
the pattern must be one that could match an empty string.
</P>
<P>
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
@ -3850,7 +3851,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 20 July 2019
Last updated: 22 July 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -80,16 +80,18 @@ is also disabled for partial matching.
A partial match occurs during a call to <b>pcre2_match()</b> when the end of the
subject string is reached successfully, but matching cannot continue because
more characters are needed, and in addition, either at least one character in
the subject has been inspected or the pattern contains a lookbehind. An
the subject has been inspected or the pattern contains a lookbehind, or (when
PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
inspected character need not form part of the final matched string; lookbehind
assertions and the \K escape sequence provide ways of inspecting characters
before the start of a matched string.
</P>
<P>
The two additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there
would be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
The three additional requirements define the cases where adding more characters
to the existing subject may complete the same match that would occur if they
had all been present in the first place. Without these conditions there would
be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty).
</P>
<P>
@ -449,7 +451,7 @@ Cambridge, England.
</P>
<br><a name="SEC10" href="#TOC1">REVISION</a><br>
<P>
Last updated: 21 July 2019
Last updated: 22 July 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -2720,7 +2720,8 @@ Your program may crash or loop indefinitely or give wrong results.
These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. In addition, either at least one
character must have been inspected or the pattern must contain a lookbehind.
character must have been inspected or the pattern must contain a lookbehind, or
the pattern must be one that could match an empty string.
.P
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
is set, matching continues by testing any remaining alternatives. Only if no

View File

@ -1,4 +1,4 @@
.TH PCRE2PARTIAL 3 "21 July 2019" "PCRE2 10.34"
.TH PCRE2PARTIAL 3 "22 July 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2"
@ -56,15 +56,17 @@ is also disabled for partial matching.
A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
subject string is reached successfully, but matching cannot continue because
more characters are needed, and in addition, either at least one character in
the subject has been inspected or the pattern contains a lookbehind. An
the subject has been inspected or the pattern contains a lookbehind, or (when
PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
inspected character need not form part of the final matched string; lookbehind
assertions and the \eK escape sequence provide ways of inspecting characters
before the start of a matched string.
.P
The two additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there
would be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
The three additional requirements define the cases where adding more characters
to the existing subject may complete the same match that would occur if they
had all been present in the first place. Without these conditions there would
be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty).
.P
When a partial match is returned, the first two elements in the ovector point
@ -422,6 +424,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 21 July 2019
Last updated: 22 July 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -3185,8 +3185,8 @@ for (;;)
ptr >= end_subject && /* End of subject and */
( /* either */
ptr > mb->start_used_ptr || /* Inspected non-empty string */
mb->haslookbehind /* or pattern has lookbehind */
)
mb->allowemptypartial /* or pattern has lookbehind */
) /* or could match empty */
)
))
match_count = PCRE2_ERROR_PARTIAL;
@ -3417,7 +3417,8 @@ mb->tables = re->tables;
mb->start_subject = subject;
mb->end_subject = end_subject;
mb->start_offset = start_offset;
mb->haslookbehind = (re->max_lookbehind > 0);
mb->allowemptypartial = (re->max_lookbehind > 0) ||
(re->flags & PCRE2_MATCH_EMPTY) != 0;
mb->moptions = options;
mb->poptions = re->overall_options;
mb->match_call_count = 0;

View File

@ -854,7 +854,7 @@ typedef struct match_block {
uint32_t match_call_count; /* Number of times a new frame is created */
BOOL hitend; /* Hit the end of the subject at some point */
BOOL hasthen; /* Pattern contains (*THEN) */
BOOL haslookbehind; /* Pattern contains sigificant lookbehind */
BOOL allowemptypartial; /* Allow empty hard partial */
const uint8_t *lcc; /* Points to lower casing table */
const uint8_t *fcc; /* Points to case-flipping table */
const uint8_t *ctypes; /* Points to table of type maps */
@ -910,7 +910,7 @@ typedef struct dfa_match_block {
uint32_t poptions; /* Pattern options */
uint32_t nltype; /* Newline type */
uint32_t nllen; /* Newline string length */
BOOL haslookbehind; /* Pattern contains significant lookbehind */
BOOL allowemptypartial; /* Allow empty hard partial */
PCRE2_UCHAR nl[4]; /* Newline string when fixed */
uint16_t bsr_convention; /* \R interpretation */
pcre2_callout_block *cb; /* Points to a callout block */

View File

@ -508,7 +508,8 @@ A partial match is returned only if no complete match can be found. */
}
#define SCHECK_PARTIAL()\
if (mb->partial != 0 && (Feptr > mb->start_used_ptr || mb->haslookbehind)) \
if (mb->partial != 0 && \
(Feptr > mb->start_used_ptr || mb->allowemptypartial)) \
{ \
mb->hitend = TRUE; \
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \
@ -6451,7 +6452,8 @@ mb->start_subject = subject;
mb->start_offset = start_offset;
mb->end_subject = end_subject;
mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
mb->haslookbehind = (re->max_lookbehind > 0);
mb->allowemptypartial = (re->max_lookbehind > 0) ||
(re->flags & PCRE2_MATCH_EMPTY) != 0;
mb->poptions = re->overall_options; /* Pattern options */
mb->ignore_skip_arg = 0;
mb->mark = mb->nomatch_mark = NULL; /* In case never set */

6
testdata/testinput2 vendored
View File

@ -5719,4 +5719,10 @@ a)"xI
abc\n\=ph,no_jit
abc\n\=ps
/(?![ab]).*/
ab\=ph,no_jit
/c*+/
ab\=ph,offset=2,no_jit
# End of testinput2

6
testdata/testinput6 vendored
View File

@ -5020,4 +5020,10 @@
bxyz
xyz
/(?![ab]).*/
ab\=ph
/c*+/
ab\=ph,offset=2
# End of testinput6

View File

@ -17231,6 +17231,14 @@ Partial match: \x0a
abc\n\=ps
0:
/(?![ab]).*/
ab\=ph,no_jit
Partial match:
/c*+/
ab\=ph,offset=2,no_jit
Partial match:
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data

View File

@ -7887,4 +7887,12 @@ Partial match:
xyz
0:
/(?![ab]).*/
ab\=ph
Partial match:
/c*+/
ab\=ph,offset=2
Partial match:
# End of testinput6