More partial match tweaks.

This commit is contained in:
Philip.Hazel 2019-07-22 16:30:44 +00:00
parent f7e21162fa
commit 3572634086
13 changed files with 501 additions and 457 deletions

View File

@ -104,7 +104,11 @@ within it, the nested lookbehind was not correctly processed. For example, if
is another situation where adding characters to the current subject can
lead to a full match. Example: /c*+(?<=[bc])/ with subject "ab".
(b) An empty string partial hard match can be returned for \z and \Z as it
(b) Similarly, if a pattern could match an empty string, an empty partial
match may be given. Example: /(?![ab]).*/ with subject "ab". This case
applies only to PCRE2_PARTIAL_HARD.
(c) An empty string partial hard match can be returned for \z and \Z as it
is documented that they shouldn't match.

View File

@ -2057,7 +2057,7 @@ the following negative numbers:
PCRE2_ERROR_BADOPTION the value of <i>what</i> was invalid
PCRE2_ERROR_UNSET the requested field is not set
</pre>
The "magic number" is placed at the start of each compiled pattern as an simple
The "magic number" is placed at the start of each compiled pattern as a simple
check against passing an arbitrary memory pointer. Here is a typical call of
<b>pcre2_pattern_info()</b>, to obtain the length of the compiled pattern:
<pre>
@ -2114,7 +2114,7 @@ options returned for PCRE2_INFO_ALLOPTIONS.
PCRE2_INFO_BACKREFMAX
</pre>
Return the number of the highest backreference in the pattern. The third
argument should point to an <b>uint32_t</b> variable. Named capture groups
argument should point to a <b>uint32_t</b> variable. Named capture groups
acquire numbers as well as names, and these count towards the highest
backreference. Backreferences such as \4 or \g{12} match the captured
characters of the given group, but in addition, the check that a capture
@ -2132,7 +2132,7 @@ that \R matches only CR, LF, or CRLF.
</pre>
Return the highest capture group number in the pattern. In patterns where (?|
is not used, this is also the total number of capture groups. The third
argument should point to an <b>uint32_t</b> variable.
argument should point to a <b>uint32_t</b> variable.
<pre>
PCRE2_INFO_DEPTHLIMIT
</pre>
@ -2157,7 +2157,7 @@ returned. Otherwise NULL is returned. The third argument should point to a
PCRE2_INFO_FIRSTCODETYPE
</pre>
Return information about the first code unit of any matched string, for a
non-anchored pattern. The third argument should point to an <b>uint32_t</b>
non-anchored pattern. The third argument should point to a <b>uint32_t</b>
variable. If there is a fixed first value, for example, the letter "c" from a
pattern such as (cat|cow|coyote), 1 is returned, and the value can be retrieved
using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is
@ -2169,7 +2169,7 @@ is returned.
</pre>
Return the value of the first code unit of any matched string for a pattern
where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third
argument should point to an <b>uint32_t</b> variable. In the 8-bit library, the
argument should point to a <b>uint32_t</b> variable. In the 8-bit library, the
value is always less than 256. In the 16-bit library the value can be up to
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
and up to 0xffffffff when not using UTF-32 mode.
@ -2185,12 +2185,12 @@ pattern. Each additional capture group adds two PCRE2_SIZE variables.
PCRE2_INFO_HASBACKSLASHC
</pre>
Return 1 if the pattern contains any instances of \C, otherwise 0. The third
argument should point to an <b>uint32_t</b> variable.
argument should point to a <b>uint32_t</b> variable.
<pre>
PCRE2_INFO_HASCRORLF
</pre>
Return 1 if the pattern contains any explicit matches for CR or LF characters,
otherwise 0. The third argument should point to an <b>uint32_t</b> variable. An
otherwise 0. The third argument should point to a <b>uint32_t</b> variable. An
explicit match is either a literal CR or LF character, or \r or \n or one of
the equivalent hexadecimal or octal escape sequences.
<pre>
@ -2206,7 +2206,7 @@ defaulted by the caller of the match function.
PCRE2_INFO_JCHANGED
</pre>
Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
0. The third argument should point to an <b>uint32_t</b> variable. (?J) and
0. The third argument should point to a <b>uint32_t</b> variable. (?J) and
(?-J) set and unset the local PCRE2_DUPNAMES option, respectively.
<pre>
PCRE2_INFO_JITSIZE
@ -2218,7 +2218,7 @@ return zero. The third argument should point to a <b>size_t</b> variable.
PCRE2_INFO_LASTCODETYPE
</pre>
Returns 1 if there is a rightmost literal code unit that must exist in any
matched string, other than at its start. The third argument should point to an
matched string, other than at its start. The third argument should point to a
<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is
returned, the code unit value itself can be retrieved using
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
@ -2231,12 +2231,12 @@ PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
Return the value of the rightmost literal code unit that must exist in any
matched string, other than at its start, for a pattern where
PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argument
should point to an <b>uint32_t</b> variable.
should point to a <b>uint32_t</b> variable.
<pre>
PCRE2_INFO_MATCHEMPTY
</pre>
Return 1 if the pattern might match an empty string, otherwise 0. The third
argument should point to an <b>uint32_t</b> variable. When a pattern contains
argument should point to a <b>uint32_t</b> variable. When a pattern contains
recursive subroutine calls it is not always possible to determine whether or
not it can match an empty string. PCRE2 takes a cautious approach and returns 1
in such cases.
@ -2279,7 +2279,7 @@ If a minimum length for matching subject strings was computed, its value is
returned. Otherwise the returned value is 0. This value is not computed when
PCRE2_NO_START_OPTIMIZE is set. The value is a number of characters, which in
UTF mode may be different from the number of code units. The third argument
should point to an <b>uint32_t</b> variable. The value is a lower bound to the
should point to a <b>uint32_t</b> variable. The value is a lower bound to the
length of any matching string. There may not be any strings of that length that
do actually match, but every string that does match is at least that long.
<pre>
@ -2726,7 +2726,8 @@ Your program may crash or loop indefinitely or give wrong results.
These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. In addition, either at least one
character must have been inspected or the pattern must contain a lookbehind.
character must have been inspected or the pattern must contain a lookbehind, or
the pattern must be one that could match an empty string.
</P>
<P>
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
@ -3850,7 +3851,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 20 July 2019
Last updated: 22 July 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -80,15 +80,17 @@ is also disabled for partial matching.
A partial match occurs during a call to <b>pcre2_match()</b> when the end of the
subject string is reached successfully, but matching cannot continue because
more characters are needed, and in addition, either at least one character in
the subject has been inspected or the pattern contains a lookbehind. An
the subject has been inspected or the pattern contains a lookbehind, or (when
PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
inspected character need not form part of the final matched string; lookbehind
assertions and the \K escape sequence provide ways of inspecting characters
before the start of a matched string.
</P>
<P>
The two additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there
would be a partial match of an empty string at the end of the subject for all
The three additional requirements define the cases where adding more characters
to the existing subject may complete the same match that would occur if they
had all been present in the first place. Without these conditions there would
be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty).
</P>
@ -449,7 +451,7 @@ Cambridge, England.
</P>
<br><a name="SEC10" href="#TOC1">REVISION</a><br>
<P>
Last updated: 21 July 2019
Last updated: 22 July 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -2013,8 +2013,8 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_ERROR_BADOPTION the value of what was invalid
PCRE2_ERROR_UNSET the requested field is not set
The "magic number" is placed at the start of each compiled pattern as
an simple check against passing an arbitrary memory pointer. Here is a
The "magic number" is placed at the start of each compiled pattern as a
simple check against passing an arbitrary memory pointer. Here is a
typical call of pcre2_pattern_info(), to obtain the length of the com-
piled pattern:
@ -2073,7 +2073,7 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_BACKREFMAX
Return the number of the highest backreference in the pattern. The
third argument should point to an uint32_t variable. Named capture
third argument should point to a uint32_t variable. Named capture
groups acquire numbers as well as names, and these count towards the
highest backreference. Backreferences such as \4 or \g{12} match the
captured characters of the given group, but in addition, the check that
@ -2091,7 +2091,7 @@ INFORMATION ABOUT A COMPILED PATTERN
Return the highest capture group number in the pattern. In patterns
where (?| is not used, this is also the total number of capture groups.
The third argument should point to an uint32_t variable.
The third argument should point to a uint32_t variable.
PCRE2_INFO_DEPTHLIMIT
@ -2117,7 +2117,7 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_FIRSTCODETYPE
Return information about the first code unit of any matched string, for
a non-anchored pattern. The third argument should point to an uint32_t
a non-anchored pattern. The third argument should point to a uint32_t
variable. If there is a fixed first value, for example, the letter "c"
from a pattern such as (cat|cow|coyote), 1 is returned, and the value
can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed
@ -2129,7 +2129,7 @@ INFORMATION ABOUT A COMPILED PATTERN
Return the value of the first code unit of any matched string for a
pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
The third argument should point to an uint32_t variable. In the 8-bit
The third argument should point to a uint32_t variable. In the 8-bit
library, the value is always less than 256. In the 16-bit library the
value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
@ -2147,12 +2147,12 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_HASBACKSLASHC
Return 1 if the pattern contains any instances of \C, otherwise 0. The
third argument should point to an uint32_t variable.
third argument should point to a uint32_t variable.
PCRE2_INFO_HASCRORLF
Return 1 if the pattern contains any explicit matches for CR or LF
characters, otherwise 0. The third argument should point to an uint32_t
characters, otherwise 0. The third argument should point to a uint32_t
variable. An explicit match is either a literal CR or LF character, or
\r or \n or one of the equivalent hexadecimal or octal escape se-
quences.
@ -2169,7 +2169,7 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_JCHANGED
Return 1 if the (?J) or (?-J) option setting is used in the pattern,
otherwise 0. The third argument should point to an uint32_t variable.
otherwise 0. The third argument should point to a uint32_t variable.
(?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
tively.
@ -2183,28 +2183,28 @@ INFORMATION ABOUT A COMPILED PATTERN
Returns 1 if there is a rightmost literal code unit that must exist in
any matched string, other than at its start. The third argument should
point to an uint32_t variable. If there is no such value, 0 is re-
turned. When 1 is returned, the code unit value itself can be retrieved
using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal
value is recorded only if it follows something of variable length. For
example, for the pattern /^a\d+z\d+/ the returned value is 1 (with "z"
returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned
value is 0.
point to a uint32_t variable. If there is no such value, 0 is returned.
When 1 is returned, the code unit value itself can be retrieved using
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
recorded only if it follows something of variable length. For example,
for the pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned
from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is
0.
PCRE2_INFO_LASTCODEUNIT
Return the value of the rightmost literal code unit that must exist in
any matched string, other than at its start, for a pattern where
PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
ment should point to an uint32_t variable.
ment should point to a uint32_t variable.
PCRE2_INFO_MATCHEMPTY
Return 1 if the pattern might match an empty string, otherwise 0. The
third argument should point to an uint32_t variable. When a pattern
contains recursive subroutine calls it is not always possible to deter-
mine whether or not it can match an empty string. PCRE2 takes a cau-
tious approach and returns 1 in such cases.
third argument should point to a uint32_t variable. When a pattern con-
tains recursive subroutine calls it is not always possible to determine
whether or not it can match an empty string. PCRE2 takes a cautious ap-
proach and returns 1 in such cases.
PCRE2_INFO_MATCHLIMIT
@ -2244,7 +2244,7 @@ INFORMATION ABOUT A COMPILED PATTERN
value is returned. Otherwise the returned value is 0. This value is not
computed when PCRE2_NO_START_OPTIMIZE is set. The value is a number of
characters, which in UTF mode may be different from the number of code
units. The third argument should point to an uint32_t variable. The
units. The third argument should point to a uint32_t variable. The
value is a lower bound to the length of any matching string. There may
not be any strings of that length that do actually match, but every
string that does match is at least that long.
@ -2663,7 +2663,8 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
curs if the end of the subject string is reached successfully, but
there are not enough subject characters to complete the match. In addi-
tion, either at least one character must have been inspected or the
pattern must contain a lookbehind.
pattern must contain a lookbehind, or the pattern must be one that
could match an empty string.
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR-
TIAL_HARD) is set, matching continues by testing any remaining alterna-
@ -3705,7 +3706,7 @@ AUTHOR
REVISION
Last updated: 20 July 2019
Last updated: 22 July 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
@ -5703,13 +5704,15 @@ PARTIAL MATCHING USING pcre2_match()
the subject string is reached successfully, but matching cannot con-
tinue because more characters are needed, and in addition, either at
least one character in the subject has been inspected or the pattern
contains a lookbehind. An inspected character need not form part of the
final matched string; lookbehind assertions and the \K escape sequence
provide ways of inspecting characters before the start of a matched
string.
contains a lookbehind, or (when PCRE2_PARTIAL_HARD is set) the pattern
could match an empty string. An inspected character need not form part
of the final matched string; lookbehind assertions and the \K escape
sequence provide ways of inspecting characters before the start of a
matched string.
The two additional requirements define the cases where adding more
characters to the existing subject may complete the match. Without
The three additional requirements define the cases where adding more
characters to the existing subject may complete the same match that
would occur if they had all been present in the first place. Without
these conditions there would be a partial match of an empty string at
the end of the subject for all unanchored patterns (and also for an-
chored patterns if the subject itself is empty).
@ -6068,7 +6071,7 @@ AUTHOR
REVISION
Last updated: 21 July 2019
Last updated: 22 July 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -2720,7 +2720,8 @@ Your program may crash or loop indefinitely or give wrong results.
These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. In addition, either at least one
character must have been inspected or the pattern must contain a lookbehind.
character must have been inspected or the pattern must contain a lookbehind, or
the pattern must be one that could match an empty string.
.P
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
is set, matching continues by testing any remaining alternatives. Only if no

View File

@ -1,4 +1,4 @@
.TH PCRE2PARTIAL 3 "21 July 2019" "PCRE2 10.34"
.TH PCRE2PARTIAL 3 "22 July 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2"
@ -56,14 +56,16 @@ is also disabled for partial matching.
A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
subject string is reached successfully, but matching cannot continue because
more characters are needed, and in addition, either at least one character in
the subject has been inspected or the pattern contains a lookbehind. An
the subject has been inspected or the pattern contains a lookbehind, or (when
PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
inspected character need not form part of the final matched string; lookbehind
assertions and the \eK escape sequence provide ways of inspecting characters
before the start of a matched string.
.P
The two additional requirements define the cases where adding more characters
to the existing subject may complete the match. Without these conditions there
would be a partial match of an empty string at the end of the subject for all
The three additional requirements define the cases where adding more characters
to the existing subject may complete the same match that would occur if they
had all been present in the first place. Without these conditions there would
be a partial match of an empty string at the end of the subject for all
unanchored patterns (and also for anchored patterns if the subject itself is
empty).
.P
@ -422,6 +424,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 21 July 2019
Last updated: 22 July 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -3185,8 +3185,8 @@ for (;;)
ptr >= end_subject && /* End of subject and */
( /* either */
ptr > mb->start_used_ptr || /* Inspected non-empty string */
mb->haslookbehind /* or pattern has lookbehind */
)
mb->allowemptypartial /* or pattern has lookbehind */
) /* or could match empty */
)
))
match_count = PCRE2_ERROR_PARTIAL;
@ -3417,7 +3417,8 @@ mb->tables = re->tables;
mb->start_subject = subject;
mb->end_subject = end_subject;
mb->start_offset = start_offset;
mb->haslookbehind = (re->max_lookbehind > 0);
mb->allowemptypartial = (re->max_lookbehind > 0) ||
(re->flags & PCRE2_MATCH_EMPTY) != 0;
mb->moptions = options;
mb->poptions = re->overall_options;
mb->match_call_count = 0;

View File

@ -854,7 +854,7 @@ typedef struct match_block {
uint32_t match_call_count; /* Number of times a new frame is created */
BOOL hitend; /* Hit the end of the subject at some point */
BOOL hasthen; /* Pattern contains (*THEN) */
BOOL haslookbehind; /* Pattern contains sigificant lookbehind */
BOOL allowemptypartial; /* Allow empty hard partial */
const uint8_t *lcc; /* Points to lower casing table */
const uint8_t *fcc; /* Points to case-flipping table */
const uint8_t *ctypes; /* Points to table of type maps */
@ -910,7 +910,7 @@ typedef struct dfa_match_block {
uint32_t poptions; /* Pattern options */
uint32_t nltype; /* Newline type */
uint32_t nllen; /* Newline string length */
BOOL haslookbehind; /* Pattern contains significant lookbehind */
BOOL allowemptypartial; /* Allow empty hard partial */
PCRE2_UCHAR nl[4]; /* Newline string when fixed */
uint16_t bsr_convention; /* \R interpretation */
pcre2_callout_block *cb; /* Points to a callout block */

View File

@ -508,7 +508,8 @@ A partial match is returned only if no complete match can be found. */
}
#define SCHECK_PARTIAL()\
if (mb->partial != 0 && (Feptr > mb->start_used_ptr || mb->haslookbehind)) \
if (mb->partial != 0 && \
(Feptr > mb->start_used_ptr || mb->allowemptypartial)) \
{ \
mb->hitend = TRUE; \
if (mb->partial > 1) return PCRE2_ERROR_PARTIAL; \
@ -6451,7 +6452,8 @@ mb->start_subject = subject;
mb->start_offset = start_offset;
mb->end_subject = end_subject;
mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
mb->haslookbehind = (re->max_lookbehind > 0);
mb->allowemptypartial = (re->max_lookbehind > 0) ||
(re->flags & PCRE2_MATCH_EMPTY) != 0;
mb->poptions = re->overall_options; /* Pattern options */
mb->ignore_skip_arg = 0;
mb->mark = mb->nomatch_mark = NULL; /* In case never set */

6
testdata/testinput2 vendored
View File

@ -5719,4 +5719,10 @@ a)"xI
abc\n\=ph,no_jit
abc\n\=ps
/(?![ab]).*/
ab\=ph,no_jit
/c*+/
ab\=ph,offset=2,no_jit
# End of testinput2

6
testdata/testinput6 vendored
View File

@ -5020,4 +5020,10 @@
bxyz
xyz
/(?![ab]).*/
ab\=ph
/c*+/
ab\=ph,offset=2
# End of testinput6

View File

@ -17231,6 +17231,14 @@ Partial match: \x0a
abc\n\=ps
0:
/(?![ab]).*/
ab\=ph,no_jit
Partial match:
/c*+/
ab\=ph,offset=2,no_jit
Partial match:
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data

View File

@ -7887,4 +7887,12 @@ Partial match:
xyz
0:
/(?![ab]).*/
ab\=ph
Partial match:
/c*+/
ab\=ph,offset=2
Partial match:
# End of testinput6