Allow (*ACCEPT) to be quantified.

This commit is contained in:
Philip.Hazel 2019-06-10 16:41:22 +00:00
parent cc51779d88
commit 306f2b9c57
7 changed files with 264 additions and 168 deletions

View File

@ -25,6 +25,9 @@ PCRE2_MATCH_INVALID_UTF compile-time option.
7. Adjust the limit for "must have" code unit searching, in particular, 7. Adjust the limit for "must have" code unit searching, in particular,
increase it substantially for non-anchored patterns. increase it substantially for non-anchored patterns.
8. Allow (*ACCEPT) to be quantified, because an ungreedy quantifier with a zero
minimum is potentially useful.
Version 10.33 16-April-2019 Version 10.33 16-April-2019
--------------------------- ---------------------------

View File

@ -3224,8 +3224,8 @@ The doubling is removed before the string is passed to the callout function.
There are a number of special "Backtracking Control Verbs" (to use Perl's There are a number of special "Backtracking Control Verbs" (to use Perl's
terminology) that modify the behaviour of backtracking during matching. They terminology) that modify the behaviour of backtracking during matching. They
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form, are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
possibly behaving differently depending on whether or not a name is present. and may behave differently depending on whether or not a name argument is
The names are not required to be unique within the pattern. present. The names are not required to be unique within the pattern.
</P> </P>
<P> <P>
By default, for compatibility with Perl, a name is any sequence of characters By default, for compatibility with Perl, a name is any sequence of characters
@ -3253,7 +3253,8 @@ PCRE2_ALT_VERBNAMES is also set.
The maximum length of a name is 255 in the 8-bit library and 65535 in the The maximum length of a name is 255 in the 8-bit library and 65535 in the
16-bit and 32-bit libraries. If the name is empty, that is, if the closing 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
parenthesis immediately follows the colon, the effect is as if the colon were parenthesis immediately follows the colon, the effect is as if the colon were
not there. Any number of these verbs may occur in a pattern. not there. Any number of these verbs may occur in a pattern. Except for
(*ACCEPT), they may not be quantified.
</P> </P>
<P> <P>
Since these verbs are specifically related to backtracking, most of them can be Since these verbs are specifically related to backtracking, most of them can be
@ -3316,6 +3317,18 @@ This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
the outer parentheses. the outer parentheses.
</P> </P>
<P> <P>
(*ACCEPT) is the only backtracking verb that is allowed to be quantified
because an ungreedy quantification with a minimum of zero acts only when a
backtrack happens. Consider, for example,
<pre>
A(*ACCEPT)??BC
</pre>
where A, B, and C may be complex expressions. After matching "A", the matcher
processes "BC"; if that fails, causing a backtrack, (*ACCEPT) is triggered and
the match succeeds. Whereas (*COMMIT) (see below) means "fail on backtrack", a
repeated (*ACCEPT) of this type means "succeed on backtrack".
</P>
<P>
<b>Warning:</b> (*ACCEPT) should not be used within a script run group, because <b>Warning:</b> (*ACCEPT) should not be used within a script run group, because
it causes an immediate exit from the group, bypassing the script run checking. it causes an immediate exit from the group, bypassing the script run checking.
<pre> <pre>
@ -3333,8 +3346,9 @@ A match with the string "aaaa" always fails, but the callout is taken before
each backtrack happens (in this example, 10 times). each backtrack happens (in this example, 10 times).
</P> </P>
<P> <P>
(*ACCEPT:NAME) and (*FAIL:NAME) are treated as (*MARK:NAME)(*ACCEPT) and (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*ACCEPT) and
(*MARK:NAME)(*FAIL), respectively. (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is recorded just before
the verb acts.
</P> </P>
<br><b> <br><b>
Recording which path was taken Recording which path was taken
@ -3728,7 +3742,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC31" href="#TOC1">REVISION</a><br> <br><a name="SEC31" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 23 May 2019 Last updated: 10 June 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -8947,8 +8947,8 @@ BACKTRACKING CONTROL
There are a number of special "Backtracking Control Verbs" (to use There are a number of special "Backtracking Control Verbs" (to use
Perl's terminology) that modify the behaviour of backtracking during Perl's terminology) that modify the behaviour of backtracking during
matching. They are generally of the form (*VERB) or (*VERB:NAME). Some matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
verbs take either form, possibly behaving differently depending on verbs take either form, and may behave differently depending on whether
whether or not a name is present. The names are not required to be or not a name argument is present. The names are not required to be
unique within the pattern. unique within the pattern.
By default, for compatibility with Perl, a name is any sequence of By default, for compatibility with Perl, a name is any sequence of
@ -8975,7 +8975,7 @@ BACKTRACKING CONTROL
the 16-bit and 32-bit libraries. If the name is empty, that is, if the the 16-bit and 32-bit libraries. If the name is empty, that is, if the
closing parenthesis immediately follows the colon, the effect is as if closing parenthesis immediately follows the colon, the effect is as if
the colon were not there. Any number of these verbs may occur in a pat- the colon were not there. Any number of these verbs may occur in a pat-
tern. tern. Except for (*ACCEPT), they may not be quantified.
Since these verbs are specifically related to backtracking, most of Since these verbs are specifically related to backtracking, most of
them can be used only when the pattern is to be matched using the tra- them can be used only when the pattern is to be matched using the tra-
@ -9025,6 +9025,18 @@ BACKTRACKING CONTROL
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
tured by the outer parentheses. tured by the outer parentheses.
(*ACCEPT) is the only backtracking verb that is allowed to be quanti-
fied because an ungreedy quantification with a minimum of zero acts
only when a backtrack happens. Consider, for example,
A(*ACCEPT)??BC
where A, B, and C may be complex expressions. After matching "A", the
matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT)
is triggered and the match succeeds. Whereas (*COMMIT) (see below)
means "fail on backtrack", a repeated (*ACCEPT) of this type means
"succeed on backtrack".
Warning: (*ACCEPT) should not be used within a script run group, Warning: (*ACCEPT) should not be used within a script run group,
because it causes an immediate exit from the group, bypassing the because it causes an immediate exit from the group, bypassing the
script run checking. script run checking.
@ -9043,8 +9055,9 @@ BACKTRACKING CONTROL
A match with the string "aaaa" always fails, but the callout is taken A match with the string "aaaa" always fails, but the callout is taken
before each backtrack happens (in this example, 10 times). before each backtrack happens (in this example, 10 times).
(*ACCEPT:NAME) and (*FAIL:NAME) are treated as (*MARK:NAME)(*ACCEPT) (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as
and (*MARK:NAME)(*FAIL), respectively. (*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a
(*MARK) is recorded just before the verb acts.
Recording which path was taken Recording which path was taken
@ -9414,7 +9427,7 @@ AUTHOR
REVISION REVISION
Last updated: 23 May 2019 Last updated: 10 June 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "23 May 2019" "PCRE2 10.34" .TH PCRE2PATTERN 3 "10 June 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -3262,8 +3262,8 @@ The doubling is removed before the string is passed to the callout function.
There are a number of special "Backtracking Control Verbs" (to use Perl's There are a number of special "Backtracking Control Verbs" (to use Perl's
terminology) that modify the behaviour of backtracking during matching. They terminology) that modify the behaviour of backtracking during matching. They
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form, are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
possibly behaving differently depending on whether or not a name is present. and may behave differently depending on whether or not a name argument is
The names are not required to be unique within the pattern. present. The names are not required to be unique within the pattern.
.P .P
By default, for compatibility with Perl, a name is any sequence of characters By default, for compatibility with Perl, a name is any sequence of characters
that does not include a closing parenthesis. The name is not processed in that does not include a closing parenthesis. The name is not processed in
@ -3287,7 +3287,8 @@ PCRE2_ALT_VERBNAMES is also set.
The maximum length of a name is 255 in the 8-bit library and 65535 in the The maximum length of a name is 255 in the 8-bit library and 65535 in the
16-bit and 32-bit libraries. If the name is empty, that is, if the closing 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
parenthesis immediately follows the colon, the effect is as if the colon were parenthesis immediately follows the colon, the effect is as if the colon were
not there. Any number of these verbs may occur in a pattern. not there. Any number of these verbs may occur in a pattern. Except for
(*ACCEPT), they may not be quantified.
.P .P
Since these verbs are specifically related to backtracking, most of them can be Since these verbs are specifically related to backtracking, most of them can be
used only when the pattern is to be matched using the traditional matching used only when the pattern is to be matched using the traditional matching
@ -3361,6 +3362,17 @@ example:
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
the outer parentheses. the outer parentheses.
.P .P
(*ACCEPT) is the only backtracking verb that is allowed to be quantified
because an ungreedy quantification with a minimum of zero acts only when a
backtrack happens. Consider, for example,
.sp
A(*ACCEPT)??BC
.sp
where A, B, and C may be complex expressions. After matching "A", the matcher
processes "BC"; if that fails, causing a backtrack, (*ACCEPT) is triggered and
the match succeeds. Whereas (*COMMIT) (see below) means "fail on backtrack", a
repeated (*ACCEPT) of this type means "succeed on backtrack".
.P
\fBWarning:\fP (*ACCEPT) should not be used within a script run group, because \fBWarning:\fP (*ACCEPT) should not be used within a script run group, because
it causes an immediate exit from the group, bypassing the script run checking. it causes an immediate exit from the group, bypassing the script run checking.
.sp .sp
@ -3377,8 +3389,9 @@ nearest equivalent is the callout feature, as for example in this pattern:
A match with the string "aaaa" always fails, but the callout is taken before A match with the string "aaaa" always fails, but the callout is taken before
each backtrack happens (in this example, 10 times). each backtrack happens (in this example, 10 times).
.P .P
(*ACCEPT:NAME) and (*FAIL:NAME) are treated as (*MARK:NAME)(*ACCEPT) and (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*ACCEPT) and
(*MARK:NAME)(*FAIL), respectively. (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is recorded just before
the verb acts.
. .
. .
.SS "Recording which path was taken" .SS "Recording which path was taken"
@ -3764,6 +3777,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 23 May 2019 Last updated: 10 June 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1419,9 +1419,6 @@ the result is "not a repeat quantifier". */
EXIT: EXIT:
if (yield || *errorcodeptr != 0) *ptrptr = p; if (yield || *errorcodeptr != 0) *ptrptr = p;
return yield; return yield;
} }
@ -2450,8 +2447,9 @@ must be last. */
enum { RANGE_NO, RANGE_STARTED, RANGE_OK_ESCAPED, RANGE_OK_LITERAL }; enum { RANGE_NO, RANGE_STARTED, RANGE_OK_ESCAPED, RANGE_OK_LITERAL };
/* Only in 32-bit mode can there be literals > META_END. A macros encapsulates /* Only in 32-bit mode can there be literals > META_END. A macro encapsulates
the storing of literal values in the parsed pattern. */ the storing of literal values in the main parsed pattern, where they can always
be quantified. */
#if PCRE2_CODE_UNIT_WIDTH == 32 #if PCRE2_CODE_UNIT_WIDTH == 32
#define PARSED_LITERAL(c, p) \ #define PARSED_LITERAL(c, p) \
@ -2474,6 +2472,7 @@ uint32_t delimiter;
uint32_t namelen; uint32_t namelen;
uint32_t class_range_state; uint32_t class_range_state;
uint32_t *verblengthptr = NULL; /* Value avoids compiler warning */ uint32_t *verblengthptr = NULL; /* Value avoids compiler warning */
uint32_t *verbstartptr = NULL;
uint32_t *previous_callout = NULL; uint32_t *previous_callout = NULL;
uint32_t *parsed_pattern = cb->parsed_pattern; uint32_t *parsed_pattern = cb->parsed_pattern;
uint32_t *parsed_pattern_end = cb->parsed_pattern_end; uint32_t *parsed_pattern_end = cb->parsed_pattern_end;
@ -2640,13 +2639,15 @@ while (ptr < ptrend)
switch(c) switch(c)
{ {
default: default: /* Don't use PARSED_LITERAL() because it */
PARSED_LITERAL(c, parsed_pattern); #if PCRE2_CODE_UNIT_WIDTH == 32 /* sets okquantifier. */
if (c >= META_END) *parsed_pattern++ = META_BIGVALUE;
#endif
*parsed_pattern++ = c;
break; break;
case CHAR_RIGHT_PARENTHESIS: case CHAR_RIGHT_PARENTHESIS:
inverbname = FALSE; inverbname = FALSE;
okquantifier = FALSE; /* Was probably set by literals */
/* This is the length in characters */ /* This is the length in characters */
verbnamelength = (PCRE2_SIZE)(parsed_pattern - verblengthptr - 1); verbnamelength = (PCRE2_SIZE)(parsed_pattern - verblengthptr - 1);
/* But the limit on the length is in code units */ /* But the limit on the length is in code units */
@ -3135,6 +3136,21 @@ while (ptr < ptrend)
goto FAILED_BACK; goto FAILED_BACK;
} }
/* Most (*VERB)s are not allowed to be quantified, but an ungreedy
quantifier can be useful for (*ACCEPT) - meaning "succeed on backtrack", a
sort of negated (*COMMIT). We therefore allow (*ACCEPT) to be quantified by
wrapping it in non-capturing brackets, but we have to allow for a preceding
(*MARK) for when (*ACCEPT) has an argument. */
if (parsed_pattern[-1] == META_ACCEPT)
{
uint32_t *p;
for (p = parsed_pattern - 1; p >= verbstartptr; p--) p[1] = p[0];
*verbstartptr = META_NOCAPTURE;
parsed_pattern[1] = META_KET;
parsed_pattern += 2;
}
/* Now we can put the quantifier into the parsed pattern vector. At this /* Now we can put the quantifier into the parsed pattern vector. At this
stage, we have only the basic quantifier. The check for a following + or ? stage, we have only the basic quantifier. The check for a following + or ?
modifier happens at the top of the loop, after any intervening comments modifier happens at the top of the loop, after any intervening comments
@ -3775,6 +3791,12 @@ while (ptr < ptrend)
goto FAILED; goto FAILED;
} }
/* Remember where this verb, possibly with a preceding (*MARK), starts,
for handling quantified (*ACCEPT). */
verbstartptr = parsed_pattern;
okquantifier = (verbs[i].meta == META_ACCEPT);
/* It appears that Perl allows any characters whatsoever, other than a /* It appears that Perl allows any characters whatsoever, other than a
closing parenthesis, to appear in arguments ("names"), so we no longer closing parenthesis, to appear in arguments ("names"), so we no longer
insist on letters, digits, and underscores. Perl does not, however, do insist on letters, digits, and underscores. Perl does not, however, do

12
testdata/testinput2 vendored
View File

@ -5591,4 +5591,16 @@ a)"xI
/\[()]{65535}(?<A>)/expand /\[()]{65535}(?<A>)/expand
/a(?:(*ACCEPT))??bc/
abc
axy
/a(*ACCEPT)??bc/
abc
axy
/a(*ACCEPT:XX)??bc/mark
abc
axy
# End of testinput2 # End of testinput2

19
testdata/testoutput2 vendored
View File

@ -16940,6 +16940,25 @@ Failed: error 197 at offset 131071: too many capturing groups (maximum 65535)
/\[()]{65535}(?<A>)/expand /\[()]{65535}(?<A>)/expand
Failed: error 197 at offset 131075: too many capturing groups (maximum 65535) Failed: error 197 at offset 131075: too many capturing groups (maximum 65535)
/a(?:(*ACCEPT))??bc/
abc
0: abc
axy
0: a
/a(*ACCEPT)??bc/
abc
0: abc
axy
0: a
/a(*ACCEPT:XX)??bc/mark
abc
0: abc
axy
0: a
MK: XX
# End of testinput2 # End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data Error -62: bad serialized data