Allow (*ACCEPT) to be quantified.
This commit is contained in:
parent
cc51779d88
commit
306f2b9c57
|
@ -25,6 +25,9 @@ PCRE2_MATCH_INVALID_UTF compile-time option.
|
||||||
7. Adjust the limit for "must have" code unit searching, in particular,
|
7. Adjust the limit for "must have" code unit searching, in particular,
|
||||||
increase it substantially for non-anchored patterns.
|
increase it substantially for non-anchored patterns.
|
||||||
|
|
||||||
|
8. Allow (*ACCEPT) to be quantified, because an ungreedy quantifier with a zero
|
||||||
|
minimum is potentially useful.
|
||||||
|
|
||||||
|
|
||||||
Version 10.33 16-April-2019
|
Version 10.33 16-April-2019
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
|
@ -3224,8 +3224,8 @@ The doubling is removed before the string is passed to the callout function.
|
||||||
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||||
terminology) that modify the behaviour of backtracking during matching. They
|
terminology) that modify the behaviour of backtracking during matching. They
|
||||||
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
|
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
|
||||||
possibly behaving differently depending on whether or not a name is present.
|
and may behave differently depending on whether or not a name argument is
|
||||||
The names are not required to be unique within the pattern.
|
present. The names are not required to be unique within the pattern.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
By default, for compatibility with Perl, a name is any sequence of characters
|
By default, for compatibility with Perl, a name is any sequence of characters
|
||||||
|
@ -3253,7 +3253,8 @@ PCRE2_ALT_VERBNAMES is also set.
|
||||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||||
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
||||||
parenthesis immediately follows the colon, the effect is as if the colon were
|
parenthesis immediately follows the colon, the effect is as if the colon were
|
||||||
not there. Any number of these verbs may occur in a pattern.
|
not there. Any number of these verbs may occur in a pattern. Except for
|
||||||
|
(*ACCEPT), they may not be quantified.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Since these verbs are specifically related to backtracking, most of them can be
|
Since these verbs are specifically related to backtracking, most of them can be
|
||||||
|
@ -3316,6 +3317,18 @@ This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
|
||||||
the outer parentheses.
|
the outer parentheses.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
(*ACCEPT) is the only backtracking verb that is allowed to be quantified
|
||||||
|
because an ungreedy quantification with a minimum of zero acts only when a
|
||||||
|
backtrack happens. Consider, for example,
|
||||||
|
<pre>
|
||||||
|
A(*ACCEPT)??BC
|
||||||
|
</pre>
|
||||||
|
where A, B, and C may be complex expressions. After matching "A", the matcher
|
||||||
|
processes "BC"; if that fails, causing a backtrack, (*ACCEPT) is triggered and
|
||||||
|
the match succeeds. Whereas (*COMMIT) (see below) means "fail on backtrack", a
|
||||||
|
repeated (*ACCEPT) of this type means "succeed on backtrack".
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
<b>Warning:</b> (*ACCEPT) should not be used within a script run group, because
|
<b>Warning:</b> (*ACCEPT) should not be used within a script run group, because
|
||||||
it causes an immediate exit from the group, bypassing the script run checking.
|
it causes an immediate exit from the group, bypassing the script run checking.
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -3333,8 +3346,9 @@ A match with the string "aaaa" always fails, but the callout is taken before
|
||||||
each backtrack happens (in this example, 10 times).
|
each backtrack happens (in this example, 10 times).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
(*ACCEPT:NAME) and (*FAIL:NAME) are treated as (*MARK:NAME)(*ACCEPT) and
|
(*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*ACCEPT) and
|
||||||
(*MARK:NAME)(*FAIL), respectively.
|
(*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is recorded just before
|
||||||
|
the verb acts.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Recording which path was taken
|
Recording which path was taken
|
||||||
|
@ -3728,7 +3742,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 23 May 2019
|
Last updated: 10 June 2019
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2019 University of Cambridge.
|
Copyright © 1997-2019 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -8947,8 +8947,8 @@ BACKTRACKING CONTROL
|
||||||
There are a number of special "Backtracking Control Verbs" (to use
|
There are a number of special "Backtracking Control Verbs" (to use
|
||||||
Perl's terminology) that modify the behaviour of backtracking during
|
Perl's terminology) that modify the behaviour of backtracking during
|
||||||
matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
|
matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
|
||||||
verbs take either form, possibly behaving differently depending on
|
verbs take either form, and may behave differently depending on whether
|
||||||
whether or not a name is present. The names are not required to be
|
or not a name argument is present. The names are not required to be
|
||||||
unique within the pattern.
|
unique within the pattern.
|
||||||
|
|
||||||
By default, for compatibility with Perl, a name is any sequence of
|
By default, for compatibility with Perl, a name is any sequence of
|
||||||
|
@ -8975,7 +8975,7 @@ BACKTRACKING CONTROL
|
||||||
the 16-bit and 32-bit libraries. If the name is empty, that is, if the
|
the 16-bit and 32-bit libraries. If the name is empty, that is, if the
|
||||||
closing parenthesis immediately follows the colon, the effect is as if
|
closing parenthesis immediately follows the colon, the effect is as if
|
||||||
the colon were not there. Any number of these verbs may occur in a pat-
|
the colon were not there. Any number of these verbs may occur in a pat-
|
||||||
tern.
|
tern. Except for (*ACCEPT), they may not be quantified.
|
||||||
|
|
||||||
Since these verbs are specifically related to backtracking, most of
|
Since these verbs are specifically related to backtracking, most of
|
||||||
them can be used only when the pattern is to be matched using the tra-
|
them can be used only when the pattern is to be matched using the tra-
|
||||||
|
@ -9025,6 +9025,18 @@ BACKTRACKING CONTROL
|
||||||
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
|
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
|
||||||
tured by the outer parentheses.
|
tured by the outer parentheses.
|
||||||
|
|
||||||
|
(*ACCEPT) is the only backtracking verb that is allowed to be quanti-
|
||||||
|
fied because an ungreedy quantification with a minimum of zero acts
|
||||||
|
only when a backtrack happens. Consider, for example,
|
||||||
|
|
||||||
|
A(*ACCEPT)??BC
|
||||||
|
|
||||||
|
where A, B, and C may be complex expressions. After matching "A", the
|
||||||
|
matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT)
|
||||||
|
is triggered and the match succeeds. Whereas (*COMMIT) (see below)
|
||||||
|
means "fail on backtrack", a repeated (*ACCEPT) of this type means
|
||||||
|
"succeed on backtrack".
|
||||||
|
|
||||||
Warning: (*ACCEPT) should not be used within a script run group,
|
Warning: (*ACCEPT) should not be used within a script run group,
|
||||||
because it causes an immediate exit from the group, bypassing the
|
because it causes an immediate exit from the group, bypassing the
|
||||||
script run checking.
|
script run checking.
|
||||||
|
@ -9043,8 +9055,9 @@ BACKTRACKING CONTROL
|
||||||
A match with the string "aaaa" always fails, but the callout is taken
|
A match with the string "aaaa" always fails, but the callout is taken
|
||||||
before each backtrack happens (in this example, 10 times).
|
before each backtrack happens (in this example, 10 times).
|
||||||
|
|
||||||
(*ACCEPT:NAME) and (*FAIL:NAME) are treated as (*MARK:NAME)(*ACCEPT)
|
(*ACCEPT:NAME) and (*FAIL:NAME) behave the same as
|
||||||
and (*MARK:NAME)(*FAIL), respectively.
|
(*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a
|
||||||
|
(*MARK) is recorded just before the verb acts.
|
||||||
|
|
||||||
Recording which path was taken
|
Recording which path was taken
|
||||||
|
|
||||||
|
@ -9414,7 +9427,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 23 May 2019
|
Last updated: 10 June 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "23 May 2019" "PCRE2 10.34"
|
.TH PCRE2PATTERN 3 "10 June 2019" "PCRE2 10.34"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -3262,8 +3262,8 @@ The doubling is removed before the string is passed to the callout function.
|
||||||
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||||
terminology) that modify the behaviour of backtracking during matching. They
|
terminology) that modify the behaviour of backtracking during matching. They
|
||||||
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
|
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
|
||||||
possibly behaving differently depending on whether or not a name is present.
|
and may behave differently depending on whether or not a name argument is
|
||||||
The names are not required to be unique within the pattern.
|
present. The names are not required to be unique within the pattern.
|
||||||
.P
|
.P
|
||||||
By default, for compatibility with Perl, a name is any sequence of characters
|
By default, for compatibility with Perl, a name is any sequence of characters
|
||||||
that does not include a closing parenthesis. The name is not processed in
|
that does not include a closing parenthesis. The name is not processed in
|
||||||
|
@ -3287,7 +3287,8 @@ PCRE2_ALT_VERBNAMES is also set.
|
||||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||||
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
||||||
parenthesis immediately follows the colon, the effect is as if the colon were
|
parenthesis immediately follows the colon, the effect is as if the colon were
|
||||||
not there. Any number of these verbs may occur in a pattern.
|
not there. Any number of these verbs may occur in a pattern. Except for
|
||||||
|
(*ACCEPT), they may not be quantified.
|
||||||
.P
|
.P
|
||||||
Since these verbs are specifically related to backtracking, most of them can be
|
Since these verbs are specifically related to backtracking, most of them can be
|
||||||
used only when the pattern is to be matched using the traditional matching
|
used only when the pattern is to be matched using the traditional matching
|
||||||
|
@ -3361,6 +3362,17 @@ example:
|
||||||
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
|
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
|
||||||
the outer parentheses.
|
the outer parentheses.
|
||||||
.P
|
.P
|
||||||
|
(*ACCEPT) is the only backtracking verb that is allowed to be quantified
|
||||||
|
because an ungreedy quantification with a minimum of zero acts only when a
|
||||||
|
backtrack happens. Consider, for example,
|
||||||
|
.sp
|
||||||
|
A(*ACCEPT)??BC
|
||||||
|
.sp
|
||||||
|
where A, B, and C may be complex expressions. After matching "A", the matcher
|
||||||
|
processes "BC"; if that fails, causing a backtrack, (*ACCEPT) is triggered and
|
||||||
|
the match succeeds. Whereas (*COMMIT) (see below) means "fail on backtrack", a
|
||||||
|
repeated (*ACCEPT) of this type means "succeed on backtrack".
|
||||||
|
.P
|
||||||
\fBWarning:\fP (*ACCEPT) should not be used within a script run group, because
|
\fBWarning:\fP (*ACCEPT) should not be used within a script run group, because
|
||||||
it causes an immediate exit from the group, bypassing the script run checking.
|
it causes an immediate exit from the group, bypassing the script run checking.
|
||||||
.sp
|
.sp
|
||||||
|
@ -3377,8 +3389,9 @@ nearest equivalent is the callout feature, as for example in this pattern:
|
||||||
A match with the string "aaaa" always fails, but the callout is taken before
|
A match with the string "aaaa" always fails, but the callout is taken before
|
||||||
each backtrack happens (in this example, 10 times).
|
each backtrack happens (in this example, 10 times).
|
||||||
.P
|
.P
|
||||||
(*ACCEPT:NAME) and (*FAIL:NAME) are treated as (*MARK:NAME)(*ACCEPT) and
|
(*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*ACCEPT) and
|
||||||
(*MARK:NAME)(*FAIL), respectively.
|
(*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is recorded just before
|
||||||
|
the verb acts.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Recording which path was taken"
|
.SS "Recording which path was taken"
|
||||||
|
@ -3764,6 +3777,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 23 May 2019
|
Last updated: 10 June 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1419,9 +1419,6 @@ the result is "not a repeat quantifier". */
|
||||||
EXIT:
|
EXIT:
|
||||||
if (yield || *errorcodeptr != 0) *ptrptr = p;
|
if (yield || *errorcodeptr != 0) *ptrptr = p;
|
||||||
return yield;
|
return yield;
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@ -2450,8 +2447,9 @@ must be last. */
|
||||||
|
|
||||||
enum { RANGE_NO, RANGE_STARTED, RANGE_OK_ESCAPED, RANGE_OK_LITERAL };
|
enum { RANGE_NO, RANGE_STARTED, RANGE_OK_ESCAPED, RANGE_OK_LITERAL };
|
||||||
|
|
||||||
/* Only in 32-bit mode can there be literals > META_END. A macros encapsulates
|
/* Only in 32-bit mode can there be literals > META_END. A macro encapsulates
|
||||||
the storing of literal values in the parsed pattern. */
|
the storing of literal values in the main parsed pattern, where they can always
|
||||||
|
be quantified. */
|
||||||
|
|
||||||
#if PCRE2_CODE_UNIT_WIDTH == 32
|
#if PCRE2_CODE_UNIT_WIDTH == 32
|
||||||
#define PARSED_LITERAL(c, p) \
|
#define PARSED_LITERAL(c, p) \
|
||||||
|
@ -2474,6 +2472,7 @@ uint32_t delimiter;
|
||||||
uint32_t namelen;
|
uint32_t namelen;
|
||||||
uint32_t class_range_state;
|
uint32_t class_range_state;
|
||||||
uint32_t *verblengthptr = NULL; /* Value avoids compiler warning */
|
uint32_t *verblengthptr = NULL; /* Value avoids compiler warning */
|
||||||
|
uint32_t *verbstartptr = NULL;
|
||||||
uint32_t *previous_callout = NULL;
|
uint32_t *previous_callout = NULL;
|
||||||
uint32_t *parsed_pattern = cb->parsed_pattern;
|
uint32_t *parsed_pattern = cb->parsed_pattern;
|
||||||
uint32_t *parsed_pattern_end = cb->parsed_pattern_end;
|
uint32_t *parsed_pattern_end = cb->parsed_pattern_end;
|
||||||
|
@ -2640,13 +2639,15 @@ while (ptr < ptrend)
|
||||||
|
|
||||||
switch(c)
|
switch(c)
|
||||||
{
|
{
|
||||||
default:
|
default: /* Don't use PARSED_LITERAL() because it */
|
||||||
PARSED_LITERAL(c, parsed_pattern);
|
#if PCRE2_CODE_UNIT_WIDTH == 32 /* sets okquantifier. */
|
||||||
|
if (c >= META_END) *parsed_pattern++ = META_BIGVALUE;
|
||||||
|
#endif
|
||||||
|
*parsed_pattern++ = c;
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case CHAR_RIGHT_PARENTHESIS:
|
case CHAR_RIGHT_PARENTHESIS:
|
||||||
inverbname = FALSE;
|
inverbname = FALSE;
|
||||||
okquantifier = FALSE; /* Was probably set by literals */
|
|
||||||
/* This is the length in characters */
|
/* This is the length in characters */
|
||||||
verbnamelength = (PCRE2_SIZE)(parsed_pattern - verblengthptr - 1);
|
verbnamelength = (PCRE2_SIZE)(parsed_pattern - verblengthptr - 1);
|
||||||
/* But the limit on the length is in code units */
|
/* But the limit on the length is in code units */
|
||||||
|
@ -3135,6 +3136,21 @@ while (ptr < ptrend)
|
||||||
goto FAILED_BACK;
|
goto FAILED_BACK;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/* Most (*VERB)s are not allowed to be quantified, but an ungreedy
|
||||||
|
quantifier can be useful for (*ACCEPT) - meaning "succeed on backtrack", a
|
||||||
|
sort of negated (*COMMIT). We therefore allow (*ACCEPT) to be quantified by
|
||||||
|
wrapping it in non-capturing brackets, but we have to allow for a preceding
|
||||||
|
(*MARK) for when (*ACCEPT) has an argument. */
|
||||||
|
|
||||||
|
if (parsed_pattern[-1] == META_ACCEPT)
|
||||||
|
{
|
||||||
|
uint32_t *p;
|
||||||
|
for (p = parsed_pattern - 1; p >= verbstartptr; p--) p[1] = p[0];
|
||||||
|
*verbstartptr = META_NOCAPTURE;
|
||||||
|
parsed_pattern[1] = META_KET;
|
||||||
|
parsed_pattern += 2;
|
||||||
|
}
|
||||||
|
|
||||||
/* Now we can put the quantifier into the parsed pattern vector. At this
|
/* Now we can put the quantifier into the parsed pattern vector. At this
|
||||||
stage, we have only the basic quantifier. The check for a following + or ?
|
stage, we have only the basic quantifier. The check for a following + or ?
|
||||||
modifier happens at the top of the loop, after any intervening comments
|
modifier happens at the top of the loop, after any intervening comments
|
||||||
|
@ -3775,6 +3791,12 @@ while (ptr < ptrend)
|
||||||
goto FAILED;
|
goto FAILED;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/* Remember where this verb, possibly with a preceding (*MARK), starts,
|
||||||
|
for handling quantified (*ACCEPT). */
|
||||||
|
|
||||||
|
verbstartptr = parsed_pattern;
|
||||||
|
okquantifier = (verbs[i].meta == META_ACCEPT);
|
||||||
|
|
||||||
/* It appears that Perl allows any characters whatsoever, other than a
|
/* It appears that Perl allows any characters whatsoever, other than a
|
||||||
closing parenthesis, to appear in arguments ("names"), so we no longer
|
closing parenthesis, to appear in arguments ("names"), so we no longer
|
||||||
insist on letters, digits, and underscores. Perl does not, however, do
|
insist on letters, digits, and underscores. Perl does not, however, do
|
||||||
|
|
|
@ -5591,4 +5591,16 @@ a)"xI
|
||||||
|
|
||||||
/\[()]{65535}(?<A>)/expand
|
/\[()]{65535}(?<A>)/expand
|
||||||
|
|
||||||
|
/a(?:(*ACCEPT))??bc/
|
||||||
|
abc
|
||||||
|
axy
|
||||||
|
|
||||||
|
/a(*ACCEPT)??bc/
|
||||||
|
abc
|
||||||
|
axy
|
||||||
|
|
||||||
|
/a(*ACCEPT:XX)??bc/mark
|
||||||
|
abc
|
||||||
|
axy
|
||||||
|
|
||||||
# End of testinput2
|
# End of testinput2
|
||||||
|
|
|
@ -16940,6 +16940,25 @@ Failed: error 197 at offset 131071: too many capturing groups (maximum 65535)
|
||||||
/\[()]{65535}(?<A>)/expand
|
/\[()]{65535}(?<A>)/expand
|
||||||
Failed: error 197 at offset 131075: too many capturing groups (maximum 65535)
|
Failed: error 197 at offset 131075: too many capturing groups (maximum 65535)
|
||||||
|
|
||||||
|
/a(?:(*ACCEPT))??bc/
|
||||||
|
abc
|
||||||
|
0: abc
|
||||||
|
axy
|
||||||
|
0: a
|
||||||
|
|
||||||
|
/a(*ACCEPT)??bc/
|
||||||
|
abc
|
||||||
|
0: abc
|
||||||
|
axy
|
||||||
|
0: a
|
||||||
|
|
||||||
|
/a(*ACCEPT:XX)??bc/mark
|
||||||
|
abc
|
||||||
|
0: abc
|
||||||
|
axy
|
||||||
|
0: a
|
||||||
|
MK: XX
|
||||||
|
|
||||||
# End of testinput2
|
# End of testinput2
|
||||||
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
|
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
|
||||||
Error -62: bad serialized data
|
Error -62: bad serialized data
|
||||||
|
|
Loading…
Reference in New Issue