Allow real repetition of assertions.

This commit is contained in:
Philip.Hazel 2020-01-01 12:07:02 +00:00
parent eaf4572ff8
commit 5ba5230b82
8 changed files with 114 additions and 81 deletions

View File

@ -32,6 +32,13 @@ now correctly backtracked, so this unnecessary restriction has been removed.
regex engine. The Perl regex folks are aware of this usage and have made a note regex engine. The Perl regex folks are aware of this usage and have made a note
about it. about it.
9. When an assertion is repeated, PCRE2 used to limit the maximum repetition to
1, believing that repeating an assertion is pointless. However, if a positive
assertion contains capturing groups, repetition can be useful. In any case, an
assertion could always be wrapped in a repeated group. The only restriction
that is now imposed is that an unlimited maximum is changed to one more than
the minimum.
Version 10.34 21-November-2019 Version 10.34 21-November-2019
------------------------------ ------------------------------

View File

@ -1901,8 +1901,8 @@ are permitted for groups with the same number, for example:
(?|(?<AA>aa)|(?<AA>bb)) (?|(?<AA>aa)|(?<AA>bb))
</pre> </pre>
The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
option at compile time, or by the use of (?J) within the pattern, as described option at compile time, or by the use of (?J) within the pattern, as described
in the section entitled in the section entitled
<a href="#internaloptions">"Internal Option Setting"</a> <a href="#internaloptions">"Internal Option Setting"</a>
above. above.
</P> </P>
@ -1968,7 +1968,7 @@ items:
an escape such as \d or \pL that matches a single character an escape such as \d or \pL that matches a single character
a character class a character class
a backreference a backreference
a parenthesized group (including most assertions) a parenthesized group (including lookaround assertions)
a subroutine call (recursive or otherwise) a subroutine call (recursive or otherwise)
</pre> </pre>
The general repetition quantifier specifies a minimum and maximum number of The general repetition quantifier specifies a minimum and maximum number of
@ -2359,7 +2359,7 @@ of zero.
For versions of PCRE2 less than 10.25, backreferences of this type used to For versions of PCRE2 less than 10.25, backreferences of this type used to
cause the group that they reference to be treated as an cause the group that they reference to be treated as an
<a href="#atomicgroup">atomic group.</a> <a href="#atomicgroup">atomic group.</a>
This restriction no longer applies, and backtracking into such groups can occur This restriction no longer applies, and backtracking into such groups can occur
as normal. as normal.
<a name="bigassertions"></a></P> <a name="bigassertions"></a></P>
<br><a name="SEC20" href="#TOC1">ASSERTIONS</a><br> <br><a name="SEC20" href="#TOC1">ASSERTIONS</a><br>
@ -2420,26 +2420,13 @@ control passes to the previous backtracking point, thus discarding any captured
strings within the assertion. strings within the assertion.
</P> </P>
<P> <P>
For compatibility with Perl, most assertion groups may be repeated; though it Most assertion groups may be repeated; though it makes no sense to assert the
makes no sense to assert the same thing several times, the side effect of same thing several times, the side effect of capturing in positive assertions
capturing may occasionally be useful. However, an assertion that forms the may occasionally be useful. However, an assertion that forms the condition for
condition for a conditional group may not be quantified. In practice, for a conditional group may not be quantified. PCRE2 used to restrict the
other assertions, there only three cases: repetition of assertions, but from release 10.35 the only restriction is that
<br> an unlimited maximum repetition is changed to be one more than the minimum. For
<br> example, {3,} is treated as {3,4}.
(1) If the quantifier is {0}, the assertion is never obeyed during matching.
However, it may contain internal capture groups that are called from elsewhere
via the
<a href="#groupsassubroutines">subroutine mechanism.</a>
<br>
<br>
(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it
were {0,1}. At run time, the rest of the pattern match is tried with and
without the assertion, the order depending on the greediness of the quantifier.
<br>
<br>
(3) If the minimum repetition is greater than zero, the quantifier is ignored.
The assertion is obeyed just once when encountered during matching.
</P> </P>
<br><b> <br><b>
Alphabetic assertion names Alphabetic assertion names
@ -3840,9 +3827,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br> <br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 29 December 2019 Last updated: 01 January 2020
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2020 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -7729,7 +7729,7 @@ REPETITION
an escape such as \d or \pL that matches a single character an escape such as \d or \pL that matches a single character
a character class a character class
a backreference a backreference
a parenthesized group (including most assertions) a parenthesized group (including lookaround assertions)
a subroutine call (recursive or otherwise) a subroutine call (recursive or otherwise)
The general repetition quantifier specifies a minimum and maximum num- The general repetition quantifier specifies a minimum and maximum num-
@ -8162,24 +8162,14 @@ ASSERTIONS
passes to the previous backtracking point, thus discarding any captured passes to the previous backtracking point, thus discarding any captured
strings within the assertion. strings within the assertion.
For compatibility with Perl, most assertion groups may be repeated; Most assertion groups may be repeated; though it makes no sense to as-
though it makes no sense to assert the same thing several times, the sert the same thing several times, the side effect of capturing in pos-
side effect of capturing may occasionally be useful. However, an asser- itive assertions may occasionally be useful. However, an assertion that
tion that forms the condition for a conditional group may not be quan- forms the condition for a conditional group may not be quantified.
tified. In practice, for other assertions, there only three cases: PCRE2 used to restrict the repetition of assertions, but from release
10.35 the only restriction is that an unlimited maximum repetition is
(1) If the quantifier is {0}, the assertion is never obeyed during changed to be one more than the minimum. For example, {3,} is treated
matching. However, it may contain internal capture groups that are as {3,4}.
called from elsewhere via the subroutine mechanism.
(2) If quantifier is {0,n} where n is greater than zero, it is treated
as if it were {0,1}. At run time, the rest of the pattern match is
tried with and without the assertion, the order depending on the greed-
iness of the quantifier.
(3) If the minimum repetition is greater than zero, the quantifier is
ignored. The assertion is obeyed just once when encountered during
matching.
Alphabetic assertion names Alphabetic assertion names
@ -9490,8 +9480,8 @@ AUTHOR
REVISION REVISION
Last updated: 29 December 2019 Last updated: 01 January 2020
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "29 December 2019" "PCRE2 10.35" .TH PCRE2PATTERN 3 "01 January 2020" "PCRE2 10.35"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -1902,8 +1902,8 @@ are permitted for groups with the same number, for example:
(?|(?<AA>aa)|(?<AA>bb)) (?|(?<AA>aa)|(?<AA>bb))
.sp .sp
The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
option at compile time, or by the use of (?J) within the pattern, as described option at compile time, or by the use of (?J) within the pattern, as described
in the section entitled in the section entitled
.\" HTML <a href="#internaloptions"> .\" HTML <a href="#internaloptions">
.\" </a> .\" </a>
"Internal Option Setting" "Internal Option Setting"
@ -1975,7 +1975,7 @@ items:
an escape such as \ed or \epL that matches a single character an escape such as \ed or \epL that matches a single character
a character class a character class
a backreference a backreference
a parenthesized group (including most assertions) a parenthesized group (including lookaround assertions)
a subroutine call (recursive or otherwise) a subroutine call (recursive or otherwise)
.sp .sp
The general repetition quantifier specifies a minimum and maximum number of The general repetition quantifier specifies a minimum and maximum number of
@ -2362,7 +2362,7 @@ cause the group that they reference to be treated as an
.\" </a> .\" </a>
atomic group. atomic group.
.\" .\"
This restriction no longer applies, and backtracking into such groups can occur This restriction no longer applies, and backtracking into such groups can occur
as normal. as normal.
. .
. .
@ -2431,26 +2431,13 @@ the "no" branch of the condition. For other failing negative assertions,
control passes to the previous backtracking point, thus discarding any captured control passes to the previous backtracking point, thus discarding any captured
strings within the assertion. strings within the assertion.
.P .P
For compatibility with Perl, most assertion groups may be repeated; though it Most assertion groups may be repeated; though it makes no sense to assert the
makes no sense to assert the same thing several times, the side effect of same thing several times, the side effect of capturing in positive assertions
capturing may occasionally be useful. However, an assertion that forms the may occasionally be useful. However, an assertion that forms the condition for
condition for a conditional group may not be quantified. In practice, for a conditional group may not be quantified. PCRE2 used to restrict the
other assertions, there only three cases: repetition of assertions, but from release 10.35 the only restriction is that
.sp an unlimited maximum repetition is changed to be one more than the minimum. For
(1) If the quantifier is {0}, the assertion is never obeyed during matching. example, {3,} is treated as {3,4}.
However, it may contain internal capture groups that are called from elsewhere
via the
.\" HTML <a href="#groupsassubroutines">
.\" </a>
subroutine mechanism.
.\"
.sp
(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it
were {0,1}. At run time, the rest of the pattern match is tried with and
without the assertion, the order depending on the greediness of the quantifier.
.sp
(3) If the minimum repetition is greater than zero, the quantifier is ignored.
The assertion is obeyed just once when encountered during matching.
. .
. .
.SS "Alphabetic assertion names" .SS "Alphabetic assertion names"
@ -3884,6 +3871,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 29 December 2019 Last updated: 01 January 2020
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
.fi .fi

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016-2019 University of Cambridge New API code Copyright (c) 2016-2020 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -7074,15 +7074,18 @@ for (;; pptr++)
previous[GET(previous, 1)] != OP_ALT) previous[GET(previous, 1)] != OP_ALT)
goto END_REPEAT; goto END_REPEAT;
/* There is no sense in actually repeating assertions. The only /* Perl allows all assertions to be quantified, and when they contain
potential use of repetition is in cases when the assertion is optional. capturing parentheses and/or are optional there are potential uses for
Therefore, if the minimum is greater than zero, just ignore the repeat. this feature. PCRE2 used to force the maximum quantifier to 1 on the
If the maximum is not zero or one, set it to 1. */ invalid grounds that further repetition was never useful. This was
always a bit pointless, since an assertion could be wrapped with a
repeated group to achieve the effect. General repetition is now
permitted, but if the maximum is unlimited it is set to one more than
the minimum. */
if (op_previous < OP_ONCE) /* Assertion */ if (op_previous < OP_ONCE) /* Assertion */
{ {
if (repeat_min > 0) goto END_REPEAT; if (repeat_max == REPEAT_UNLIMITED) repeat_max = repeat_min + 1;
if (repeat_max > 1) repeat_max = 1;
} }
/* The case of a zero minimum is special because of the need to stick /* The case of a zero minimum is special because of the need to stick

9
testdata/testinput1 vendored
View File

@ -6393,4 +6393,13 @@ ef) x/x,mark
/^((\1+)|\d)+133X$/ /^((\1+)|\d)+133X$/
111133X 111133X
/^(?=.*(?=(([A-Z]).*(?(1)\1)))(?!.+\2)){26}/i
The quick brown fox jumps over the lazy dog.
Jackdaws love my big sphinx of quartz.
Pack my box with five dozen liquor jugs.
\= Expect no match
The quick brown fox jumps over the lazy cat.
Hackdaws love my big sphinx of quartz.
Pack my fox with five dozen liquor jugs.
# End of testinput1 # End of testinput1

21
testdata/testoutput1 vendored
View File

@ -10126,4 +10126,25 @@ No match
1: 11 1: 11
2: 11 2: 11
/^(?=.*(?=(([A-Z]).*(?(1)\1)))(?!.+\2)){26}/i
The quick brown fox jumps over the lazy dog.
0:
1: quick brown fox jumps over the lazy dog.
2: q
Jackdaws love my big sphinx of quartz.
0:
1: Jackdaws love my big sphinx of quartz.
2: J
Pack my box with five dozen liquor jugs.
0:
1: Pack my box with five dozen liquor jugs.
2: P
\= Expect no match
The quick brown fox jumps over the lazy cat.
No match
Hackdaws love my big sphinx of quartz.
No match
Pack my fox with five dozen liquor jugs.
No match
# End of testinput1 # End of testinput1

29
testdata/testoutput2 vendored
View File

@ -10962,6 +10962,12 @@ Matched, but too many substrings
Assert Assert
abc abc
Ket Ket
Assert
abc
Ket
Assert
abc
Ket
abc abc
Ket Ket
End End
@ -10973,6 +10979,10 @@ Matched, but too many substrings
Assert Assert
abc abc
Ket Ket
Brazero
Assert
abc
Ket
abc abc
Ket Ket
End End
@ -10981,9 +10991,15 @@ Matched, but too many substrings
/(?=abc)++abc/B /(?=abc)++abc/B
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra
Once
Assert Assert
abc abc
Ket Ket
Brazero
Assert
abc
Ket
Ket
abc abc
Ket Ket
End End
@ -16610,6 +16626,19 @@ No match
Assert Assert
Any Any
Ket Ket
Assert
Any
Ket
Assert
Any
Ket
Assert
Any
Ket
Brazero
Assert
Any
Ket
x x
Ket Ket
Ket Ket