From 5ba5230b82e2d3e73a83f9fb165cb3778c7e95eb Mon Sep 17 00:00:00 2001
From: "Philip.Hazel"
Date: Wed, 1 Jan 2020 12:07:02 +0000
Subject: [PATCH] Allow real repetition of assertions.
---
ChangeLog | 7 +++++++
doc/html/pcre2pattern.html | 39 ++++++++++++------------------------
doc/pcre2.txt | 32 ++++++++++-------------------
doc/pcre2pattern.3 | 41 +++++++++++++-------------------------
src/pcre2_compile.c | 17 +++++++++-------
testdata/testinput1 | 9 +++++++++
testdata/testoutput1 | 21 +++++++++++++++++++
testdata/testoutput2 | 29 +++++++++++++++++++++++++++
8 files changed, 114 insertions(+), 81 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 84d7e44..78bcc0f 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -32,6 +32,13 @@ now correctly backtracked, so this unnecessary restriction has been removed.
regex engine. The Perl regex folks are aware of this usage and have made a note
about it.
+9. When an assertion is repeated, PCRE2 used to limit the maximum repetition to
+1, believing that repeating an assertion is pointless. However, if a positive
+assertion contains capturing groups, repetition can be useful. In any case, an
+assertion could always be wrapped in a repeated group. The only restriction
+that is now imposed is that an unlimited maximum is changed to one more than
+the minimum.
+
Version 10.34 21-November-2019
------------------------------
diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
index 42d8515..36178b3 100644
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@@ -1901,8 +1901,8 @@ are permitted for groups with the same number, for example:
(?|(?<AA>aa)|(?<AA>bb))
The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
-option at compile time, or by the use of (?J) within the pattern, as described
-in the section entitled
+option at compile time, or by the use of (?J) within the pattern, as described
+in the section entitled
"Internal Option Setting"
above.
@@ -1968,7 +1968,7 @@ items:
an escape such as \d or \pL that matches a single character
a character class
a backreference
- a parenthesized group (including most assertions)
+ a parenthesized group (including lookaround assertions)
a subroutine call (recursive or otherwise)
The general repetition quantifier specifies a minimum and maximum number of
@@ -2359,7 +2359,7 @@ of zero.
For versions of PCRE2 less than 10.25, backreferences of this type used to
cause the group that they reference to be treated as an
atomic group.
-This restriction no longer applies, and backtracking into such groups can occur
+This restriction no longer applies, and backtracking into such groups can occur
as normal.
ASSERTIONS
@@ -2420,26 +2420,13 @@ control passes to the previous backtracking point, thus discarding any captured
strings within the assertion.
-For compatibility with Perl, most assertion groups may be repeated; though it
-makes no sense to assert the same thing several times, the side effect of
-capturing may occasionally be useful. However, an assertion that forms the
-condition for a conditional group may not be quantified. In practice, for
-other assertions, there only three cases:
-
-
-(1) If the quantifier is {0}, the assertion is never obeyed during matching.
-However, it may contain internal capture groups that are called from elsewhere
-via the
-subroutine mechanism.
-
-
-(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it
-were {0,1}. At run time, the rest of the pattern match is tried with and
-without the assertion, the order depending on the greediness of the quantifier.
-
-
-(3) If the minimum repetition is greater than zero, the quantifier is ignored.
-The assertion is obeyed just once when encountered during matching.
+Most assertion groups may be repeated; though it makes no sense to assert the
+same thing several times, the side effect of capturing in positive assertions
+may occasionally be useful. However, an assertion that forms the condition for
+a conditional group may not be quantified. PCRE2 used to restrict the
+repetition of assertions, but from release 10.35 the only restriction is that
+an unlimited maximum repetition is changed to be one more than the minimum. For
+example, {3,} is treated as {3,4}.
Alphabetic assertion names
@@ -3840,9 +3827,9 @@ Cambridge, England.
REVISION
-Last updated: 29 December 2019
+Last updated: 01 January 2020
-Copyright © 1997-2019 University of Cambridge.
+Copyright © 1997-2020 University of Cambridge.
Return to the PCRE2 index page.
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index 974fafa..127e6ab 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -7729,7 +7729,7 @@ REPETITION
an escape such as \d or \pL that matches a single character
a character class
a backreference
- a parenthesized group (including most assertions)
+ a parenthesized group (including lookaround assertions)
a subroutine call (recursive or otherwise)
The general repetition quantifier specifies a minimum and maximum num-
@@ -8162,24 +8162,14 @@ ASSERTIONS
passes to the previous backtracking point, thus discarding any captured
strings within the assertion.
- For compatibility with Perl, most assertion groups may be repeated;
- though it makes no sense to assert the same thing several times, the
- side effect of capturing may occasionally be useful. However, an asser-
- tion that forms the condition for a conditional group may not be quan-
- tified. In practice, for other assertions, there only three cases:
-
- (1) If the quantifier is {0}, the assertion is never obeyed during
- matching. However, it may contain internal capture groups that are
- called from elsewhere via the subroutine mechanism.
-
- (2) If quantifier is {0,n} where n is greater than zero, it is treated
- as if it were {0,1}. At run time, the rest of the pattern match is
- tried with and without the assertion, the order depending on the greed-
- iness of the quantifier.
-
- (3) If the minimum repetition is greater than zero, the quantifier is
- ignored. The assertion is obeyed just once when encountered during
- matching.
+ Most assertion groups may be repeated; though it makes no sense to as-
+ sert the same thing several times, the side effect of capturing in pos-
+ itive assertions may occasionally be useful. However, an assertion that
+ forms the condition for a conditional group may not be quantified.
+ PCRE2 used to restrict the repetition of assertions, but from release
+ 10.35 the only restriction is that an unlimited maximum repetition is
+ changed to be one more than the minimum. For example, {3,} is treated
+ as {3,4}.
Alphabetic assertion names
@@ -9490,8 +9480,8 @@ AUTHOR
REVISION
- Last updated: 29 December 2019
- Copyright (c) 1997-2019 University of Cambridge.
+ Last updated: 01 January 2020
+ Copyright (c) 1997-2020 University of Cambridge.
------------------------------------------------------------------------------
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3
index 9015679..c613878 100644
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "29 December 2019" "PCRE2 10.35"
+.TH PCRE2PATTERN 3 "01 January 2020" "PCRE2 10.35"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -1902,8 +1902,8 @@ are permitted for groups with the same number, for example:
(?|(?aa)|(?bb))
.sp
The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
-option at compile time, or by the use of (?J) within the pattern, as described
-in the section entitled
+option at compile time, or by the use of (?J) within the pattern, as described
+in the section entitled
.\" HTML
.\"
"Internal Option Setting"
@@ -1975,7 +1975,7 @@ items:
an escape such as \ed or \epL that matches a single character
a character class
a backreference
- a parenthesized group (including most assertions)
+ a parenthesized group (including lookaround assertions)
a subroutine call (recursive or otherwise)
.sp
The general repetition quantifier specifies a minimum and maximum number of
@@ -2362,7 +2362,7 @@ cause the group that they reference to be treated as an
.\"
atomic group.
.\"
-This restriction no longer applies, and backtracking into such groups can occur
+This restriction no longer applies, and backtracking into such groups can occur
as normal.
.
.
@@ -2431,26 +2431,13 @@ the "no" branch of the condition. For other failing negative assertions,
control passes to the previous backtracking point, thus discarding any captured
strings within the assertion.
.P
-For compatibility with Perl, most assertion groups may be repeated; though it
-makes no sense to assert the same thing several times, the side effect of
-capturing may occasionally be useful. However, an assertion that forms the
-condition for a conditional group may not be quantified. In practice, for
-other assertions, there only three cases:
-.sp
-(1) If the quantifier is {0}, the assertion is never obeyed during matching.
-However, it may contain internal capture groups that are called from elsewhere
-via the
-.\" HTML
-.\"
-subroutine mechanism.
-.\"
-.sp
-(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it
-were {0,1}. At run time, the rest of the pattern match is tried with and
-without the assertion, the order depending on the greediness of the quantifier.
-.sp
-(3) If the minimum repetition is greater than zero, the quantifier is ignored.
-The assertion is obeyed just once when encountered during matching.
+Most assertion groups may be repeated; though it makes no sense to assert the
+same thing several times, the side effect of capturing in positive assertions
+may occasionally be useful. However, an assertion that forms the condition for
+a conditional group may not be quantified. PCRE2 used to restrict the
+repetition of assertions, but from release 10.35 the only restriction is that
+an unlimited maximum repetition is changed to be one more than the minimum. For
+example, {3,} is treated as {3,4}.
.
.
.SS "Alphabetic assertion names"
@@ -3884,6 +3871,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 29 December 2019
-Copyright (c) 1997-2019 University of Cambridge.
+Last updated: 01 January 2020
+Copyright (c) 1997-2020 University of Cambridge.
.fi
diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c
index ed4fc74..0350328 100644
--- a/src/pcre2_compile.c
+++ b/src/pcre2_compile.c
@@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge
- New API code Copyright (c) 2016-2019 University of Cambridge
+ New API code Copyright (c) 2016-2020 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -7074,15 +7074,18 @@ for (;; pptr++)
previous[GET(previous, 1)] != OP_ALT)
goto END_REPEAT;
- /* There is no sense in actually repeating assertions. The only
- potential use of repetition is in cases when the assertion is optional.
- Therefore, if the minimum is greater than zero, just ignore the repeat.
- If the maximum is not zero or one, set it to 1. */
+ /* Perl allows all assertions to be quantified, and when they contain
+ capturing parentheses and/or are optional there are potential uses for
+ this feature. PCRE2 used to force the maximum quantifier to 1 on the
+ invalid grounds that further repetition was never useful. This was
+ always a bit pointless, since an assertion could be wrapped with a
+ repeated group to achieve the effect. General repetition is now
+ permitted, but if the maximum is unlimited it is set to one more than
+ the minimum. */
if (op_previous < OP_ONCE) /* Assertion */
{
- if (repeat_min > 0) goto END_REPEAT;
- if (repeat_max > 1) repeat_max = 1;
+ if (repeat_max == REPEAT_UNLIMITED) repeat_max = repeat_min + 1;
}
/* The case of a zero minimum is special because of the need to stick
diff --git a/testdata/testinput1 b/testdata/testinput1
index 109de29..9d7821d 100644
--- a/testdata/testinput1
+++ b/testdata/testinput1
@@ -6393,4 +6393,13 @@ ef) x/x,mark
/^((\1+)|\d)+133X$/
111133X
+/^(?=.*(?=(([A-Z]).*(?(1)\1)))(?!.+\2)){26}/i
+ The quick brown fox jumps over the lazy dog.
+ Jackdaws love my big sphinx of quartz.
+ Pack my box with five dozen liquor jugs.
+\= Expect no match
+ The quick brown fox jumps over the lazy cat.
+ Hackdaws love my big sphinx of quartz.
+ Pack my fox with five dozen liquor jugs.
+
# End of testinput1
diff --git a/testdata/testoutput1 b/testdata/testoutput1
index c425ed4..79acf04 100644
--- a/testdata/testoutput1
+++ b/testdata/testoutput1
@@ -10126,4 +10126,25 @@ No match
1: 11
2: 11
+/^(?=.*(?=(([A-Z]).*(?(1)\1)))(?!.+\2)){26}/i
+ The quick brown fox jumps over the lazy dog.
+ 0:
+ 1: quick brown fox jumps over the lazy dog.
+ 2: q
+ Jackdaws love my big sphinx of quartz.
+ 0:
+ 1: Jackdaws love my big sphinx of quartz.
+ 2: J
+ Pack my box with five dozen liquor jugs.
+ 0:
+ 1: Pack my box with five dozen liquor jugs.
+ 2: P
+\= Expect no match
+ The quick brown fox jumps over the lazy cat.
+No match
+ Hackdaws love my big sphinx of quartz.
+No match
+ Pack my fox with five dozen liquor jugs.
+No match
+
# End of testinput1
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index 438aefe..3a46a0a 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -10962,6 +10962,12 @@ Matched, but too many substrings
Assert
abc
Ket
+ Assert
+ abc
+ Ket
+ Assert
+ abc
+ Ket
abc
Ket
End
@@ -10973,6 +10979,10 @@ Matched, but too many substrings
Assert
abc
Ket
+ Brazero
+ Assert
+ abc
+ Ket
abc
Ket
End
@@ -10981,9 +10991,15 @@ Matched, but too many substrings
/(?=abc)++abc/B
------------------------------------------------------------------
Bra
+ Once
Assert
abc
Ket
+ Brazero
+ Assert
+ abc
+ Ket
+ Ket
abc
Ket
End
@@ -16610,6 +16626,19 @@ No match
Assert
Any
Ket
+ Assert
+ Any
+ Ket
+ Assert
+ Any
+ Ket
+ Assert
+ Any
+ Ket
+ Brazero
+ Assert
+ Any
+ Ket
x
Ket
Ket