From 5ba5230b82e2d3e73a83f9fb165cb3778c7e95eb Mon Sep 17 00:00:00 2001 From: "Philip.Hazel" Date: Wed, 1 Jan 2020 12:07:02 +0000 Subject: [PATCH] Allow real repetition of assertions. --- ChangeLog | 7 +++++++ doc/html/pcre2pattern.html | 39 ++++++++++++------------------------ doc/pcre2.txt | 32 ++++++++++------------------- doc/pcre2pattern.3 | 41 +++++++++++++------------------------- src/pcre2_compile.c | 17 +++++++++------- testdata/testinput1 | 9 +++++++++ testdata/testoutput1 | 21 +++++++++++++++++++ testdata/testoutput2 | 29 +++++++++++++++++++++++++++ 8 files changed, 114 insertions(+), 81 deletions(-) diff --git a/ChangeLog b/ChangeLog index 84d7e44..78bcc0f 100644 --- a/ChangeLog +++ b/ChangeLog @@ -32,6 +32,13 @@ now correctly backtracked, so this unnecessary restriction has been removed. regex engine. The Perl regex folks are aware of this usage and have made a note about it. +9. When an assertion is repeated, PCRE2 used to limit the maximum repetition to +1, believing that repeating an assertion is pointless. However, if a positive +assertion contains capturing groups, repetition can be useful. In any case, an +assertion could always be wrapped in a repeated group. The only restriction +that is now imposed is that an unlimited maximum is changed to one more than +the minimum. + Version 10.34 21-November-2019 ------------------------------ diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 42d8515..36178b3 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -1901,8 +1901,8 @@ are permitted for groups with the same number, for example: (?|(?<AA>aa)|(?<AA>bb)) The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES -option at compile time, or by the use of (?J) within the pattern, as described -in the section entitled +option at compile time, or by the use of (?J) within the pattern, as described +in the section entitled "Internal Option Setting" above.

@@ -1968,7 +1968,7 @@ items: an escape such as \d or \pL that matches a single character a character class a backreference - a parenthesized group (including most assertions) + a parenthesized group (including lookaround assertions) a subroutine call (recursive or otherwise) The general repetition quantifier specifies a minimum and maximum number of @@ -2359,7 +2359,7 @@ of zero. For versions of PCRE2 less than 10.25, backreferences of this type used to cause the group that they reference to be treated as an atomic group. -This restriction no longer applies, and backtracking into such groups can occur +This restriction no longer applies, and backtracking into such groups can occur as normal.


ASSERTIONS
@@ -2420,26 +2420,13 @@ control passes to the previous backtracking point, thus discarding any captured strings within the assertion.

-For compatibility with Perl, most assertion groups may be repeated; though it -makes no sense to assert the same thing several times, the side effect of -capturing may occasionally be useful. However, an assertion that forms the -condition for a conditional group may not be quantified. In practice, for -other assertions, there only three cases: -
-
-(1) If the quantifier is {0}, the assertion is never obeyed during matching. -However, it may contain internal capture groups that are called from elsewhere -via the -subroutine mechanism. -
-
-(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it -were {0,1}. At run time, the rest of the pattern match is tried with and -without the assertion, the order depending on the greediness of the quantifier. -
-
-(3) If the minimum repetition is greater than zero, the quantifier is ignored. -The assertion is obeyed just once when encountered during matching. +Most assertion groups may be repeated; though it makes no sense to assert the +same thing several times, the side effect of capturing in positive assertions +may occasionally be useful. However, an assertion that forms the condition for +a conditional group may not be quantified. PCRE2 used to restrict the +repetition of assertions, but from release 10.35 the only restriction is that +an unlimited maximum repetition is changed to be one more than the minimum. For +example, {3,} is treated as {3,4}.


Alphabetic assertion names @@ -3840,9 +3827,9 @@ Cambridge, England.


REVISION

-Last updated: 29 December 2019 +Last updated: 01 January 2020
-Copyright © 1997-2019 University of Cambridge. +Copyright © 1997-2020 University of Cambridge.

Return to the PCRE2 index page. diff --git a/doc/pcre2.txt b/doc/pcre2.txt index 974fafa..127e6ab 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -7729,7 +7729,7 @@ REPETITION an escape such as \d or \pL that matches a single character a character class a backreference - a parenthesized group (including most assertions) + a parenthesized group (including lookaround assertions) a subroutine call (recursive or otherwise) The general repetition quantifier specifies a minimum and maximum num- @@ -8162,24 +8162,14 @@ ASSERTIONS passes to the previous backtracking point, thus discarding any captured strings within the assertion. - For compatibility with Perl, most assertion groups may be repeated; - though it makes no sense to assert the same thing several times, the - side effect of capturing may occasionally be useful. However, an asser- - tion that forms the condition for a conditional group may not be quan- - tified. In practice, for other assertions, there only three cases: - - (1) If the quantifier is {0}, the assertion is never obeyed during - matching. However, it may contain internal capture groups that are - called from elsewhere via the subroutine mechanism. - - (2) If quantifier is {0,n} where n is greater than zero, it is treated - as if it were {0,1}. At run time, the rest of the pattern match is - tried with and without the assertion, the order depending on the greed- - iness of the quantifier. - - (3) If the minimum repetition is greater than zero, the quantifier is - ignored. The assertion is obeyed just once when encountered during - matching. + Most assertion groups may be repeated; though it makes no sense to as- + sert the same thing several times, the side effect of capturing in pos- + itive assertions may occasionally be useful. However, an assertion that + forms the condition for a conditional group may not be quantified. + PCRE2 used to restrict the repetition of assertions, but from release + 10.35 the only restriction is that an unlimited maximum repetition is + changed to be one more than the minimum. For example, {3,} is treated + as {3,4}. Alphabetic assertion names @@ -9490,8 +9480,8 @@ AUTHOR REVISION - Last updated: 29 December 2019 - Copyright (c) 1997-2019 University of Cambridge. + Last updated: 01 January 2020 + Copyright (c) 1997-2020 University of Cambridge. ------------------------------------------------------------------------------ diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index 9015679..c613878 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "29 December 2019" "PCRE2 10.35" +.TH PCRE2PATTERN 3 "01 January 2020" "PCRE2 10.35" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -1902,8 +1902,8 @@ are permitted for groups with the same number, for example: (?|(?aa)|(?bb)) .sp The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES -option at compile time, or by the use of (?J) within the pattern, as described -in the section entitled +option at compile time, or by the use of (?J) within the pattern, as described +in the section entitled .\" HTML .\" "Internal Option Setting" @@ -1975,7 +1975,7 @@ items: an escape such as \ed or \epL that matches a single character a character class a backreference - a parenthesized group (including most assertions) + a parenthesized group (including lookaround assertions) a subroutine call (recursive or otherwise) .sp The general repetition quantifier specifies a minimum and maximum number of @@ -2362,7 +2362,7 @@ cause the group that they reference to be treated as an .\" atomic group. .\" -This restriction no longer applies, and backtracking into such groups can occur +This restriction no longer applies, and backtracking into such groups can occur as normal. . . @@ -2431,26 +2431,13 @@ the "no" branch of the condition. For other failing negative assertions, control passes to the previous backtracking point, thus discarding any captured strings within the assertion. .P -For compatibility with Perl, most assertion groups may be repeated; though it -makes no sense to assert the same thing several times, the side effect of -capturing may occasionally be useful. However, an assertion that forms the -condition for a conditional group may not be quantified. In practice, for -other assertions, there only three cases: -.sp -(1) If the quantifier is {0}, the assertion is never obeyed during matching. -However, it may contain internal capture groups that are called from elsewhere -via the -.\" HTML -.\" -subroutine mechanism. -.\" -.sp -(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it -were {0,1}. At run time, the rest of the pattern match is tried with and -without the assertion, the order depending on the greediness of the quantifier. -.sp -(3) If the minimum repetition is greater than zero, the quantifier is ignored. -The assertion is obeyed just once when encountered during matching. +Most assertion groups may be repeated; though it makes no sense to assert the +same thing several times, the side effect of capturing in positive assertions +may occasionally be useful. However, an assertion that forms the condition for +a conditional group may not be quantified. PCRE2 used to restrict the +repetition of assertions, but from release 10.35 the only restriction is that +an unlimited maximum repetition is changed to be one more than the minimum. For +example, {3,} is treated as {3,4}. . . .SS "Alphabetic assertion names" @@ -3884,6 +3871,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 29 December 2019 -Copyright (c) 1997-2019 University of Cambridge. +Last updated: 01 January 2020 +Copyright (c) 1997-2020 University of Cambridge. .fi diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c index ed4fc74..0350328 100644 --- a/src/pcre2_compile.c +++ b/src/pcre2_compile.c @@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language. Written by Philip Hazel Original API code Copyright (c) 1997-2012 University of Cambridge - New API code Copyright (c) 2016-2019 University of Cambridge + New API code Copyright (c) 2016-2020 University of Cambridge ----------------------------------------------------------------------------- Redistribution and use in source and binary forms, with or without @@ -7074,15 +7074,18 @@ for (;; pptr++) previous[GET(previous, 1)] != OP_ALT) goto END_REPEAT; - /* There is no sense in actually repeating assertions. The only - potential use of repetition is in cases when the assertion is optional. - Therefore, if the minimum is greater than zero, just ignore the repeat. - If the maximum is not zero or one, set it to 1. */ + /* Perl allows all assertions to be quantified, and when they contain + capturing parentheses and/or are optional there are potential uses for + this feature. PCRE2 used to force the maximum quantifier to 1 on the + invalid grounds that further repetition was never useful. This was + always a bit pointless, since an assertion could be wrapped with a + repeated group to achieve the effect. General repetition is now + permitted, but if the maximum is unlimited it is set to one more than + the minimum. */ if (op_previous < OP_ONCE) /* Assertion */ { - if (repeat_min > 0) goto END_REPEAT; - if (repeat_max > 1) repeat_max = 1; + if (repeat_max == REPEAT_UNLIMITED) repeat_max = repeat_min + 1; } /* The case of a zero minimum is special because of the need to stick diff --git a/testdata/testinput1 b/testdata/testinput1 index 109de29..9d7821d 100644 --- a/testdata/testinput1 +++ b/testdata/testinput1 @@ -6393,4 +6393,13 @@ ef) x/x,mark /^((\1+)|\d)+133X$/ 111133X +/^(?=.*(?=(([A-Z]).*(?(1)\1)))(?!.+\2)){26}/i + The quick brown fox jumps over the lazy dog. + Jackdaws love my big sphinx of quartz. + Pack my box with five dozen liquor jugs. +\= Expect no match + The quick brown fox jumps over the lazy cat. + Hackdaws love my big sphinx of quartz. + Pack my fox with five dozen liquor jugs. + # End of testinput1 diff --git a/testdata/testoutput1 b/testdata/testoutput1 index c425ed4..79acf04 100644 --- a/testdata/testoutput1 +++ b/testdata/testoutput1 @@ -10126,4 +10126,25 @@ No match 1: 11 2: 11 +/^(?=.*(?=(([A-Z]).*(?(1)\1)))(?!.+\2)){26}/i + The quick brown fox jumps over the lazy dog. + 0: + 1: quick brown fox jumps over the lazy dog. + 2: q + Jackdaws love my big sphinx of quartz. + 0: + 1: Jackdaws love my big sphinx of quartz. + 2: J + Pack my box with five dozen liquor jugs. + 0: + 1: Pack my box with five dozen liquor jugs. + 2: P +\= Expect no match + The quick brown fox jumps over the lazy cat. +No match + Hackdaws love my big sphinx of quartz. +No match + Pack my fox with five dozen liquor jugs. +No match + # End of testinput1 diff --git a/testdata/testoutput2 b/testdata/testoutput2 index 438aefe..3a46a0a 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -10962,6 +10962,12 @@ Matched, but too many substrings Assert abc Ket + Assert + abc + Ket + Assert + abc + Ket abc Ket End @@ -10973,6 +10979,10 @@ Matched, but too many substrings Assert abc Ket + Brazero + Assert + abc + Ket abc Ket End @@ -10981,9 +10991,15 @@ Matched, but too many substrings /(?=abc)++abc/B ------------------------------------------------------------------ Bra + Once Assert abc Ket + Brazero + Assert + abc + Ket + Ket abc Ket End @@ -16610,6 +16626,19 @@ No match Assert Any Ket + Assert + Any + Ket + Assert + Any + Ket + Assert + Any + Ket + Brazero + Assert + Any + Ket x Ket Ket