Remove atomic restriction on capture groups containing recursive back

references, as since 10.30 it has been unnecessary.
This commit is contained in:
Philip.Hazel 2019-12-18 16:16:12 +00:00
parent 880aac5dda
commit 0a2033f0f7
14 changed files with 300 additions and 336 deletions

View File

@ -11,6 +11,17 @@ Version 10.35
3. A JIT bug is fixed which allowed to read the fields of the compiled
pattern before its existence is checked.
4. Back in the PCRE1 day, capturing groups that contained recursive back
references to themselves were made atomic (version 8.01, change 18) because
after the end a repeated group, the captured substrings had their values from
the final repetition, not from an earlier repetition that might be the
destination of a backtrack. This feature was documented, and was carried over
into PCRE2. However, it has now been realized that the major refactoring that
was done for 10.30 has made this atomicizing unnecessary, and it is confusing
when users are unaware of it, making some patterns appear not to be working as
expected. Capture values of recursive back references in repeated groups are
now correctly backtracked, so this unnecessary restriction has been removed.
Version 10.34 21-November-2019
------------------------------

View File

@ -9,9 +9,9 @@ dnl The PCRE2_PRERELEASE feature is for identifying release candidates. It might
dnl be defined as -RC2, for example. For real releases, it should be empty.
m4_define(pcre2_major, [10])
m4_define(pcre2_minor, [34])
m4_define(pcre2_prerelease, [])
m4_define(pcre2_date, [2019-11-21])
m4_define(pcre2_minor, [35])
m4_define(pcre2_prerelease, [-RC1])
m4_define(pcre2_date, [2019-11-27])
# NOTE: The CMakeLists.txt file searches for the above variables in the first
# 50 lines of this file. Please update that if the variables above are moved.

View File

@ -2349,11 +2349,11 @@ using alternation, as in the example above, or by a quantifier with a minimum
of zero.
</P>
<P>
Backreferences of this type cause the group that they reference to be treated
as an
For versions of PCRE2 less than 10.25, backreferences of this type used to
cause the group that they reference to be treated as an
<a href="#atomicgroup">atomic group.</a>
Once the whole group has been matched, a subsequent matching failure cannot
cause backtracking into the middle of the group.
This restriction no longer applies, and backtracking into such groups can occur
as normal.
<a name="bigassertions"></a></P>
<br><a name="SEC20" href="#TOC1">ASSERTIONS</a><br>
<P>
@ -3833,7 +3833,7 @@ Cambridge, England.
</P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P>
Last updated: 29 July 2019
Last updated: 18 December 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -8078,10 +8078,10 @@ BACKREFERENCES
the backreference. This can be done using alternation, as in the exam-
ple above, or by a quantifier with a minimum of zero.
Backreferences of this type cause the group that they reference to be
treated as an atomic group. Once the whole group has been matched, a
subsequent matching failure cannot cause backtracking into the middle
of the group.
For versions of PCRE2 less than 10.25, backreferences of this type used
to cause the group that they reference to be treated as an atomic
group. This restriction no longer applies, and backtracking into such
groups can occur as normal.
ASSERTIONS
@ -9463,7 +9463,7 @@ AUTHOR
REVISION
Last updated: 29 July 2019
Last updated: 18 December 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "29 July 2019" "PCRE2 10.34"
.TH PCRE2PATTERN 3 "18 December 2019" "PCRE2 10.35"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -2346,14 +2346,14 @@ the first iteration does not need to match the backreference. This can be done
using alternation, as in the example above, or by a quantifier with a minimum
of zero.
.P
Backreferences of this type cause the group that they reference to be treated
as an
For versions of PCRE2 less than 10.25, backreferences of this type used to
cause the group that they reference to be treated as an
.\" HTML <a href="#atomicgroup">
.\" </a>
atomic group.
.\"
Once the whole group has been matched, a subsequent matching failure cannot
cause backtracking into the middle of the group.
This restriction no longer applies, and backtracking into such groups can occur
as normal.
.
.
.\" HTML <a name="bigassertions"></a>
@ -3874,6 +3874,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 29 July 2019
Last updated: 18 December 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -6671,23 +6671,11 @@ for (;; pptr++)
}
/* For a back reference, update the back reference map and the
maximum back reference. Then, for each group, we must check to
see if it is recursive, that is, it is inside the group that it
references. A flag is set so that the group can be made atomic.
*/
maximum back reference. */
cb->backref_map |= (groupnumber < 32)? (1u << groupnumber) : 1;
if (groupnumber > cb->top_backref)
cb->top_backref = groupnumber;
for (oc = cb->open_caps; oc != NULL; oc = oc->next)
{
if (oc->number == groupnumber)
{
oc->flag = TRUE;
break;
}
}
}
}
@ -7682,19 +7670,6 @@ for (;; pptr++)
cb->backref_map |= (meta_arg < 32)? (1u << meta_arg) : 1;
if (meta_arg > cb->top_backref) cb->top_backref = meta_arg;
/* Check to see if this back reference is recursive, that it, it
is inside the group that it references. A flag is set so that the
group can be made atomic. */
for (oc = cb->open_caps; oc != NULL; oc = oc->next)
{
if (oc->number == meta_arg)
{
oc->flag = TRUE;
break;
}
}
break;
@ -8035,7 +8010,6 @@ and skip over the pattern offset. */
lookbehind = *code == OP_ASSERTBACK ||
*code == OP_ASSERTBACK_NOT ||
*code == OP_ASSERTBACK_NA;
if (lookbehind)
{
lookbehindlength = META_DATA(pptr[-1]);
@ -8053,7 +8027,6 @@ if (*code == OP_CBRA)
capnumber = GET2(code, 1 + LINK_SIZE);
capitem.number = capnumber;
capitem.next = cb->open_caps;
capitem.flag = FALSE;
capitem.assert_depth = cb->assert_depth;
cb->open_caps = &capitem;
}
@ -8182,26 +8155,9 @@ for (;;)
PUT(code, 1, (int)(code - start_bracket));
code += 1 + LINK_SIZE;
/* If it was a capturing subpattern, check to see if it contained any
recursive back references. If so, we must wrap it in atomic brackets. In
any event, remove the block from the chain. */
/* If it was a capturing subpattern, remove the block from the chain. */
if (capnumber > 0)
{
if (cb->open_caps->flag)
{
(void)memmove(start_bracket + 1 + LINK_SIZE, start_bracket,
CU2BYTES(code - start_bracket));
*start_bracket = OP_ONCE;
code += 1 + LINK_SIZE;
PUT(start_bracket, 1, (int)(code - start_bracket));
*code = OP_KET;
PUT(code, 1, (int)(code - start_bracket));
code += 1 + LINK_SIZE;
length += 2 + 2*LINK_SIZE;
}
cb->open_caps = cb->open_caps->next;
}
if (capnumber > 0) cb->open_caps = cb->open_caps->next;
/* Set values to pass back */

View File

@ -1759,13 +1759,11 @@ typedef struct pcre2_memctl {
/* Structure for building a chain of open capturing subpatterns during
compiling, so that instructions to close them can be compiled when (*ACCEPT) is
encountered. This is also used to identify subpatterns that contain recursive
back references to themselves, so that they can be made atomic. */
encountered. */
typedef struct open_capitem {
struct open_capitem *next; /* Chain link */
uint16_t number; /* Capture number */
uint16_t flag; /* Set TRUE if recursive back ref */
uint16_t assert_depth; /* Assertion depth when opened */
} open_capitem;

7
testdata/testinput1 vendored
View File

@ -6386,4 +6386,11 @@ ef) x/x,mark
/^(?<A>a)(?(<A>)b)((?<=b).*)$/
abc
/^(a\1?){4}$/
aaaa
aaaaaa
/^((\1+)|\d)+133X$/
111133X
# End of testinput1

20
testdata/testinput2 vendored
View File

@ -324,16 +324,7 @@
\= Expect no match
fooabar
# This one is here because Perl behaves differently; see also the following.
/^(a\1?){4}$/I
\= Expect no match
aaaa
aaaaaa
# Perl does not fail these two for the final subjects. Neither did PCRE until
# release 8.01. The problem is in backtracking into a subpattern that contains
# a recursive reference to itself. PCRE has now made these into atomic patterns.
# Perl does not fail these two for the final subjects.
/^(xa|=?\1a){2}$/
xa=xaa
@ -5772,4 +5763,13 @@ a)"xI
/(a)?a/I
manm
/^(?|(\*)(*napla:\S*_(\2?+.+))|(\w)(?=\S*_(\2?+\1)))+_\2$/
*abc_12345abc
/^(?|(\*)(*napla:\S*_(\3?+.+))|(\w)(?=\S*_((\2?+\1))))+_\2$/
*abc_12345abc
/^((\1+)(?C)|\d)+133X$/
111133X\=callout_capture
# End of testinput2

14
testdata/testoutput1 vendored
View File

@ -10112,4 +10112,18 @@ No match
1: a
2: c
/^(a\1?){4}$/
aaaa
0: aaaa
1: a
aaaaaa
0: aaaaaa
1: aa
/^((\1+)|\d)+133X$/
111133X
0: 111133X
1: 11
2: 11
# End of testinput1

78
testdata/testoutput2 vendored
View File

@ -809,24 +809,7 @@ Subject length lower bound = 3
fooabar
No match
# This one is here because Perl behaves differently; see also the following.
/^(a\1?){4}$/I
Capture group count = 1
Max back reference = 1
Compile options: <none>
Overall options: anchored
First code unit = 'a'
Subject length lower bound = 4
\= Expect no match
aaaa
No match
aaaaaa
No match
# Perl does not fail these two for the final subjects. Neither did PCRE until
# release 8.01. The problem is in backtracking into a subpattern that contains
# a recursive reference to itself. PCRE has now made these into atomic patterns.
# Perl does not fail these two for the final subjects.
/^(xa|=?\1a){2}$/
xa=xaa
@ -10060,7 +10043,6 @@ No match
------------------------------------------------------------------
Bra
^
Once
CBra 1
ab
CBra 2
@ -10071,8 +10053,6 @@ No match
Alt
x
Ket
Ket
Once
CBra 1
ab
CBra 2
@ -10083,7 +10063,6 @@ No match
Alt
x
Ket
Ket
$
Ket
End
@ -10479,27 +10458,23 @@ Failed: error 168 at offset 3: \c must be followed by a printable ASCII characte
/(?P<abn>(?P=abn)xxx)/B
------------------------------------------------------------------
Bra
Once
CBra 1
\1
xxx
Ket
Ket
Ket
End
------------------------------------------------------------------
/(a\1z)/B
------------------------------------------------------------------
Bra
Once
CBra 1
a
\1
z
Ket
Ket
Ket
End
------------------------------------------------------------------
@ -11299,27 +11274,23 @@ No match
/(?P<abn>(?P=abn)xxx)/B
------------------------------------------------------------------
Bra
Once
CBra 1
\1
xxx
Ket
Ket
Ket
End
------------------------------------------------------------------
/(a\1z)/B
------------------------------------------------------------------
Bra
Once
CBra 1
a
\1
z
Ket
Ket
Ket
End
------------------------------------------------------------------
@ -13319,7 +13290,6 @@ Failed: error 144 at offset 5: subpattern name must start with a non-digit
Bra
Brazero
SCBra 1
Once
CBra 2
CBra 3
a
@ -13331,7 +13301,6 @@ Failed: error 144 at offset 5: subpattern name must start with a non-digit
Ket
Recurse
Ket
Ket
KetRmax
a?+
Ket
@ -13999,7 +13968,6 @@ Matched, but too many substrings
/((?+1)(\1))/B
------------------------------------------------------------------
Bra
Once
CBra 1
Recurse
CBra 2
@ -14007,7 +13975,6 @@ Matched, but too many substrings
Ket
Ket
Ket
Ket
End
------------------------------------------------------------------
@ -14425,7 +14392,6 @@ Subject length lower bound = 1
------------------------------------------------------------------
Bra
Any
Once
CBra 1
Recurse
Recurse
@ -14434,7 +14400,6 @@ Subject length lower bound = 1
Alt
$
Ket
Ket
CBra 2
Ket
Ket
@ -14445,7 +14410,6 @@ Subject length lower bound = 1
------------------------------------------------------------------
Bra
Any
Once
CBra 1
Recurse
Recurse
@ -14457,7 +14421,6 @@ Subject length lower bound = 1
Alt
$
Ket
Ket
CBra 3
Ket
Ket
@ -17435,6 +17398,45 @@ Subject length lower bound = 1
manm
0: a
/^(?|(\*)(*napla:\S*_(\2?+.+))|(\w)(?=\S*_(\2?+\1)))+_\2$/
*abc_12345abc
0: *abc_12345abc
1: c
2: 12345abc
/^(?|(\*)(*napla:\S*_(\3?+.+))|(\w)(?=\S*_((\2?+\1))))+_\2$/
*abc_12345abc
0: *abc_12345abc
1: c
2: 12345abc
3: 12345abc
/^((\1+)(?C)|\d)+133X$/
111133X\=callout_capture
Callout 0: last capture = 2
1: 1
2: 111
--->111133X
^ ^ |
Callout 0: last capture = 2
1: 3
2: 3
--->111133X
^ ^ |
Callout 0: last capture = 2
1: 1
2: 11
--->111133X
^ ^ |
Callout 0: last capture = 2
1: 3
2: 3
--->111133X
^ ^ |
0: 111133X
1: 11
2: 11
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data

View File

@ -720,41 +720,37 @@ Memory allocation (code space): 14
/(((a\2)|(a*)\g<-1>))*a?/
------------------------------------------------------------------
0 39 Bra
0 35 Bra
2 Brazero
3 32 SCBra 1
6 27 Once
8 12 CBra 2
11 7 CBra 3
14 a
16 \2
18 7 Ket
20 11 Alt
22 5 CBra 4
25 a*
27 5 Ket
29 22 Recurse
31 23 Ket
33 27 Ket
35 32 KetRmax
37 a?+
39 39 Ket
41 End
3 28 SCBra 1
6 12 CBra 2
9 7 CBra 3
12 a
14 \2
16 7 Ket
18 11 Alt
20 5 CBra 4
23 a*
25 5 Ket
27 20 Recurse
29 23 Ket
31 28 KetRmax
33 a?+
35 35 Ket
37 End
------------------------------------------------------------------
/((?+1)(\1))/
------------------------------------------------------------------
0 20 Bra
2 16 Once
4 12 CBra 1
7 9 Recurse
9 5 CBra 2
12 \1
14 5 Ket
16 12 Ket
18 16 Ket
20 20 Ket
22 End
0 16 Bra
2 12 CBra 1
5 7 Recurse
7 5 CBra 2
10 \1
12 5 Ket
14 12 Ket
16 16 Ket
18 End
------------------------------------------------------------------
"(?1)(?#?'){2}(a)"
@ -771,45 +767,41 @@ Memory allocation (code space): 14
/.((?2)(?R)|\1|$)()/
------------------------------------------------------------------
0 28 Bra
0 24 Bra
2 Any
3 18 Once
5 7 CBra 1
8 23 Recurse
10 0 Recurse
12 4 Alt
14 \1
16 3 Alt
18 $
19 14 Ket
21 18 Ket
23 3 CBra 2
26 3 Ket
28 28 Ket
30 End
3 7 CBra 1
6 19 Recurse
8 0 Recurse
10 4 Alt
12 \1
14 3 Alt
16 $
17 14 Ket
19 3 CBra 2
22 3 Ket
24 24 Ket
26 End
------------------------------------------------------------------
/.((?3)(?R)()(?2)|\1|$)()/
------------------------------------------------------------------
0 35 Bra
0 31 Bra
2 Any
3 25 Once
5 14 CBra 1
8 30 Recurse
10 0 Recurse
12 3 CBra 2
15 3 Ket
17 12 Recurse
19 4 Alt
21 \1
23 3 Alt
25 $
26 21 Ket
28 25 Ket
30 3 CBra 3
33 3 Ket
35 35 Ket
37 End
3 14 CBra 1
6 26 Recurse
8 0 Recurse
10 3 CBra 2
13 3 Ket
15 10 Recurse
17 4 Alt
19 \1
21 3 Alt
23 $
24 21 Ket
26 3 CBra 3
29 3 Ket
31 31 Ket
33 End
------------------------------------------------------------------
/(?1)()((((((\1++))\x85)+)|))/

View File

@ -720,41 +720,37 @@ Memory allocation (code space): 28
/(((a\2)|(a*)\g<-1>))*a?/
------------------------------------------------------------------
0 39 Bra
0 35 Bra
2 Brazero
3 32 SCBra 1
6 27 Once
8 12 CBra 2
11 7 CBra 3
14 a
16 \2
18 7 Ket
20 11 Alt
22 5 CBra 4
25 a*
27 5 Ket
29 22 Recurse
31 23 Ket
33 27 Ket
35 32 KetRmax
37 a?+
39 39 Ket
41 End
3 28 SCBra 1
6 12 CBra 2
9 7 CBra 3
12 a
14 \2
16 7 Ket
18 11 Alt
20 5 CBra 4
23 a*
25 5 Ket
27 20 Recurse
29 23 Ket
31 28 KetRmax
33 a?+
35 35 Ket
37 End
------------------------------------------------------------------
/((?+1)(\1))/
------------------------------------------------------------------
0 20 Bra
2 16 Once
4 12 CBra 1
7 9 Recurse
9 5 CBra 2
12 \1
14 5 Ket
16 12 Ket
18 16 Ket
20 20 Ket
22 End
0 16 Bra
2 12 CBra 1
5 7 Recurse
7 5 CBra 2
10 \1
12 5 Ket
14 12 Ket
16 16 Ket
18 End
------------------------------------------------------------------
"(?1)(?#?'){2}(a)"
@ -771,45 +767,41 @@ Memory allocation (code space): 28
/.((?2)(?R)|\1|$)()/
------------------------------------------------------------------
0 28 Bra
0 24 Bra
2 Any
3 18 Once
5 7 CBra 1
8 23 Recurse
10 0 Recurse
12 4 Alt
14 \1
16 3 Alt
18 $
19 14 Ket
21 18 Ket
23 3 CBra 2
26 3 Ket
28 28 Ket
30 End
3 7 CBra 1
6 19 Recurse
8 0 Recurse
10 4 Alt
12 \1
14 3 Alt
16 $
17 14 Ket
19 3 CBra 2
22 3 Ket
24 24 Ket
26 End
------------------------------------------------------------------
/.((?3)(?R)()(?2)|\1|$)()/
------------------------------------------------------------------
0 35 Bra
0 31 Bra
2 Any
3 25 Once
5 14 CBra 1
8 30 Recurse
10 0 Recurse
12 3 CBra 2
15 3 Ket
17 12 Recurse
19 4 Alt
21 \1
23 3 Alt
25 $
26 21 Ket
28 25 Ket
30 3 CBra 3
33 3 Ket
35 35 Ket
37 End
3 14 CBra 1
6 26 Recurse
8 0 Recurse
10 3 CBra 2
13 3 Ket
15 10 Recurse
17 4 Alt
19 \1
21 3 Alt
23 $
24 21 Ket
26 3 CBra 3
29 3 Ket
31 31 Ket
33 End
------------------------------------------------------------------
/(?1)()((((((\1++))\x85)+)|))/

View File

@ -720,41 +720,37 @@ Memory allocation (code space): 10
/(((a\2)|(a*)\g<-1>))*a?/
------------------------------------------------------------------
0 57 Bra
0 51 Bra
3 Brazero
4 48 SCBra 1
9 40 Once
12 18 CBra 2
17 10 CBra 3
22 a
24 \2
27 10 Ket
30 16 Alt
33 7 CBra 4
38 a*
40 7 Ket
43 33 Recurse
46 34 Ket
49 40 Ket
52 48 KetRmax
55 a?+
57 57 Ket
60 End
4 42 SCBra 1
9 18 CBra 2
14 10 CBra 3
19 a
21 \2
24 10 Ket
27 16 Alt
30 7 CBra 4
35 a*
37 7 Ket
40 30 Recurse
43 34 Ket
46 42 KetRmax
49 a?+
51 51 Ket
54 End
------------------------------------------------------------------
/((?+1)(\1))/
------------------------------------------------------------------
0 31 Bra
3 25 Once
6 19 CBra 1
11 14 Recurse
14 8 CBra 2
19 \1
22 8 Ket
25 19 Ket
28 25 Ket
31 31 Ket
34 End
0 25 Bra
3 19 CBra 1
8 11 Recurse
11 8 CBra 2
16 \1
19 8 Ket
22 19 Ket
25 25 Ket
28 End
------------------------------------------------------------------
"(?1)(?#?'){2}(a)"
@ -771,45 +767,41 @@ Memory allocation (code space): 10
/.((?2)(?R)|\1|$)()/
------------------------------------------------------------------
0 42 Bra
0 36 Bra
3 Any
4 27 Once
7 11 CBra 1
12 34 Recurse
15 0 Recurse
18 6 Alt
21 \1
24 4 Alt
27 $
28 21 Ket
31 27 Ket
34 5 CBra 2
39 5 Ket
42 42 Ket
45 End
4 11 CBra 1
9 28 Recurse
12 0 Recurse
15 6 Alt
18 \1
21 4 Alt
24 $
25 21 Ket
28 5 CBra 2
33 5 Ket
36 36 Ket
39 End
------------------------------------------------------------------
/.((?3)(?R)()(?2)|\1|$)()/
------------------------------------------------------------------
0 53 Bra
0 47 Bra
3 Any
4 38 Once
7 22 CBra 1
12 45 Recurse
15 0 Recurse
18 5 CBra 2
23 5 Ket
26 18 Recurse
29 6 Alt
32 \1
35 4 Alt
38 $
39 32 Ket
42 38 Ket
45 5 CBra 3
50 5 Ket
53 53 Ket
56 End
4 22 CBra 1
9 39 Recurse
12 0 Recurse
15 5 CBra 2
20 5 Ket
23 15 Recurse
26 6 Alt
29 \1
32 4 Alt
35 $
36 32 Ket
39 5 CBra 3
44 5 Ket
47 47 Ket
50 End
------------------------------------------------------------------
/(?1)()((((((\1++))\x85)+)|))/