Add (?* and (?<* synonyms for non-atomic lookarounds.

This commit is contained in:
Philip.Hazel 2019-12-28 13:53:59 +00:00
parent d170829b26
commit ac4ab7186d
9 changed files with 81 additions and 42 deletions

View File

@ -28,6 +28,10 @@ now correctly backtracked, so this unnecessary restriction has been removed.
7. Added PCRE2_SUBSTITUTE_MATCHED. 7. Added PCRE2_SUBSTITUTE_MATCHED.
8. Added (?* and (?<* as synonms for (*napla: and (*naplb: to match another
regex engine. The Perl regex folks are aware of this usage and have made a note
about it.
Version 10.34 21-November-2019 Version 10.34 21-November-2019
------------------------------ ------------------------------

View File

@ -2624,8 +2624,8 @@ backtracking into the assertion. However, there are some cases where non-atomic
positive assertions can be useful. PCRE2 provides these using the following positive assertions can be useful. PCRE2 provides these using the following
syntax: syntax:
<pre> <pre>
(*non_atomic_positive_lookahead: or (*napla: (*non_atomic_positive_lookahead: or (*napla: or (?*
(*non_atomic_positive_lookbehind: or (*naplb: (*non_atomic_positive_lookbehind: or (*naplb: or (?&#60;*
</pre> </pre>
Consider the problem of finding the right-most word in a string that also Consider the problem of finding the right-most word in a string that also
appears earlier in the string, that is, it must appear at least twice in total. appears earlier in the string, that is, it must appear at least twice in total.
@ -3833,7 +3833,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br> <br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 18 December 2019 Last updated: 28 December 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -553,11 +553,13 @@ Each top-level branch of a lookbehind must be of a fixed length.
<P> <P>
These assertions are specific to PCRE2 and are not Perl-compatible. These assertions are specific to PCRE2 and are not Perl-compatible.
<pre> <pre>
(*napla:...) (?*...) )
(*non_atomic_positive_lookahead:...) (*napla:...) ) synonyms
(*non_atomic_positive_lookahead:...) )
(*naplb:...) (?&#60;*...) )
(*non_atomic_positive_lookbehind:...) (*naplb:...) ) synonyms
(*non_atomic_positive_lookbehind:...) )
</PRE> </PRE>
</P> </P>
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br> <br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
@ -683,7 +685,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC29" href="#TOC1">REVISION</a><br> <br><a name="SEC29" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 29 July 2019 Last updated: 28 December 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -8354,8 +8354,8 @@ NON-ATOMIC ASSERTIONS
some cases where non-atomic positive assertions can be useful. PCRE2 some cases where non-atomic positive assertions can be useful. PCRE2
provides these using the following syntax: provides these using the following syntax:
(*non_atomic_positive_lookahead: or (*napla: (*non_atomic_positive_lookahead: or (*napla: or (?*
(*non_atomic_positive_lookbehind: or (*naplb: (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
Consider the problem of finding the right-most word in a string that Consider the problem of finding the right-most word in a string that
also appears earlier in the string, that is, it must appear at least also appears earlier in the string, that is, it must appear at least
@ -9487,7 +9487,7 @@ AUTHOR
REVISION REVISION
Last updated: 18 December 2019 Last updated: 28 December 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -10716,11 +10716,13 @@ NON-ATOMIC LOOKAROUND ASSERTIONS
These assertions are specific to PCRE2 and are not Perl-compatible. These assertions are specific to PCRE2 and are not Perl-compatible.
(*napla:...) (?*...) )
(*non_atomic_positive_lookahead:...) (*napla:...) ) synonyms
(*non_atomic_positive_lookahead:...) )
(*naplb:...) (?<*...) )
(*non_atomic_positive_lookbehind:...) (*naplb:...) ) synonyms
(*non_atomic_positive_lookbehind:...) )
SCRIPT RUNS SCRIPT RUNS
@ -10844,7 +10846,7 @@ AUTHOR
REVISION REVISION
Last updated: 29 July 2019 Last updated: 28 December 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "18 December 2019" "PCRE2 10.35" .TH PCRE2PATTERN 3 "28 December 2019" "PCRE2 10.35"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -2637,8 +2637,8 @@ backtracking into the assertion. However, there are some cases where non-atomic
positive assertions can be useful. PCRE2 provides these using the following positive assertions can be useful. PCRE2 provides these using the following
syntax: syntax:
.sp .sp
(*non_atomic_positive_lookahead: or (*napla: (*non_atomic_positive_lookahead: or (*napla: or (?*
(*non_atomic_positive_lookbehind: or (*naplb: (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
.sp .sp
Consider the problem of finding the right-most word in a string that also Consider the problem of finding the right-most word in a string that also
appears earlier in the string, that is, it must appear at least twice in total. appears earlier in the string, that is, it must appear at least twice in total.
@ -3874,6 +3874,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 18 December 2019 Last updated: 28 December 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "29 July 2019" "PCRE2 10.34" .TH PCRE2SYNTAX 3 "28 December 2019" "PCRE2 10.35"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -531,11 +531,13 @@ Each top-level branch of a lookbehind must be of a fixed length.
.sp .sp
These assertions are specific to PCRE2 and are not Perl-compatible. These assertions are specific to PCRE2 and are not Perl-compatible.
.sp .sp
(*napla:...) (?*...) )
(*non_atomic_positive_lookahead:...) (*napla:...) ) synonyms
.sp (*non_atomic_positive_lookahead:...) )
(*naplb:...) .sp
(*non_atomic_positive_lookbehind:...) (?<*...) )
(*naplb:...) ) synonyms
(*non_atomic_positive_lookbehind:...) )
. .
. .
.SH "SCRIPT RUNS" .SH "SCRIPT RUNS"
@ -670,6 +672,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 29 July 2019 Last updated: 28 December 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -3653,7 +3653,7 @@ while (ptr < ptrend)
if (ptr >= ptrend) goto UNCLOSED_PARENTHESIS; if (ptr >= ptrend) goto UNCLOSED_PARENTHESIS;
/* If ( is not followed by ? it is either a capture or a special verb or an /* If ( is not followed by ? it is either a capture or a special verb or an
alpha assertion. */ alpha assertion or a positive non-atomic lookahead. */
if (*ptr != CHAR_QUESTION_MARK) if (*ptr != CHAR_QUESTION_MARK)
{ {
@ -3685,10 +3685,10 @@ while (ptr < ptrend)
break; break;
/* Handle "alpha assertions" such as (*pla:...). Most of these are /* Handle "alpha assertions" such as (*pla:...). Most of these are
synonyms for the historical symbolic assertions, but the script run ones synonyms for the historical symbolic assertions, but the script run and
are new. They are distinguished by starting with a lower case letter. non-atomic lookaround ones are new. They are distinguished by starting
Checking both ends of the alphabet makes this work in all character with a lower case letter. Checking both ends of the alphabet makes this
codes. */ work in all character codes. */
else if (CHMAX_255(c) && (cb->ctypes[c] & ctype_lcletter) != 0) else if (CHMAX_255(c) && (cb->ctypes[c] & ctype_lcletter) != 0)
{ {
@ -3747,9 +3747,7 @@ while (ptr < ptrend)
goto POSITIVE_LOOK_AHEAD; goto POSITIVE_LOOK_AHEAD;
case META_LOOKAHEAD_NA: case META_LOOKAHEAD_NA:
*parsed_pattern++ = meta; goto POSITIVE_NONATOMIC_LOOK_AHEAD;
ptr++;
goto POST_ASSERTION;
case META_LOOKAHEADNOT: case META_LOOKAHEADNOT:
goto NEGATIVE_LOOK_AHEAD; goto NEGATIVE_LOOK_AHEAD;
@ -4438,6 +4436,12 @@ while (ptr < ptrend)
ptr++; ptr++;
goto POST_ASSERTION; goto POST_ASSERTION;
case CHAR_ASTERISK:
POSITIVE_NONATOMIC_LOOK_AHEAD: /* Come from (?* */
*parsed_pattern++ = META_LOOKAHEAD_NA;
ptr++;
goto POST_ASSERTION;
case CHAR_EXCLAMATION_MARK: case CHAR_EXCLAMATION_MARK:
NEGATIVE_LOOK_AHEAD: /* Come from (*nla: */ NEGATIVE_LOOK_AHEAD: /* Come from (*nla: */
*parsed_pattern++ = META_LOOKAHEADNOT; *parsed_pattern++ = META_LOOKAHEADNOT;
@ -4447,20 +4451,23 @@ while (ptr < ptrend)
/* ---- Lookbehind assertions ---- */ /* ---- Lookbehind assertions ---- */
/* (?< followed by = or ! is a lookbehind assertion. Otherwise (?< is the /* (?< followed by = or ! or * is a lookbehind assertion. Otherwise (?<
start of the name of a capturing group. */ is the start of the name of a capturing group. */
case CHAR_LESS_THAN_SIGN: case CHAR_LESS_THAN_SIGN:
if (ptrend - ptr <= 1 || if (ptrend - ptr <= 1 ||
(ptr[1] != CHAR_EQUALS_SIGN && ptr[1] != CHAR_EXCLAMATION_MARK)) (ptr[1] != CHAR_EQUALS_SIGN &&
ptr[1] != CHAR_EXCLAMATION_MARK &&
ptr[1] != CHAR_ASTERISK))
{ {
terminator = CHAR_GREATER_THAN_SIGN; terminator = CHAR_GREATER_THAN_SIGN;
goto DEFINE_NAME; goto DEFINE_NAME;
} }
*parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)? *parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)?
META_LOOKBEHIND : META_LOOKBEHINDNOT; META_LOOKBEHIND : (ptr[1] == CHAR_EXCLAMATION_MARK)?
META_LOOKBEHINDNOT : META_LOOKBEHIND_NA;
POST_LOOKBEHIND: /* Come from (*plb: (*naplb: and (*nlb: */ POST_LOOKBEHIND: /* Come from (*plb: (*naplb: and (*nlb: */
*has_lookbehind = TRUE; *has_lookbehind = TRUE;
offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2); offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2);
PUTOFFSET(offset, parsed_pattern); PUTOFFSET(offset, parsed_pattern);
@ -4633,8 +4640,6 @@ while (ptr < ptrend)
*parsed_pattern++ = META_KET; *parsed_pattern++ = META_KET;
} }
if (top_nest == (nest_save *)(cb->start_workspace)) top_nest = NULL; if (top_nest == (nest_save *)(cb->start_workspace)) top_nest = NULL;
else top_nest--; else top_nest--;
} }

7
testdata/testinput2 vendored
View File

@ -5670,6 +5670,9 @@ a)"xI
/\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/ /\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4 word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
/\A(?*.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
/(*plb:(.)..|(.)...)(\1|\2)/ /(*plb:(.)..|(.)...)(\1|\2)/
abcdb\=offset=4 abcdb\=offset=4
abcda\=offset=4 abcda\=offset=4
@ -5678,6 +5681,10 @@ a)"xI
abcdb\=offset=4 abcdb\=offset=4
abcda\=offset=4 abcda\=offset=4
/(?<*(.)..|(.)...)(\1|\2)/
abcdb\=offset=4
abcda\=offset=4
/(*non_atomic_positive_lookahead:ab)/B /(*non_atomic_positive_lookahead:ab)/B
/(*non_atomic_positive_lookbehind:ab)/B /(*non_atomic_positive_lookbehind:ab)/B

17
testdata/testoutput2 vendored
View File

@ -17088,6 +17088,11 @@ No match
0: word1 word3 word1 word2 word3 word2 word2 word1 word3 0: word1 word3 word1 word2 word3 word2 word2 word1 word3
1: word3 1: word3
/\A(?*.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
0: word1 word3 word1 word2 word3 word2 word2 word1 word3
1: word3
/(*plb:(.)..|(.)...)(\1|\2)/ /(*plb:(.)..|(.)...)(\1|\2)/
abcdb\=offset=4 abcdb\=offset=4
0: b 0: b
@ -17109,6 +17114,18 @@ No match
2: a 2: a
3: a 3: a
/(?<*(.)..|(.)...)(\1|\2)/
abcdb\=offset=4
0: b
1: b
2: <unset>
3: b
abcda\=offset=4
0: a
1: <unset>
2: a
3: a
/(*non_atomic_positive_lookahead:ab)/B /(*non_atomic_positive_lookahead:ab)/B
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra