Implement non-atomic positive assertions.

This commit is contained in:
Philip.Hazel 2019-07-13 11:12:03 +00:00
parent 691aca7a86
commit 620f3a1307
21 changed files with 1134 additions and 683 deletions

View File

@ -88,6 +88,8 @@ otherwise), an atomic group, or a recursion.
17. Check for integer overflow when computing lookbehind lengths. Fixes
Clusterfuzz issue 15636.
18. Implement non-atomic positive lookaround assertions.
Version 10.33 16-April-2019
---------------------------

34
HACKING
View File

@ -195,6 +195,7 @@ META_END End of pattern (this value is 0x80000000)
META_FAIL (*FAIL)
META_KET ) closing parenthesis
META_LOOKAHEAD (?= start of lookahead
META_LOOKAHEAD_NA (*napla: start of non-atomic lookahead
META_LOOKAHEADNOT (?! start of negative lookahead
META_NOCAPTURE (?: no capture parens
META_PLUS +
@ -286,8 +287,9 @@ The following are also followed just by an offset, but also the lower 16 bits
of the main word contain the length of the first branch of the lookbehind
group; this is used when generating OP_REVERSE for that branch.
META_LOOKBEHIND (?<=
META_LOOKBEHINDNOT (?<!
META_LOOKBEHIND (?<= start of lookbehind
META_LOOKBEHIND_NA (*naplb: start of non-atomic lookbehind
META_LOOKBEHINDNOT (?<! start of negative lookbehind
The following are followed by two elements, the minimum and maximum. Repeat
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
@ -715,13 +717,15 @@ Assertions
----------
Forward assertions are also just like other subpatterns, but starting with one
of the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
is OP_REVERSE, followed by a count of the number of characters to move back the
pointer in the subject string. In ASCII or UTF-32 mode, the count is also the
number of code units, but in UTF-8/16 mode each character may occupy more than
one code unit. A separate count is present in each alternative of a lookbehind
assertion, allowing them to have different (but fixed) lengths.
of the opcodes OP_ASSERT, OP_ASSERT_NA (non-atomic assertion), or
OP_ASSERT_NOT. Backward assertions use the opcodes OP_ASSERTBACK,
OP_ASSERTBACK_NA, and OP_ASSERTBACK_NOT, and the first opcode inside the
assertion is OP_REVERSE, followed by a count of the number of characters to
move back the pointer in the subject string. In ASCII or UTF-32 mode, the count
is also the number of code units, but in UTF-8/16 mode each character may
occupy more than one code unit. A separate count is present in each alternative
of a lookbehind assertion, allowing each branch to have a different (but fixed)
length.
Conditional subpatterns
@ -754,11 +758,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
or OP_FALSE.
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
must start with a parenthesized assertion, whose opcode normally immediately
follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a
callout is inserted immediately before the assertion. It is also possible to
insert a manual callout at this point. Only assertion conditions may have
callouts preceding the condition.
must start with a parenthesized atomic assertion, whose opcode normally
immediately follows OP_COND or OP_SCOND. However, if automatic callouts are
enabled, a callout is inserted immediately before the assertion. It is also
possible to insert a manual callout at this point. Only assertion conditions
may have callouts preceding the condition.
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
parts of the pattern, so this is another opcode that may appear as a condition.
@ -823,4 +827,4 @@ not a real opcode, but is used to check at compile time that tables indexed by
opcode are the correct length, in order to catch updating errors.
Philip Hazel
20 July 2018
12 July 2019

View File

@ -205,6 +205,11 @@ different way and is not Perl-compatible.
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
the start of a pattern that set overall options that cannot be changed within
the pattern.
<br>
<br>
(m) PCRE2 supports non-atomic positive lookaround assertions. This is an
extension to the lookaround facilities. The default, Perl-compatible
lookarounds are atomic.
</P>
<P>
18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
@ -234,7 +239,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 12 February 2019
Last updated: 13 July 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -33,17 +33,18 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
<li><a name="TOC22" href="#SEC22">CONDITIONAL GROUPS</a>
<li><a name="TOC23" href="#SEC23">COMMENTS</a>
<li><a name="TOC24" href="#SEC24">RECURSIVE PATTERNS</a>
<li><a name="TOC25" href="#SEC25">GROUPS AS SUBROUTINES</a>
<li><a name="TOC26" href="#SEC26">ONIGURUMA SUBROUTINE SYNTAX</a>
<li><a name="TOC27" href="#SEC27">CALLOUTS</a>
<li><a name="TOC28" href="#SEC28">BACKTRACKING CONTROL</a>
<li><a name="TOC29" href="#SEC29">SEE ALSO</a>
<li><a name="TOC30" href="#SEC30">AUTHOR</a>
<li><a name="TOC31" href="#SEC31">REVISION</a>
<li><a name="TOC21" href="#SEC21">NON-ATOMIC ASSERTIONS</a>
<li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a>
<li><a name="TOC23" href="#SEC23">CONDITIONAL GROUPS</a>
<li><a name="TOC24" href="#SEC24">COMMENTS</a>
<li><a name="TOC25" href="#SEC25">RECURSIVE PATTERNS</a>
<li><a name="TOC26" href="#SEC26">GROUPS AS SUBROUTINES</a>
<li><a name="TOC27" href="#SEC27">ONIGURUMA SUBROUTINE SYNTAX</a>
<li><a name="TOC28" href="#SEC28">CALLOUTS</a>
<li><a name="TOC29" href="#SEC29">BACKTRACKING CONTROL</a>
<li><a name="TOC30" href="#SEC30">SEE ALSO</a>
<li><a name="TOC31" href="#SEC31">AUTHOR</a>
<li><a name="TOC32" href="#SEC32">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
<P>
@ -2364,19 +2365,23 @@ those that look behind it, and in each case an assertion may be positive (must
match for the assertion to be true) or negative (must not match for the
assertion to be true). An assertion group is matched in the normal way,
and if it is true, matching continues after it, but with the matching position
in the subject string is was it was before the assertion was processed.
in the subject string reset to what it was before the assertion was processed.
</P>
<P>
A lookaround assertion may also appear as the condition in a
The Perl-compatible lookaround assertions are atomic. If an assertion is true,
but there is a subsequent matching failure, there is no backtracking into the
assertion. However, there are some cases where non-atomic assertions can be
useful. PCRE2 has some support for these, described in the section entitled
<a href="#nonatomicassertions">"Non-atomic assertions"</a>
below, but they are not Perl-compatible.
</P>
<P>
A lookaround assertion may appear as the condition in a
<a href="#conditions">conditional group</a>
(see below). In this case, the result of matching the assertion determines
which branch of the condition is followed.
</P>
<P>
Lookaround assertions are atomic. If an assertion is true, but there is a
subsequent matching failure, there is no backtracking into the assertion.
</P>
<P>
Assertion groups are not capture groups. If an assertion contains capture
groups within it, these are counted for the purposes of numbering the capture
groups in the whole pattern. Within each branch of an assertion, locally
@ -2429,11 +2434,11 @@ The assertion is obeyed just once when encountered during matching.
Alphabetic assertion names
</b><br>
<P>
Traditionally, symbolic sequences such as (?= and (?&#60;= have been used to specify
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
alternatives which might be easier to remember. They all start with (* instead
of (? and must be written using lower case letters. PCRE2 supports the
following synonyms:
Traditionally, symbolic sequences such as (?= and (?&#60;= have been used to
specify lookaround assertions. Perl 5.28 introduced some experimental
alphabetic alternatives which might be easier to remember. They all start with
(* instead of (? and must be written using lower case letters. PCRE2 supports
the following synonyms:
<pre>
(*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?!
@ -2606,8 +2611,63 @@ preceded by "foo", while
</pre>
is another pattern that matches "foo" preceded by three digits and any three
characters that are not "999".
<a name="nonatomicassertions"></a></P>
<br><a name="SEC21" href="#TOC1">NON-ATOMIC ASSERTIONS</a><br>
<P>
The traditional Perl-compatible lookaround assertions are atomic. That is, if
an assertion is true, but there is a subsequent matching failure, there is no
backtracking into the assertion. However, there are some cases where non-atomic
positive assertions can be useful. PCRE2 provides these using the following
syntax:
<pre>
(*non_atomic_positive_lookahead: or (*napla:
(*non_atomic_positive_lookbehind: or (*naplb:
</pre>
Consider the problem of finding the right-most word in a string that also
appears earlier in the string, that is, it must appear at least twice in total.
This pattern returns the required result as captured substring 1:
<pre>
^(?x)(*napla: .* \b(\w++)) (?&#62; .*? \b\1\b ){2}
</pre>
For a subject such as "word1 word2 word3 word2 word3 word4" the result is
"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
"x" option, which causes white space (introduced for readability) to be
ignored. Inside the assertion, the greedy .* at first consumes the entire
string, but then has to backtrack until the rest of the assertion can match a
word, which is captured by group 1. In other words, when the assertion first
succeeds, it captures the right-most word in the string.
</P>
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
<P>
The current matching point is then reset to the start of the subject, and the
rest of the pattern match checks for two occurrences of the captured word,
using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
if the last word in the string does not occur twice, this part of the pattern
fails. If a traditional atomic lookhead (?= or (*pla: had been used, the
assertion could not be re-entered, and the whole match would fail. The pattern
would succeed only if the very last word in the subject was found twice.
</P>
<P>
Using a non-atomic lookahead, however, means that when the last word does not
occur twice in the string, the lookahead can backtrack and find the second-last
word, and so on, until either the match succeeds, or all words have been
tested.
</P>
<P>
Two conditions must be met for a non-atomic assertion to be useful: the
contents of one or more capturing groups must change after a backtrack into the
assertion, and there must be a backreference to a changed group later in the
pattern. If this is not the case, the rest of the pattern match fails exactly
as before because nothing has changed, so using a non-atomic assertion just
wastes resources.
</P>
<P>
Non-atomic assertions are not supported by the alternative matching function
<b>pcre2_dfa_match()</b>. They are also not supported by JIT (but may be in
future). Note that assertions that appear as conditions for
<a href="#conditions">conditional groups</a>
(see below) must be atomic.
</P>
<br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br>
<P>
In concept, a script run is a sequence of characters that are all from the same
Unicode script such as Latin or Greek. However, because some scripts are
@ -2669,7 +2729,7 @@ parentheses.
should not be used within a script run group, because it causes an immediate
exit from the group, bypassing the script run checking.
<a name="conditions"></a></P>
<br><a name="SEC22" href="#TOC1">CONDITIONAL GROUPS</a><br>
<br><a name="SEC23" href="#TOC1">CONDITIONAL GROUPS</a><br>
<P>
It is possible to cause the matching process to obey a pattern fragment
conditionally or to choose between two alternative fragments, depending on
@ -2845,8 +2905,13 @@ Assertion conditions
<P>
If the condition is not in any of the above formats, it must be a parenthesized
assertion. This may be a positive or negative lookahead or lookbehind
assertion. Consider this pattern, again containing non-significant white space,
and with the two alternatives on the second line:
assertion. However, it must be a traditional atomic assertion, not one of the
PCRE2-specific
<a href="#nonatomicassertions">non-atomic assertions.</a>
</P>
<P>
Consider this pattern, again containing non-significant white space, and with
the two alternatives on the second line:
<pre>
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
@ -2865,7 +2930,7 @@ positive and negative assertions, because matching always continues after the
assertion, whether it succeeds or fails. (Compare non-conditional assertions,
for which captures are retained only for positive assertions that succeed.)
<a name="comments"></a></P>
<br><a name="SEC23" href="#TOC1">COMMENTS</a><br>
<br><a name="SEC24" href="#TOC1">COMMENTS</a><br>
<P>
There are two ways of including comments in patterns that are processed by
PCRE2. In both cases, the start of the comment must not be in a character
@ -2895,7 +2960,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so
it does not terminate the comment. Only an actual character with the code value
0x0a (the default newline) does so.
<a name="recursion"></a></P>
<br><a name="SEC24" href="#TOC1">RECURSIVE PATTERNS</a><br>
<br><a name="SEC25" href="#TOC1">RECURSIVE PATTERNS</a><br>
<P>
Consider the problem of matching a string in parentheses, allowing for
unlimited nested parentheses. Without the use of recursion, the best that can
@ -3083,7 +3148,7 @@ alternative matches "a" and then recurses. In the recursion, \1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
<a name="groupsassubroutines"></a></P>
<br><a name="SEC25" href="#TOC1">GROUPS AS SUBROUTINES</a><br>
<br><a name="SEC26" href="#TOC1">GROUPS AS SUBROUTINES</a><br>
<P>
If the syntax for a recursive group call (either by number or by name) is used
outside the parentheses to which it refers, it operates a bit like a subroutine
@ -3131,7 +3196,7 @@ in groups when called as subroutines is described in the section entitled
<a href="#btsub">"Backtracking verbs in subroutines"</a>
below.
<a name="onigurumasubroutines"></a></P>
<br><a name="SEC26" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
<br><a name="SEC27" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
<P>
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
a number enclosed either in angle brackets or single quotes, is an alternative
@ -3149,7 +3214,7 @@ plus or a minus sign it is taken as a relative reference. For example:
Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
synonymous. The former is a backreference; the latter is a subroutine call.
</P>
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
<br><a name="SEC28" href="#TOC1">CALLOUTS</a><br>
<P>
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
code to be obeyed in the middle of matching a regular expression. This makes it
@ -3225,7 +3290,7 @@ example:
</pre>
The doubling is removed before the string is passed to the callout function.
<a name="backtrackcontrol"></a></P>
<br><a name="SEC28" href="#TOC1">BACKTRACKING CONTROL</a><br>
<br><a name="SEC29" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
There are a number of special "Backtracking Control Verbs" (to use Perl's
terminology) that modify the behaviour of backtracking during matching. They
@ -3739,12 +3804,12 @@ enclosing group that has alternatives (its normal behaviour). However, if there
is no such group within the subroutine's group, the subroutine match fails and
there is a backtrack at the outer level.
</P>
<br><a name="SEC29" href="#TOC1">SEE ALSO</a><br>
<br><a name="SEC30" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
</P>
<br><a name="SEC30" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC31" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@ -3753,9 +3818,9 @@ University Computing Service
Cambridge, England.
<br>
</P>
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P>
Last updated: 22 June 2019
Last updated: 13 July 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -32,15 +32,16 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC20" href="#SEC20">SCRIPT RUNS</a>
<li><a name="TOC21" href="#SEC21">BACKREFERENCES</a>
<li><a name="TOC22" href="#SEC22">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC23" href="#SEC23">CONDITIONAL PATTERNS</a>
<li><a name="TOC24" href="#SEC24">BACKTRACKING CONTROL</a>
<li><a name="TOC25" href="#SEC25">CALLOUTS</a>
<li><a name="TOC26" href="#SEC26">SEE ALSO</a>
<li><a name="TOC27" href="#SEC27">AUTHOR</a>
<li><a name="TOC28" href="#SEC28">REVISION</a>
<li><a name="TOC20" href="#SEC20">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
<li><a name="TOC22" href="#SEC22">BACKREFERENCES</a>
<li><a name="TOC23" href="#SEC23">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC24" href="#SEC24">CONDITIONAL PATTERNS</a>
<li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a>
<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
<li><a name="TOC27" href="#SEC27">SEE ALSO</a>
<li><a name="TOC28" href="#SEC28">AUTHOR</a>
<li><a name="TOC29" href="#SEC29">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
<P>
@ -544,7 +545,18 @@ setting with a similar syntax.
</pre>
Each top-level branch of a lookbehind must be of a fixed length.
</P>
<br><a name="SEC20" href="#TOC1">SCRIPT RUNS</a><br>
<br><a name="SEC20" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
<P>
These assertions are specific to PCRE2 and are not Perl-compatible.
<pre>
(*napla:...)
(*non_atomic_positive_lookahead:...)
(*naplb:...)
(*non_atomic_positive_lookbehind:...)
</PRE>
</P>
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
<P>
<pre>
(*script_run:...) ) script run, can be backtracked into
@ -554,7 +566,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
(*asr:...) )
</PRE>
</P>
<br><a name="SEC21" href="#TOC1">BACKREFERENCES</a><br>
<br><a name="SEC22" href="#TOC1">BACKREFERENCES</a><br>
<P>
<pre>
\n reference by number (can be ambiguous)
@ -571,7 +583,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
(?P=name) reference by name (Python)
</PRE>
</P>
<br><a name="SEC22" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<br><a name="SEC23" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P>
<pre>
(?R) recurse whole pattern
@ -590,7 +602,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
\g'-n' call subroutine by relative number (PCRE2 extension)
</PRE>
</P>
<br><a name="SEC23" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<br><a name="SEC24" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P>
<pre>
(?(condition)yes-pattern)
@ -613,7 +625,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists.
</P>
<br><a name="SEC24" href="#TOC1">BACKTRACKING CONTROL</a><br>
<br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
@ -640,7 +652,7 @@ pattern is not anchored.
The effect of one of these verbs in a group called as a subroutine is confined
to the subroutine call.
</P>
<br><a name="SEC25" href="#TOC1">CALLOUTS</a><br>
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
<P>
<pre>
(?C) callout (assumed number 0)
@ -651,12 +663,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it.
</P>
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
<br><a name="SEC27" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
</P>
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC28" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@ -665,9 +677,9 @@ University Computing Service
Cambridge, England.
<br>
</P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<br><a name="SEC29" href="#TOC1">REVISION</a><br>
<P>
Last updated: 11 February 2019
Last updated: 12 July 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2COMPAT 3 "12 February 2019" "PCRE2 10.33"
.TH PCRE2COMPAT 3 "13 July 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@ -170,6 +170,10 @@ different way and is not Perl-compatible.
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
the start of a pattern that set overall options that cannot be changed within
the pattern.
.sp
(m) PCRE2 supports non-atomic positive lookaround assertions. This is an
extension to the lookaround facilities. The default, Perl-compatible
lookarounds are atomic.
.P
18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
modifier restricts /i case-insensitive matching to pure ascii, ignoring Unicode
@ -199,6 +203,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 12 February 2019
Last updated: 13 July 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "22 June 2019" "PCRE2 10.34"
.TH PCRE2PATTERN 3 "13 July 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -2370,9 +2370,19 @@ those that look behind it, and in each case an assertion may be positive (must
match for the assertion to be true) or negative (must not match for the
assertion to be true). An assertion group is matched in the normal way,
and if it is true, matching continues after it, but with the matching position
in the subject string is was it was before the assertion was processed.
in the subject string reset to what it was before the assertion was processed.
.P
A lookaround assertion may also appear as the condition in a
The Perl-compatible lookaround assertions are atomic. If an assertion is true,
but there is a subsequent matching failure, there is no backtracking into the
assertion. However, there are some cases where non-atomic assertions can be
useful. PCRE2 has some support for these, described in the section entitled
.\" HTML <a href="#nonatomicassertions">
.\" </a>
"Non-atomic assertions"
.\"
below, but they are not Perl-compatible.
.P
A lookaround assertion may appear as the condition in a
.\" HTML <a href="#conditions">
.\" </a>
conditional group
@ -2380,9 +2390,6 @@ conditional group
(see below). In this case, the result of matching the assertion determines
which branch of the condition is followed.
.P
Lookaround assertions are atomic. If an assertion is true, but there is a
subsequent matching failure, there is no backtracking into the assertion.
.P
Assertion groups are not capture groups. If an assertion contains capture
groups within it, these are counted for the purposes of numbering the capture
groups in the whole pattern. Within each branch of an assertion, locally
@ -2435,11 +2442,11 @@ The assertion is obeyed just once when encountered during matching.
.SS "Alphabetic assertion names"
.rs
.sp
Traditionally, symbolic sequences such as (?= and (?<= have been used to specify
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
alternatives which might be easier to remember. They all start with (* instead
of (? and must be written using lower case letters. PCRE2 supports the
following synonyms:
Traditionally, symbolic sequences such as (?= and (?<= have been used to
specify lookaround assertions. Perl 5.28 introduced some experimental
alphabetic alternatives which might be easier to remember. They all start with
(* instead of (? and must be written using lower case letters. PCRE2 supports
the following synonyms:
.sp
(*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?!
@ -2616,6 +2623,63 @@ is another pattern that matches "foo" preceded by three digits and any three
characters that are not "999".
.
.
.\" HTML <a name="nonatomicassertions"></a>
.SH "NON-ATOMIC ASSERTIONS"
.rs
.sp
The traditional Perl-compatible lookaround assertions are atomic. That is, if
an assertion is true, but there is a subsequent matching failure, there is no
backtracking into the assertion. However, there are some cases where non-atomic
positive assertions can be useful. PCRE2 provides these using the following
syntax:
.sp
(*non_atomic_positive_lookahead: or (*napla:
(*non_atomic_positive_lookbehind: or (*naplb:
.sp
Consider the problem of finding the right-most word in a string that also
appears earlier in the string, that is, it must appear at least twice in total.
This pattern returns the required result as captured substring 1:
.sp
^(?x)(*napla: .* \eb(\ew++)) (?> .*? \eb\e1\eb ){2}
.sp
For a subject such as "word1 word2 word3 word2 word3 word4" the result is
"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
"x" option, which causes white space (introduced for readability) to be
ignored. Inside the assertion, the greedy .* at first consumes the entire
string, but then has to backtrack until the rest of the assertion can match a
word, which is captured by group 1. In other words, when the assertion first
succeeds, it captures the right-most word in the string.
.P
The current matching point is then reset to the start of the subject, and the
rest of the pattern match checks for two occurrences of the captured word,
using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
if the last word in the string does not occur twice, this part of the pattern
fails. If a traditional atomic lookhead (?= or (*pla: had been used, the
assertion could not be re-entered, and the whole match would fail. The pattern
would succeed only if the very last word in the subject was found twice.
.P
Using a non-atomic lookahead, however, means that when the last word does not
occur twice in the string, the lookahead can backtrack and find the second-last
word, and so on, until either the match succeeds, or all words have been
tested.
.P
Two conditions must be met for a non-atomic assertion to be useful: the
contents of one or more capturing groups must change after a backtrack into the
assertion, and there must be a backreference to a changed group later in the
pattern. If this is not the case, the rest of the pattern match fails exactly
as before because nothing has changed, so using a non-atomic assertion just
wastes resources.
.P
Non-atomic assertions are not supported by the alternative matching function
\fBpcre2_dfa_match()\fP. They are also not supported by JIT (but may be in
future). Note that assertions that appear as conditions for
.\" HTML <a href="#conditions">
.\" </a>
conditional groups
.\"
(see below) must be atomic.
.
.
.SH "SCRIPT RUNS"
.rs
.sp
@ -2867,8 +2931,15 @@ than two digits.
.sp
If the condition is not in any of the above formats, it must be a parenthesized
assertion. This may be a positive or negative lookahead or lookbehind
assertion. Consider this pattern, again containing non-significant white space,
and with the two alternatives on the second line:
assertion. However, it must be a traditional atomic assertion, not one of the
PCRE2-specific
.\" HTML <a href="#nonatomicassertions">
.\" </a>
non-atomic assertions.
.\"
.P
Consider this pattern, again containing non-significant white space, and with
the two alternatives on the second line:
.sp
(?(?=[^a-z]*[a-z])
\ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} )
@ -3788,6 +3859,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 22 June 2019
Last updated: 13 July 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "11 February 2019" "PCRE2 10.33"
.TH PCRE2SYNTAX 3 "12 July 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -522,6 +522,18 @@ setting with a similar syntax.
Each top-level branch of a lookbehind must be of a fixed length.
.
.
.SH "NON-ATOMIC LOOKAROUND ASSERTIONS"
.rs
.sp
These assertions are specific to PCRE2 and are not Perl-compatible.
.sp
(*napla:...)
(*non_atomic_positive_lookahead:...)
.sp
(*naplb:...)
(*non_atomic_positive_lookbehind:...)
.
.
.SH "SCRIPT RUNS"
.rs
.sp
@ -654,6 +666,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 11 February 2019
Last updated: 12 July 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -307,6 +307,7 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
#define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE 196
#define PCRE2_ERROR_TOO_MANY_CAPTURES 197
#define PCRE2_ERROR_CONDITION_ATOMIC_ASSERTION_EXPECTED 198
/* "Expected" matching error codes: no match and partial match. */

View File

@ -624,6 +624,13 @@ for(;;)
case OP_ASSERTBACK_NOT:
case OP_ONCE:
return !entered_a_group;
/* Non-atomic assertions - don't possessify last iterator. This needs
more thought. */
case OP_ASSERT_NA:
case OP_ASSERTBACK_NA:
return FALSE;
}
/* Skip over the bracket and inspect what comes next. */

View File

@ -250,36 +250,41 @@ is present where expected in a conditional group. */
#define META_LOOKBEHIND 0x80250000u /* (?<= */
#define META_LOOKBEHINDNOT 0x80260000u /* (?<! */
/* These cannot be conditions */
#define META_LOOKAHEAD_NA 0x80270000u /* (*napla: */
#define META_LOOKBEHIND_NA 0x80280000u /* (*naplb: */
/* These must be kept in this order, with consecutive values, and the _ARG
versions of COMMIT, PRUNE, SKIP, and THEN immediately after their non-argument
versions. */
#define META_MARK 0x80270000u /* (*MARK) */
#define META_ACCEPT 0x80280000u /* (*ACCEPT) */
#define META_FAIL 0x80290000u /* (*FAIL) */
#define META_COMMIT 0x802a0000u /* These */
#define META_COMMIT_ARG 0x802b0000u /* pairs */
#define META_PRUNE 0x802c0000u /* must */
#define META_PRUNE_ARG 0x802d0000u /* be */
#define META_SKIP 0x802e0000u /* kept */
#define META_SKIP_ARG 0x802f0000u /* in */
#define META_THEN 0x80300000u /* this */
#define META_THEN_ARG 0x80310000u /* order */
#define META_MARK 0x80290000u /* (*MARK) */
#define META_ACCEPT 0x802a0000u /* (*ACCEPT) */
#define META_FAIL 0x802b0000u /* (*FAIL) */
#define META_COMMIT 0x802c0000u /* These */
#define META_COMMIT_ARG 0x802d0000u /* pairs */
#define META_PRUNE 0x802e0000u /* must */
#define META_PRUNE_ARG 0x802f0000u /* be */
#define META_SKIP 0x80300000u /* kept */
#define META_SKIP_ARG 0x80310000u /* in */
#define META_THEN 0x80320000u /* this */
#define META_THEN_ARG 0x80330000u /* order */
/* These must be kept in groups of adjacent 3 values, and all together. */
#define META_ASTERISK 0x80320000u /* * */
#define META_ASTERISK_PLUS 0x80330000u /* *+ */
#define META_ASTERISK_QUERY 0x80340000u /* *? */
#define META_PLUS 0x80350000u /* + */
#define META_PLUS_PLUS 0x80360000u /* ++ */
#define META_PLUS_QUERY 0x80370000u /* +? */
#define META_QUERY 0x80380000u /* ? */
#define META_QUERY_PLUS 0x80390000u /* ?+ */
#define META_QUERY_QUERY 0x803a0000u /* ?? */
#define META_MINMAX 0x803b0000u /* {n,m} repeat */
#define META_MINMAX_PLUS 0x803c0000u /* {n,m}+ repeat */
#define META_MINMAX_QUERY 0x803d0000u /* {n,m}? repeat */
#define META_ASTERISK 0x80340000u /* * */
#define META_ASTERISK_PLUS 0x80350000u /* *+ */
#define META_ASTERISK_QUERY 0x80360000u /* *? */
#define META_PLUS 0x80370000u /* + */
#define META_PLUS_PLUS 0x80380000u /* ++ */
#define META_PLUS_QUERY 0x80390000u /* +? */
#define META_QUERY 0x803a0000u /* ? */
#define META_QUERY_PLUS 0x803b0000u /* ?+ */
#define META_QUERY_QUERY 0x803c0000u /* ?? */
#define META_MINMAX 0x803d0000u /* {n,m} repeat */
#define META_MINMAX_PLUS 0x803e0000u /* {n,m}+ repeat */
#define META_MINMAX_QUERY 0x803f0000u /* {n,m}? repeat */
#define META_FIRST_QUANTIFIER META_ASTERISK
#define META_LAST_QUANTIFIER META_MINMAX_QUERY
@ -335,6 +340,8 @@ static unsigned char meta_extra_lengths[] = {
0, /* META_LOOKAHEADNOT */
SIZEOFFSET, /* META_LOOKBEHIND */
SIZEOFFSET, /* META_LOOKBEHINDNOT */
0, /* META_LOOKAHEAD_NA */
SIZEOFFSET, /* META_LOOKBEHIND_NA */
1, /* META_MARK - plus the string length */
0, /* META_ACCEPT */
0, /* META_FAIL */
@ -637,10 +644,14 @@ typedef struct alasitem {
static const char alasnames[] =
STRING_pla0
STRING_plb0
STRING_napla0
STRING_naplb0
STRING_nla0
STRING_nlb0
STRING_positive_lookahead0
STRING_positive_lookbehind0
STRING_non_atomic_positive_lookahead0
STRING_non_atomic_positive_lookbehind0
STRING_negative_lookahead0
STRING_negative_lookbehind0
STRING_atomic0
@ -652,10 +663,14 @@ static const char alasnames[] =
static const alasitem alasmeta[] = {
{ 3, META_LOOKAHEAD },
{ 3, META_LOOKBEHIND },
{ 5, META_LOOKAHEAD_NA },
{ 5, META_LOOKBEHIND_NA },
{ 3, META_LOOKAHEADNOT },
{ 3, META_LOOKBEHINDNOT },
{ 18, META_LOOKAHEAD },
{ 19, META_LOOKBEHIND },
{ 29, META_LOOKAHEAD_NA },
{ 30, META_LOOKBEHIND_NA },
{ 18, META_LOOKAHEADNOT },
{ 19, META_LOOKBEHINDNOT },
{ 6, META_ATOMIC },
@ -784,7 +799,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97 };
ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97, ERR98 };
/* This is a table of start-of-pattern options such as (*UTF) and settings such
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
@ -1015,6 +1030,7 @@ for (;;)
case META_NOCAPTURE: fprintf(stderr, "META (?:"); break;
case META_LOOKAHEAD: fprintf(stderr, "META (?="); break;
case META_LOOKAHEADNOT: fprintf(stderr, "META (?!"); break;
case META_LOOKAHEAD_NA: fprintf(stderr, "META (*napla:"); break;
case META_SCRIPT_RUN: fprintf(stderr, "META (*sr:"); break;
case META_KET: fprintf(stderr, "META )"); break;
case META_ALT: fprintf(stderr, "META | %d", meta_arg); break;
@ -1046,6 +1062,12 @@ for (;;)
fprintf(stderr, "%zd", offset);
break;
case META_LOOKBEHIND_NA:
fprintf(stderr, "META (*naplb: %d offset=", meta_arg);
GETOFFSET(offset, pptr);
fprintf(stderr, "%zd", offset);
break;
case META_LOOKBEHINDNOT:
fprintf(stderr, "META (?<! %d offset=", meta_arg);
GETOFFSET(offset, pptr);
@ -3695,19 +3717,20 @@ while (ptr < ptrend)
goto FAILED;
}
/* Check for expecting an assertion condition. If so, only lookaround
assertions are valid. */
/* Check for expecting an assertion condition. If so, only atomic
lookaround assertions are valid. */
meta = alasmeta[i].meta;
if (prev_expect_cond_assert > 0 &&
(meta < META_LOOKAHEAD || meta > META_LOOKBEHINDNOT))
{
errorcode = ERR28; /* Assertion expected */
errorcode = (meta == META_LOOKAHEAD_NA || meta == META_LOOKBEHIND_NA)?
ERR98 : ERR28; /* (Atomic) assertion expected */
goto FAILED;
}
/* The lookaround alphabetic synonyms can be almost entirely handled by
jumping to the code that handles the traditional symbolic forms. */
/* The lookaround alphabetic synonyms can mostly be handled by jumping
to the code that handles the traditional symbolic forms. */
switch(meta)
{
@ -3721,11 +3744,17 @@ while (ptr < ptrend)
case META_LOOKAHEAD:
goto POSITIVE_LOOK_AHEAD;
case META_LOOKAHEAD_NA:
*parsed_pattern++ = meta;
ptr++;
goto POST_ASSERTION;
case META_LOOKAHEADNOT:
goto NEGATIVE_LOOK_AHEAD;
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
*parsed_pattern++ = meta;
ptr--;
goto POST_LOOKBEHIND;
@ -4429,7 +4458,7 @@ while (ptr < ptrend)
*parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)?
META_LOOKBEHIND : META_LOOKBEHINDNOT;
POST_LOOKBEHIND: /* Come from (*plb: and (*nlb: */
POST_LOOKBEHIND: /* Come from (*plb: (*naplb: and (*nlb: */
*has_lookbehind = TRUE;
offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2);
PUTOFFSET(offset, parsed_pattern);
@ -6300,6 +6329,11 @@ for (;; pptr++)
cb->assert_depth += 1;
goto GROUP_PROCESS;
case META_LOOKAHEAD_NA:
bravalue = OP_ASSERT_NA;
cb->assert_depth += 1;
goto GROUP_PROCESS;
/* Optimize (?!) to (*FAIL) unless it is quantified - which is a weird
thing to do, but Perl allows all assertions to be quantified, and when
they contain capturing parentheses there may be a potential use for
@ -6331,6 +6365,11 @@ for (;; pptr++)
cb->assert_depth += 1;
goto GROUP_PROCESS;
case META_LOOKBEHIND_NA:
bravalue = OP_ASSERTBACK_NA;
cb->assert_depth += 1;
goto GROUP_PROCESS;
case META_ATOMIC:
bravalue = OP_ONCE;
goto GROUP_PROCESS_NOTE_EMPTY;
@ -7931,7 +7970,10 @@ length = 2 + 2*LINK_SIZE + skipunits;
/* Remember if this is a lookbehind assertion, and if it is, save its length
and skip over the pattern offset. */
lookbehind = *code == OP_ASSERTBACK || *code == OP_ASSERTBACK_NOT;
lookbehind = *code == OP_ASSERTBACK ||
*code == OP_ASSERTBACK_NOT ||
*code == OP_ASSERTBACK_NA;
if (lookbehind)
{
lookbehindlength = META_DATA(pptr[-1]);
@ -8802,8 +8844,10 @@ for (;; pptr++)
case META_COND_VERSION:
case META_LOOKAHEAD:
case META_LOOKAHEADNOT:
case META_LOOKAHEAD_NA:
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
case META_NOCAPTURE:
case META_SCRIPT_RUN:
nestlevel++;
@ -9064,6 +9108,7 @@ for (;; pptr++)
case META_LOOKAHEAD:
case META_LOOKAHEADNOT:
case META_LOOKAHEAD_NA:
pptr = parsed_skip(pptr + 1, PSKIP_KET);
if (pptr == NULL) goto PARSED_SKIP_FAILED;
@ -9102,6 +9147,7 @@ for (;; pptr++)
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
return -1;
if (max - branchlength > extra) extra = max - branchlength;
@ -9453,6 +9499,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
case META_KET:
case META_LOOKAHEAD:
case META_LOOKAHEADNOT:
case META_LOOKAHEAD_NA:
case META_NOCAPTURE:
case META_PLUS:
case META_PLUS_PLUS:
@ -9514,6 +9561,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, NULL, cb))
return errorcode;
break;

View File

@ -173,6 +173,8 @@ static const uint8_t coptable[] = {
0, /* Assert not */
0, /* Assert behind */
0, /* Assert behind not */
0, /* NA assert */
0, /* NA assert behind */
0, /* ONCE */
0, /* SCRIPT_RUN */
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
@ -248,6 +250,8 @@ static const uint8_t poptable[] = {
0, /* Assert not */
0, /* Assert behind */
0, /* Assert behind not */
0, /* NA assert */
0, /* NA assert behind */
0, /* ONCE */
0, /* SCRIPT_RUN */
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */

View File

@ -185,6 +185,7 @@ static const unsigned char compile_error_texts[] =
"(*alpha_assertion) not recognized\0"
"script runs require Unicode support, which this version of PCRE2 does not have\0"
"too many capturing groups (maximum 65535)\0"
"atomic assertion expected after (?( or (?(?C)\0"
;
/* Match-time and UTF error texts are in the same format. */

View File

@ -883,12 +883,16 @@ a positive value. */
#define STRING_atomic0 "atomic\0"
#define STRING_pla0 "pla\0"
#define STRING_plb0 "plb\0"
#define STRING_napla0 "napla\0"
#define STRING_naplb0 "naplb\0"
#define STRING_nla0 "nla\0"
#define STRING_nlb0 "nlb\0"
#define STRING_sr0 "sr\0"
#define STRING_asr0 "asr\0"
#define STRING_positive_lookahead0 "positive_lookahead\0"
#define STRING_positive_lookbehind0 "positive_lookbehind\0"
#define STRING_non_atomic_positive_lookahead0 "non_atomic_positive_lookahead\0"
#define STRING_non_atomic_positive_lookbehind0 "non_atomic_positive_lookbehind\0"
#define STRING_negative_lookahead0 "negative_lookahead\0"
#define STRING_negative_lookbehind0 "negative_lookbehind\0"
#define STRING_script_run0 "script_run\0"
@ -1173,12 +1177,16 @@ only. */
#define STRING_atomic0 STR_a STR_t STR_o STR_m STR_i STR_c "\0"
#define STRING_pla0 STR_p STR_l STR_a "\0"
#define STRING_plb0 STR_p STR_l STR_b "\0"
#define STRING_napla0 STR_n STR_a STR_p STR_l STR_a "\0"
#define STRING_naplb0 STR_n STR_a STR_p STR_l STR_b "\0"
#define STRING_nla0 STR_n STR_l STR_a "\0"
#define STRING_nlb0 STR_n STR_l STR_b "\0"
#define STRING_sr0 STR_s STR_r "\0"
#define STRING_asr0 STR_a STR_s STR_r "\0"
#define STRING_positive_lookahead0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_positive_lookbehind0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_non_atomic_positive_lookahead0 STR_n STR_o STR_n STR_UNDERSCORE STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_non_atomic_positive_lookbehind0 STR_n STR_o STR_n STR_UNDERSCORE STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_negative_lookahead0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_negative_lookbehind0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_script_run0 STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n "\0"
@ -1303,7 +1311,7 @@ enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
Starting from 1 (i.e. after OP_END), the values up to OP_EOD must correspond in
order to the list of escapes immediately above. Furthermore, values up to
OP_DOLLM must not be changed without adjusting the table called autoposstab in
pcre2_auto_possess.c
pcre2_auto_possess.c.
Whenever this list is updated, the two macro definitions that follow must be
updated to match. The possessification table called "opcode_possessify" in
@ -1501,80 +1509,81 @@ enum {
OP_KETRMIN, /* 123 order. They are for groups the repeat for ever. */
OP_KETRPOS, /* 124 Possessive unlimited repeat. */
/* The assertions must come before BRA, CBRA, ONCE, and COND, and the four
asserts must remain in order. */
/* The assertions must come before BRA, CBRA, ONCE, and COND. */
OP_REVERSE, /* 125 Move pointer back - used in lookbehind assertions */
OP_ASSERT, /* 126 Positive lookahead */
OP_ASSERT_NOT, /* 127 Negative lookahead */
OP_ASSERTBACK, /* 128 Positive lookbehind */
OP_ASSERTBACK_NOT, /* 129 Negative lookbehind */
OP_ASSERT_NA, /* 130 Positive non-atomic lookahead */
OP_ASSERTBACK_NA, /* 131 Positive non-atomic lookbehind */
/* ONCE, SCRIPT_RUN, BRA, BRAPOS, CBRA, CBRAPOS, and COND must come
immediately after the assertions, with ONCE first, as there's a test for >=
ONCE for a subpattern that isn't an assertion. The POS versions must
immediately follow the non-POS versions in each case. */
OP_ONCE, /* 130 Atomic group, contains captures */
OP_SCRIPT_RUN, /* 131 Non-capture, but check characters' scripts */
OP_BRA, /* 132 Start of non-capturing bracket */
OP_BRAPOS, /* 133 Ditto, with unlimited, possessive repeat */
OP_CBRA, /* 134 Start of capturing bracket */
OP_CBRAPOS, /* 135 Ditto, with unlimited, possessive repeat */
OP_COND, /* 136 Conditional group */
OP_ONCE, /* 132 Atomic group, contains captures */
OP_SCRIPT_RUN, /* 133 Non-capture, but check characters' scripts */
OP_BRA, /* 134 Start of non-capturing bracket */
OP_BRAPOS, /* 135 Ditto, with unlimited, possessive repeat */
OP_CBRA, /* 136 Start of capturing bracket */
OP_CBRAPOS, /* 137 Ditto, with unlimited, possessive repeat */
OP_COND, /* 138 Conditional group */
/* These five must follow the previous five, in the same order. There's a
check for >= SBRA to distinguish the two sets. */
OP_SBRA, /* 137 Start of non-capturing bracket, check empty */
OP_SBRAPOS, /* 138 Ditto, with unlimited, possessive repeat */
OP_SCBRA, /* 139 Start of capturing bracket, check empty */
OP_SCBRAPOS, /* 140 Ditto, with unlimited, possessive repeat */
OP_SCOND, /* 141 Conditional group, check empty */
OP_SBRA, /* 139 Start of non-capturing bracket, check empty */
OP_SBRAPOS, /* 149 Ditto, with unlimited, possessive repeat */
OP_SCBRA, /* 141 Start of capturing bracket, check empty */
OP_SCBRAPOS, /* 142 Ditto, with unlimited, possessive repeat */
OP_SCOND, /* 143 Conditional group, check empty */
/* The next two pairs must (respectively) be kept together. */
OP_CREF, /* 142 Used to hold a capture number as condition */
OP_DNCREF, /* 143 Used to point to duplicate names as a condition */
OP_RREF, /* 144 Used to hold a recursion number as condition */
OP_DNRREF, /* 145 Used to point to duplicate names as a condition */
OP_FALSE, /* 146 Always false (used by DEFINE and VERSION) */
OP_TRUE, /* 147 Always true (used by VERSION) */
OP_CREF, /* 144 Used to hold a capture number as condition */
OP_DNCREF, /* 145 Used to point to duplicate names as a condition */
OP_RREF, /* 146 Used to hold a recursion number as condition */
OP_DNRREF, /* 147 Used to point to duplicate names as a condition */
OP_FALSE, /* 148 Always false (used by DEFINE and VERSION) */
OP_TRUE, /* 149 Always true (used by VERSION) */
OP_BRAZERO, /* 148 These two must remain together and in this */
OP_BRAMINZERO, /* 149 order. */
OP_BRAPOSZERO, /* 150 */
OP_BRAZERO, /* 150 These two must remain together and in this */
OP_BRAMINZERO, /* 151 order. */
OP_BRAPOSZERO, /* 152 */
/* These are backtracking control verbs */
OP_MARK, /* 151 always has an argument */
OP_PRUNE, /* 152 */
OP_PRUNE_ARG, /* 153 same, but with argument */
OP_SKIP, /* 154 */
OP_SKIP_ARG, /* 155 same, but with argument */
OP_THEN, /* 156 */
OP_THEN_ARG, /* 157 same, but with argument */
OP_COMMIT, /* 158 */
OP_COMMIT_ARG, /* 159 same, but with argument */
OP_MARK, /* 153 always has an argument */
OP_PRUNE, /* 154 */
OP_PRUNE_ARG, /* 155 same, but with argument */
OP_SKIP, /* 156 */
OP_SKIP_ARG, /* 157 same, but with argument */
OP_THEN, /* 158 */
OP_THEN_ARG, /* 159 same, but with argument */
OP_COMMIT, /* 160 */
OP_COMMIT_ARG, /* 161 same, but with argument */
/* These are forced failure and success verbs. FAIL and ACCEPT do accept an
argument, but these cases can be compiled as, for example, (*MARK:X)(*FAIL)
without the need for a special opcode. */
OP_FAIL, /* 160 */
OP_ACCEPT, /* 161 */
OP_ASSERT_ACCEPT, /* 162 Used inside assertions */
OP_CLOSE, /* 163 Used before OP_ACCEPT to close open captures */
OP_FAIL, /* 162 */
OP_ACCEPT, /* 163 */
OP_ASSERT_ACCEPT, /* 164 Used inside assertions */
OP_CLOSE, /* 165 Used before OP_ACCEPT to close open captures */
/* This is used to skip a subpattern with a {0} quantifier */
OP_SKIPZERO, /* 164 */
OP_SKIPZERO, /* 166 */
/* This is used to identify a DEFINE group during compilation so that it can
be checked for having only one branch. It is changed to OP_FALSE before
compilation finishes. */
OP_DEFINE, /* 165 */
OP_DEFINE, /* 167 */
/* This is not an opcode, but is used to check that tables indexed by opcode
are the correct length, in order to catch updating errors - there have been
@ -1587,7 +1596,7 @@ enum {
/* *** NOTE NOTE NOTE *** Whenever the list above is updated, the two macro
definitions that follow must also be updated to match. There are also tables
called "opcode_possessify" in pcre2_compile.c and "coptable" and "poptable" in
pcre2_dfa_exec.c that must be updated. */
pcre2_dfa_match.c that must be updated. */
/* This macro defines textual names for all the opcodes. These are used only
@ -1620,7 +1629,9 @@ some cases doesn't actually use these names at all). */
"class", "nclass", "xclass", "Ref", "Refi", "DnRef", "DnRefi", \
"Recurse", "Callout", "CalloutStr", \
"Alt", "Ket", "KetRmax", "KetRmin", "KetRpos", \
"Reverse", "Assert", "Assert not", "AssertB", "AssertB not", \
"Reverse", "Assert", "Assert not", \
"Assert back", "Assert back not", \
"Non-atomic assert", "Non-atomic assert back", \
"Once", \
"Script run", \
"Bra", "BraPos", "CBra", "CBraPos", \
@ -1705,6 +1716,8 @@ in UTF-8 mode. The code that uses this table must know about such things. */
1+LINK_SIZE, /* Assert not */ \
1+LINK_SIZE, /* Assert behind */ \
1+LINK_SIZE, /* Assert behind not */ \
1+LINK_SIZE, /* NA Assert */ \
1+LINK_SIZE, /* NA Assert behind */ \
1+LINK_SIZE, /* ONCE */ \
1+LINK_SIZE, /* SCRIPT_RUN */ \
1+LINK_SIZE, /* BRA */ \

View File

@ -5127,6 +5127,8 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
case OP_ASSERT:
case OP_ASSERTBACK:
case OP_ASSERT_NA:
case OP_ASSERTBACK_NA:
Lframe_type = GF_NOCAPTURE | Fop;
for (;;)
{
@ -5497,10 +5499,20 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
case OP_SCOND:
break;
/* Positive assertions are like OP_ONCE, except that in addition the
/* Non-atomic positive assertions are like OP_BRA, except that the
subject pointer must be put back to where it was at the start of the
assertion. */
case OP_ASSERT_NA:
case OP_ASSERTBACK_NA:
if (Feptr > mb->last_used_ptr) mb->last_used_ptr = Feptr;
Feptr = P->eptr;
break;
/* Atomic positive assertions are like OP_ONCE, except that in addition
the subject pointer must be put back to where it was at the start of the
assertion. */
case OP_ASSERT:
case OP_ASSERTBACK:
if (Feptr > mb->last_used_ptr) mb->last_used_ptr = Feptr;

View File

@ -392,6 +392,8 @@ for(;;)
case OP_ASSERT_NOT:
case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT:
case OP_ASSERT_NA:
case OP_ASSERTBACK_NA:
case OP_ONCE:
case OP_SCRIPT_RUN:
case OP_COND:

View File

@ -240,6 +240,8 @@ for (;;)
case OP_ASSERT_NOT:
case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT:
case OP_ASSERT_NA:
case OP_ASSERTBACK_NA:
do cc += GET(cc, 1); while (*cc == OP_ALT);
/* Fall through */
@ -1089,6 +1091,7 @@ do
case OP_ONCE:
case OP_SCRIPT_RUN:
case OP_ASSERT:
case OP_ASSERT_NA:
rc = set_start_bits(re, tcode, utf);
if (rc == SSB_FAIL || rc == SSB_UNKNOWN) return rc;
if (rc == SSB_DONE) try_next = FALSE; else
@ -1131,6 +1134,7 @@ do
case OP_ASSERT_NOT:
case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT:
case OP_ASSERTBACK_NA:
do tcode += GET(tcode, 1); while (*tcode == OP_ALT);
tcode += 1 + LINK_SIZE;
break;

29
testdata/testinput2 vendored
View File

@ -5653,4 +5653,33 @@ a)"xI
# Multiplication overflow
/(X{65535})(?<=\1{32770})/
# ---- Non-atomic assertion tests ----
# Expect error: not allowed as a condition
/(?(*napla:xx)bc)/
/\A(*pla:.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
/\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
/(*plb:(.)..|(.)...)(\1|\2)/
abcdb\=offset=4
abcda\=offset=4
/(*naplb:(.)..|(.)...)(\1|\2)/
abcdb\=offset=4
abcda\=offset=4
/(*non_atomic_positive_lookahead:ab)/B
/(*non_atomic_positive_lookbehind:ab)/B
/(*pla:ab+)/B
/(*napla:ab+)/B
# ----
# End of testinput2

89
testdata/testoutput2 vendored
View File

@ -11117,7 +11117,7 @@ Matched, but too many substrings
------------------------------------------------------------------
Bra
Brazero
AssertB
Assert back
Reverse
CBra 1
abc
@ -13346,7 +13346,7 @@ Failed: error 144 at offset 5: subpattern name must start with a non-digit
Ket
red
\b
AssertB
Assert back
Reverse
\w
Ket
@ -13403,7 +13403,7 @@ Failed: error 133 at offset 7: parentheses are too deeply nested (stack check)
Once
\s*+
Ket
AssertB
Assert back
Reverse
\w
Ket
@ -16619,7 +16619,7 @@ No match
/(?<=(?=.){4,5}x)/B
------------------------------------------------------------------
Bra
AssertB
Assert back
Reverse
Assert
Any
@ -17086,6 +17086,87 @@ Failed: error 187 at offset 15: lookbehind assertion is too long
/(X{65535})(?<=\1{32770})/
Failed: error 187 at offset 10: lookbehind assertion is too long
# ---- Non-atomic assertion tests ----
# Expect error: not allowed as a condition
/(?(*napla:xx)bc)/
Failed: error 198 at offset 9: atomic assertion expected after (?( or (?(?C)
/\A(*pla:.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
No match
/\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
0: word1 word3 word1 word2 word3 word2 word2 word1 word3
1: word3
/(*plb:(.)..|(.)...)(\1|\2)/
abcdb\=offset=4
0: b
1: b
2: <unset>
3: b
abcda\=offset=4
No match
/(*naplb:(.)..|(.)...)(\1|\2)/
abcdb\=offset=4
0: b
1: b
2: <unset>
3: b
abcda\=offset=4
0: a
1: <unset>
2: a
3: a
/(*non_atomic_positive_lookahead:ab)/B
------------------------------------------------------------------
Bra
Non-atomic assert
ab
Ket
Ket
End
------------------------------------------------------------------
/(*non_atomic_positive_lookbehind:ab)/B
------------------------------------------------------------------
Bra
Non-atomic assert back
Reverse
ab
Ket
Ket
End
------------------------------------------------------------------
/(*pla:ab+)/B
------------------------------------------------------------------
Bra
Assert
a
b++
Ket
Ket
End
------------------------------------------------------------------
/(*napla:ab+)/B
------------------------------------------------------------------
Bra
Non-atomic assert
a
b+
Ket
Ket
End
------------------------------------------------------------------
# ----
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data

View File

@ -4017,7 +4017,7 @@ MK: a\x{12345}b\x{09}(d)c
------------------------------------------------------------------
Bra
\b
AssertB
Assert back
Reverse
prop Xwd
Ket
@ -4196,7 +4196,7 @@ Failed: error 125 at offset 2: lookbehind assertion is not fixed length
------------------------------------------------------------------
Bra
^
AssertB not
Assert back not
Assert
\x{10385c}
Ket
@ -4828,7 +4828,7 @@ MK: ABC
/(?<!)(*sr:)/B
------------------------------------------------------------------
Bra
AssertB not
Assert back not
Ket
Script run
Ket
@ -4839,7 +4839,7 @@ MK: ABC
/(?<=abc(?=X(*sr:BXY)CCC)XBXYCCC)./B
------------------------------------------------------------------
Bra
AssertB
Assert back
Reverse
abc
Assert