Implement non-atomic positive assertions.

This commit is contained in:
Philip.Hazel 2019-07-13 11:12:03 +00:00
parent 691aca7a86
commit 620f3a1307
21 changed files with 1134 additions and 683 deletions

View File

@ -88,6 +88,8 @@ otherwise), an atomic group, or a recursion.
17. Check for integer overflow when computing lookbehind lengths. Fixes 17. Check for integer overflow when computing lookbehind lengths. Fixes
Clusterfuzz issue 15636. Clusterfuzz issue 15636.
18. Implement non-atomic positive lookaround assertions.
Version 10.33 16-April-2019 Version 10.33 16-April-2019
--------------------------- ---------------------------

34
HACKING
View File

@ -195,6 +195,7 @@ META_END End of pattern (this value is 0x80000000)
META_FAIL (*FAIL) META_FAIL (*FAIL)
META_KET ) closing parenthesis META_KET ) closing parenthesis
META_LOOKAHEAD (?= start of lookahead META_LOOKAHEAD (?= start of lookahead
META_LOOKAHEAD_NA (*napla: start of non-atomic lookahead
META_LOOKAHEADNOT (?! start of negative lookahead META_LOOKAHEADNOT (?! start of negative lookahead
META_NOCAPTURE (?: no capture parens META_NOCAPTURE (?: no capture parens
META_PLUS + META_PLUS +
@ -286,8 +287,9 @@ The following are also followed just by an offset, but also the lower 16 bits
of the main word contain the length of the first branch of the lookbehind of the main word contain the length of the first branch of the lookbehind
group; this is used when generating OP_REVERSE for that branch. group; this is used when generating OP_REVERSE for that branch.
META_LOOKBEHIND (?<= META_LOOKBEHIND (?<= start of lookbehind
META_LOOKBEHINDNOT (?<! META_LOOKBEHIND_NA (*naplb: start of non-atomic lookbehind
META_LOOKBEHINDNOT (?<! start of negative lookbehind
The following are followed by two elements, the minimum and maximum. Repeat The following are followed by two elements, the minimum and maximum. Repeat
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
@ -715,13 +717,15 @@ Assertions
---------- ----------
Forward assertions are also just like other subpatterns, but starting with one Forward assertions are also just like other subpatterns, but starting with one
of the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes of the opcodes OP_ASSERT, OP_ASSERT_NA (non-atomic assertion), or
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion OP_ASSERT_NOT. Backward assertions use the opcodes OP_ASSERTBACK,
is OP_REVERSE, followed by a count of the number of characters to move back the OP_ASSERTBACK_NA, and OP_ASSERTBACK_NOT, and the first opcode inside the
pointer in the subject string. In ASCII or UTF-32 mode, the count is also the assertion is OP_REVERSE, followed by a count of the number of characters to
number of code units, but in UTF-8/16 mode each character may occupy more than move back the pointer in the subject string. In ASCII or UTF-32 mode, the count
one code unit. A separate count is present in each alternative of a lookbehind is also the number of code units, but in UTF-8/16 mode each character may
assertion, allowing them to have different (but fixed) lengths. occupy more than one code unit. A separate count is present in each alternative
of a lookbehind assertion, allowing each branch to have a different (but fixed)
length.
Conditional subpatterns Conditional subpatterns
@ -754,11 +758,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
or OP_FALSE. or OP_FALSE.
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
must start with a parenthesized assertion, whose opcode normally immediately must start with a parenthesized atomic assertion, whose opcode normally
follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a immediately follows OP_COND or OP_SCOND. However, if automatic callouts are
callout is inserted immediately before the assertion. It is also possible to enabled, a callout is inserted immediately before the assertion. It is also
insert a manual callout at this point. Only assertion conditions may have possible to insert a manual callout at this point. Only assertion conditions
callouts preceding the condition. may have callouts preceding the condition.
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
parts of the pattern, so this is another opcode that may appear as a condition. parts of the pattern, so this is another opcode that may appear as a condition.
@ -823,4 +827,4 @@ not a real opcode, but is used to check at compile time that tables indexed by
opcode are the correct length, in order to catch updating errors. opcode are the correct length, in order to catch updating errors.
Philip Hazel Philip Hazel
20 July 2018 12 July 2019

View File

@ -205,6 +205,11 @@ different way and is not Perl-compatible.
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at (l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
the start of a pattern that set overall options that cannot be changed within the start of a pattern that set overall options that cannot be changed within
the pattern. the pattern.
<br>
<br>
(m) PCRE2 supports non-atomic positive lookaround assertions. This is an
extension to the lookaround facilities. The default, Perl-compatible
lookarounds are atomic.
</P> </P>
<P> <P>
18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa 18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
@ -234,7 +239,7 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 12 February 2019 Last updated: 13 July 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -33,17 +33,18 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> <li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a> <li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a> <li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a> <li><a name="TOC21" href="#SEC21">NON-ATOMIC ASSERTIONS</a>
<li><a name="TOC22" href="#SEC22">CONDITIONAL GROUPS</a> <li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a>
<li><a name="TOC23" href="#SEC23">COMMENTS</a> <li><a name="TOC23" href="#SEC23">CONDITIONAL GROUPS</a>
<li><a name="TOC24" href="#SEC24">RECURSIVE PATTERNS</a> <li><a name="TOC24" href="#SEC24">COMMENTS</a>
<li><a name="TOC25" href="#SEC25">GROUPS AS SUBROUTINES</a> <li><a name="TOC25" href="#SEC25">RECURSIVE PATTERNS</a>
<li><a name="TOC26" href="#SEC26">ONIGURUMA SUBROUTINE SYNTAX</a> <li><a name="TOC26" href="#SEC26">GROUPS AS SUBROUTINES</a>
<li><a name="TOC27" href="#SEC27">CALLOUTS</a> <li><a name="TOC27" href="#SEC27">ONIGURUMA SUBROUTINE SYNTAX</a>
<li><a name="TOC28" href="#SEC28">BACKTRACKING CONTROL</a> <li><a name="TOC28" href="#SEC28">CALLOUTS</a>
<li><a name="TOC29" href="#SEC29">SEE ALSO</a> <li><a name="TOC29" href="#SEC29">BACKTRACKING CONTROL</a>
<li><a name="TOC30" href="#SEC30">AUTHOR</a> <li><a name="TOC30" href="#SEC30">SEE ALSO</a>
<li><a name="TOC31" href="#SEC31">REVISION</a> <li><a name="TOC31" href="#SEC31">AUTHOR</a>
<li><a name="TOC32" href="#SEC32">REVISION</a>
</ul> </ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br> <br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
<P> <P>
@ -2364,19 +2365,23 @@ those that look behind it, and in each case an assertion may be positive (must
match for the assertion to be true) or negative (must not match for the match for the assertion to be true) or negative (must not match for the
assertion to be true). An assertion group is matched in the normal way, assertion to be true). An assertion group is matched in the normal way,
and if it is true, matching continues after it, but with the matching position and if it is true, matching continues after it, but with the matching position
in the subject string is was it was before the assertion was processed. in the subject string reset to what it was before the assertion was processed.
</P> </P>
<P> <P>
A lookaround assertion may also appear as the condition in a The Perl-compatible lookaround assertions are atomic. If an assertion is true,
but there is a subsequent matching failure, there is no backtracking into the
assertion. However, there are some cases where non-atomic assertions can be
useful. PCRE2 has some support for these, described in the section entitled
<a href="#nonatomicassertions">"Non-atomic assertions"</a>
below, but they are not Perl-compatible.
</P>
<P>
A lookaround assertion may appear as the condition in a
<a href="#conditions">conditional group</a> <a href="#conditions">conditional group</a>
(see below). In this case, the result of matching the assertion determines (see below). In this case, the result of matching the assertion determines
which branch of the condition is followed. which branch of the condition is followed.
</P> </P>
<P> <P>
Lookaround assertions are atomic. If an assertion is true, but there is a
subsequent matching failure, there is no backtracking into the assertion.
</P>
<P>
Assertion groups are not capture groups. If an assertion contains capture Assertion groups are not capture groups. If an assertion contains capture
groups within it, these are counted for the purposes of numbering the capture groups within it, these are counted for the purposes of numbering the capture
groups in the whole pattern. Within each branch of an assertion, locally groups in the whole pattern. Within each branch of an assertion, locally
@ -2429,11 +2434,11 @@ The assertion is obeyed just once when encountered during matching.
Alphabetic assertion names Alphabetic assertion names
</b><br> </b><br>
<P> <P>
Traditionally, symbolic sequences such as (?= and (?&#60;= have been used to specify Traditionally, symbolic sequences such as (?= and (?&#60;= have been used to
lookaround assertions. Perl 5.28 introduced some experimental alphabetic specify lookaround assertions. Perl 5.28 introduced some experimental
alternatives which might be easier to remember. They all start with (* instead alphabetic alternatives which might be easier to remember. They all start with
of (? and must be written using lower case letters. PCRE2 supports the (* instead of (? and must be written using lower case letters. PCRE2 supports
following synonyms: the following synonyms:
<pre> <pre>
(*positive_lookahead: or (*pla: is the same as (?= (*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?! (*negative_lookahead: or (*nla: is the same as (?!
@ -2606,8 +2611,63 @@ preceded by "foo", while
</pre> </pre>
is another pattern that matches "foo" preceded by three digits and any three is another pattern that matches "foo" preceded by three digits and any three
characters that are not "999". characters that are not "999".
<a name="nonatomicassertions"></a></P>
<br><a name="SEC21" href="#TOC1">NON-ATOMIC ASSERTIONS</a><br>
<P>
The traditional Perl-compatible lookaround assertions are atomic. That is, if
an assertion is true, but there is a subsequent matching failure, there is no
backtracking into the assertion. However, there are some cases where non-atomic
positive assertions can be useful. PCRE2 provides these using the following
syntax:
<pre>
(*non_atomic_positive_lookahead: or (*napla:
(*non_atomic_positive_lookbehind: or (*naplb:
</pre>
Consider the problem of finding the right-most word in a string that also
appears earlier in the string, that is, it must appear at least twice in total.
This pattern returns the required result as captured substring 1:
<pre>
^(?x)(*napla: .* \b(\w++)) (?&#62; .*? \b\1\b ){2}
</pre>
For a subject such as "word1 word2 word3 word2 word3 word4" the result is
"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
"x" option, which causes white space (introduced for readability) to be
ignored. Inside the assertion, the greedy .* at first consumes the entire
string, but then has to backtrack until the rest of the assertion can match a
word, which is captured by group 1. In other words, when the assertion first
succeeds, it captures the right-most word in the string.
</P> </P>
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br> <P>
The current matching point is then reset to the start of the subject, and the
rest of the pattern match checks for two occurrences of the captured word,
using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
if the last word in the string does not occur twice, this part of the pattern
fails. If a traditional atomic lookhead (?= or (*pla: had been used, the
assertion could not be re-entered, and the whole match would fail. The pattern
would succeed only if the very last word in the subject was found twice.
</P>
<P>
Using a non-atomic lookahead, however, means that when the last word does not
occur twice in the string, the lookahead can backtrack and find the second-last
word, and so on, until either the match succeeds, or all words have been
tested.
</P>
<P>
Two conditions must be met for a non-atomic assertion to be useful: the
contents of one or more capturing groups must change after a backtrack into the
assertion, and there must be a backreference to a changed group later in the
pattern. If this is not the case, the rest of the pattern match fails exactly
as before because nothing has changed, so using a non-atomic assertion just
wastes resources.
</P>
<P>
Non-atomic assertions are not supported by the alternative matching function
<b>pcre2_dfa_match()</b>. They are also not supported by JIT (but may be in
future). Note that assertions that appear as conditions for
<a href="#conditions">conditional groups</a>
(see below) must be atomic.
</P>
<br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br>
<P> <P>
In concept, a script run is a sequence of characters that are all from the same In concept, a script run is a sequence of characters that are all from the same
Unicode script such as Latin or Greek. However, because some scripts are Unicode script such as Latin or Greek. However, because some scripts are
@ -2669,7 +2729,7 @@ parentheses.
should not be used within a script run group, because it causes an immediate should not be used within a script run group, because it causes an immediate
exit from the group, bypassing the script run checking. exit from the group, bypassing the script run checking.
<a name="conditions"></a></P> <a name="conditions"></a></P>
<br><a name="SEC22" href="#TOC1">CONDITIONAL GROUPS</a><br> <br><a name="SEC23" href="#TOC1">CONDITIONAL GROUPS</a><br>
<P> <P>
It is possible to cause the matching process to obey a pattern fragment It is possible to cause the matching process to obey a pattern fragment
conditionally or to choose between two alternative fragments, depending on conditionally or to choose between two alternative fragments, depending on
@ -2845,8 +2905,13 @@ Assertion conditions
<P> <P>
If the condition is not in any of the above formats, it must be a parenthesized If the condition is not in any of the above formats, it must be a parenthesized
assertion. This may be a positive or negative lookahead or lookbehind assertion. This may be a positive or negative lookahead or lookbehind
assertion. Consider this pattern, again containing non-significant white space, assertion. However, it must be a traditional atomic assertion, not one of the
and with the two alternatives on the second line: PCRE2-specific
<a href="#nonatomicassertions">non-atomic assertions.</a>
</P>
<P>
Consider this pattern, again containing non-significant white space, and with
the two alternatives on the second line:
<pre> <pre>
(?(?=[^a-z]*[a-z]) (?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
@ -2865,7 +2930,7 @@ positive and negative assertions, because matching always continues after the
assertion, whether it succeeds or fails. (Compare non-conditional assertions, assertion, whether it succeeds or fails. (Compare non-conditional assertions,
for which captures are retained only for positive assertions that succeed.) for which captures are retained only for positive assertions that succeed.)
<a name="comments"></a></P> <a name="comments"></a></P>
<br><a name="SEC23" href="#TOC1">COMMENTS</a><br> <br><a name="SEC24" href="#TOC1">COMMENTS</a><br>
<P> <P>
There are two ways of including comments in patterns that are processed by There are two ways of including comments in patterns that are processed by
PCRE2. In both cases, the start of the comment must not be in a character PCRE2. In both cases, the start of the comment must not be in a character
@ -2895,7 +2960,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so
it does not terminate the comment. Only an actual character with the code value it does not terminate the comment. Only an actual character with the code value
0x0a (the default newline) does so. 0x0a (the default newline) does so.
<a name="recursion"></a></P> <a name="recursion"></a></P>
<br><a name="SEC24" href="#TOC1">RECURSIVE PATTERNS</a><br> <br><a name="SEC25" href="#TOC1">RECURSIVE PATTERNS</a><br>
<P> <P>
Consider the problem of matching a string in parentheses, allowing for Consider the problem of matching a string in parentheses, allowing for
unlimited nested parentheses. Without the use of recursion, the best that can unlimited nested parentheses. Without the use of recursion, the best that can
@ -3083,7 +3148,7 @@ alternative matches "a" and then recurses. In the recursion, \1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in "b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works. later versions (I tried 5.024) it now works.
<a name="groupsassubroutines"></a></P> <a name="groupsassubroutines"></a></P>
<br><a name="SEC25" href="#TOC1">GROUPS AS SUBROUTINES</a><br> <br><a name="SEC26" href="#TOC1">GROUPS AS SUBROUTINES</a><br>
<P> <P>
If the syntax for a recursive group call (either by number or by name) is used If the syntax for a recursive group call (either by number or by name) is used
outside the parentheses to which it refers, it operates a bit like a subroutine outside the parentheses to which it refers, it operates a bit like a subroutine
@ -3131,7 +3196,7 @@ in groups when called as subroutines is described in the section entitled
<a href="#btsub">"Backtracking verbs in subroutines"</a> <a href="#btsub">"Backtracking verbs in subroutines"</a>
below. below.
<a name="onigurumasubroutines"></a></P> <a name="onigurumasubroutines"></a></P>
<br><a name="SEC26" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br> <br><a name="SEC27" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
<P> <P>
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
a number enclosed either in angle brackets or single quotes, is an alternative a number enclosed either in angle brackets or single quotes, is an alternative
@ -3149,7 +3214,7 @@ plus or a minus sign it is taken as a relative reference. For example:
Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i> Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
synonymous. The former is a backreference; the latter is a subroutine call. synonymous. The former is a backreference; the latter is a subroutine call.
</P> </P>
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br> <br><a name="SEC28" href="#TOC1">CALLOUTS</a><br>
<P> <P>
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
code to be obeyed in the middle of matching a regular expression. This makes it code to be obeyed in the middle of matching a regular expression. This makes it
@ -3225,7 +3290,7 @@ example:
</pre> </pre>
The doubling is removed before the string is passed to the callout function. The doubling is removed before the string is passed to the callout function.
<a name="backtrackcontrol"></a></P> <a name="backtrackcontrol"></a></P>
<br><a name="SEC28" href="#TOC1">BACKTRACKING CONTROL</a><br> <br><a name="SEC29" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P> <P>
There are a number of special "Backtracking Control Verbs" (to use Perl's There are a number of special "Backtracking Control Verbs" (to use Perl's
terminology) that modify the behaviour of backtracking during matching. They terminology) that modify the behaviour of backtracking during matching. They
@ -3739,12 +3804,12 @@ enclosing group that has alternatives (its normal behaviour). However, if there
is no such group within the subroutine's group, the subroutine match fails and is no such group within the subroutine's group, the subroutine match fails and
there is a backtrack at the outer level. there is a backtrack at the outer level.
</P> </P>
<br><a name="SEC29" href="#TOC1">SEE ALSO</a><br> <br><a name="SEC30" href="#TOC1">SEE ALSO</a><br>
<P> <P>
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
<b>pcre2syntax</b>(3), <b>pcre2</b>(3). <b>pcre2syntax</b>(3), <b>pcre2</b>(3).
</P> </P>
<br><a name="SEC30" href="#TOC1">AUTHOR</a><br> <br><a name="SEC31" href="#TOC1">AUTHOR</a><br>
<P> <P>
Philip Hazel Philip Hazel
<br> <br>
@ -3753,9 +3818,9 @@ University Computing Service
Cambridge, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC31" href="#TOC1">REVISION</a><br> <br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 22 June 2019 Last updated: 13 July 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -32,15 +32,16 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a> <li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a> <li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a> <li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC20" href="#SEC20">SCRIPT RUNS</a> <li><a name="TOC20" href="#SEC20">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
<li><a name="TOC21" href="#SEC21">BACKREFERENCES</a> <li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
<li><a name="TOC22" href="#SEC22">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a> <li><a name="TOC22" href="#SEC22">BACKREFERENCES</a>
<li><a name="TOC23" href="#SEC23">CONDITIONAL PATTERNS</a> <li><a name="TOC23" href="#SEC23">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC24" href="#SEC24">BACKTRACKING CONTROL</a> <li><a name="TOC24" href="#SEC24">CONDITIONAL PATTERNS</a>
<li><a name="TOC25" href="#SEC25">CALLOUTS</a> <li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a>
<li><a name="TOC26" href="#SEC26">SEE ALSO</a> <li><a name="TOC26" href="#SEC26">CALLOUTS</a>
<li><a name="TOC27" href="#SEC27">AUTHOR</a> <li><a name="TOC27" href="#SEC27">SEE ALSO</a>
<li><a name="TOC28" href="#SEC28">REVISION</a> <li><a name="TOC28" href="#SEC28">AUTHOR</a>
<li><a name="TOC29" href="#SEC29">REVISION</a>
</ul> </ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br> <br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
<P> <P>
@ -544,7 +545,18 @@ setting with a similar syntax.
</pre> </pre>
Each top-level branch of a lookbehind must be of a fixed length. Each top-level branch of a lookbehind must be of a fixed length.
</P> </P>
<br><a name="SEC20" href="#TOC1">SCRIPT RUNS</a><br> <br><a name="SEC20" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
<P>
These assertions are specific to PCRE2 and are not Perl-compatible.
<pre>
(*napla:...)
(*non_atomic_positive_lookahead:...)
(*naplb:...)
(*non_atomic_positive_lookbehind:...)
</PRE>
</P>
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
<P> <P>
<pre> <pre>
(*script_run:...) ) script run, can be backtracked into (*script_run:...) ) script run, can be backtracked into
@ -554,7 +566,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
(*asr:...) ) (*asr:...) )
</PRE> </PRE>
</P> </P>
<br><a name="SEC21" href="#TOC1">BACKREFERENCES</a><br> <br><a name="SEC22" href="#TOC1">BACKREFERENCES</a><br>
<P> <P>
<pre> <pre>
\n reference by number (can be ambiguous) \n reference by number (can be ambiguous)
@ -571,7 +583,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
(?P=name) reference by name (Python) (?P=name) reference by name (Python)
</PRE> </PRE>
</P> </P>
<br><a name="SEC22" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br> <br><a name="SEC23" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P> <P>
<pre> <pre>
(?R) recurse whole pattern (?R) recurse whole pattern
@ -590,7 +602,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
\g'-n' call subroutine by relative number (PCRE2 extension) \g'-n' call subroutine by relative number (PCRE2 extension)
</PRE> </PRE>
</P> </P>
<br><a name="SEC23" href="#TOC1">CONDITIONAL PATTERNS</a><br> <br><a name="SEC24" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P> <P>
<pre> <pre>
(?(condition)yes-pattern) (?(condition)yes-pattern)
@ -613,7 +625,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a reference conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists. condition if the relevant named group exists.
</P> </P>
<br><a name="SEC24" href="#TOC1">BACKTRACKING CONTROL</a><br> <br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P> <P>
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
@ -640,7 +652,7 @@ pattern is not anchored.
The effect of one of these verbs in a group called as a subroutine is confined The effect of one of these verbs in a group called as a subroutine is confined
to the subroutine call. to the subroutine call.
</P> </P>
<br><a name="SEC25" href="#TOC1">CALLOUTS</a><br> <br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
<P> <P>
<pre> <pre>
(?C) callout (assumed number 0) (?C) callout (assumed number 0)
@ -651,12 +663,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it. delimiter }. To encode the ending delimiter within the string, double it.
</P> </P>
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br> <br><a name="SEC27" href="#TOC1">SEE ALSO</a><br>
<P> <P>
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
<b>pcre2matching</b>(3), <b>pcre2</b>(3). <b>pcre2matching</b>(3), <b>pcre2</b>(3).
</P> </P>
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br> <br><a name="SEC28" href="#TOC1">AUTHOR</a><br>
<P> <P>
Philip Hazel Philip Hazel
<br> <br>
@ -665,9 +677,9 @@ University Computing Service
Cambridge, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br> <br><a name="SEC29" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 11 February 2019 Last updated: 12 July 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -4887,6 +4887,10 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
at the start of a pattern that set overall options that cannot be at the start of a pattern that set overall options that cannot be
changed within the pattern. changed within the pattern.
(m) PCRE2 supports non-atomic positive lookaround assertions. This is
an extension to the lookaround facilities. The default, Perl-compatible
lookarounds are atomic.
18. The Perl /a modifier restricts /d numbers to pure ascii, and the 18. The Perl /a modifier restricts /d numbers to pure ascii, and the
/aa modifier restricts /i case-insensitive matching to pure ascii, ig- /aa modifier restricts /i case-insensitive matching to pure ascii, ig-
noring Unicode rules. This separation cannot be represented with noring Unicode rules. This separation cannot be represented with
@ -4909,7 +4913,7 @@ AUTHOR
REVISION REVISION
Last updated: 12 February 2019 Last updated: 13 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -8140,16 +8144,19 @@ ASSERTIONS
sertion may be positive (must match for the assertion to be true) or sertion may be positive (must match for the assertion to be true) or
negative (must not match for the assertion to be true). An assertion negative (must not match for the assertion to be true). An assertion
group is matched in the normal way, and if it is true, matching contin- group is matched in the normal way, and if it is true, matching contin-
ues after it, but with the matching position in the subject string is ues after it, but with the matching position in the subject string re-
was it was before the assertion was processed. set to what it was before the assertion was processed.
A lookaround assertion may also appear as the condition in a condi- The Perl-compatible lookaround assertions are atomic. If an assertion
tional group (see below). In this case, the result of matching the as- is true, but there is a subsequent matching failure, there is no back-
sertion determines which branch of the condition is followed. tracking into the assertion. However, there are some cases where non-
atomic assertions can be useful. PCRE2 has some support for these, de-
scribed in the section entitled "Non-atomic assertions" below, but they
are not Perl-compatible.
Lookaround assertions are atomic. If an assertion is true, but there is A lookaround assertion may appear as the condition in a conditional
a subsequent matching failure, there is no backtracking into the asser- group (see below). In this case, the result of matching the assertion
tion. determines which branch of the condition is followed.
Assertion groups are not capture groups. If an assertion contains cap- Assertion groups are not capture groups. If an assertion contains cap-
ture groups within it, these are counted for the purposes of numbering ture groups within it, these are counted for the purposes of numbering
@ -8362,6 +8369,60 @@ ASSERTIONS
three characters that are not "999". three characters that are not "999".
NON-ATOMIC ASSERTIONS
The traditional Perl-compatible lookaround assertions are atomic. That
is, if an assertion is true, but there is a subsequent matching fail-
ure, there is no backtracking into the assertion. However, there are
some cases where non-atomic positive assertions can be useful. PCRE2
provides these using the following syntax:
(*non_atomic_positive_lookahead: or (*napla:
(*non_atomic_positive_lookbehind: or (*naplb:
Consider the problem of finding the right-most word in a string that
also appears earlier in the string, that is, it must appear at least
twice in total. This pattern returns the required result as captured
substring 1:
^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
For a subject such as "word1 word2 word3 word2 word3 word4" the result
is "word3". How does it work? At the start, ^(?x) anchors the pattern
and sets the "x" option, which causes white space (introduced for read-
ability) to be ignored. Inside the assertion, the greedy .* at first
consumes the entire string, but then has to backtrack until the rest of
the assertion can match a word, which is captured by group 1. In other
words, when the assertion first succeeds, it captures the right-most
word in the string.
The current matching point is then reset to the start of the subject,
and the rest of the pattern match checks for two occurrences of the
captured word, using an ungreedy .*? to scan from the left. If this
succeeds, we are done, but if the last word in the string does not oc-
cur twice, this part of the pattern fails. If a traditional atomic
lookhead (?= or (*pla: had been used, the assertion could not be re-en-
tered, and the whole match would fail. The pattern would succeed only
if the very last word in the subject was found twice.
Using a non-atomic lookahead, however, means that when the last word
does not occur twice in the string, the lookahead can backtrack and
find the second-last word, and so on, until either the match succeeds,
or all words have been tested.
Two conditions must be met for a non-atomic assertion to be useful: the
contents of one or more capturing groups must change after a backtrack
into the assertion, and there must be a backreference to a changed
group later in the pattern. If this is not the case, the rest of the
pattern match fails exactly as before because nothing has changed, so
using a non-atomic assertion just wastes resources.
Non-atomic assertions are not supported by the alternative matching
function pcre2_dfa_match(). They are also not supported by JIT (but may
be in future). Note that assertions that appear as conditions for con-
ditional groups (see below) must be atomic.
SCRIPT RUNS SCRIPT RUNS
In concept, a script run is a sequence of characters that are all from In concept, a script run is a sequence of characters that are all from
@ -8578,9 +8639,11 @@ CONDITIONAL GROUPS
If the condition is not in any of the above formats, it must be a If the condition is not in any of the above formats, it must be a
parenthesized assertion. This may be a positive or negative lookahead parenthesized assertion. This may be a positive or negative lookahead
or lookbehind assertion. Consider this pattern, again containing non- or lookbehind assertion. However, it must be a traditional atomic as-
significant white space, and with the two alternatives on the second sertion, not one of the PCRE2-specific non-atomic assertions.
line:
Consider this pattern, again containing non-significant white space,
and with the two alternatives on the second line:
(?(?=[^a-z]*[a-z]) (?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
@ -9439,7 +9502,7 @@ AUTHOR
REVISION REVISION
Last updated: 22 June 2019 Last updated: 13 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -10663,6 +10726,17 @@ LOOKAHEAD AND LOOKBEHIND ASSERTIONS
Each top-level branch of a lookbehind must be of a fixed length. Each top-level branch of a lookbehind must be of a fixed length.
NON-ATOMIC LOOKAROUND ASSERTIONS
These assertions are specific to PCRE2 and are not Perl-compatible.
(*napla:...)
(*non_atomic_positive_lookahead:...)
(*naplb:...)
(*non_atomic_positive_lookbehind:...)
SCRIPT RUNS SCRIPT RUNS
(*script_run:...) ) script run, can be backtracked into (*script_run:...) ) script run, can be backtracked into
@ -10784,7 +10858,7 @@ AUTHOR
REVISION REVISION
Last updated: 11 February 2019 Last updated: 12 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2COMPAT 3 "12 February 2019" "PCRE2 10.33" .TH PCRE2COMPAT 3 "13 July 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL" .SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@ -170,6 +170,10 @@ different way and is not Perl-compatible.
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at (l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
the start of a pattern that set overall options that cannot be changed within the start of a pattern that set overall options that cannot be changed within
the pattern. the pattern.
.sp
(m) PCRE2 supports non-atomic positive lookaround assertions. This is an
extension to the lookaround facilities. The default, Perl-compatible
lookarounds are atomic.
.P .P
18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa 18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
modifier restricts /i case-insensitive matching to pure ascii, ignoring Unicode modifier restricts /i case-insensitive matching to pure ascii, ignoring Unicode
@ -199,6 +203,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 12 February 2019 Last updated: 13 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "22 June 2019" "PCRE2 10.34" .TH PCRE2PATTERN 3 "13 July 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -2370,9 +2370,19 @@ those that look behind it, and in each case an assertion may be positive (must
match for the assertion to be true) or negative (must not match for the match for the assertion to be true) or negative (must not match for the
assertion to be true). An assertion group is matched in the normal way, assertion to be true). An assertion group is matched in the normal way,
and if it is true, matching continues after it, but with the matching position and if it is true, matching continues after it, but with the matching position
in the subject string is was it was before the assertion was processed. in the subject string reset to what it was before the assertion was processed.
.P .P
A lookaround assertion may also appear as the condition in a The Perl-compatible lookaround assertions are atomic. If an assertion is true,
but there is a subsequent matching failure, there is no backtracking into the
assertion. However, there are some cases where non-atomic assertions can be
useful. PCRE2 has some support for these, described in the section entitled
.\" HTML <a href="#nonatomicassertions">
.\" </a>
"Non-atomic assertions"
.\"
below, but they are not Perl-compatible.
.P
A lookaround assertion may appear as the condition in a
.\" HTML <a href="#conditions"> .\" HTML <a href="#conditions">
.\" </a> .\" </a>
conditional group conditional group
@ -2380,9 +2390,6 @@ conditional group
(see below). In this case, the result of matching the assertion determines (see below). In this case, the result of matching the assertion determines
which branch of the condition is followed. which branch of the condition is followed.
.P .P
Lookaround assertions are atomic. If an assertion is true, but there is a
subsequent matching failure, there is no backtracking into the assertion.
.P
Assertion groups are not capture groups. If an assertion contains capture Assertion groups are not capture groups. If an assertion contains capture
groups within it, these are counted for the purposes of numbering the capture groups within it, these are counted for the purposes of numbering the capture
groups in the whole pattern. Within each branch of an assertion, locally groups in the whole pattern. Within each branch of an assertion, locally
@ -2435,11 +2442,11 @@ The assertion is obeyed just once when encountered during matching.
.SS "Alphabetic assertion names" .SS "Alphabetic assertion names"
.rs .rs
.sp .sp
Traditionally, symbolic sequences such as (?= and (?<= have been used to specify Traditionally, symbolic sequences such as (?= and (?<= have been used to
lookaround assertions. Perl 5.28 introduced some experimental alphabetic specify lookaround assertions. Perl 5.28 introduced some experimental
alternatives which might be easier to remember. They all start with (* instead alphabetic alternatives which might be easier to remember. They all start with
of (? and must be written using lower case letters. PCRE2 supports the (* instead of (? and must be written using lower case letters. PCRE2 supports
following synonyms: the following synonyms:
.sp .sp
(*positive_lookahead: or (*pla: is the same as (?= (*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?! (*negative_lookahead: or (*nla: is the same as (?!
@ -2616,6 +2623,63 @@ is another pattern that matches "foo" preceded by three digits and any three
characters that are not "999". characters that are not "999".
. .
. .
.\" HTML <a name="nonatomicassertions"></a>
.SH "NON-ATOMIC ASSERTIONS"
.rs
.sp
The traditional Perl-compatible lookaround assertions are atomic. That is, if
an assertion is true, but there is a subsequent matching failure, there is no
backtracking into the assertion. However, there are some cases where non-atomic
positive assertions can be useful. PCRE2 provides these using the following
syntax:
.sp
(*non_atomic_positive_lookahead: or (*napla:
(*non_atomic_positive_lookbehind: or (*naplb:
.sp
Consider the problem of finding the right-most word in a string that also
appears earlier in the string, that is, it must appear at least twice in total.
This pattern returns the required result as captured substring 1:
.sp
^(?x)(*napla: .* \eb(\ew++)) (?> .*? \eb\e1\eb ){2}
.sp
For a subject such as "word1 word2 word3 word2 word3 word4" the result is
"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
"x" option, which causes white space (introduced for readability) to be
ignored. Inside the assertion, the greedy .* at first consumes the entire
string, but then has to backtrack until the rest of the assertion can match a
word, which is captured by group 1. In other words, when the assertion first
succeeds, it captures the right-most word in the string.
.P
The current matching point is then reset to the start of the subject, and the
rest of the pattern match checks for two occurrences of the captured word,
using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
if the last word in the string does not occur twice, this part of the pattern
fails. If a traditional atomic lookhead (?= or (*pla: had been used, the
assertion could not be re-entered, and the whole match would fail. The pattern
would succeed only if the very last word in the subject was found twice.
.P
Using a non-atomic lookahead, however, means that when the last word does not
occur twice in the string, the lookahead can backtrack and find the second-last
word, and so on, until either the match succeeds, or all words have been
tested.
.P
Two conditions must be met for a non-atomic assertion to be useful: the
contents of one or more capturing groups must change after a backtrack into the
assertion, and there must be a backreference to a changed group later in the
pattern. If this is not the case, the rest of the pattern match fails exactly
as before because nothing has changed, so using a non-atomic assertion just
wastes resources.
.P
Non-atomic assertions are not supported by the alternative matching function
\fBpcre2_dfa_match()\fP. They are also not supported by JIT (but may be in
future). Note that assertions that appear as conditions for
.\" HTML <a href="#conditions">
.\" </a>
conditional groups
.\"
(see below) must be atomic.
.
.
.SH "SCRIPT RUNS" .SH "SCRIPT RUNS"
.rs .rs
.sp .sp
@ -2867,8 +2931,15 @@ than two digits.
.sp .sp
If the condition is not in any of the above formats, it must be a parenthesized If the condition is not in any of the above formats, it must be a parenthesized
assertion. This may be a positive or negative lookahead or lookbehind assertion. This may be a positive or negative lookahead or lookbehind
assertion. Consider this pattern, again containing non-significant white space, assertion. However, it must be a traditional atomic assertion, not one of the
and with the two alternatives on the second line: PCRE2-specific
.\" HTML <a href="#nonatomicassertions">
.\" </a>
non-atomic assertions.
.\"
.P
Consider this pattern, again containing non-significant white space, and with
the two alternatives on the second line:
.sp .sp
(?(?=[^a-z]*[a-z]) (?(?=[^a-z]*[a-z])
\ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} ) \ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} )
@ -3788,6 +3859,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 22 June 2019 Last updated: 13 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "11 February 2019" "PCRE2 10.33" .TH PCRE2SYNTAX 3 "12 July 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -522,6 +522,18 @@ setting with a similar syntax.
Each top-level branch of a lookbehind must be of a fixed length. Each top-level branch of a lookbehind must be of a fixed length.
. .
. .
.SH "NON-ATOMIC LOOKAROUND ASSERTIONS"
.rs
.sp
These assertions are specific to PCRE2 and are not Perl-compatible.
.sp
(*napla:...)
(*non_atomic_positive_lookahead:...)
.sp
(*naplb:...)
(*non_atomic_positive_lookbehind:...)
.
.
.SH "SCRIPT RUNS" .SH "SCRIPT RUNS"
.rs .rs
.sp .sp
@ -654,6 +666,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 11 February 2019 Last updated: 12 July 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -307,6 +307,7 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195 #define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
#define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE 196 #define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE 196
#define PCRE2_ERROR_TOO_MANY_CAPTURES 197 #define PCRE2_ERROR_TOO_MANY_CAPTURES 197
#define PCRE2_ERROR_CONDITION_ATOMIC_ASSERTION_EXPECTED 198
/* "Expected" matching error codes: no match and partial match. */ /* "Expected" matching error codes: no match and partial match. */

View File

@ -624,6 +624,13 @@ for(;;)
case OP_ASSERTBACK_NOT: case OP_ASSERTBACK_NOT:
case OP_ONCE: case OP_ONCE:
return !entered_a_group; return !entered_a_group;
/* Non-atomic assertions - don't possessify last iterator. This needs
more thought. */
case OP_ASSERT_NA:
case OP_ASSERTBACK_NA:
return FALSE;
} }
/* Skip over the bracket and inspect what comes next. */ /* Skip over the bracket and inspect what comes next. */

View File

@ -250,36 +250,41 @@ is present where expected in a conditional group. */
#define META_LOOKBEHIND 0x80250000u /* (?<= */ #define META_LOOKBEHIND 0x80250000u /* (?<= */
#define META_LOOKBEHINDNOT 0x80260000u /* (?<! */ #define META_LOOKBEHINDNOT 0x80260000u /* (?<! */
/* These cannot be conditions */
#define META_LOOKAHEAD_NA 0x80270000u /* (*napla: */
#define META_LOOKBEHIND_NA 0x80280000u /* (*naplb: */
/* These must be kept in this order, with consecutive values, and the _ARG /* These must be kept in this order, with consecutive values, and the _ARG
versions of COMMIT, PRUNE, SKIP, and THEN immediately after their non-argument versions of COMMIT, PRUNE, SKIP, and THEN immediately after their non-argument
versions. */ versions. */
#define META_MARK 0x80270000u /* (*MARK) */ #define META_MARK 0x80290000u /* (*MARK) */
#define META_ACCEPT 0x80280000u /* (*ACCEPT) */ #define META_ACCEPT 0x802a0000u /* (*ACCEPT) */
#define META_FAIL 0x80290000u /* (*FAIL) */ #define META_FAIL 0x802b0000u /* (*FAIL) */
#define META_COMMIT 0x802a0000u /* These */ #define META_COMMIT 0x802c0000u /* These */
#define META_COMMIT_ARG 0x802b0000u /* pairs */ #define META_COMMIT_ARG 0x802d0000u /* pairs */
#define META_PRUNE 0x802c0000u /* must */ #define META_PRUNE 0x802e0000u /* must */
#define META_PRUNE_ARG 0x802d0000u /* be */ #define META_PRUNE_ARG 0x802f0000u /* be */
#define META_SKIP 0x802e0000u /* kept */ #define META_SKIP 0x80300000u /* kept */
#define META_SKIP_ARG 0x802f0000u /* in */ #define META_SKIP_ARG 0x80310000u /* in */
#define META_THEN 0x80300000u /* this */ #define META_THEN 0x80320000u /* this */
#define META_THEN_ARG 0x80310000u /* order */ #define META_THEN_ARG 0x80330000u /* order */
/* These must be kept in groups of adjacent 3 values, and all together. */ /* These must be kept in groups of adjacent 3 values, and all together. */
#define META_ASTERISK 0x80320000u /* * */ #define META_ASTERISK 0x80340000u /* * */
#define META_ASTERISK_PLUS 0x80330000u /* *+ */ #define META_ASTERISK_PLUS 0x80350000u /* *+ */
#define META_ASTERISK_QUERY 0x80340000u /* *? */ #define META_ASTERISK_QUERY 0x80360000u /* *? */
#define META_PLUS 0x80350000u /* + */ #define META_PLUS 0x80370000u /* + */
#define META_PLUS_PLUS 0x80360000u /* ++ */ #define META_PLUS_PLUS 0x80380000u /* ++ */
#define META_PLUS_QUERY 0x80370000u /* +? */ #define META_PLUS_QUERY 0x80390000u /* +? */
#define META_QUERY 0x80380000u /* ? */ #define META_QUERY 0x803a0000u /* ? */
#define META_QUERY_PLUS 0x80390000u /* ?+ */ #define META_QUERY_PLUS 0x803b0000u /* ?+ */
#define META_QUERY_QUERY 0x803a0000u /* ?? */ #define META_QUERY_QUERY 0x803c0000u /* ?? */
#define META_MINMAX 0x803b0000u /* {n,m} repeat */ #define META_MINMAX 0x803d0000u /* {n,m} repeat */
#define META_MINMAX_PLUS 0x803c0000u /* {n,m}+ repeat */ #define META_MINMAX_PLUS 0x803e0000u /* {n,m}+ repeat */
#define META_MINMAX_QUERY 0x803d0000u /* {n,m}? repeat */ #define META_MINMAX_QUERY 0x803f0000u /* {n,m}? repeat */
#define META_FIRST_QUANTIFIER META_ASTERISK #define META_FIRST_QUANTIFIER META_ASTERISK
#define META_LAST_QUANTIFIER META_MINMAX_QUERY #define META_LAST_QUANTIFIER META_MINMAX_QUERY
@ -335,6 +340,8 @@ static unsigned char meta_extra_lengths[] = {
0, /* META_LOOKAHEADNOT */ 0, /* META_LOOKAHEADNOT */
SIZEOFFSET, /* META_LOOKBEHIND */ SIZEOFFSET, /* META_LOOKBEHIND */
SIZEOFFSET, /* META_LOOKBEHINDNOT */ SIZEOFFSET, /* META_LOOKBEHINDNOT */
0, /* META_LOOKAHEAD_NA */
SIZEOFFSET, /* META_LOOKBEHIND_NA */
1, /* META_MARK - plus the string length */ 1, /* META_MARK - plus the string length */
0, /* META_ACCEPT */ 0, /* META_ACCEPT */
0, /* META_FAIL */ 0, /* META_FAIL */
@ -637,10 +644,14 @@ typedef struct alasitem {
static const char alasnames[] = static const char alasnames[] =
STRING_pla0 STRING_pla0
STRING_plb0 STRING_plb0
STRING_napla0
STRING_naplb0
STRING_nla0 STRING_nla0
STRING_nlb0 STRING_nlb0
STRING_positive_lookahead0 STRING_positive_lookahead0
STRING_positive_lookbehind0 STRING_positive_lookbehind0
STRING_non_atomic_positive_lookahead0
STRING_non_atomic_positive_lookbehind0
STRING_negative_lookahead0 STRING_negative_lookahead0
STRING_negative_lookbehind0 STRING_negative_lookbehind0
STRING_atomic0 STRING_atomic0
@ -652,10 +663,14 @@ static const char alasnames[] =
static const alasitem alasmeta[] = { static const alasitem alasmeta[] = {
{ 3, META_LOOKAHEAD }, { 3, META_LOOKAHEAD },
{ 3, META_LOOKBEHIND }, { 3, META_LOOKBEHIND },
{ 5, META_LOOKAHEAD_NA },
{ 5, META_LOOKBEHIND_NA },
{ 3, META_LOOKAHEADNOT }, { 3, META_LOOKAHEADNOT },
{ 3, META_LOOKBEHINDNOT }, { 3, META_LOOKBEHINDNOT },
{ 18, META_LOOKAHEAD }, { 18, META_LOOKAHEAD },
{ 19, META_LOOKBEHIND }, { 19, META_LOOKBEHIND },
{ 29, META_LOOKAHEAD_NA },
{ 30, META_LOOKBEHIND_NA },
{ 18, META_LOOKAHEADNOT }, { 18, META_LOOKAHEADNOT },
{ 19, META_LOOKBEHINDNOT }, { 19, META_LOOKBEHINDNOT },
{ 6, META_ATOMIC }, { 6, META_ATOMIC },
@ -784,7 +799,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90, ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97 }; ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97, ERR98 };
/* This is a table of start-of-pattern options such as (*UTF) and settings such /* This is a table of start-of-pattern options such as (*UTF) and settings such
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
@ -1015,6 +1030,7 @@ for (;;)
case META_NOCAPTURE: fprintf(stderr, "META (?:"); break; case META_NOCAPTURE: fprintf(stderr, "META (?:"); break;
case META_LOOKAHEAD: fprintf(stderr, "META (?="); break; case META_LOOKAHEAD: fprintf(stderr, "META (?="); break;
case META_LOOKAHEADNOT: fprintf(stderr, "META (?!"); break; case META_LOOKAHEADNOT: fprintf(stderr, "META (?!"); break;
case META_LOOKAHEAD_NA: fprintf(stderr, "META (*napla:"); break;
case META_SCRIPT_RUN: fprintf(stderr, "META (*sr:"); break; case META_SCRIPT_RUN: fprintf(stderr, "META (*sr:"); break;
case META_KET: fprintf(stderr, "META )"); break; case META_KET: fprintf(stderr, "META )"); break;
case META_ALT: fprintf(stderr, "META | %d", meta_arg); break; case META_ALT: fprintf(stderr, "META | %d", meta_arg); break;
@ -1046,6 +1062,12 @@ for (;;)
fprintf(stderr, "%zd", offset); fprintf(stderr, "%zd", offset);
break; break;
case META_LOOKBEHIND_NA:
fprintf(stderr, "META (*naplb: %d offset=", meta_arg);
GETOFFSET(offset, pptr);
fprintf(stderr, "%zd", offset);
break;
case META_LOOKBEHINDNOT: case META_LOOKBEHINDNOT:
fprintf(stderr, "META (?<! %d offset=", meta_arg); fprintf(stderr, "META (?<! %d offset=", meta_arg);
GETOFFSET(offset, pptr); GETOFFSET(offset, pptr);
@ -3695,19 +3717,20 @@ while (ptr < ptrend)
goto FAILED; goto FAILED;
} }
/* Check for expecting an assertion condition. If so, only lookaround /* Check for expecting an assertion condition. If so, only atomic
assertions are valid. */ lookaround assertions are valid. */
meta = alasmeta[i].meta; meta = alasmeta[i].meta;
if (prev_expect_cond_assert > 0 && if (prev_expect_cond_assert > 0 &&
(meta < META_LOOKAHEAD || meta > META_LOOKBEHINDNOT)) (meta < META_LOOKAHEAD || meta > META_LOOKBEHINDNOT))
{ {
errorcode = ERR28; /* Assertion expected */ errorcode = (meta == META_LOOKAHEAD_NA || meta == META_LOOKBEHIND_NA)?
ERR98 : ERR28; /* (Atomic) assertion expected */
goto FAILED; goto FAILED;
} }
/* The lookaround alphabetic synonyms can be almost entirely handled by /* The lookaround alphabetic synonyms can mostly be handled by jumping
jumping to the code that handles the traditional symbolic forms. */ to the code that handles the traditional symbolic forms. */
switch(meta) switch(meta)
{ {
@ -3721,11 +3744,17 @@ while (ptr < ptrend)
case META_LOOKAHEAD: case META_LOOKAHEAD:
goto POSITIVE_LOOK_AHEAD; goto POSITIVE_LOOK_AHEAD;
case META_LOOKAHEAD_NA:
*parsed_pattern++ = meta;
ptr++;
goto POST_ASSERTION;
case META_LOOKAHEADNOT: case META_LOOKAHEADNOT:
goto NEGATIVE_LOOK_AHEAD; goto NEGATIVE_LOOK_AHEAD;
case META_LOOKBEHIND: case META_LOOKBEHIND:
case META_LOOKBEHINDNOT: case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
*parsed_pattern++ = meta; *parsed_pattern++ = meta;
ptr--; ptr--;
goto POST_LOOKBEHIND; goto POST_LOOKBEHIND;
@ -4429,7 +4458,7 @@ while (ptr < ptrend)
*parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)? *parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)?
META_LOOKBEHIND : META_LOOKBEHINDNOT; META_LOOKBEHIND : META_LOOKBEHINDNOT;
POST_LOOKBEHIND: /* Come from (*plb: and (*nlb: */ POST_LOOKBEHIND: /* Come from (*plb: (*naplb: and (*nlb: */
*has_lookbehind = TRUE; *has_lookbehind = TRUE;
offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2); offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2);
PUTOFFSET(offset, parsed_pattern); PUTOFFSET(offset, parsed_pattern);
@ -6300,6 +6329,11 @@ for (;; pptr++)
cb->assert_depth += 1; cb->assert_depth += 1;
goto GROUP_PROCESS; goto GROUP_PROCESS;
case META_LOOKAHEAD_NA:
bravalue = OP_ASSERT_NA;
cb->assert_depth += 1;
goto GROUP_PROCESS;
/* Optimize (?!) to (*FAIL) unless it is quantified - which is a weird /* Optimize (?!) to (*FAIL) unless it is quantified - which is a weird
thing to do, but Perl allows all assertions to be quantified, and when thing to do, but Perl allows all assertions to be quantified, and when
they contain capturing parentheses there may be a potential use for they contain capturing parentheses there may be a potential use for
@ -6331,6 +6365,11 @@ for (;; pptr++)
cb->assert_depth += 1; cb->assert_depth += 1;
goto GROUP_PROCESS; goto GROUP_PROCESS;
case META_LOOKBEHIND_NA:
bravalue = OP_ASSERTBACK_NA;
cb->assert_depth += 1;
goto GROUP_PROCESS;
case META_ATOMIC: case META_ATOMIC:
bravalue = OP_ONCE; bravalue = OP_ONCE;
goto GROUP_PROCESS_NOTE_EMPTY; goto GROUP_PROCESS_NOTE_EMPTY;
@ -7931,7 +7970,10 @@ length = 2 + 2*LINK_SIZE + skipunits;
/* Remember if this is a lookbehind assertion, and if it is, save its length /* Remember if this is a lookbehind assertion, and if it is, save its length
and skip over the pattern offset. */ and skip over the pattern offset. */
lookbehind = *code == OP_ASSERTBACK || *code == OP_ASSERTBACK_NOT; lookbehind = *code == OP_ASSERTBACK ||
*code == OP_ASSERTBACK_NOT ||
*code == OP_ASSERTBACK_NA;
if (lookbehind) if (lookbehind)
{ {
lookbehindlength = META_DATA(pptr[-1]); lookbehindlength = META_DATA(pptr[-1]);
@ -8802,8 +8844,10 @@ for (;; pptr++)
case META_COND_VERSION: case META_COND_VERSION:
case META_LOOKAHEAD: case META_LOOKAHEAD:
case META_LOOKAHEADNOT: case META_LOOKAHEADNOT:
case META_LOOKAHEAD_NA:
case META_LOOKBEHIND: case META_LOOKBEHIND:
case META_LOOKBEHINDNOT: case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
case META_NOCAPTURE: case META_NOCAPTURE:
case META_SCRIPT_RUN: case META_SCRIPT_RUN:
nestlevel++; nestlevel++;
@ -9064,6 +9108,7 @@ for (;; pptr++)
case META_LOOKAHEAD: case META_LOOKAHEAD:
case META_LOOKAHEADNOT: case META_LOOKAHEADNOT:
case META_LOOKAHEAD_NA:
pptr = parsed_skip(pptr + 1, PSKIP_KET); pptr = parsed_skip(pptr + 1, PSKIP_KET);
if (pptr == NULL) goto PARSED_SKIP_FAILED; if (pptr == NULL) goto PARSED_SKIP_FAILED;
@ -9102,6 +9147,7 @@ for (;; pptr++)
case META_LOOKBEHIND: case META_LOOKBEHIND:
case META_LOOKBEHINDNOT: case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb)) if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
return -1; return -1;
if (max - branchlength > extra) extra = max - branchlength; if (max - branchlength > extra) extra = max - branchlength;
@ -9453,6 +9499,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
case META_KET: case META_KET:
case META_LOOKAHEAD: case META_LOOKAHEAD:
case META_LOOKAHEADNOT: case META_LOOKAHEADNOT:
case META_LOOKAHEAD_NA:
case META_NOCAPTURE: case META_NOCAPTURE:
case META_PLUS: case META_PLUS:
case META_PLUS_PLUS: case META_PLUS_PLUS:
@ -9514,6 +9561,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
case META_LOOKBEHIND: case META_LOOKBEHIND:
case META_LOOKBEHINDNOT: case META_LOOKBEHINDNOT:
case META_LOOKBEHIND_NA:
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, NULL, cb)) if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, NULL, cb))
return errorcode; return errorcode;
break; break;

View File

@ -173,6 +173,8 @@ static const uint8_t coptable[] = {
0, /* Assert not */ 0, /* Assert not */
0, /* Assert behind */ 0, /* Assert behind */
0, /* Assert behind not */ 0, /* Assert behind not */
0, /* NA assert */
0, /* NA assert behind */
0, /* ONCE */ 0, /* ONCE */
0, /* SCRIPT_RUN */ 0, /* SCRIPT_RUN */
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */ 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
@ -248,6 +250,8 @@ static const uint8_t poptable[] = {
0, /* Assert not */ 0, /* Assert not */
0, /* Assert behind */ 0, /* Assert behind */
0, /* Assert behind not */ 0, /* Assert behind not */
0, /* NA assert */
0, /* NA assert behind */
0, /* ONCE */ 0, /* ONCE */
0, /* SCRIPT_RUN */ 0, /* SCRIPT_RUN */
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */ 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */

View File

@ -185,6 +185,7 @@ static const unsigned char compile_error_texts[] =
"(*alpha_assertion) not recognized\0" "(*alpha_assertion) not recognized\0"
"script runs require Unicode support, which this version of PCRE2 does not have\0" "script runs require Unicode support, which this version of PCRE2 does not have\0"
"too many capturing groups (maximum 65535)\0" "too many capturing groups (maximum 65535)\0"
"atomic assertion expected after (?( or (?(?C)\0"
; ;
/* Match-time and UTF error texts are in the same format. */ /* Match-time and UTF error texts are in the same format. */

View File

@ -883,12 +883,16 @@ a positive value. */
#define STRING_atomic0 "atomic\0" #define STRING_atomic0 "atomic\0"
#define STRING_pla0 "pla\0" #define STRING_pla0 "pla\0"
#define STRING_plb0 "plb\0" #define STRING_plb0 "plb\0"
#define STRING_napla0 "napla\0"
#define STRING_naplb0 "naplb\0"
#define STRING_nla0 "nla\0" #define STRING_nla0 "nla\0"
#define STRING_nlb0 "nlb\0" #define STRING_nlb0 "nlb\0"
#define STRING_sr0 "sr\0" #define STRING_sr0 "sr\0"
#define STRING_asr0 "asr\0" #define STRING_asr0 "asr\0"
#define STRING_positive_lookahead0 "positive_lookahead\0" #define STRING_positive_lookahead0 "positive_lookahead\0"
#define STRING_positive_lookbehind0 "positive_lookbehind\0" #define STRING_positive_lookbehind0 "positive_lookbehind\0"
#define STRING_non_atomic_positive_lookahead0 "non_atomic_positive_lookahead\0"
#define STRING_non_atomic_positive_lookbehind0 "non_atomic_positive_lookbehind\0"
#define STRING_negative_lookahead0 "negative_lookahead\0" #define STRING_negative_lookahead0 "negative_lookahead\0"
#define STRING_negative_lookbehind0 "negative_lookbehind\0" #define STRING_negative_lookbehind0 "negative_lookbehind\0"
#define STRING_script_run0 "script_run\0" #define STRING_script_run0 "script_run\0"
@ -1173,12 +1177,16 @@ only. */
#define STRING_atomic0 STR_a STR_t STR_o STR_m STR_i STR_c "\0" #define STRING_atomic0 STR_a STR_t STR_o STR_m STR_i STR_c "\0"
#define STRING_pla0 STR_p STR_l STR_a "\0" #define STRING_pla0 STR_p STR_l STR_a "\0"
#define STRING_plb0 STR_p STR_l STR_b "\0" #define STRING_plb0 STR_p STR_l STR_b "\0"
#define STRING_napla0 STR_n STR_a STR_p STR_l STR_a "\0"
#define STRING_naplb0 STR_n STR_a STR_p STR_l STR_b "\0"
#define STRING_nla0 STR_n STR_l STR_a "\0" #define STRING_nla0 STR_n STR_l STR_a "\0"
#define STRING_nlb0 STR_n STR_l STR_b "\0" #define STRING_nlb0 STR_n STR_l STR_b "\0"
#define STRING_sr0 STR_s STR_r "\0" #define STRING_sr0 STR_s STR_r "\0"
#define STRING_asr0 STR_a STR_s STR_r "\0" #define STRING_asr0 STR_a STR_s STR_r "\0"
#define STRING_positive_lookahead0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0" #define STRING_positive_lookahead0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_positive_lookbehind0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0" #define STRING_positive_lookbehind0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_non_atomic_positive_lookahead0 STR_n STR_o STR_n STR_UNDERSCORE STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_non_atomic_positive_lookbehind0 STR_n STR_o STR_n STR_UNDERSCORE STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_negative_lookahead0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0" #define STRING_negative_lookahead0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_negative_lookbehind0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0" #define STRING_negative_lookbehind0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_script_run0 STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n "\0" #define STRING_script_run0 STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n "\0"
@ -1303,7 +1311,7 @@ enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
Starting from 1 (i.e. after OP_END), the values up to OP_EOD must correspond in Starting from 1 (i.e. after OP_END), the values up to OP_EOD must correspond in
order to the list of escapes immediately above. Furthermore, values up to order to the list of escapes immediately above. Furthermore, values up to
OP_DOLLM must not be changed without adjusting the table called autoposstab in OP_DOLLM must not be changed without adjusting the table called autoposstab in
pcre2_auto_possess.c pcre2_auto_possess.c.
Whenever this list is updated, the two macro definitions that follow must be Whenever this list is updated, the two macro definitions that follow must be
updated to match. The possessification table called "opcode_possessify" in updated to match. The possessification table called "opcode_possessify" in
@ -1501,80 +1509,81 @@ enum {
OP_KETRMIN, /* 123 order. They are for groups the repeat for ever. */ OP_KETRMIN, /* 123 order. They are for groups the repeat for ever. */
OP_KETRPOS, /* 124 Possessive unlimited repeat. */ OP_KETRPOS, /* 124 Possessive unlimited repeat. */
/* The assertions must come before BRA, CBRA, ONCE, and COND, and the four /* The assertions must come before BRA, CBRA, ONCE, and COND. */
asserts must remain in order. */
OP_REVERSE, /* 125 Move pointer back - used in lookbehind assertions */ OP_REVERSE, /* 125 Move pointer back - used in lookbehind assertions */
OP_ASSERT, /* 126 Positive lookahead */ OP_ASSERT, /* 126 Positive lookahead */
OP_ASSERT_NOT, /* 127 Negative lookahead */ OP_ASSERT_NOT, /* 127 Negative lookahead */
OP_ASSERTBACK, /* 128 Positive lookbehind */ OP_ASSERTBACK, /* 128 Positive lookbehind */
OP_ASSERTBACK_NOT, /* 129 Negative lookbehind */ OP_ASSERTBACK_NOT, /* 129 Negative lookbehind */
OP_ASSERT_NA, /* 130 Positive non-atomic lookahead */
OP_ASSERTBACK_NA, /* 131 Positive non-atomic lookbehind */
/* ONCE, SCRIPT_RUN, BRA, BRAPOS, CBRA, CBRAPOS, and COND must come /* ONCE, SCRIPT_RUN, BRA, BRAPOS, CBRA, CBRAPOS, and COND must come
immediately after the assertions, with ONCE first, as there's a test for >= immediately after the assertions, with ONCE first, as there's a test for >=
ONCE for a subpattern that isn't an assertion. The POS versions must ONCE for a subpattern that isn't an assertion. The POS versions must
immediately follow the non-POS versions in each case. */ immediately follow the non-POS versions in each case. */
OP_ONCE, /* 130 Atomic group, contains captures */ OP_ONCE, /* 132 Atomic group, contains captures */
OP_SCRIPT_RUN, /* 131 Non-capture, but check characters' scripts */ OP_SCRIPT_RUN, /* 133 Non-capture, but check characters' scripts */
OP_BRA, /* 132 Start of non-capturing bracket */ OP_BRA, /* 134 Start of non-capturing bracket */
OP_BRAPOS, /* 133 Ditto, with unlimited, possessive repeat */ OP_BRAPOS, /* 135 Ditto, with unlimited, possessive repeat */
OP_CBRA, /* 134 Start of capturing bracket */ OP_CBRA, /* 136 Start of capturing bracket */
OP_CBRAPOS, /* 135 Ditto, with unlimited, possessive repeat */ OP_CBRAPOS, /* 137 Ditto, with unlimited, possessive repeat */
OP_COND, /* 136 Conditional group */ OP_COND, /* 138 Conditional group */
/* These five must follow the previous five, in the same order. There's a /* These five must follow the previous five, in the same order. There's a
check for >= SBRA to distinguish the two sets. */ check for >= SBRA to distinguish the two sets. */
OP_SBRA, /* 137 Start of non-capturing bracket, check empty */ OP_SBRA, /* 139 Start of non-capturing bracket, check empty */
OP_SBRAPOS, /* 138 Ditto, with unlimited, possessive repeat */ OP_SBRAPOS, /* 149 Ditto, with unlimited, possessive repeat */
OP_SCBRA, /* 139 Start of capturing bracket, check empty */ OP_SCBRA, /* 141 Start of capturing bracket, check empty */
OP_SCBRAPOS, /* 140 Ditto, with unlimited, possessive repeat */ OP_SCBRAPOS, /* 142 Ditto, with unlimited, possessive repeat */
OP_SCOND, /* 141 Conditional group, check empty */ OP_SCOND, /* 143 Conditional group, check empty */
/* The next two pairs must (respectively) be kept together. */ /* The next two pairs must (respectively) be kept together. */
OP_CREF, /* 142 Used to hold a capture number as condition */ OP_CREF, /* 144 Used to hold a capture number as condition */
OP_DNCREF, /* 143 Used to point to duplicate names as a condition */ OP_DNCREF, /* 145 Used to point to duplicate names as a condition */
OP_RREF, /* 144 Used to hold a recursion number as condition */ OP_RREF, /* 146 Used to hold a recursion number as condition */
OP_DNRREF, /* 145 Used to point to duplicate names as a condition */ OP_DNRREF, /* 147 Used to point to duplicate names as a condition */
OP_FALSE, /* 146 Always false (used by DEFINE and VERSION) */ OP_FALSE, /* 148 Always false (used by DEFINE and VERSION) */
OP_TRUE, /* 147 Always true (used by VERSION) */ OP_TRUE, /* 149 Always true (used by VERSION) */
OP_BRAZERO, /* 148 These two must remain together and in this */ OP_BRAZERO, /* 150 These two must remain together and in this */
OP_BRAMINZERO, /* 149 order. */ OP_BRAMINZERO, /* 151 order. */
OP_BRAPOSZERO, /* 150 */ OP_BRAPOSZERO, /* 152 */
/* These are backtracking control verbs */ /* These are backtracking control verbs */
OP_MARK, /* 151 always has an argument */ OP_MARK, /* 153 always has an argument */
OP_PRUNE, /* 152 */ OP_PRUNE, /* 154 */
OP_PRUNE_ARG, /* 153 same, but with argument */ OP_PRUNE_ARG, /* 155 same, but with argument */
OP_SKIP, /* 154 */ OP_SKIP, /* 156 */
OP_SKIP_ARG, /* 155 same, but with argument */ OP_SKIP_ARG, /* 157 same, but with argument */
OP_THEN, /* 156 */ OP_THEN, /* 158 */
OP_THEN_ARG, /* 157 same, but with argument */ OP_THEN_ARG, /* 159 same, but with argument */
OP_COMMIT, /* 158 */ OP_COMMIT, /* 160 */
OP_COMMIT_ARG, /* 159 same, but with argument */ OP_COMMIT_ARG, /* 161 same, but with argument */
/* These are forced failure and success verbs. FAIL and ACCEPT do accept an /* These are forced failure and success verbs. FAIL and ACCEPT do accept an
argument, but these cases can be compiled as, for example, (*MARK:X)(*FAIL) argument, but these cases can be compiled as, for example, (*MARK:X)(*FAIL)
without the need for a special opcode. */ without the need for a special opcode. */
OP_FAIL, /* 160 */ OP_FAIL, /* 162 */
OP_ACCEPT, /* 161 */ OP_ACCEPT, /* 163 */
OP_ASSERT_ACCEPT, /* 162 Used inside assertions */ OP_ASSERT_ACCEPT, /* 164 Used inside assertions */
OP_CLOSE, /* 163 Used before OP_ACCEPT to close open captures */ OP_CLOSE, /* 165 Used before OP_ACCEPT to close open captures */
/* This is used to skip a subpattern with a {0} quantifier */ /* This is used to skip a subpattern with a {0} quantifier */
OP_SKIPZERO, /* 164 */ OP_SKIPZERO, /* 166 */
/* This is used to identify a DEFINE group during compilation so that it can /* This is used to identify a DEFINE group during compilation so that it can
be checked for having only one branch. It is changed to OP_FALSE before be checked for having only one branch. It is changed to OP_FALSE before
compilation finishes. */ compilation finishes. */
OP_DEFINE, /* 165 */ OP_DEFINE, /* 167 */
/* This is not an opcode, but is used to check that tables indexed by opcode /* This is not an opcode, but is used to check that tables indexed by opcode
are the correct length, in order to catch updating errors - there have been are the correct length, in order to catch updating errors - there have been
@ -1587,7 +1596,7 @@ enum {
/* *** NOTE NOTE NOTE *** Whenever the list above is updated, the two macro /* *** NOTE NOTE NOTE *** Whenever the list above is updated, the two macro
definitions that follow must also be updated to match. There are also tables definitions that follow must also be updated to match. There are also tables
called "opcode_possessify" in pcre2_compile.c and "coptable" and "poptable" in called "opcode_possessify" in pcre2_compile.c and "coptable" and "poptable" in
pcre2_dfa_exec.c that must be updated. */ pcre2_dfa_match.c that must be updated. */
/* This macro defines textual names for all the opcodes. These are used only /* This macro defines textual names for all the opcodes. These are used only
@ -1620,7 +1629,9 @@ some cases doesn't actually use these names at all). */
"class", "nclass", "xclass", "Ref", "Refi", "DnRef", "DnRefi", \ "class", "nclass", "xclass", "Ref", "Refi", "DnRef", "DnRefi", \
"Recurse", "Callout", "CalloutStr", \ "Recurse", "Callout", "CalloutStr", \
"Alt", "Ket", "KetRmax", "KetRmin", "KetRpos", \ "Alt", "Ket", "KetRmax", "KetRmin", "KetRpos", \
"Reverse", "Assert", "Assert not", "AssertB", "AssertB not", \ "Reverse", "Assert", "Assert not", \
"Assert back", "Assert back not", \
"Non-atomic assert", "Non-atomic assert back", \
"Once", \ "Once", \
"Script run", \ "Script run", \
"Bra", "BraPos", "CBra", "CBraPos", \ "Bra", "BraPos", "CBra", "CBraPos", \
@ -1705,6 +1716,8 @@ in UTF-8 mode. The code that uses this table must know about such things. */
1+LINK_SIZE, /* Assert not */ \ 1+LINK_SIZE, /* Assert not */ \
1+LINK_SIZE, /* Assert behind */ \ 1+LINK_SIZE, /* Assert behind */ \
1+LINK_SIZE, /* Assert behind not */ \ 1+LINK_SIZE, /* Assert behind not */ \
1+LINK_SIZE, /* NA Assert */ \
1+LINK_SIZE, /* NA Assert behind */ \
1+LINK_SIZE, /* ONCE */ \ 1+LINK_SIZE, /* ONCE */ \
1+LINK_SIZE, /* SCRIPT_RUN */ \ 1+LINK_SIZE, /* SCRIPT_RUN */ \
1+LINK_SIZE, /* BRA */ \ 1+LINK_SIZE, /* BRA */ \

View File

@ -5127,6 +5127,8 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
case OP_ASSERT: case OP_ASSERT:
case OP_ASSERTBACK: case OP_ASSERTBACK:
case OP_ASSERT_NA:
case OP_ASSERTBACK_NA:
Lframe_type = GF_NOCAPTURE | Fop; Lframe_type = GF_NOCAPTURE | Fop;
for (;;) for (;;)
{ {
@ -5497,10 +5499,20 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
case OP_SCOND: case OP_SCOND:
break; break;
/* Positive assertions are like OP_ONCE, except that in addition the /* Non-atomic positive assertions are like OP_BRA, except that the
subject pointer must be put back to where it was at the start of the subject pointer must be put back to where it was at the start of the
assertion. */ assertion. */
case OP_ASSERT_NA:
case OP_ASSERTBACK_NA:
if (Feptr > mb->last_used_ptr) mb->last_used_ptr = Feptr;
Feptr = P->eptr;
break;
/* Atomic positive assertions are like OP_ONCE, except that in addition
the subject pointer must be put back to where it was at the start of the
assertion. */
case OP_ASSERT: case OP_ASSERT:
case OP_ASSERTBACK: case OP_ASSERTBACK:
if (Feptr > mb->last_used_ptr) mb->last_used_ptr = Feptr; if (Feptr > mb->last_used_ptr) mb->last_used_ptr = Feptr;

View File

@ -392,6 +392,8 @@ for(;;)
case OP_ASSERT_NOT: case OP_ASSERT_NOT:
case OP_ASSERTBACK: case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT: case OP_ASSERTBACK_NOT:
case OP_ASSERT_NA:
case OP_ASSERTBACK_NA:
case OP_ONCE: case OP_ONCE:
case OP_SCRIPT_RUN: case OP_SCRIPT_RUN:
case OP_COND: case OP_COND:

View File

@ -240,6 +240,8 @@ for (;;)
case OP_ASSERT_NOT: case OP_ASSERT_NOT:
case OP_ASSERTBACK: case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT: case OP_ASSERTBACK_NOT:
case OP_ASSERT_NA:
case OP_ASSERTBACK_NA:
do cc += GET(cc, 1); while (*cc == OP_ALT); do cc += GET(cc, 1); while (*cc == OP_ALT);
/* Fall through */ /* Fall through */
@ -1089,6 +1091,7 @@ do
case OP_ONCE: case OP_ONCE:
case OP_SCRIPT_RUN: case OP_SCRIPT_RUN:
case OP_ASSERT: case OP_ASSERT:
case OP_ASSERT_NA:
rc = set_start_bits(re, tcode, utf); rc = set_start_bits(re, tcode, utf);
if (rc == SSB_FAIL || rc == SSB_UNKNOWN) return rc; if (rc == SSB_FAIL || rc == SSB_UNKNOWN) return rc;
if (rc == SSB_DONE) try_next = FALSE; else if (rc == SSB_DONE) try_next = FALSE; else
@ -1131,6 +1134,7 @@ do
case OP_ASSERT_NOT: case OP_ASSERT_NOT:
case OP_ASSERTBACK: case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT: case OP_ASSERTBACK_NOT:
case OP_ASSERTBACK_NA:
do tcode += GET(tcode, 1); while (*tcode == OP_ALT); do tcode += GET(tcode, 1); while (*tcode == OP_ALT);
tcode += 1 + LINK_SIZE; tcode += 1 + LINK_SIZE;
break; break;

29
testdata/testinput2 vendored
View File

@ -5653,4 +5653,33 @@ a)"xI
# Multiplication overflow # Multiplication overflow
/(X{65535})(?<=\1{32770})/ /(X{65535})(?<=\1{32770})/
# ---- Non-atomic assertion tests ----
# Expect error: not allowed as a condition
/(?(*napla:xx)bc)/
/\A(*pla:.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
/\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
/(*plb:(.)..|(.)...)(\1|\2)/
abcdb\=offset=4
abcda\=offset=4
/(*naplb:(.)..|(.)...)(\1|\2)/
abcdb\=offset=4
abcda\=offset=4
/(*non_atomic_positive_lookahead:ab)/B
/(*non_atomic_positive_lookbehind:ab)/B
/(*pla:ab+)/B
/(*napla:ab+)/B
# ----
# End of testinput2 # End of testinput2

89
testdata/testoutput2 vendored
View File

@ -11117,7 +11117,7 @@ Matched, but too many substrings
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra
Brazero Brazero
AssertB Assert back
Reverse Reverse
CBra 1 CBra 1
abc abc
@ -13346,7 +13346,7 @@ Failed: error 144 at offset 5: subpattern name must start with a non-digit
Ket Ket
red red
\b \b
AssertB Assert back
Reverse Reverse
\w \w
Ket Ket
@ -13403,7 +13403,7 @@ Failed: error 133 at offset 7: parentheses are too deeply nested (stack check)
Once Once
\s*+ \s*+
Ket Ket
AssertB Assert back
Reverse Reverse
\w \w
Ket Ket
@ -16619,7 +16619,7 @@ No match
/(?<=(?=.){4,5}x)/B /(?<=(?=.){4,5}x)/B
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra
AssertB Assert back
Reverse Reverse
Assert Assert
Any Any
@ -17086,6 +17086,87 @@ Failed: error 187 at offset 15: lookbehind assertion is too long
/(X{65535})(?<=\1{32770})/ /(X{65535})(?<=\1{32770})/
Failed: error 187 at offset 10: lookbehind assertion is too long Failed: error 187 at offset 10: lookbehind assertion is too long
# ---- Non-atomic assertion tests ----
# Expect error: not allowed as a condition
/(?(*napla:xx)bc)/
Failed: error 198 at offset 9: atomic assertion expected after (?( or (?(?C)
/\A(*pla:.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
No match
/\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
0: word1 word3 word1 word2 word3 word2 word2 word1 word3
1: word3
/(*plb:(.)..|(.)...)(\1|\2)/
abcdb\=offset=4
0: b
1: b
2: <unset>
3: b
abcda\=offset=4
No match
/(*naplb:(.)..|(.)...)(\1|\2)/
abcdb\=offset=4
0: b
1: b
2: <unset>
3: b
abcda\=offset=4
0: a
1: <unset>
2: a
3: a
/(*non_atomic_positive_lookahead:ab)/B
------------------------------------------------------------------
Bra
Non-atomic assert
ab
Ket
Ket
End
------------------------------------------------------------------
/(*non_atomic_positive_lookbehind:ab)/B
------------------------------------------------------------------
Bra
Non-atomic assert back
Reverse
ab
Ket
Ket
End
------------------------------------------------------------------
/(*pla:ab+)/B
------------------------------------------------------------------
Bra
Assert
a
b++
Ket
Ket
End
------------------------------------------------------------------
/(*napla:ab+)/B
------------------------------------------------------------------
Bra
Non-atomic assert
a
b+
Ket
Ket
End
------------------------------------------------------------------
# ----
# End of testinput2 # End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data Error -62: bad serialized data

View File

@ -4017,7 +4017,7 @@ MK: a\x{12345}b\x{09}(d)c
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra
\b \b
AssertB Assert back
Reverse Reverse
prop Xwd prop Xwd
Ket Ket
@ -4196,7 +4196,7 @@ Failed: error 125 at offset 2: lookbehind assertion is not fixed length
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra
^ ^
AssertB not Assert back not
Assert Assert
\x{10385c} \x{10385c}
Ket Ket
@ -4828,7 +4828,7 @@ MK: ABC
/(?<!)(*sr:)/B /(?<!)(*sr:)/B
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra
AssertB not Assert back not
Ket Ket
Script run Script run
Ket Ket
@ -4839,7 +4839,7 @@ MK: ABC
/(?<=abc(?=X(*sr:BXY)CCC)XBXYCCC)./B /(?<=abc(?=X(*sr:BXY)CCC)XBXYCCC)./B
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra
AssertB Assert back
Reverse Reverse
abc abc
Assert Assert