Implement non-atomic positive assertions.
This commit is contained in:
parent
691aca7a86
commit
620f3a1307
|
@ -88,6 +88,8 @@ otherwise), an atomic group, or a recursion.
|
|||
17. Check for integer overflow when computing lookbehind lengths. Fixes
|
||||
Clusterfuzz issue 15636.
|
||||
|
||||
18. Implement non-atomic positive lookaround assertions.
|
||||
|
||||
|
||||
Version 10.33 16-April-2019
|
||||
---------------------------
|
||||
|
|
34
HACKING
34
HACKING
|
@ -195,6 +195,7 @@ META_END End of pattern (this value is 0x80000000)
|
|||
META_FAIL (*FAIL)
|
||||
META_KET ) closing parenthesis
|
||||
META_LOOKAHEAD (?= start of lookahead
|
||||
META_LOOKAHEAD_NA (*napla: start of non-atomic lookahead
|
||||
META_LOOKAHEADNOT (?! start of negative lookahead
|
||||
META_NOCAPTURE (?: no capture parens
|
||||
META_PLUS +
|
||||
|
@ -286,8 +287,9 @@ The following are also followed just by an offset, but also the lower 16 bits
|
|||
of the main word contain the length of the first branch of the lookbehind
|
||||
group; this is used when generating OP_REVERSE for that branch.
|
||||
|
||||
META_LOOKBEHIND (?<=
|
||||
META_LOOKBEHINDNOT (?<!
|
||||
META_LOOKBEHIND (?<= start of lookbehind
|
||||
META_LOOKBEHIND_NA (*naplb: start of non-atomic lookbehind
|
||||
META_LOOKBEHINDNOT (?<! start of negative lookbehind
|
||||
|
||||
The following are followed by two elements, the minimum and maximum. Repeat
|
||||
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
|
||||
|
@ -715,13 +717,15 @@ Assertions
|
|||
----------
|
||||
|
||||
Forward assertions are also just like other subpatterns, but starting with one
|
||||
of the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
|
||||
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
|
||||
is OP_REVERSE, followed by a count of the number of characters to move back the
|
||||
pointer in the subject string. In ASCII or UTF-32 mode, the count is also the
|
||||
number of code units, but in UTF-8/16 mode each character may occupy more than
|
||||
one code unit. A separate count is present in each alternative of a lookbehind
|
||||
assertion, allowing them to have different (but fixed) lengths.
|
||||
of the opcodes OP_ASSERT, OP_ASSERT_NA (non-atomic assertion), or
|
||||
OP_ASSERT_NOT. Backward assertions use the opcodes OP_ASSERTBACK,
|
||||
OP_ASSERTBACK_NA, and OP_ASSERTBACK_NOT, and the first opcode inside the
|
||||
assertion is OP_REVERSE, followed by a count of the number of characters to
|
||||
move back the pointer in the subject string. In ASCII or UTF-32 mode, the count
|
||||
is also the number of code units, but in UTF-8/16 mode each character may
|
||||
occupy more than one code unit. A separate count is present in each alternative
|
||||
of a lookbehind assertion, allowing each branch to have a different (but fixed)
|
||||
length.
|
||||
|
||||
|
||||
Conditional subpatterns
|
||||
|
@ -754,11 +758,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
|
|||
or OP_FALSE.
|
||||
|
||||
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
|
||||
must start with a parenthesized assertion, whose opcode normally immediately
|
||||
follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a
|
||||
callout is inserted immediately before the assertion. It is also possible to
|
||||
insert a manual callout at this point. Only assertion conditions may have
|
||||
callouts preceding the condition.
|
||||
must start with a parenthesized atomic assertion, whose opcode normally
|
||||
immediately follows OP_COND or OP_SCOND. However, if automatic callouts are
|
||||
enabled, a callout is inserted immediately before the assertion. It is also
|
||||
possible to insert a manual callout at this point. Only assertion conditions
|
||||
may have callouts preceding the condition.
|
||||
|
||||
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
|
||||
parts of the pattern, so this is another opcode that may appear as a condition.
|
||||
|
@ -823,4 +827,4 @@ not a real opcode, but is used to check at compile time that tables indexed by
|
|||
opcode are the correct length, in order to catch updating errors.
|
||||
|
||||
Philip Hazel
|
||||
20 July 2018
|
||||
12 July 2019
|
||||
|
|
|
@ -205,6 +205,11 @@ different way and is not Perl-compatible.
|
|||
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
|
||||
the start of a pattern that set overall options that cannot be changed within
|
||||
the pattern.
|
||||
<br>
|
||||
<br>
|
||||
(m) PCRE2 supports non-atomic positive lookaround assertions. This is an
|
||||
extension to the lookaround facilities. The default, Perl-compatible
|
||||
lookarounds are atomic.
|
||||
</P>
|
||||
<P>
|
||||
18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
|
||||
|
@ -234,7 +239,7 @@ Cambridge, England.
|
|||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 12 February 2019
|
||||
Last updated: 13 July 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -33,17 +33,18 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
|
||||
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
|
||||
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
|
||||
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
|
||||
<li><a name="TOC22" href="#SEC22">CONDITIONAL GROUPS</a>
|
||||
<li><a name="TOC23" href="#SEC23">COMMENTS</a>
|
||||
<li><a name="TOC24" href="#SEC24">RECURSIVE PATTERNS</a>
|
||||
<li><a name="TOC25" href="#SEC25">GROUPS AS SUBROUTINES</a>
|
||||
<li><a name="TOC26" href="#SEC26">ONIGURUMA SUBROUTINE SYNTAX</a>
|
||||
<li><a name="TOC27" href="#SEC27">CALLOUTS</a>
|
||||
<li><a name="TOC28" href="#SEC28">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC29" href="#SEC29">SEE ALSO</a>
|
||||
<li><a name="TOC30" href="#SEC30">AUTHOR</a>
|
||||
<li><a name="TOC31" href="#SEC31">REVISION</a>
|
||||
<li><a name="TOC21" href="#SEC21">NON-ATOMIC ASSERTIONS</a>
|
||||
<li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a>
|
||||
<li><a name="TOC23" href="#SEC23">CONDITIONAL GROUPS</a>
|
||||
<li><a name="TOC24" href="#SEC24">COMMENTS</a>
|
||||
<li><a name="TOC25" href="#SEC25">RECURSIVE PATTERNS</a>
|
||||
<li><a name="TOC26" href="#SEC26">GROUPS AS SUBROUTINES</a>
|
||||
<li><a name="TOC27" href="#SEC27">ONIGURUMA SUBROUTINE SYNTAX</a>
|
||||
<li><a name="TOC28" href="#SEC28">CALLOUTS</a>
|
||||
<li><a name="TOC29" href="#SEC29">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC30" href="#SEC30">SEE ALSO</a>
|
||||
<li><a name="TOC31" href="#SEC31">AUTHOR</a>
|
||||
<li><a name="TOC32" href="#SEC32">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
|
||||
<P>
|
||||
|
@ -2364,19 +2365,23 @@ those that look behind it, and in each case an assertion may be positive (must
|
|||
match for the assertion to be true) or negative (must not match for the
|
||||
assertion to be true). An assertion group is matched in the normal way,
|
||||
and if it is true, matching continues after it, but with the matching position
|
||||
in the subject string is was it was before the assertion was processed.
|
||||
in the subject string reset to what it was before the assertion was processed.
|
||||
</P>
|
||||
<P>
|
||||
A lookaround assertion may also appear as the condition in a
|
||||
The Perl-compatible lookaround assertions are atomic. If an assertion is true,
|
||||
but there is a subsequent matching failure, there is no backtracking into the
|
||||
assertion. However, there are some cases where non-atomic assertions can be
|
||||
useful. PCRE2 has some support for these, described in the section entitled
|
||||
<a href="#nonatomicassertions">"Non-atomic assertions"</a>
|
||||
below, but they are not Perl-compatible.
|
||||
</P>
|
||||
<P>
|
||||
A lookaround assertion may appear as the condition in a
|
||||
<a href="#conditions">conditional group</a>
|
||||
(see below). In this case, the result of matching the assertion determines
|
||||
which branch of the condition is followed.
|
||||
</P>
|
||||
<P>
|
||||
Lookaround assertions are atomic. If an assertion is true, but there is a
|
||||
subsequent matching failure, there is no backtracking into the assertion.
|
||||
</P>
|
||||
<P>
|
||||
Assertion groups are not capture groups. If an assertion contains capture
|
||||
groups within it, these are counted for the purposes of numbering the capture
|
||||
groups in the whole pattern. Within each branch of an assertion, locally
|
||||
|
@ -2429,11 +2434,11 @@ The assertion is obeyed just once when encountered during matching.
|
|||
Alphabetic assertion names
|
||||
</b><br>
|
||||
<P>
|
||||
Traditionally, symbolic sequences such as (?= and (?<= have been used to specify
|
||||
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
|
||||
alternatives which might be easier to remember. They all start with (* instead
|
||||
of (? and must be written using lower case letters. PCRE2 supports the
|
||||
following synonyms:
|
||||
Traditionally, symbolic sequences such as (?= and (?<= have been used to
|
||||
specify lookaround assertions. Perl 5.28 introduced some experimental
|
||||
alphabetic alternatives which might be easier to remember. They all start with
|
||||
(* instead of (? and must be written using lower case letters. PCRE2 supports
|
||||
the following synonyms:
|
||||
<pre>
|
||||
(*positive_lookahead: or (*pla: is the same as (?=
|
||||
(*negative_lookahead: or (*nla: is the same as (?!
|
||||
|
@ -2606,8 +2611,63 @@ preceded by "foo", while
|
|||
</pre>
|
||||
is another pattern that matches "foo" preceded by three digits and any three
|
||||
characters that are not "999".
|
||||
<a name="nonatomicassertions"></a></P>
|
||||
<br><a name="SEC21" href="#TOC1">NON-ATOMIC ASSERTIONS</a><br>
|
||||
<P>
|
||||
The traditional Perl-compatible lookaround assertions are atomic. That is, if
|
||||
an assertion is true, but there is a subsequent matching failure, there is no
|
||||
backtracking into the assertion. However, there are some cases where non-atomic
|
||||
positive assertions can be useful. PCRE2 provides these using the following
|
||||
syntax:
|
||||
<pre>
|
||||
(*non_atomic_positive_lookahead: or (*napla:
|
||||
(*non_atomic_positive_lookbehind: or (*naplb:
|
||||
</pre>
|
||||
Consider the problem of finding the right-most word in a string that also
|
||||
appears earlier in the string, that is, it must appear at least twice in total.
|
||||
This pattern returns the required result as captured substring 1:
|
||||
<pre>
|
||||
^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
|
||||
</pre>
|
||||
For a subject such as "word1 word2 word3 word2 word3 word4" the result is
|
||||
"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
|
||||
"x" option, which causes white space (introduced for readability) to be
|
||||
ignored. Inside the assertion, the greedy .* at first consumes the entire
|
||||
string, but then has to backtrack until the rest of the assertion can match a
|
||||
word, which is captured by group 1. In other words, when the assertion first
|
||||
succeeds, it captures the right-most word in the string.
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<P>
|
||||
The current matching point is then reset to the start of the subject, and the
|
||||
rest of the pattern match checks for two occurrences of the captured word,
|
||||
using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
|
||||
if the last word in the string does not occur twice, this part of the pattern
|
||||
fails. If a traditional atomic lookhead (?= or (*pla: had been used, the
|
||||
assertion could not be re-entered, and the whole match would fail. The pattern
|
||||
would succeed only if the very last word in the subject was found twice.
|
||||
</P>
|
||||
<P>
|
||||
Using a non-atomic lookahead, however, means that when the last word does not
|
||||
occur twice in the string, the lookahead can backtrack and find the second-last
|
||||
word, and so on, until either the match succeeds, or all words have been
|
||||
tested.
|
||||
</P>
|
||||
<P>
|
||||
Two conditions must be met for a non-atomic assertion to be useful: the
|
||||
contents of one or more capturing groups must change after a backtrack into the
|
||||
assertion, and there must be a backreference to a changed group later in the
|
||||
pattern. If this is not the case, the rest of the pattern match fails exactly
|
||||
as before because nothing has changed, so using a non-atomic assertion just
|
||||
wastes resources.
|
||||
</P>
|
||||
<P>
|
||||
Non-atomic assertions are not supported by the alternative matching function
|
||||
<b>pcre2_dfa_match()</b>. They are also not supported by JIT (but may be in
|
||||
future). Note that assertions that appear as conditions for
|
||||
<a href="#conditions">conditional groups</a>
|
||||
(see below) must be atomic.
|
||||
</P>
|
||||
<br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<P>
|
||||
In concept, a script run is a sequence of characters that are all from the same
|
||||
Unicode script such as Latin or Greek. However, because some scripts are
|
||||
|
@ -2669,7 +2729,7 @@ parentheses.
|
|||
should not be used within a script run group, because it causes an immediate
|
||||
exit from the group, bypassing the script run checking.
|
||||
<a name="conditions"></a></P>
|
||||
<br><a name="SEC22" href="#TOC1">CONDITIONAL GROUPS</a><br>
|
||||
<br><a name="SEC23" href="#TOC1">CONDITIONAL GROUPS</a><br>
|
||||
<P>
|
||||
It is possible to cause the matching process to obey a pattern fragment
|
||||
conditionally or to choose between two alternative fragments, depending on
|
||||
|
@ -2845,8 +2905,13 @@ Assertion conditions
|
|||
<P>
|
||||
If the condition is not in any of the above formats, it must be a parenthesized
|
||||
assertion. This may be a positive or negative lookahead or lookbehind
|
||||
assertion. Consider this pattern, again containing non-significant white space,
|
||||
and with the two alternatives on the second line:
|
||||
assertion. However, it must be a traditional atomic assertion, not one of the
|
||||
PCRE2-specific
|
||||
<a href="#nonatomicassertions">non-atomic assertions.</a>
|
||||
</P>
|
||||
<P>
|
||||
Consider this pattern, again containing non-significant white space, and with
|
||||
the two alternatives on the second line:
|
||||
<pre>
|
||||
(?(?=[^a-z]*[a-z])
|
||||
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
||||
|
@ -2865,7 +2930,7 @@ positive and negative assertions, because matching always continues after the
|
|||
assertion, whether it succeeds or fails. (Compare non-conditional assertions,
|
||||
for which captures are retained only for positive assertions that succeed.)
|
||||
<a name="comments"></a></P>
|
||||
<br><a name="SEC23" href="#TOC1">COMMENTS</a><br>
|
||||
<br><a name="SEC24" href="#TOC1">COMMENTS</a><br>
|
||||
<P>
|
||||
There are two ways of including comments in patterns that are processed by
|
||||
PCRE2. In both cases, the start of the comment must not be in a character
|
||||
|
@ -2895,7 +2960,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so
|
|||
it does not terminate the comment. Only an actual character with the code value
|
||||
0x0a (the default newline) does so.
|
||||
<a name="recursion"></a></P>
|
||||
<br><a name="SEC24" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
||||
<br><a name="SEC25" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
||||
<P>
|
||||
Consider the problem of matching a string in parentheses, allowing for
|
||||
unlimited nested parentheses. Without the use of recursion, the best that can
|
||||
|
@ -3083,7 +3148,7 @@ alternative matches "a" and then recurses. In the recursion, \1 does now match
|
|||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
later versions (I tried 5.024) it now works.
|
||||
<a name="groupsassubroutines"></a></P>
|
||||
<br><a name="SEC25" href="#TOC1">GROUPS AS SUBROUTINES</a><br>
|
||||
<br><a name="SEC26" href="#TOC1">GROUPS AS SUBROUTINES</a><br>
|
||||
<P>
|
||||
If the syntax for a recursive group call (either by number or by name) is used
|
||||
outside the parentheses to which it refers, it operates a bit like a subroutine
|
||||
|
@ -3131,7 +3196,7 @@ in groups when called as subroutines is described in the section entitled
|
|||
<a href="#btsub">"Backtracking verbs in subroutines"</a>
|
||||
below.
|
||||
<a name="onigurumasubroutines"></a></P>
|
||||
<br><a name="SEC26" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
||||
<br><a name="SEC27" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
||||
<P>
|
||||
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
|
||||
a number enclosed either in angle brackets or single quotes, is an alternative
|
||||
|
@ -3149,7 +3214,7 @@ plus or a minus sign it is taken as a relative reference. For example:
|
|||
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
||||
synonymous. The former is a backreference; the latter is a subroutine call.
|
||||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
|
||||
<br><a name="SEC28" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
|
||||
code to be obeyed in the middle of matching a regular expression. This makes it
|
||||
|
@ -3225,7 +3290,7 @@ example:
|
|||
</pre>
|
||||
The doubling is removed before the string is passed to the callout function.
|
||||
<a name="backtrackcontrol"></a></P>
|
||||
<br><a name="SEC28" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<br><a name="SEC29" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||
terminology) that modify the behaviour of backtracking during matching. They
|
||||
|
@ -3739,12 +3804,12 @@ enclosing group that has alternatives (its normal behaviour). However, if there
|
|||
is no such group within the subroutine's group, the subroutine match fails and
|
||||
there is a backtrack at the outer level.
|
||||
</P>
|
||||
<br><a name="SEC29" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC30" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
|
||||
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC31" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
|
@ -3753,9 +3818,9 @@ University Computing Service
|
|||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 22 June 2019
|
||||
Last updated: 13 July 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -32,15 +32,16 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
|
||||
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
|
||||
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
||||
<li><a name="TOC20" href="#SEC20">SCRIPT RUNS</a>
|
||||
<li><a name="TOC21" href="#SEC21">BACKREFERENCES</a>
|
||||
<li><a name="TOC22" href="#SEC22">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||
<li><a name="TOC23" href="#SEC23">CONDITIONAL PATTERNS</a>
|
||||
<li><a name="TOC24" href="#SEC24">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC25" href="#SEC25">CALLOUTS</a>
|
||||
<li><a name="TOC26" href="#SEC26">SEE ALSO</a>
|
||||
<li><a name="TOC27" href="#SEC27">AUTHOR</a>
|
||||
<li><a name="TOC28" href="#SEC28">REVISION</a>
|
||||
<li><a name="TOC20" href="#SEC20">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
|
||||
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
|
||||
<li><a name="TOC22" href="#SEC22">BACKREFERENCES</a>
|
||||
<li><a name="TOC23" href="#SEC23">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||
<li><a name="TOC24" href="#SEC24">CONDITIONAL PATTERNS</a>
|
||||
<li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
|
||||
<li><a name="TOC27" href="#SEC27">SEE ALSO</a>
|
||||
<li><a name="TOC28" href="#SEC28">AUTHOR</a>
|
||||
<li><a name="TOC29" href="#SEC29">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
||||
<P>
|
||||
|
@ -544,7 +545,18 @@ setting with a similar syntax.
|
|||
</pre>
|
||||
Each top-level branch of a lookbehind must be of a fixed length.
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<br><a name="SEC20" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
|
||||
<P>
|
||||
These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||
<pre>
|
||||
(*napla:...)
|
||||
(*non_atomic_positive_lookahead:...)
|
||||
|
||||
(*naplb:...)
|
||||
(*non_atomic_positive_lookbehind:...)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(*script_run:...) ) script run, can be backtracked into
|
||||
|
@ -554,7 +566,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
|
|||
(*asr:...) )
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<br><a name="SEC22" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\n reference by number (can be ambiguous)
|
||||
|
@ -571,7 +583,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
|
|||
(?P=name) reference by name (Python)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC22" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||
<br><a name="SEC23" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?R) recurse whole pattern
|
||||
|
@ -590,7 +602,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
|
|||
\g'-n' call subroutine by relative number (PCRE2 extension)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC23" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||
<br><a name="SEC24" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?(condition)yes-pattern)
|
||||
|
@ -613,7 +625,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
|||
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||
condition if the relevant named group exists.
|
||||
</P>
|
||||
<br><a name="SEC24" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
||||
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
||||
|
@ -640,7 +652,7 @@ pattern is not anchored.
|
|||
The effect of one of these verbs in a group called as a subroutine is confined
|
||||
to the subroutine call.
|
||||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">CALLOUTS</a><br>
|
||||
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?C) callout (assumed number 0)
|
||||
|
@ -651,12 +663,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
|||
start and the end), and the starting delimiter { matched with the ending
|
||||
delimiter }. To encode the ending delimiter within the string, double it.
|
||||
</P>
|
||||
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC27" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC28" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
|
@ -665,9 +677,9 @@ University Computing Service
|
|||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC29" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 11 February 2019
|
||||
Last updated: 12 July 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
102
doc/pcre2.txt
102
doc/pcre2.txt
|
@ -4887,6 +4887,10 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
|||
at the start of a pattern that set overall options that cannot be
|
||||
changed within the pattern.
|
||||
|
||||
(m) PCRE2 supports non-atomic positive lookaround assertions. This is
|
||||
an extension to the lookaround facilities. The default, Perl-compatible
|
||||
lookarounds are atomic.
|
||||
|
||||
18. The Perl /a modifier restricts /d numbers to pure ascii, and the
|
||||
/aa modifier restricts /i case-insensitive matching to pure ascii, ig-
|
||||
noring Unicode rules. This separation cannot be represented with
|
||||
|
@ -4909,7 +4913,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 12 February 2019
|
||||
Last updated: 13 July 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -8140,16 +8144,19 @@ ASSERTIONS
|
|||
sertion may be positive (must match for the assertion to be true) or
|
||||
negative (must not match for the assertion to be true). An assertion
|
||||
group is matched in the normal way, and if it is true, matching contin-
|
||||
ues after it, but with the matching position in the subject string is
|
||||
was it was before the assertion was processed.
|
||||
ues after it, but with the matching position in the subject string re-
|
||||
set to what it was before the assertion was processed.
|
||||
|
||||
A lookaround assertion may also appear as the condition in a condi-
|
||||
tional group (see below). In this case, the result of matching the as-
|
||||
sertion determines which branch of the condition is followed.
|
||||
The Perl-compatible lookaround assertions are atomic. If an assertion
|
||||
is true, but there is a subsequent matching failure, there is no back-
|
||||
tracking into the assertion. However, there are some cases where non-
|
||||
atomic assertions can be useful. PCRE2 has some support for these, de-
|
||||
scribed in the section entitled "Non-atomic assertions" below, but they
|
||||
are not Perl-compatible.
|
||||
|
||||
Lookaround assertions are atomic. If an assertion is true, but there is
|
||||
a subsequent matching failure, there is no backtracking into the asser-
|
||||
tion.
|
||||
A lookaround assertion may appear as the condition in a conditional
|
||||
group (see below). In this case, the result of matching the assertion
|
||||
determines which branch of the condition is followed.
|
||||
|
||||
Assertion groups are not capture groups. If an assertion contains cap-
|
||||
ture groups within it, these are counted for the purposes of numbering
|
||||
|
@ -8362,6 +8369,60 @@ ASSERTIONS
|
|||
three characters that are not "999".
|
||||
|
||||
|
||||
NON-ATOMIC ASSERTIONS
|
||||
|
||||
The traditional Perl-compatible lookaround assertions are atomic. That
|
||||
is, if an assertion is true, but there is a subsequent matching fail-
|
||||
ure, there is no backtracking into the assertion. However, there are
|
||||
some cases where non-atomic positive assertions can be useful. PCRE2
|
||||
provides these using the following syntax:
|
||||
|
||||
(*non_atomic_positive_lookahead: or (*napla:
|
||||
(*non_atomic_positive_lookbehind: or (*naplb:
|
||||
|
||||
Consider the problem of finding the right-most word in a string that
|
||||
also appears earlier in the string, that is, it must appear at least
|
||||
twice in total. This pattern returns the required result as captured
|
||||
substring 1:
|
||||
|
||||
^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
|
||||
|
||||
For a subject such as "word1 word2 word3 word2 word3 word4" the result
|
||||
is "word3". How does it work? At the start, ^(?x) anchors the pattern
|
||||
and sets the "x" option, which causes white space (introduced for read-
|
||||
ability) to be ignored. Inside the assertion, the greedy .* at first
|
||||
consumes the entire string, but then has to backtrack until the rest of
|
||||
the assertion can match a word, which is captured by group 1. In other
|
||||
words, when the assertion first succeeds, it captures the right-most
|
||||
word in the string.
|
||||
|
||||
The current matching point is then reset to the start of the subject,
|
||||
and the rest of the pattern match checks for two occurrences of the
|
||||
captured word, using an ungreedy .*? to scan from the left. If this
|
||||
succeeds, we are done, but if the last word in the string does not oc-
|
||||
cur twice, this part of the pattern fails. If a traditional atomic
|
||||
lookhead (?= or (*pla: had been used, the assertion could not be re-en-
|
||||
tered, and the whole match would fail. The pattern would succeed only
|
||||
if the very last word in the subject was found twice.
|
||||
|
||||
Using a non-atomic lookahead, however, means that when the last word
|
||||
does not occur twice in the string, the lookahead can backtrack and
|
||||
find the second-last word, and so on, until either the match succeeds,
|
||||
or all words have been tested.
|
||||
|
||||
Two conditions must be met for a non-atomic assertion to be useful: the
|
||||
contents of one or more capturing groups must change after a backtrack
|
||||
into the assertion, and there must be a backreference to a changed
|
||||
group later in the pattern. If this is not the case, the rest of the
|
||||
pattern match fails exactly as before because nothing has changed, so
|
||||
using a non-atomic assertion just wastes resources.
|
||||
|
||||
Non-atomic assertions are not supported by the alternative matching
|
||||
function pcre2_dfa_match(). They are also not supported by JIT (but may
|
||||
be in future). Note that assertions that appear as conditions for con-
|
||||
ditional groups (see below) must be atomic.
|
||||
|
||||
|
||||
SCRIPT RUNS
|
||||
|
||||
In concept, a script run is a sequence of characters that are all from
|
||||
|
@ -8578,9 +8639,11 @@ CONDITIONAL GROUPS
|
|||
|
||||
If the condition is not in any of the above formats, it must be a
|
||||
parenthesized assertion. This may be a positive or negative lookahead
|
||||
or lookbehind assertion. Consider this pattern, again containing non-
|
||||
significant white space, and with the two alternatives on the second
|
||||
line:
|
||||
or lookbehind assertion. However, it must be a traditional atomic as-
|
||||
sertion, not one of the PCRE2-specific non-atomic assertions.
|
||||
|
||||
Consider this pattern, again containing non-significant white space,
|
||||
and with the two alternatives on the second line:
|
||||
|
||||
(?(?=[^a-z]*[a-z])
|
||||
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
||||
|
@ -9439,7 +9502,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 22 June 2019
|
||||
Last updated: 13 July 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -10663,6 +10726,17 @@ LOOKAHEAD AND LOOKBEHIND ASSERTIONS
|
|||
Each top-level branch of a lookbehind must be of a fixed length.
|
||||
|
||||
|
||||
NON-ATOMIC LOOKAROUND ASSERTIONS
|
||||
|
||||
These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||
|
||||
(*napla:...)
|
||||
(*non_atomic_positive_lookahead:...)
|
||||
|
||||
(*naplb:...)
|
||||
(*non_atomic_positive_lookbehind:...)
|
||||
|
||||
|
||||
SCRIPT RUNS
|
||||
|
||||
(*script_run:...) ) script run, can be backtracked into
|
||||
|
@ -10784,7 +10858,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 11 February 2019
|
||||
Last updated: 12 July 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2COMPAT 3 "12 February 2019" "PCRE2 10.33"
|
||||
.TH PCRE2COMPAT 3 "13 July 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
|
||||
|
@ -170,6 +170,10 @@ different way and is not Perl-compatible.
|
|||
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
|
||||
the start of a pattern that set overall options that cannot be changed within
|
||||
the pattern.
|
||||
.sp
|
||||
(m) PCRE2 supports non-atomic positive lookaround assertions. This is an
|
||||
extension to the lookaround facilities. The default, Perl-compatible
|
||||
lookarounds are atomic.
|
||||
.P
|
||||
18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
|
||||
modifier restricts /i case-insensitive matching to pure ascii, ignoring Unicode
|
||||
|
@ -199,6 +203,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 12 February 2019
|
||||
Last updated: 13 July 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "22 June 2019" "PCRE2 10.34"
|
||||
.TH PCRE2PATTERN 3 "13 July 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -2370,9 +2370,19 @@ those that look behind it, and in each case an assertion may be positive (must
|
|||
match for the assertion to be true) or negative (must not match for the
|
||||
assertion to be true). An assertion group is matched in the normal way,
|
||||
and if it is true, matching continues after it, but with the matching position
|
||||
in the subject string is was it was before the assertion was processed.
|
||||
in the subject string reset to what it was before the assertion was processed.
|
||||
.P
|
||||
A lookaround assertion may also appear as the condition in a
|
||||
The Perl-compatible lookaround assertions are atomic. If an assertion is true,
|
||||
but there is a subsequent matching failure, there is no backtracking into the
|
||||
assertion. However, there are some cases where non-atomic assertions can be
|
||||
useful. PCRE2 has some support for these, described in the section entitled
|
||||
.\" HTML <a href="#nonatomicassertions">
|
||||
.\" </a>
|
||||
"Non-atomic assertions"
|
||||
.\"
|
||||
below, but they are not Perl-compatible.
|
||||
.P
|
||||
A lookaround assertion may appear as the condition in a
|
||||
.\" HTML <a href="#conditions">
|
||||
.\" </a>
|
||||
conditional group
|
||||
|
@ -2380,9 +2390,6 @@ conditional group
|
|||
(see below). In this case, the result of matching the assertion determines
|
||||
which branch of the condition is followed.
|
||||
.P
|
||||
Lookaround assertions are atomic. If an assertion is true, but there is a
|
||||
subsequent matching failure, there is no backtracking into the assertion.
|
||||
.P
|
||||
Assertion groups are not capture groups. If an assertion contains capture
|
||||
groups within it, these are counted for the purposes of numbering the capture
|
||||
groups in the whole pattern. Within each branch of an assertion, locally
|
||||
|
@ -2435,11 +2442,11 @@ The assertion is obeyed just once when encountered during matching.
|
|||
.SS "Alphabetic assertion names"
|
||||
.rs
|
||||
.sp
|
||||
Traditionally, symbolic sequences such as (?= and (?<= have been used to specify
|
||||
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
|
||||
alternatives which might be easier to remember. They all start with (* instead
|
||||
of (? and must be written using lower case letters. PCRE2 supports the
|
||||
following synonyms:
|
||||
Traditionally, symbolic sequences such as (?= and (?<= have been used to
|
||||
specify lookaround assertions. Perl 5.28 introduced some experimental
|
||||
alphabetic alternatives which might be easier to remember. They all start with
|
||||
(* instead of (? and must be written using lower case letters. PCRE2 supports
|
||||
the following synonyms:
|
||||
.sp
|
||||
(*positive_lookahead: or (*pla: is the same as (?=
|
||||
(*negative_lookahead: or (*nla: is the same as (?!
|
||||
|
@ -2616,6 +2623,63 @@ is another pattern that matches "foo" preceded by three digits and any three
|
|||
characters that are not "999".
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="nonatomicassertions"></a>
|
||||
.SH "NON-ATOMIC ASSERTIONS"
|
||||
.rs
|
||||
.sp
|
||||
The traditional Perl-compatible lookaround assertions are atomic. That is, if
|
||||
an assertion is true, but there is a subsequent matching failure, there is no
|
||||
backtracking into the assertion. However, there are some cases where non-atomic
|
||||
positive assertions can be useful. PCRE2 provides these using the following
|
||||
syntax:
|
||||
.sp
|
||||
(*non_atomic_positive_lookahead: or (*napla:
|
||||
(*non_atomic_positive_lookbehind: or (*naplb:
|
||||
.sp
|
||||
Consider the problem of finding the right-most word in a string that also
|
||||
appears earlier in the string, that is, it must appear at least twice in total.
|
||||
This pattern returns the required result as captured substring 1:
|
||||
.sp
|
||||
^(?x)(*napla: .* \eb(\ew++)) (?> .*? \eb\e1\eb ){2}
|
||||
.sp
|
||||
For a subject such as "word1 word2 word3 word2 word3 word4" the result is
|
||||
"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
|
||||
"x" option, which causes white space (introduced for readability) to be
|
||||
ignored. Inside the assertion, the greedy .* at first consumes the entire
|
||||
string, but then has to backtrack until the rest of the assertion can match a
|
||||
word, which is captured by group 1. In other words, when the assertion first
|
||||
succeeds, it captures the right-most word in the string.
|
||||
.P
|
||||
The current matching point is then reset to the start of the subject, and the
|
||||
rest of the pattern match checks for two occurrences of the captured word,
|
||||
using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
|
||||
if the last word in the string does not occur twice, this part of the pattern
|
||||
fails. If a traditional atomic lookhead (?= or (*pla: had been used, the
|
||||
assertion could not be re-entered, and the whole match would fail. The pattern
|
||||
would succeed only if the very last word in the subject was found twice.
|
||||
.P
|
||||
Using a non-atomic lookahead, however, means that when the last word does not
|
||||
occur twice in the string, the lookahead can backtrack and find the second-last
|
||||
word, and so on, until either the match succeeds, or all words have been
|
||||
tested.
|
||||
.P
|
||||
Two conditions must be met for a non-atomic assertion to be useful: the
|
||||
contents of one or more capturing groups must change after a backtrack into the
|
||||
assertion, and there must be a backreference to a changed group later in the
|
||||
pattern. If this is not the case, the rest of the pattern match fails exactly
|
||||
as before because nothing has changed, so using a non-atomic assertion just
|
||||
wastes resources.
|
||||
.P
|
||||
Non-atomic assertions are not supported by the alternative matching function
|
||||
\fBpcre2_dfa_match()\fP. They are also not supported by JIT (but may be in
|
||||
future). Note that assertions that appear as conditions for
|
||||
.\" HTML <a href="#conditions">
|
||||
.\" </a>
|
||||
conditional groups
|
||||
.\"
|
||||
(see below) must be atomic.
|
||||
.
|
||||
.
|
||||
.SH "SCRIPT RUNS"
|
||||
.rs
|
||||
.sp
|
||||
|
@ -2867,8 +2931,15 @@ than two digits.
|
|||
.sp
|
||||
If the condition is not in any of the above formats, it must be a parenthesized
|
||||
assertion. This may be a positive or negative lookahead or lookbehind
|
||||
assertion. Consider this pattern, again containing non-significant white space,
|
||||
and with the two alternatives on the second line:
|
||||
assertion. However, it must be a traditional atomic assertion, not one of the
|
||||
PCRE2-specific
|
||||
.\" HTML <a href="#nonatomicassertions">
|
||||
.\" </a>
|
||||
non-atomic assertions.
|
||||
.\"
|
||||
.P
|
||||
Consider this pattern, again containing non-significant white space, and with
|
||||
the two alternatives on the second line:
|
||||
.sp
|
||||
(?(?=[^a-z]*[a-z])
|
||||
\ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} )
|
||||
|
@ -3788,6 +3859,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 22 June 2019
|
||||
Last updated: 13 July 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "11 February 2019" "PCRE2 10.33"
|
||||
.TH PCRE2SYNTAX 3 "12 July 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -522,6 +522,18 @@ setting with a similar syntax.
|
|||
Each top-level branch of a lookbehind must be of a fixed length.
|
||||
.
|
||||
.
|
||||
.SH "NON-ATOMIC LOOKAROUND ASSERTIONS"
|
||||
.rs
|
||||
.sp
|
||||
These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||
.sp
|
||||
(*napla:...)
|
||||
(*non_atomic_positive_lookahead:...)
|
||||
.sp
|
||||
(*naplb:...)
|
||||
(*non_atomic_positive_lookbehind:...)
|
||||
.
|
||||
.
|
||||
.SH "SCRIPT RUNS"
|
||||
.rs
|
||||
.sp
|
||||
|
@ -654,6 +666,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 11 February 2019
|
||||
Last updated: 12 July 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -307,6 +307,7 @@ pcre2_pattern_convert(). */
|
|||
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
|
||||
#define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE 196
|
||||
#define PCRE2_ERROR_TOO_MANY_CAPTURES 197
|
||||
#define PCRE2_ERROR_CONDITION_ATOMIC_ASSERTION_EXPECTED 198
|
||||
|
||||
|
||||
/* "Expected" matching error codes: no match and partial match. */
|
||||
|
|
|
@ -624,6 +624,13 @@ for(;;)
|
|||
case OP_ASSERTBACK_NOT:
|
||||
case OP_ONCE:
|
||||
return !entered_a_group;
|
||||
|
||||
/* Non-atomic assertions - don't possessify last iterator. This needs
|
||||
more thought. */
|
||||
|
||||
case OP_ASSERT_NA:
|
||||
case OP_ASSERTBACK_NA:
|
||||
return FALSE;
|
||||
}
|
||||
|
||||
/* Skip over the bracket and inspect what comes next. */
|
||||
|
|
|
@ -250,36 +250,41 @@ is present where expected in a conditional group. */
|
|||
#define META_LOOKBEHIND 0x80250000u /* (?<= */
|
||||
#define META_LOOKBEHINDNOT 0x80260000u /* (?<! */
|
||||
|
||||
/* These cannot be conditions */
|
||||
|
||||
#define META_LOOKAHEAD_NA 0x80270000u /* (*napla: */
|
||||
#define META_LOOKBEHIND_NA 0x80280000u /* (*naplb: */
|
||||
|
||||
/* These must be kept in this order, with consecutive values, and the _ARG
|
||||
versions of COMMIT, PRUNE, SKIP, and THEN immediately after their non-argument
|
||||
versions. */
|
||||
|
||||
#define META_MARK 0x80270000u /* (*MARK) */
|
||||
#define META_ACCEPT 0x80280000u /* (*ACCEPT) */
|
||||
#define META_FAIL 0x80290000u /* (*FAIL) */
|
||||
#define META_COMMIT 0x802a0000u /* These */
|
||||
#define META_COMMIT_ARG 0x802b0000u /* pairs */
|
||||
#define META_PRUNE 0x802c0000u /* must */
|
||||
#define META_PRUNE_ARG 0x802d0000u /* be */
|
||||
#define META_SKIP 0x802e0000u /* kept */
|
||||
#define META_SKIP_ARG 0x802f0000u /* in */
|
||||
#define META_THEN 0x80300000u /* this */
|
||||
#define META_THEN_ARG 0x80310000u /* order */
|
||||
#define META_MARK 0x80290000u /* (*MARK) */
|
||||
#define META_ACCEPT 0x802a0000u /* (*ACCEPT) */
|
||||
#define META_FAIL 0x802b0000u /* (*FAIL) */
|
||||
#define META_COMMIT 0x802c0000u /* These */
|
||||
#define META_COMMIT_ARG 0x802d0000u /* pairs */
|
||||
#define META_PRUNE 0x802e0000u /* must */
|
||||
#define META_PRUNE_ARG 0x802f0000u /* be */
|
||||
#define META_SKIP 0x80300000u /* kept */
|
||||
#define META_SKIP_ARG 0x80310000u /* in */
|
||||
#define META_THEN 0x80320000u /* this */
|
||||
#define META_THEN_ARG 0x80330000u /* order */
|
||||
|
||||
/* These must be kept in groups of adjacent 3 values, and all together. */
|
||||
|
||||
#define META_ASTERISK 0x80320000u /* * */
|
||||
#define META_ASTERISK_PLUS 0x80330000u /* *+ */
|
||||
#define META_ASTERISK_QUERY 0x80340000u /* *? */
|
||||
#define META_PLUS 0x80350000u /* + */
|
||||
#define META_PLUS_PLUS 0x80360000u /* ++ */
|
||||
#define META_PLUS_QUERY 0x80370000u /* +? */
|
||||
#define META_QUERY 0x80380000u /* ? */
|
||||
#define META_QUERY_PLUS 0x80390000u /* ?+ */
|
||||
#define META_QUERY_QUERY 0x803a0000u /* ?? */
|
||||
#define META_MINMAX 0x803b0000u /* {n,m} repeat */
|
||||
#define META_MINMAX_PLUS 0x803c0000u /* {n,m}+ repeat */
|
||||
#define META_MINMAX_QUERY 0x803d0000u /* {n,m}? repeat */
|
||||
#define META_ASTERISK 0x80340000u /* * */
|
||||
#define META_ASTERISK_PLUS 0x80350000u /* *+ */
|
||||
#define META_ASTERISK_QUERY 0x80360000u /* *? */
|
||||
#define META_PLUS 0x80370000u /* + */
|
||||
#define META_PLUS_PLUS 0x80380000u /* ++ */
|
||||
#define META_PLUS_QUERY 0x80390000u /* +? */
|
||||
#define META_QUERY 0x803a0000u /* ? */
|
||||
#define META_QUERY_PLUS 0x803b0000u /* ?+ */
|
||||
#define META_QUERY_QUERY 0x803c0000u /* ?? */
|
||||
#define META_MINMAX 0x803d0000u /* {n,m} repeat */
|
||||
#define META_MINMAX_PLUS 0x803e0000u /* {n,m}+ repeat */
|
||||
#define META_MINMAX_QUERY 0x803f0000u /* {n,m}? repeat */
|
||||
|
||||
#define META_FIRST_QUANTIFIER META_ASTERISK
|
||||
#define META_LAST_QUANTIFIER META_MINMAX_QUERY
|
||||
|
@ -335,6 +340,8 @@ static unsigned char meta_extra_lengths[] = {
|
|||
0, /* META_LOOKAHEADNOT */
|
||||
SIZEOFFSET, /* META_LOOKBEHIND */
|
||||
SIZEOFFSET, /* META_LOOKBEHINDNOT */
|
||||
0, /* META_LOOKAHEAD_NA */
|
||||
SIZEOFFSET, /* META_LOOKBEHIND_NA */
|
||||
1, /* META_MARK - plus the string length */
|
||||
0, /* META_ACCEPT */
|
||||
0, /* META_FAIL */
|
||||
|
@ -637,10 +644,14 @@ typedef struct alasitem {
|
|||
static const char alasnames[] =
|
||||
STRING_pla0
|
||||
STRING_plb0
|
||||
STRING_napla0
|
||||
STRING_naplb0
|
||||
STRING_nla0
|
||||
STRING_nlb0
|
||||
STRING_positive_lookahead0
|
||||
STRING_positive_lookbehind0
|
||||
STRING_non_atomic_positive_lookahead0
|
||||
STRING_non_atomic_positive_lookbehind0
|
||||
STRING_negative_lookahead0
|
||||
STRING_negative_lookbehind0
|
||||
STRING_atomic0
|
||||
|
@ -652,10 +663,14 @@ static const char alasnames[] =
|
|||
static const alasitem alasmeta[] = {
|
||||
{ 3, META_LOOKAHEAD },
|
||||
{ 3, META_LOOKBEHIND },
|
||||
{ 5, META_LOOKAHEAD_NA },
|
||||
{ 5, META_LOOKBEHIND_NA },
|
||||
{ 3, META_LOOKAHEADNOT },
|
||||
{ 3, META_LOOKBEHINDNOT },
|
||||
{ 18, META_LOOKAHEAD },
|
||||
{ 19, META_LOOKBEHIND },
|
||||
{ 29, META_LOOKAHEAD_NA },
|
||||
{ 30, META_LOOKBEHIND_NA },
|
||||
{ 18, META_LOOKAHEADNOT },
|
||||
{ 19, META_LOOKBEHINDNOT },
|
||||
{ 6, META_ATOMIC },
|
||||
|
@ -784,7 +799,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
|
|||
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
|
||||
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
|
||||
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
|
||||
ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97 };
|
||||
ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97, ERR98 };
|
||||
|
||||
/* This is a table of start-of-pattern options such as (*UTF) and settings such
|
||||
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
|
||||
|
@ -1015,6 +1030,7 @@ for (;;)
|
|||
case META_NOCAPTURE: fprintf(stderr, "META (?:"); break;
|
||||
case META_LOOKAHEAD: fprintf(stderr, "META (?="); break;
|
||||
case META_LOOKAHEADNOT: fprintf(stderr, "META (?!"); break;
|
||||
case META_LOOKAHEAD_NA: fprintf(stderr, "META (*napla:"); break;
|
||||
case META_SCRIPT_RUN: fprintf(stderr, "META (*sr:"); break;
|
||||
case META_KET: fprintf(stderr, "META )"); break;
|
||||
case META_ALT: fprintf(stderr, "META | %d", meta_arg); break;
|
||||
|
@ -1046,6 +1062,12 @@ for (;;)
|
|||
fprintf(stderr, "%zd", offset);
|
||||
break;
|
||||
|
||||
case META_LOOKBEHIND_NA:
|
||||
fprintf(stderr, "META (*naplb: %d offset=", meta_arg);
|
||||
GETOFFSET(offset, pptr);
|
||||
fprintf(stderr, "%zd", offset);
|
||||
break;
|
||||
|
||||
case META_LOOKBEHINDNOT:
|
||||
fprintf(stderr, "META (?<! %d offset=", meta_arg);
|
||||
GETOFFSET(offset, pptr);
|
||||
|
@ -3695,19 +3717,20 @@ while (ptr < ptrend)
|
|||
goto FAILED;
|
||||
}
|
||||
|
||||
/* Check for expecting an assertion condition. If so, only lookaround
|
||||
assertions are valid. */
|
||||
/* Check for expecting an assertion condition. If so, only atomic
|
||||
lookaround assertions are valid. */
|
||||
|
||||
meta = alasmeta[i].meta;
|
||||
if (prev_expect_cond_assert > 0 &&
|
||||
(meta < META_LOOKAHEAD || meta > META_LOOKBEHINDNOT))
|
||||
{
|
||||
errorcode = ERR28; /* Assertion expected */
|
||||
errorcode = (meta == META_LOOKAHEAD_NA || meta == META_LOOKBEHIND_NA)?
|
||||
ERR98 : ERR28; /* (Atomic) assertion expected */
|
||||
goto FAILED;
|
||||
}
|
||||
|
||||
/* The lookaround alphabetic synonyms can be almost entirely handled by
|
||||
jumping to the code that handles the traditional symbolic forms. */
|
||||
/* The lookaround alphabetic synonyms can mostly be handled by jumping
|
||||
to the code that handles the traditional symbolic forms. */
|
||||
|
||||
switch(meta)
|
||||
{
|
||||
|
@ -3721,11 +3744,17 @@ while (ptr < ptrend)
|
|||
case META_LOOKAHEAD:
|
||||
goto POSITIVE_LOOK_AHEAD;
|
||||
|
||||
case META_LOOKAHEAD_NA:
|
||||
*parsed_pattern++ = meta;
|
||||
ptr++;
|
||||
goto POST_ASSERTION;
|
||||
|
||||
case META_LOOKAHEADNOT:
|
||||
goto NEGATIVE_LOOK_AHEAD;
|
||||
|
||||
case META_LOOKBEHIND:
|
||||
case META_LOOKBEHINDNOT:
|
||||
case META_LOOKBEHIND_NA:
|
||||
*parsed_pattern++ = meta;
|
||||
ptr--;
|
||||
goto POST_LOOKBEHIND;
|
||||
|
@ -4429,7 +4458,7 @@ while (ptr < ptrend)
|
|||
*parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)?
|
||||
META_LOOKBEHIND : META_LOOKBEHINDNOT;
|
||||
|
||||
POST_LOOKBEHIND: /* Come from (*plb: and (*nlb: */
|
||||
POST_LOOKBEHIND: /* Come from (*plb: (*naplb: and (*nlb: */
|
||||
*has_lookbehind = TRUE;
|
||||
offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2);
|
||||
PUTOFFSET(offset, parsed_pattern);
|
||||
|
@ -6300,6 +6329,11 @@ for (;; pptr++)
|
|||
cb->assert_depth += 1;
|
||||
goto GROUP_PROCESS;
|
||||
|
||||
case META_LOOKAHEAD_NA:
|
||||
bravalue = OP_ASSERT_NA;
|
||||
cb->assert_depth += 1;
|
||||
goto GROUP_PROCESS;
|
||||
|
||||
/* Optimize (?!) to (*FAIL) unless it is quantified - which is a weird
|
||||
thing to do, but Perl allows all assertions to be quantified, and when
|
||||
they contain capturing parentheses there may be a potential use for
|
||||
|
@ -6331,6 +6365,11 @@ for (;; pptr++)
|
|||
cb->assert_depth += 1;
|
||||
goto GROUP_PROCESS;
|
||||
|
||||
case META_LOOKBEHIND_NA:
|
||||
bravalue = OP_ASSERTBACK_NA;
|
||||
cb->assert_depth += 1;
|
||||
goto GROUP_PROCESS;
|
||||
|
||||
case META_ATOMIC:
|
||||
bravalue = OP_ONCE;
|
||||
goto GROUP_PROCESS_NOTE_EMPTY;
|
||||
|
@ -7931,7 +7970,10 @@ length = 2 + 2*LINK_SIZE + skipunits;
|
|||
/* Remember if this is a lookbehind assertion, and if it is, save its length
|
||||
and skip over the pattern offset. */
|
||||
|
||||
lookbehind = *code == OP_ASSERTBACK || *code == OP_ASSERTBACK_NOT;
|
||||
lookbehind = *code == OP_ASSERTBACK ||
|
||||
*code == OP_ASSERTBACK_NOT ||
|
||||
*code == OP_ASSERTBACK_NA;
|
||||
|
||||
if (lookbehind)
|
||||
{
|
||||
lookbehindlength = META_DATA(pptr[-1]);
|
||||
|
@ -8802,8 +8844,10 @@ for (;; pptr++)
|
|||
case META_COND_VERSION:
|
||||
case META_LOOKAHEAD:
|
||||
case META_LOOKAHEADNOT:
|
||||
case META_LOOKAHEAD_NA:
|
||||
case META_LOOKBEHIND:
|
||||
case META_LOOKBEHINDNOT:
|
||||
case META_LOOKBEHIND_NA:
|
||||
case META_NOCAPTURE:
|
||||
case META_SCRIPT_RUN:
|
||||
nestlevel++;
|
||||
|
@ -9064,6 +9108,7 @@ for (;; pptr++)
|
|||
|
||||
case META_LOOKAHEAD:
|
||||
case META_LOOKAHEADNOT:
|
||||
case META_LOOKAHEAD_NA:
|
||||
pptr = parsed_skip(pptr + 1, PSKIP_KET);
|
||||
if (pptr == NULL) goto PARSED_SKIP_FAILED;
|
||||
|
||||
|
@ -9102,6 +9147,7 @@ for (;; pptr++)
|
|||
|
||||
case META_LOOKBEHIND:
|
||||
case META_LOOKBEHINDNOT:
|
||||
case META_LOOKBEHIND_NA:
|
||||
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
|
||||
return -1;
|
||||
if (max - branchlength > extra) extra = max - branchlength;
|
||||
|
@ -9453,6 +9499,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
|
|||
case META_KET:
|
||||
case META_LOOKAHEAD:
|
||||
case META_LOOKAHEADNOT:
|
||||
case META_LOOKAHEAD_NA:
|
||||
case META_NOCAPTURE:
|
||||
case META_PLUS:
|
||||
case META_PLUS_PLUS:
|
||||
|
@ -9514,6 +9561,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
|
|||
|
||||
case META_LOOKBEHIND:
|
||||
case META_LOOKBEHINDNOT:
|
||||
case META_LOOKBEHIND_NA:
|
||||
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, NULL, cb))
|
||||
return errorcode;
|
||||
break;
|
||||
|
|
|
@ -173,6 +173,8 @@ static const uint8_t coptable[] = {
|
|||
0, /* Assert not */
|
||||
0, /* Assert behind */
|
||||
0, /* Assert behind not */
|
||||
0, /* NA assert */
|
||||
0, /* NA assert behind */
|
||||
0, /* ONCE */
|
||||
0, /* SCRIPT_RUN */
|
||||
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
|
||||
|
@ -248,6 +250,8 @@ static const uint8_t poptable[] = {
|
|||
0, /* Assert not */
|
||||
0, /* Assert behind */
|
||||
0, /* Assert behind not */
|
||||
0, /* NA assert */
|
||||
0, /* NA assert behind */
|
||||
0, /* ONCE */
|
||||
0, /* SCRIPT_RUN */
|
||||
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
|
||||
|
|
|
@ -185,6 +185,7 @@ static const unsigned char compile_error_texts[] =
|
|||
"(*alpha_assertion) not recognized\0"
|
||||
"script runs require Unicode support, which this version of PCRE2 does not have\0"
|
||||
"too many capturing groups (maximum 65535)\0"
|
||||
"atomic assertion expected after (?( or (?(?C)\0"
|
||||
;
|
||||
|
||||
/* Match-time and UTF error texts are in the same format. */
|
||||
|
|
|
@ -883,12 +883,16 @@ a positive value. */
|
|||
#define STRING_atomic0 "atomic\0"
|
||||
#define STRING_pla0 "pla\0"
|
||||
#define STRING_plb0 "plb\0"
|
||||
#define STRING_napla0 "napla\0"
|
||||
#define STRING_naplb0 "naplb\0"
|
||||
#define STRING_nla0 "nla\0"
|
||||
#define STRING_nlb0 "nlb\0"
|
||||
#define STRING_sr0 "sr\0"
|
||||
#define STRING_asr0 "asr\0"
|
||||
#define STRING_positive_lookahead0 "positive_lookahead\0"
|
||||
#define STRING_positive_lookbehind0 "positive_lookbehind\0"
|
||||
#define STRING_non_atomic_positive_lookahead0 "non_atomic_positive_lookahead\0"
|
||||
#define STRING_non_atomic_positive_lookbehind0 "non_atomic_positive_lookbehind\0"
|
||||
#define STRING_negative_lookahead0 "negative_lookahead\0"
|
||||
#define STRING_negative_lookbehind0 "negative_lookbehind\0"
|
||||
#define STRING_script_run0 "script_run\0"
|
||||
|
@ -1173,12 +1177,16 @@ only. */
|
|||
#define STRING_atomic0 STR_a STR_t STR_o STR_m STR_i STR_c "\0"
|
||||
#define STRING_pla0 STR_p STR_l STR_a "\0"
|
||||
#define STRING_plb0 STR_p STR_l STR_b "\0"
|
||||
#define STRING_napla0 STR_n STR_a STR_p STR_l STR_a "\0"
|
||||
#define STRING_naplb0 STR_n STR_a STR_p STR_l STR_b "\0"
|
||||
#define STRING_nla0 STR_n STR_l STR_a "\0"
|
||||
#define STRING_nlb0 STR_n STR_l STR_b "\0"
|
||||
#define STRING_sr0 STR_s STR_r "\0"
|
||||
#define STRING_asr0 STR_a STR_s STR_r "\0"
|
||||
#define STRING_positive_lookahead0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
|
||||
#define STRING_positive_lookbehind0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
|
||||
#define STRING_non_atomic_positive_lookahead0 STR_n STR_o STR_n STR_UNDERSCORE STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
|
||||
#define STRING_non_atomic_positive_lookbehind0 STR_n STR_o STR_n STR_UNDERSCORE STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
|
||||
#define STRING_negative_lookahead0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
|
||||
#define STRING_negative_lookbehind0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
|
||||
#define STRING_script_run0 STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n "\0"
|
||||
|
@ -1303,7 +1311,7 @@ enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
|
|||
Starting from 1 (i.e. after OP_END), the values up to OP_EOD must correspond in
|
||||
order to the list of escapes immediately above. Furthermore, values up to
|
||||
OP_DOLLM must not be changed without adjusting the table called autoposstab in
|
||||
pcre2_auto_possess.c
|
||||
pcre2_auto_possess.c.
|
||||
|
||||
Whenever this list is updated, the two macro definitions that follow must be
|
||||
updated to match. The possessification table called "opcode_possessify" in
|
||||
|
@ -1501,80 +1509,81 @@ enum {
|
|||
OP_KETRMIN, /* 123 order. They are for groups the repeat for ever. */
|
||||
OP_KETRPOS, /* 124 Possessive unlimited repeat. */
|
||||
|
||||
/* The assertions must come before BRA, CBRA, ONCE, and COND, and the four
|
||||
asserts must remain in order. */
|
||||
/* The assertions must come before BRA, CBRA, ONCE, and COND. */
|
||||
|
||||
OP_REVERSE, /* 125 Move pointer back - used in lookbehind assertions */
|
||||
OP_ASSERT, /* 126 Positive lookahead */
|
||||
OP_ASSERT_NOT, /* 127 Negative lookahead */
|
||||
OP_ASSERTBACK, /* 128 Positive lookbehind */
|
||||
OP_ASSERTBACK_NOT, /* 129 Negative lookbehind */
|
||||
OP_ASSERT_NA, /* 130 Positive non-atomic lookahead */
|
||||
OP_ASSERTBACK_NA, /* 131 Positive non-atomic lookbehind */
|
||||
|
||||
/* ONCE, SCRIPT_RUN, BRA, BRAPOS, CBRA, CBRAPOS, and COND must come
|
||||
immediately after the assertions, with ONCE first, as there's a test for >=
|
||||
ONCE for a subpattern that isn't an assertion. The POS versions must
|
||||
immediately follow the non-POS versions in each case. */
|
||||
|
||||
OP_ONCE, /* 130 Atomic group, contains captures */
|
||||
OP_SCRIPT_RUN, /* 131 Non-capture, but check characters' scripts */
|
||||
OP_BRA, /* 132 Start of non-capturing bracket */
|
||||
OP_BRAPOS, /* 133 Ditto, with unlimited, possessive repeat */
|
||||
OP_CBRA, /* 134 Start of capturing bracket */
|
||||
OP_CBRAPOS, /* 135 Ditto, with unlimited, possessive repeat */
|
||||
OP_COND, /* 136 Conditional group */
|
||||
OP_ONCE, /* 132 Atomic group, contains captures */
|
||||
OP_SCRIPT_RUN, /* 133 Non-capture, but check characters' scripts */
|
||||
OP_BRA, /* 134 Start of non-capturing bracket */
|
||||
OP_BRAPOS, /* 135 Ditto, with unlimited, possessive repeat */
|
||||
OP_CBRA, /* 136 Start of capturing bracket */
|
||||
OP_CBRAPOS, /* 137 Ditto, with unlimited, possessive repeat */
|
||||
OP_COND, /* 138 Conditional group */
|
||||
|
||||
/* These five must follow the previous five, in the same order. There's a
|
||||
check for >= SBRA to distinguish the two sets. */
|
||||
|
||||
OP_SBRA, /* 137 Start of non-capturing bracket, check empty */
|
||||
OP_SBRAPOS, /* 138 Ditto, with unlimited, possessive repeat */
|
||||
OP_SCBRA, /* 139 Start of capturing bracket, check empty */
|
||||
OP_SCBRAPOS, /* 140 Ditto, with unlimited, possessive repeat */
|
||||
OP_SCOND, /* 141 Conditional group, check empty */
|
||||
OP_SBRA, /* 139 Start of non-capturing bracket, check empty */
|
||||
OP_SBRAPOS, /* 149 Ditto, with unlimited, possessive repeat */
|
||||
OP_SCBRA, /* 141 Start of capturing bracket, check empty */
|
||||
OP_SCBRAPOS, /* 142 Ditto, with unlimited, possessive repeat */
|
||||
OP_SCOND, /* 143 Conditional group, check empty */
|
||||
|
||||
/* The next two pairs must (respectively) be kept together. */
|
||||
|
||||
OP_CREF, /* 142 Used to hold a capture number as condition */
|
||||
OP_DNCREF, /* 143 Used to point to duplicate names as a condition */
|
||||
OP_RREF, /* 144 Used to hold a recursion number as condition */
|
||||
OP_DNRREF, /* 145 Used to point to duplicate names as a condition */
|
||||
OP_FALSE, /* 146 Always false (used by DEFINE and VERSION) */
|
||||
OP_TRUE, /* 147 Always true (used by VERSION) */
|
||||
OP_CREF, /* 144 Used to hold a capture number as condition */
|
||||
OP_DNCREF, /* 145 Used to point to duplicate names as a condition */
|
||||
OP_RREF, /* 146 Used to hold a recursion number as condition */
|
||||
OP_DNRREF, /* 147 Used to point to duplicate names as a condition */
|
||||
OP_FALSE, /* 148 Always false (used by DEFINE and VERSION) */
|
||||
OP_TRUE, /* 149 Always true (used by VERSION) */
|
||||
|
||||
OP_BRAZERO, /* 148 These two must remain together and in this */
|
||||
OP_BRAMINZERO, /* 149 order. */
|
||||
OP_BRAPOSZERO, /* 150 */
|
||||
OP_BRAZERO, /* 150 These two must remain together and in this */
|
||||
OP_BRAMINZERO, /* 151 order. */
|
||||
OP_BRAPOSZERO, /* 152 */
|
||||
|
||||
/* These are backtracking control verbs */
|
||||
|
||||
OP_MARK, /* 151 always has an argument */
|
||||
OP_PRUNE, /* 152 */
|
||||
OP_PRUNE_ARG, /* 153 same, but with argument */
|
||||
OP_SKIP, /* 154 */
|
||||
OP_SKIP_ARG, /* 155 same, but with argument */
|
||||
OP_THEN, /* 156 */
|
||||
OP_THEN_ARG, /* 157 same, but with argument */
|
||||
OP_COMMIT, /* 158 */
|
||||
OP_COMMIT_ARG, /* 159 same, but with argument */
|
||||
OP_MARK, /* 153 always has an argument */
|
||||
OP_PRUNE, /* 154 */
|
||||
OP_PRUNE_ARG, /* 155 same, but with argument */
|
||||
OP_SKIP, /* 156 */
|
||||
OP_SKIP_ARG, /* 157 same, but with argument */
|
||||
OP_THEN, /* 158 */
|
||||
OP_THEN_ARG, /* 159 same, but with argument */
|
||||
OP_COMMIT, /* 160 */
|
||||
OP_COMMIT_ARG, /* 161 same, but with argument */
|
||||
|
||||
/* These are forced failure and success verbs. FAIL and ACCEPT do accept an
|
||||
argument, but these cases can be compiled as, for example, (*MARK:X)(*FAIL)
|
||||
without the need for a special opcode. */
|
||||
|
||||
OP_FAIL, /* 160 */
|
||||
OP_ACCEPT, /* 161 */
|
||||
OP_ASSERT_ACCEPT, /* 162 Used inside assertions */
|
||||
OP_CLOSE, /* 163 Used before OP_ACCEPT to close open captures */
|
||||
OP_FAIL, /* 162 */
|
||||
OP_ACCEPT, /* 163 */
|
||||
OP_ASSERT_ACCEPT, /* 164 Used inside assertions */
|
||||
OP_CLOSE, /* 165 Used before OP_ACCEPT to close open captures */
|
||||
|
||||
/* This is used to skip a subpattern with a {0} quantifier */
|
||||
|
||||
OP_SKIPZERO, /* 164 */
|
||||
OP_SKIPZERO, /* 166 */
|
||||
|
||||
/* This is used to identify a DEFINE group during compilation so that it can
|
||||
be checked for having only one branch. It is changed to OP_FALSE before
|
||||
compilation finishes. */
|
||||
|
||||
OP_DEFINE, /* 165 */
|
||||
OP_DEFINE, /* 167 */
|
||||
|
||||
/* This is not an opcode, but is used to check that tables indexed by opcode
|
||||
are the correct length, in order to catch updating errors - there have been
|
||||
|
@ -1587,7 +1596,7 @@ enum {
|
|||
/* *** NOTE NOTE NOTE *** Whenever the list above is updated, the two macro
|
||||
definitions that follow must also be updated to match. There are also tables
|
||||
called "opcode_possessify" in pcre2_compile.c and "coptable" and "poptable" in
|
||||
pcre2_dfa_exec.c that must be updated. */
|
||||
pcre2_dfa_match.c that must be updated. */
|
||||
|
||||
|
||||
/* This macro defines textual names for all the opcodes. These are used only
|
||||
|
@ -1620,7 +1629,9 @@ some cases doesn't actually use these names at all). */
|
|||
"class", "nclass", "xclass", "Ref", "Refi", "DnRef", "DnRefi", \
|
||||
"Recurse", "Callout", "CalloutStr", \
|
||||
"Alt", "Ket", "KetRmax", "KetRmin", "KetRpos", \
|
||||
"Reverse", "Assert", "Assert not", "AssertB", "AssertB not", \
|
||||
"Reverse", "Assert", "Assert not", \
|
||||
"Assert back", "Assert back not", \
|
||||
"Non-atomic assert", "Non-atomic assert back", \
|
||||
"Once", \
|
||||
"Script run", \
|
||||
"Bra", "BraPos", "CBra", "CBraPos", \
|
||||
|
@ -1705,6 +1716,8 @@ in UTF-8 mode. The code that uses this table must know about such things. */
|
|||
1+LINK_SIZE, /* Assert not */ \
|
||||
1+LINK_SIZE, /* Assert behind */ \
|
||||
1+LINK_SIZE, /* Assert behind not */ \
|
||||
1+LINK_SIZE, /* NA Assert */ \
|
||||
1+LINK_SIZE, /* NA Assert behind */ \
|
||||
1+LINK_SIZE, /* ONCE */ \
|
||||
1+LINK_SIZE, /* SCRIPT_RUN */ \
|
||||
1+LINK_SIZE, /* BRA */ \
|
||||
|
|
|
@ -5127,6 +5127,8 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
|
||||
case OP_ASSERT:
|
||||
case OP_ASSERTBACK:
|
||||
case OP_ASSERT_NA:
|
||||
case OP_ASSERTBACK_NA:
|
||||
Lframe_type = GF_NOCAPTURE | Fop;
|
||||
for (;;)
|
||||
{
|
||||
|
@ -5497,10 +5499,20 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
case OP_SCOND:
|
||||
break;
|
||||
|
||||
/* Positive assertions are like OP_ONCE, except that in addition the
|
||||
/* Non-atomic positive assertions are like OP_BRA, except that the
|
||||
subject pointer must be put back to where it was at the start of the
|
||||
assertion. */
|
||||
|
||||
case OP_ASSERT_NA:
|
||||
case OP_ASSERTBACK_NA:
|
||||
if (Feptr > mb->last_used_ptr) mb->last_used_ptr = Feptr;
|
||||
Feptr = P->eptr;
|
||||
break;
|
||||
|
||||
/* Atomic positive assertions are like OP_ONCE, except that in addition
|
||||
the subject pointer must be put back to where it was at the start of the
|
||||
assertion. */
|
||||
|
||||
case OP_ASSERT:
|
||||
case OP_ASSERTBACK:
|
||||
if (Feptr > mb->last_used_ptr) mb->last_used_ptr = Feptr;
|
||||
|
|
|
@ -392,6 +392,8 @@ for(;;)
|
|||
case OP_ASSERT_NOT:
|
||||
case OP_ASSERTBACK:
|
||||
case OP_ASSERTBACK_NOT:
|
||||
case OP_ASSERT_NA:
|
||||
case OP_ASSERTBACK_NA:
|
||||
case OP_ONCE:
|
||||
case OP_SCRIPT_RUN:
|
||||
case OP_COND:
|
||||
|
|
|
@ -240,6 +240,8 @@ for (;;)
|
|||
case OP_ASSERT_NOT:
|
||||
case OP_ASSERTBACK:
|
||||
case OP_ASSERTBACK_NOT:
|
||||
case OP_ASSERT_NA:
|
||||
case OP_ASSERTBACK_NA:
|
||||
do cc += GET(cc, 1); while (*cc == OP_ALT);
|
||||
/* Fall through */
|
||||
|
||||
|
@ -1089,6 +1091,7 @@ do
|
|||
case OP_ONCE:
|
||||
case OP_SCRIPT_RUN:
|
||||
case OP_ASSERT:
|
||||
case OP_ASSERT_NA:
|
||||
rc = set_start_bits(re, tcode, utf);
|
||||
if (rc == SSB_FAIL || rc == SSB_UNKNOWN) return rc;
|
||||
if (rc == SSB_DONE) try_next = FALSE; else
|
||||
|
@ -1131,6 +1134,7 @@ do
|
|||
case OP_ASSERT_NOT:
|
||||
case OP_ASSERTBACK:
|
||||
case OP_ASSERTBACK_NOT:
|
||||
case OP_ASSERTBACK_NA:
|
||||
do tcode += GET(tcode, 1); while (*tcode == OP_ALT);
|
||||
tcode += 1 + LINK_SIZE;
|
||||
break;
|
||||
|
|
|
@ -5653,4 +5653,33 @@ a)"xI
|
|||
# Multiplication overflow
|
||||
/(X{65535})(?<=\1{32770})/
|
||||
|
||||
# ---- Non-atomic assertion tests ----
|
||||
|
||||
# Expect error: not allowed as a condition
|
||||
/(?(*napla:xx)bc)/
|
||||
|
||||
/\A(*pla:.*\b(\w++))(?>.*?\b\1\b){3}/
|
||||
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
|
||||
|
||||
/\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/
|
||||
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
|
||||
|
||||
/(*plb:(.)..|(.)...)(\1|\2)/
|
||||
abcdb\=offset=4
|
||||
abcda\=offset=4
|
||||
|
||||
/(*naplb:(.)..|(.)...)(\1|\2)/
|
||||
abcdb\=offset=4
|
||||
abcda\=offset=4
|
||||
|
||||
/(*non_atomic_positive_lookahead:ab)/B
|
||||
|
||||
/(*non_atomic_positive_lookbehind:ab)/B
|
||||
|
||||
/(*pla:ab+)/B
|
||||
|
||||
/(*napla:ab+)/B
|
||||
|
||||
# ----
|
||||
|
||||
# End of testinput2
|
||||
|
|
|
@ -11117,7 +11117,7 @@ Matched, but too many substrings
|
|||
------------------------------------------------------------------
|
||||
Bra
|
||||
Brazero
|
||||
AssertB
|
||||
Assert back
|
||||
Reverse
|
||||
CBra 1
|
||||
abc
|
||||
|
@ -13346,7 +13346,7 @@ Failed: error 144 at offset 5: subpattern name must start with a non-digit
|
|||
Ket
|
||||
red
|
||||
\b
|
||||
AssertB
|
||||
Assert back
|
||||
Reverse
|
||||
\w
|
||||
Ket
|
||||
|
@ -13403,7 +13403,7 @@ Failed: error 133 at offset 7: parentheses are too deeply nested (stack check)
|
|||
Once
|
||||
\s*+
|
||||
Ket
|
||||
AssertB
|
||||
Assert back
|
||||
Reverse
|
||||
\w
|
||||
Ket
|
||||
|
@ -16619,7 +16619,7 @@ No match
|
|||
/(?<=(?=.){4,5}x)/B
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
AssertB
|
||||
Assert back
|
||||
Reverse
|
||||
Assert
|
||||
Any
|
||||
|
@ -17086,6 +17086,87 @@ Failed: error 187 at offset 15: lookbehind assertion is too long
|
|||
/(X{65535})(?<=\1{32770})/
|
||||
Failed: error 187 at offset 10: lookbehind assertion is too long
|
||||
|
||||
# ---- Non-atomic assertion tests ----
|
||||
|
||||
# Expect error: not allowed as a condition
|
||||
/(?(*napla:xx)bc)/
|
||||
Failed: error 198 at offset 9: atomic assertion expected after (?( or (?(?C)
|
||||
|
||||
/\A(*pla:.*\b(\w++))(?>.*?\b\1\b){3}/
|
||||
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
|
||||
No match
|
||||
|
||||
/\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/
|
||||
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
|
||||
0: word1 word3 word1 word2 word3 word2 word2 word1 word3
|
||||
1: word3
|
||||
|
||||
/(*plb:(.)..|(.)...)(\1|\2)/
|
||||
abcdb\=offset=4
|
||||
0: b
|
||||
1: b
|
||||
2: <unset>
|
||||
3: b
|
||||
abcda\=offset=4
|
||||
No match
|
||||
|
||||
/(*naplb:(.)..|(.)...)(\1|\2)/
|
||||
abcdb\=offset=4
|
||||
0: b
|
||||
1: b
|
||||
2: <unset>
|
||||
3: b
|
||||
abcda\=offset=4
|
||||
0: a
|
||||
1: <unset>
|
||||
2: a
|
||||
3: a
|
||||
|
||||
/(*non_atomic_positive_lookahead:ab)/B
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
Non-atomic assert
|
||||
ab
|
||||
Ket
|
||||
Ket
|
||||
End
|
||||
------------------------------------------------------------------
|
||||
|
||||
/(*non_atomic_positive_lookbehind:ab)/B
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
Non-atomic assert back
|
||||
Reverse
|
||||
ab
|
||||
Ket
|
||||
Ket
|
||||
End
|
||||
------------------------------------------------------------------
|
||||
|
||||
/(*pla:ab+)/B
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
Assert
|
||||
a
|
||||
b++
|
||||
Ket
|
||||
Ket
|
||||
End
|
||||
------------------------------------------------------------------
|
||||
|
||||
/(*napla:ab+)/B
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
Non-atomic assert
|
||||
a
|
||||
b+
|
||||
Ket
|
||||
Ket
|
||||
End
|
||||
------------------------------------------------------------------
|
||||
|
||||
# ----
|
||||
|
||||
# End of testinput2
|
||||
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
|
||||
Error -62: bad serialized data
|
||||
|
|
|
@ -4017,7 +4017,7 @@ MK: a\x{12345}b\x{09}(d)c
|
|||
------------------------------------------------------------------
|
||||
Bra
|
||||
\b
|
||||
AssertB
|
||||
Assert back
|
||||
Reverse
|
||||
prop Xwd
|
||||
Ket
|
||||
|
@ -4196,7 +4196,7 @@ Failed: error 125 at offset 2: lookbehind assertion is not fixed length
|
|||
------------------------------------------------------------------
|
||||
Bra
|
||||
^
|
||||
AssertB not
|
||||
Assert back not
|
||||
Assert
|
||||
\x{10385c}
|
||||
Ket
|
||||
|
@ -4828,7 +4828,7 @@ MK: ABC
|
|||
/(?<!)(*sr:)/B
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
AssertB not
|
||||
Assert back not
|
||||
Ket
|
||||
Script run
|
||||
Ket
|
||||
|
@ -4839,7 +4839,7 @@ MK: ABC
|
|||
/(?<=abc(?=X(*sr:BXY)CCC)XBXYCCC)./B
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
AssertB
|
||||
Assert back
|
||||
Reverse
|
||||
abc
|
||||
Assert
|
||||
|
|
Loading…
Reference in New Issue