Implement non-atomic positive assertions.
This commit is contained in:
parent
691aca7a86
commit
620f3a1307
|
@ -88,6 +88,8 @@ otherwise), an atomic group, or a recursion.
|
||||||
17. Check for integer overflow when computing lookbehind lengths. Fixes
|
17. Check for integer overflow when computing lookbehind lengths. Fixes
|
||||||
Clusterfuzz issue 15636.
|
Clusterfuzz issue 15636.
|
||||||
|
|
||||||
|
18. Implement non-atomic positive lookaround assertions.
|
||||||
|
|
||||||
|
|
||||||
Version 10.33 16-April-2019
|
Version 10.33 16-April-2019
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
34
HACKING
34
HACKING
|
@ -195,6 +195,7 @@ META_END End of pattern (this value is 0x80000000)
|
||||||
META_FAIL (*FAIL)
|
META_FAIL (*FAIL)
|
||||||
META_KET ) closing parenthesis
|
META_KET ) closing parenthesis
|
||||||
META_LOOKAHEAD (?= start of lookahead
|
META_LOOKAHEAD (?= start of lookahead
|
||||||
|
META_LOOKAHEAD_NA (*napla: start of non-atomic lookahead
|
||||||
META_LOOKAHEADNOT (?! start of negative lookahead
|
META_LOOKAHEADNOT (?! start of negative lookahead
|
||||||
META_NOCAPTURE (?: no capture parens
|
META_NOCAPTURE (?: no capture parens
|
||||||
META_PLUS +
|
META_PLUS +
|
||||||
|
@ -286,8 +287,9 @@ The following are also followed just by an offset, but also the lower 16 bits
|
||||||
of the main word contain the length of the first branch of the lookbehind
|
of the main word contain the length of the first branch of the lookbehind
|
||||||
group; this is used when generating OP_REVERSE for that branch.
|
group; this is used when generating OP_REVERSE for that branch.
|
||||||
|
|
||||||
META_LOOKBEHIND (?<=
|
META_LOOKBEHIND (?<= start of lookbehind
|
||||||
META_LOOKBEHINDNOT (?<!
|
META_LOOKBEHIND_NA (*naplb: start of non-atomic lookbehind
|
||||||
|
META_LOOKBEHINDNOT (?<! start of negative lookbehind
|
||||||
|
|
||||||
The following are followed by two elements, the minimum and maximum. Repeat
|
The following are followed by two elements, the minimum and maximum. Repeat
|
||||||
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
|
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
|
||||||
|
@ -715,13 +717,15 @@ Assertions
|
||||||
----------
|
----------
|
||||||
|
|
||||||
Forward assertions are also just like other subpatterns, but starting with one
|
Forward assertions are also just like other subpatterns, but starting with one
|
||||||
of the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
|
of the opcodes OP_ASSERT, OP_ASSERT_NA (non-atomic assertion), or
|
||||||
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
|
OP_ASSERT_NOT. Backward assertions use the opcodes OP_ASSERTBACK,
|
||||||
is OP_REVERSE, followed by a count of the number of characters to move back the
|
OP_ASSERTBACK_NA, and OP_ASSERTBACK_NOT, and the first opcode inside the
|
||||||
pointer in the subject string. In ASCII or UTF-32 mode, the count is also the
|
assertion is OP_REVERSE, followed by a count of the number of characters to
|
||||||
number of code units, but in UTF-8/16 mode each character may occupy more than
|
move back the pointer in the subject string. In ASCII or UTF-32 mode, the count
|
||||||
one code unit. A separate count is present in each alternative of a lookbehind
|
is also the number of code units, but in UTF-8/16 mode each character may
|
||||||
assertion, allowing them to have different (but fixed) lengths.
|
occupy more than one code unit. A separate count is present in each alternative
|
||||||
|
of a lookbehind assertion, allowing each branch to have a different (but fixed)
|
||||||
|
length.
|
||||||
|
|
||||||
|
|
||||||
Conditional subpatterns
|
Conditional subpatterns
|
||||||
|
@ -754,11 +758,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
|
||||||
or OP_FALSE.
|
or OP_FALSE.
|
||||||
|
|
||||||
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
|
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
|
||||||
must start with a parenthesized assertion, whose opcode normally immediately
|
must start with a parenthesized atomic assertion, whose opcode normally
|
||||||
follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a
|
immediately follows OP_COND or OP_SCOND. However, if automatic callouts are
|
||||||
callout is inserted immediately before the assertion. It is also possible to
|
enabled, a callout is inserted immediately before the assertion. It is also
|
||||||
insert a manual callout at this point. Only assertion conditions may have
|
possible to insert a manual callout at this point. Only assertion conditions
|
||||||
callouts preceding the condition.
|
may have callouts preceding the condition.
|
||||||
|
|
||||||
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
|
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
|
||||||
parts of the pattern, so this is another opcode that may appear as a condition.
|
parts of the pattern, so this is another opcode that may appear as a condition.
|
||||||
|
@ -823,4 +827,4 @@ not a real opcode, but is used to check at compile time that tables indexed by
|
||||||
opcode are the correct length, in order to catch updating errors.
|
opcode are the correct length, in order to catch updating errors.
|
||||||
|
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
20 July 2018
|
12 July 2019
|
||||||
|
|
|
@ -205,6 +205,11 @@ different way and is not Perl-compatible.
|
||||||
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
|
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
|
||||||
the start of a pattern that set overall options that cannot be changed within
|
the start of a pattern that set overall options that cannot be changed within
|
||||||
the pattern.
|
the pattern.
|
||||||
|
<br>
|
||||||
|
<br>
|
||||||
|
(m) PCRE2 supports non-atomic positive lookaround assertions. This is an
|
||||||
|
extension to the lookaround facilities. The default, Perl-compatible
|
||||||
|
lookarounds are atomic.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
|
18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
|
||||||
|
@ -234,7 +239,7 @@ Cambridge, England.
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 12 February 2019
|
Last updated: 13 July 2019
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2019 University of Cambridge.
|
Copyright © 1997-2019 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -33,17 +33,18 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
|
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
|
||||||
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
|
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
|
||||||
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
|
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
|
||||||
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
|
<li><a name="TOC21" href="#SEC21">NON-ATOMIC ASSERTIONS</a>
|
||||||
<li><a name="TOC22" href="#SEC22">CONDITIONAL GROUPS</a>
|
<li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a>
|
||||||
<li><a name="TOC23" href="#SEC23">COMMENTS</a>
|
<li><a name="TOC23" href="#SEC23">CONDITIONAL GROUPS</a>
|
||||||
<li><a name="TOC24" href="#SEC24">RECURSIVE PATTERNS</a>
|
<li><a name="TOC24" href="#SEC24">COMMENTS</a>
|
||||||
<li><a name="TOC25" href="#SEC25">GROUPS AS SUBROUTINES</a>
|
<li><a name="TOC25" href="#SEC25">RECURSIVE PATTERNS</a>
|
||||||
<li><a name="TOC26" href="#SEC26">ONIGURUMA SUBROUTINE SYNTAX</a>
|
<li><a name="TOC26" href="#SEC26">GROUPS AS SUBROUTINES</a>
|
||||||
<li><a name="TOC27" href="#SEC27">CALLOUTS</a>
|
<li><a name="TOC27" href="#SEC27">ONIGURUMA SUBROUTINE SYNTAX</a>
|
||||||
<li><a name="TOC28" href="#SEC28">BACKTRACKING CONTROL</a>
|
<li><a name="TOC28" href="#SEC28">CALLOUTS</a>
|
||||||
<li><a name="TOC29" href="#SEC29">SEE ALSO</a>
|
<li><a name="TOC29" href="#SEC29">BACKTRACKING CONTROL</a>
|
||||||
<li><a name="TOC30" href="#SEC30">AUTHOR</a>
|
<li><a name="TOC30" href="#SEC30">SEE ALSO</a>
|
||||||
<li><a name="TOC31" href="#SEC31">REVISION</a>
|
<li><a name="TOC31" href="#SEC31">AUTHOR</a>
|
||||||
|
<li><a name="TOC32" href="#SEC32">REVISION</a>
|
||||||
</ul>
|
</ul>
|
||||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
|
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -2364,19 +2365,23 @@ those that look behind it, and in each case an assertion may be positive (must
|
||||||
match for the assertion to be true) or negative (must not match for the
|
match for the assertion to be true) or negative (must not match for the
|
||||||
assertion to be true). An assertion group is matched in the normal way,
|
assertion to be true). An assertion group is matched in the normal way,
|
||||||
and if it is true, matching continues after it, but with the matching position
|
and if it is true, matching continues after it, but with the matching position
|
||||||
in the subject string is was it was before the assertion was processed.
|
in the subject string reset to what it was before the assertion was processed.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
A lookaround assertion may also appear as the condition in a
|
The Perl-compatible lookaround assertions are atomic. If an assertion is true,
|
||||||
|
but there is a subsequent matching failure, there is no backtracking into the
|
||||||
|
assertion. However, there are some cases where non-atomic assertions can be
|
||||||
|
useful. PCRE2 has some support for these, described in the section entitled
|
||||||
|
<a href="#nonatomicassertions">"Non-atomic assertions"</a>
|
||||||
|
below, but they are not Perl-compatible.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
A lookaround assertion may appear as the condition in a
|
||||||
<a href="#conditions">conditional group</a>
|
<a href="#conditions">conditional group</a>
|
||||||
(see below). In this case, the result of matching the assertion determines
|
(see below). In this case, the result of matching the assertion determines
|
||||||
which branch of the condition is followed.
|
which branch of the condition is followed.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Lookaround assertions are atomic. If an assertion is true, but there is a
|
|
||||||
subsequent matching failure, there is no backtracking into the assertion.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
Assertion groups are not capture groups. If an assertion contains capture
|
Assertion groups are not capture groups. If an assertion contains capture
|
||||||
groups within it, these are counted for the purposes of numbering the capture
|
groups within it, these are counted for the purposes of numbering the capture
|
||||||
groups in the whole pattern. Within each branch of an assertion, locally
|
groups in the whole pattern. Within each branch of an assertion, locally
|
||||||
|
@ -2429,11 +2434,11 @@ The assertion is obeyed just once when encountered during matching.
|
||||||
Alphabetic assertion names
|
Alphabetic assertion names
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Traditionally, symbolic sequences such as (?= and (?<= have been used to specify
|
Traditionally, symbolic sequences such as (?= and (?<= have been used to
|
||||||
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
|
specify lookaround assertions. Perl 5.28 introduced some experimental
|
||||||
alternatives which might be easier to remember. They all start with (* instead
|
alphabetic alternatives which might be easier to remember. They all start with
|
||||||
of (? and must be written using lower case letters. PCRE2 supports the
|
(* instead of (? and must be written using lower case letters. PCRE2 supports
|
||||||
following synonyms:
|
the following synonyms:
|
||||||
<pre>
|
<pre>
|
||||||
(*positive_lookahead: or (*pla: is the same as (?=
|
(*positive_lookahead: or (*pla: is the same as (?=
|
||||||
(*negative_lookahead: or (*nla: is the same as (?!
|
(*negative_lookahead: or (*nla: is the same as (?!
|
||||||
|
@ -2606,8 +2611,63 @@ preceded by "foo", while
|
||||||
</pre>
|
</pre>
|
||||||
is another pattern that matches "foo" preceded by three digits and any three
|
is another pattern that matches "foo" preceded by three digits and any three
|
||||||
characters that are not "999".
|
characters that are not "999".
|
||||||
|
<a name="nonatomicassertions"></a></P>
|
||||||
|
<br><a name="SEC21" href="#TOC1">NON-ATOMIC ASSERTIONS</a><br>
|
||||||
|
<P>
|
||||||
|
The traditional Perl-compatible lookaround assertions are atomic. That is, if
|
||||||
|
an assertion is true, but there is a subsequent matching failure, there is no
|
||||||
|
backtracking into the assertion. However, there are some cases where non-atomic
|
||||||
|
positive assertions can be useful. PCRE2 provides these using the following
|
||||||
|
syntax:
|
||||||
|
<pre>
|
||||||
|
(*non_atomic_positive_lookahead: or (*napla:
|
||||||
|
(*non_atomic_positive_lookbehind: or (*naplb:
|
||||||
|
</pre>
|
||||||
|
Consider the problem of finding the right-most word in a string that also
|
||||||
|
appears earlier in the string, that is, it must appear at least twice in total.
|
||||||
|
This pattern returns the required result as captured substring 1:
|
||||||
|
<pre>
|
||||||
|
^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
|
||||||
|
</pre>
|
||||||
|
For a subject such as "word1 word2 word3 word2 word3 word4" the result is
|
||||||
|
"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
|
||||||
|
"x" option, which causes white space (introduced for readability) to be
|
||||||
|
ignored. Inside the assertion, the greedy .* at first consumes the entire
|
||||||
|
string, but then has to backtrack until the rest of the assertion can match a
|
||||||
|
word, which is captured by group 1. In other words, when the assertion first
|
||||||
|
succeeds, it captures the right-most word in the string.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
|
<P>
|
||||||
|
The current matching point is then reset to the start of the subject, and the
|
||||||
|
rest of the pattern match checks for two occurrences of the captured word,
|
||||||
|
using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
|
||||||
|
if the last word in the string does not occur twice, this part of the pattern
|
||||||
|
fails. If a traditional atomic lookhead (?= or (*pla: had been used, the
|
||||||
|
assertion could not be re-entered, and the whole match would fail. The pattern
|
||||||
|
would succeed only if the very last word in the subject was found twice.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Using a non-atomic lookahead, however, means that when the last word does not
|
||||||
|
occur twice in the string, the lookahead can backtrack and find the second-last
|
||||||
|
word, and so on, until either the match succeeds, or all words have been
|
||||||
|
tested.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Two conditions must be met for a non-atomic assertion to be useful: the
|
||||||
|
contents of one or more capturing groups must change after a backtrack into the
|
||||||
|
assertion, and there must be a backreference to a changed group later in the
|
||||||
|
pattern. If this is not the case, the rest of the pattern match fails exactly
|
||||||
|
as before because nothing has changed, so using a non-atomic assertion just
|
||||||
|
wastes resources.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Non-atomic assertions are not supported by the alternative matching function
|
||||||
|
<b>pcre2_dfa_match()</b>. They are also not supported by JIT (but may be in
|
||||||
|
future). Note that assertions that appear as conditions for
|
||||||
|
<a href="#conditions">conditional groups</a>
|
||||||
|
(see below) must be atomic.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
In concept, a script run is a sequence of characters that are all from the same
|
In concept, a script run is a sequence of characters that are all from the same
|
||||||
Unicode script such as Latin or Greek. However, because some scripts are
|
Unicode script such as Latin or Greek. However, because some scripts are
|
||||||
|
@ -2669,7 +2729,7 @@ parentheses.
|
||||||
should not be used within a script run group, because it causes an immediate
|
should not be used within a script run group, because it causes an immediate
|
||||||
exit from the group, bypassing the script run checking.
|
exit from the group, bypassing the script run checking.
|
||||||
<a name="conditions"></a></P>
|
<a name="conditions"></a></P>
|
||||||
<br><a name="SEC22" href="#TOC1">CONDITIONAL GROUPS</a><br>
|
<br><a name="SEC23" href="#TOC1">CONDITIONAL GROUPS</a><br>
|
||||||
<P>
|
<P>
|
||||||
It is possible to cause the matching process to obey a pattern fragment
|
It is possible to cause the matching process to obey a pattern fragment
|
||||||
conditionally or to choose between two alternative fragments, depending on
|
conditionally or to choose between two alternative fragments, depending on
|
||||||
|
@ -2845,8 +2905,13 @@ Assertion conditions
|
||||||
<P>
|
<P>
|
||||||
If the condition is not in any of the above formats, it must be a parenthesized
|
If the condition is not in any of the above formats, it must be a parenthesized
|
||||||
assertion. This may be a positive or negative lookahead or lookbehind
|
assertion. This may be a positive or negative lookahead or lookbehind
|
||||||
assertion. Consider this pattern, again containing non-significant white space,
|
assertion. However, it must be a traditional atomic assertion, not one of the
|
||||||
and with the two alternatives on the second line:
|
PCRE2-specific
|
||||||
|
<a href="#nonatomicassertions">non-atomic assertions.</a>
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Consider this pattern, again containing non-significant white space, and with
|
||||||
|
the two alternatives on the second line:
|
||||||
<pre>
|
<pre>
|
||||||
(?(?=[^a-z]*[a-z])
|
(?(?=[^a-z]*[a-z])
|
||||||
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
||||||
|
@ -2865,7 +2930,7 @@ positive and negative assertions, because matching always continues after the
|
||||||
assertion, whether it succeeds or fails. (Compare non-conditional assertions,
|
assertion, whether it succeeds or fails. (Compare non-conditional assertions,
|
||||||
for which captures are retained only for positive assertions that succeed.)
|
for which captures are retained only for positive assertions that succeed.)
|
||||||
<a name="comments"></a></P>
|
<a name="comments"></a></P>
|
||||||
<br><a name="SEC23" href="#TOC1">COMMENTS</a><br>
|
<br><a name="SEC24" href="#TOC1">COMMENTS</a><br>
|
||||||
<P>
|
<P>
|
||||||
There are two ways of including comments in patterns that are processed by
|
There are two ways of including comments in patterns that are processed by
|
||||||
PCRE2. In both cases, the start of the comment must not be in a character
|
PCRE2. In both cases, the start of the comment must not be in a character
|
||||||
|
@ -2895,7 +2960,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so
|
||||||
it does not terminate the comment. Only an actual character with the code value
|
it does not terminate the comment. Only an actual character with the code value
|
||||||
0x0a (the default newline) does so.
|
0x0a (the default newline) does so.
|
||||||
<a name="recursion"></a></P>
|
<a name="recursion"></a></P>
|
||||||
<br><a name="SEC24" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
<br><a name="SEC25" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
Consider the problem of matching a string in parentheses, allowing for
|
Consider the problem of matching a string in parentheses, allowing for
|
||||||
unlimited nested parentheses. Without the use of recursion, the best that can
|
unlimited nested parentheses. Without the use of recursion, the best that can
|
||||||
|
@ -3083,7 +3148,7 @@ alternative matches "a" and then recurses. In the recursion, \1 does now match
|
||||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||||
later versions (I tried 5.024) it now works.
|
later versions (I tried 5.024) it now works.
|
||||||
<a name="groupsassubroutines"></a></P>
|
<a name="groupsassubroutines"></a></P>
|
||||||
<br><a name="SEC25" href="#TOC1">GROUPS AS SUBROUTINES</a><br>
|
<br><a name="SEC26" href="#TOC1">GROUPS AS SUBROUTINES</a><br>
|
||||||
<P>
|
<P>
|
||||||
If the syntax for a recursive group call (either by number or by name) is used
|
If the syntax for a recursive group call (either by number or by name) is used
|
||||||
outside the parentheses to which it refers, it operates a bit like a subroutine
|
outside the parentheses to which it refers, it operates a bit like a subroutine
|
||||||
|
@ -3131,7 +3196,7 @@ in groups when called as subroutines is described in the section entitled
|
||||||
<a href="#btsub">"Backtracking verbs in subroutines"</a>
|
<a href="#btsub">"Backtracking verbs in subroutines"</a>
|
||||||
below.
|
below.
|
||||||
<a name="onigurumasubroutines"></a></P>
|
<a name="onigurumasubroutines"></a></P>
|
||||||
<br><a name="SEC26" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
<br><a name="SEC27" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
||||||
<P>
|
<P>
|
||||||
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
|
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
|
||||||
a number enclosed either in angle brackets or single quotes, is an alternative
|
a number enclosed either in angle brackets or single quotes, is an alternative
|
||||||
|
@ -3149,7 +3214,7 @@ plus or a minus sign it is taken as a relative reference. For example:
|
||||||
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
||||||
synonymous. The former is a backreference; the latter is a subroutine call.
|
synonymous. The former is a backreference; the latter is a subroutine call.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
|
<br><a name="SEC28" href="#TOC1">CALLOUTS</a><br>
|
||||||
<P>
|
<P>
|
||||||
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
|
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
|
||||||
code to be obeyed in the middle of matching a regular expression. This makes it
|
code to be obeyed in the middle of matching a regular expression. This makes it
|
||||||
|
@ -3225,7 +3290,7 @@ example:
|
||||||
</pre>
|
</pre>
|
||||||
The doubling is removed before the string is passed to the callout function.
|
The doubling is removed before the string is passed to the callout function.
|
||||||
<a name="backtrackcontrol"></a></P>
|
<a name="backtrackcontrol"></a></P>
|
||||||
<br><a name="SEC28" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
<br><a name="SEC29" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||||
<P>
|
<P>
|
||||||
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||||
terminology) that modify the behaviour of backtracking during matching. They
|
terminology) that modify the behaviour of backtracking during matching. They
|
||||||
|
@ -3739,12 +3804,12 @@ enclosing group that has alternatives (its normal behaviour). However, if there
|
||||||
is no such group within the subroutine's group, the subroutine match fails and
|
is no such group within the subroutine's group, the subroutine match fails and
|
||||||
there is a backtrack at the outer level.
|
there is a backtrack at the outer level.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC29" href="#TOC1">SEE ALSO</a><br>
|
<br><a name="SEC30" href="#TOC1">SEE ALSO</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
|
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
|
||||||
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
|
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">AUTHOR</a><br>
|
<br><a name="SEC31" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
|
@ -3753,9 +3818,9 @@ University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 22 June 2019
|
Last updated: 13 July 2019
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2019 University of Cambridge.
|
Copyright © 1997-2019 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -32,15 +32,16 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
|
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
|
||||||
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
|
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
|
||||||
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
||||||
<li><a name="TOC20" href="#SEC20">SCRIPT RUNS</a>
|
<li><a name="TOC20" href="#SEC20">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
|
||||||
<li><a name="TOC21" href="#SEC21">BACKREFERENCES</a>
|
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
|
||||||
<li><a name="TOC22" href="#SEC22">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
<li><a name="TOC22" href="#SEC22">BACKREFERENCES</a>
|
||||||
<li><a name="TOC23" href="#SEC23">CONDITIONAL PATTERNS</a>
|
<li><a name="TOC23" href="#SEC23">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||||
<li><a name="TOC24" href="#SEC24">BACKTRACKING CONTROL</a>
|
<li><a name="TOC24" href="#SEC24">CONDITIONAL PATTERNS</a>
|
||||||
<li><a name="TOC25" href="#SEC25">CALLOUTS</a>
|
<li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a>
|
||||||
<li><a name="TOC26" href="#SEC26">SEE ALSO</a>
|
<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
|
||||||
<li><a name="TOC27" href="#SEC27">AUTHOR</a>
|
<li><a name="TOC27" href="#SEC27">SEE ALSO</a>
|
||||||
<li><a name="TOC28" href="#SEC28">REVISION</a>
|
<li><a name="TOC28" href="#SEC28">AUTHOR</a>
|
||||||
|
<li><a name="TOC29" href="#SEC29">REVISION</a>
|
||||||
</ul>
|
</ul>
|
||||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -544,7 +545,18 @@ setting with a similar syntax.
|
||||||
</pre>
|
</pre>
|
||||||
Each top-level branch of a lookbehind must be of a fixed length.
|
Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC20" href="#TOC1">SCRIPT RUNS</a><br>
|
<br><a name="SEC20" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
|
||||||
|
<P>
|
||||||
|
These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||||
|
<pre>
|
||||||
|
(*napla:...)
|
||||||
|
(*non_atomic_positive_lookahead:...)
|
||||||
|
|
||||||
|
(*naplb:...)
|
||||||
|
(*non_atomic_positive_lookbehind:...)
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(*script_run:...) ) script run, can be backtracked into
|
(*script_run:...) ) script run, can be backtracked into
|
||||||
|
@ -554,7 +566,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
(*asr:...) )
|
(*asr:...) )
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">BACKREFERENCES</a><br>
|
<br><a name="SEC22" href="#TOC1">BACKREFERENCES</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
\n reference by number (can be ambiguous)
|
\n reference by number (can be ambiguous)
|
||||||
|
@ -571,7 +583,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
(?P=name) reference by name (Python)
|
(?P=name) reference by name (Python)
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC22" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
<br><a name="SEC23" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?R) recurse whole pattern
|
(?R) recurse whole pattern
|
||||||
|
@ -590,7 +602,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
\g'-n' call subroutine by relative number (PCRE2 extension)
|
\g'-n' call subroutine by relative number (PCRE2 extension)
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC23" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
<br><a name="SEC24" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?(condition)yes-pattern)
|
(?(condition)yes-pattern)
|
||||||
|
@ -613,7 +625,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||||
conditions or recursion tests. Such a condition is interpreted as a reference
|
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||||
condition if the relevant named group exists.
|
condition if the relevant named group exists.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC24" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
<br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||||
<P>
|
<P>
|
||||||
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
||||||
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
||||||
|
@ -640,7 +652,7 @@ pattern is not anchored.
|
||||||
The effect of one of these verbs in a group called as a subroutine is confined
|
The effect of one of these verbs in a group called as a subroutine is confined
|
||||||
to the subroutine call.
|
to the subroutine call.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC25" href="#TOC1">CALLOUTS</a><br>
|
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?C) callout (assumed number 0)
|
(?C) callout (assumed number 0)
|
||||||
|
@ -651,12 +663,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
||||||
start and the end), and the starting delimiter { matched with the ending
|
start and the end), and the starting delimiter { matched with the ending
|
||||||
delimiter }. To encode the ending delimiter within the string, double it.
|
delimiter }. To encode the ending delimiter within the string, double it.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
|
<br><a name="SEC27" href="#TOC1">SEE ALSO</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||||
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
|
<br><a name="SEC28" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
|
@ -665,9 +677,9 @@ University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC29" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 11 February 2019
|
Last updated: 12 July 2019
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2019 University of Cambridge.
|
Copyright © 1997-2019 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
102
doc/pcre2.txt
102
doc/pcre2.txt
|
@ -4887,6 +4887,10 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
||||||
at the start of a pattern that set overall options that cannot be
|
at the start of a pattern that set overall options that cannot be
|
||||||
changed within the pattern.
|
changed within the pattern.
|
||||||
|
|
||||||
|
(m) PCRE2 supports non-atomic positive lookaround assertions. This is
|
||||||
|
an extension to the lookaround facilities. The default, Perl-compatible
|
||||||
|
lookarounds are atomic.
|
||||||
|
|
||||||
18. The Perl /a modifier restricts /d numbers to pure ascii, and the
|
18. The Perl /a modifier restricts /d numbers to pure ascii, and the
|
||||||
/aa modifier restricts /i case-insensitive matching to pure ascii, ig-
|
/aa modifier restricts /i case-insensitive matching to pure ascii, ig-
|
||||||
noring Unicode rules. This separation cannot be represented with
|
noring Unicode rules. This separation cannot be represented with
|
||||||
|
@ -4909,7 +4913,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 12 February 2019
|
Last updated: 13 July 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -8140,16 +8144,19 @@ ASSERTIONS
|
||||||
sertion may be positive (must match for the assertion to be true) or
|
sertion may be positive (must match for the assertion to be true) or
|
||||||
negative (must not match for the assertion to be true). An assertion
|
negative (must not match for the assertion to be true). An assertion
|
||||||
group is matched in the normal way, and if it is true, matching contin-
|
group is matched in the normal way, and if it is true, matching contin-
|
||||||
ues after it, but with the matching position in the subject string is
|
ues after it, but with the matching position in the subject string re-
|
||||||
was it was before the assertion was processed.
|
set to what it was before the assertion was processed.
|
||||||
|
|
||||||
A lookaround assertion may also appear as the condition in a condi-
|
The Perl-compatible lookaround assertions are atomic. If an assertion
|
||||||
tional group (see below). In this case, the result of matching the as-
|
is true, but there is a subsequent matching failure, there is no back-
|
||||||
sertion determines which branch of the condition is followed.
|
tracking into the assertion. However, there are some cases where non-
|
||||||
|
atomic assertions can be useful. PCRE2 has some support for these, de-
|
||||||
|
scribed in the section entitled "Non-atomic assertions" below, but they
|
||||||
|
are not Perl-compatible.
|
||||||
|
|
||||||
Lookaround assertions are atomic. If an assertion is true, but there is
|
A lookaround assertion may appear as the condition in a conditional
|
||||||
a subsequent matching failure, there is no backtracking into the asser-
|
group (see below). In this case, the result of matching the assertion
|
||||||
tion.
|
determines which branch of the condition is followed.
|
||||||
|
|
||||||
Assertion groups are not capture groups. If an assertion contains cap-
|
Assertion groups are not capture groups. If an assertion contains cap-
|
||||||
ture groups within it, these are counted for the purposes of numbering
|
ture groups within it, these are counted for the purposes of numbering
|
||||||
|
@ -8362,6 +8369,60 @@ ASSERTIONS
|
||||||
three characters that are not "999".
|
three characters that are not "999".
|
||||||
|
|
||||||
|
|
||||||
|
NON-ATOMIC ASSERTIONS
|
||||||
|
|
||||||
|
The traditional Perl-compatible lookaround assertions are atomic. That
|
||||||
|
is, if an assertion is true, but there is a subsequent matching fail-
|
||||||
|
ure, there is no backtracking into the assertion. However, there are
|
||||||
|
some cases where non-atomic positive assertions can be useful. PCRE2
|
||||||
|
provides these using the following syntax:
|
||||||
|
|
||||||
|
(*non_atomic_positive_lookahead: or (*napla:
|
||||||
|
(*non_atomic_positive_lookbehind: or (*naplb:
|
||||||
|
|
||||||
|
Consider the problem of finding the right-most word in a string that
|
||||||
|
also appears earlier in the string, that is, it must appear at least
|
||||||
|
twice in total. This pattern returns the required result as captured
|
||||||
|
substring 1:
|
||||||
|
|
||||||
|
^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
|
||||||
|
|
||||||
|
For a subject such as "word1 word2 word3 word2 word3 word4" the result
|
||||||
|
is "word3". How does it work? At the start, ^(?x) anchors the pattern
|
||||||
|
and sets the "x" option, which causes white space (introduced for read-
|
||||||
|
ability) to be ignored. Inside the assertion, the greedy .* at first
|
||||||
|
consumes the entire string, but then has to backtrack until the rest of
|
||||||
|
the assertion can match a word, which is captured by group 1. In other
|
||||||
|
words, when the assertion first succeeds, it captures the right-most
|
||||||
|
word in the string.
|
||||||
|
|
||||||
|
The current matching point is then reset to the start of the subject,
|
||||||
|
and the rest of the pattern match checks for two occurrences of the
|
||||||
|
captured word, using an ungreedy .*? to scan from the left. If this
|
||||||
|
succeeds, we are done, but if the last word in the string does not oc-
|
||||||
|
cur twice, this part of the pattern fails. If a traditional atomic
|
||||||
|
lookhead (?= or (*pla: had been used, the assertion could not be re-en-
|
||||||
|
tered, and the whole match would fail. The pattern would succeed only
|
||||||
|
if the very last word in the subject was found twice.
|
||||||
|
|
||||||
|
Using a non-atomic lookahead, however, means that when the last word
|
||||||
|
does not occur twice in the string, the lookahead can backtrack and
|
||||||
|
find the second-last word, and so on, until either the match succeeds,
|
||||||
|
or all words have been tested.
|
||||||
|
|
||||||
|
Two conditions must be met for a non-atomic assertion to be useful: the
|
||||||
|
contents of one or more capturing groups must change after a backtrack
|
||||||
|
into the assertion, and there must be a backreference to a changed
|
||||||
|
group later in the pattern. If this is not the case, the rest of the
|
||||||
|
pattern match fails exactly as before because nothing has changed, so
|
||||||
|
using a non-atomic assertion just wastes resources.
|
||||||
|
|
||||||
|
Non-atomic assertions are not supported by the alternative matching
|
||||||
|
function pcre2_dfa_match(). They are also not supported by JIT (but may
|
||||||
|
be in future). Note that assertions that appear as conditions for con-
|
||||||
|
ditional groups (see below) must be atomic.
|
||||||
|
|
||||||
|
|
||||||
SCRIPT RUNS
|
SCRIPT RUNS
|
||||||
|
|
||||||
In concept, a script run is a sequence of characters that are all from
|
In concept, a script run is a sequence of characters that are all from
|
||||||
|
@ -8578,9 +8639,11 @@ CONDITIONAL GROUPS
|
||||||
|
|
||||||
If the condition is not in any of the above formats, it must be a
|
If the condition is not in any of the above formats, it must be a
|
||||||
parenthesized assertion. This may be a positive or negative lookahead
|
parenthesized assertion. This may be a positive or negative lookahead
|
||||||
or lookbehind assertion. Consider this pattern, again containing non-
|
or lookbehind assertion. However, it must be a traditional atomic as-
|
||||||
significant white space, and with the two alternatives on the second
|
sertion, not one of the PCRE2-specific non-atomic assertions.
|
||||||
line:
|
|
||||||
|
Consider this pattern, again containing non-significant white space,
|
||||||
|
and with the two alternatives on the second line:
|
||||||
|
|
||||||
(?(?=[^a-z]*[a-z])
|
(?(?=[^a-z]*[a-z])
|
||||||
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
||||||
|
@ -9439,7 +9502,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 22 June 2019
|
Last updated: 13 July 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -10663,6 +10726,17 @@ LOOKAHEAD AND LOOKBEHIND ASSERTIONS
|
||||||
Each top-level branch of a lookbehind must be of a fixed length.
|
Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
|
|
||||||
|
|
||||||
|
NON-ATOMIC LOOKAROUND ASSERTIONS
|
||||||
|
|
||||||
|
These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||||
|
|
||||||
|
(*napla:...)
|
||||||
|
(*non_atomic_positive_lookahead:...)
|
||||||
|
|
||||||
|
(*naplb:...)
|
||||||
|
(*non_atomic_positive_lookbehind:...)
|
||||||
|
|
||||||
|
|
||||||
SCRIPT RUNS
|
SCRIPT RUNS
|
||||||
|
|
||||||
(*script_run:...) ) script run, can be backtracked into
|
(*script_run:...) ) script run, can be backtracked into
|
||||||
|
@ -10784,7 +10858,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 11 February 2019
|
Last updated: 12 July 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2COMPAT 3 "12 February 2019" "PCRE2 10.33"
|
.TH PCRE2COMPAT 3 "13 July 2019" "PCRE2 10.34"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
|
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
|
||||||
|
@ -170,6 +170,10 @@ different way and is not Perl-compatible.
|
||||||
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
|
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
|
||||||
the start of a pattern that set overall options that cannot be changed within
|
the start of a pattern that set overall options that cannot be changed within
|
||||||
the pattern.
|
the pattern.
|
||||||
|
.sp
|
||||||
|
(m) PCRE2 supports non-atomic positive lookaround assertions. This is an
|
||||||
|
extension to the lookaround facilities. The default, Perl-compatible
|
||||||
|
lookarounds are atomic.
|
||||||
.P
|
.P
|
||||||
18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
|
18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
|
||||||
modifier restricts /i case-insensitive matching to pure ascii, ignoring Unicode
|
modifier restricts /i case-insensitive matching to pure ascii, ignoring Unicode
|
||||||
|
@ -199,6 +203,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 12 February 2019
|
Last updated: 13 July 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "22 June 2019" "PCRE2 10.34"
|
.TH PCRE2PATTERN 3 "13 July 2019" "PCRE2 10.34"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -2370,9 +2370,19 @@ those that look behind it, and in each case an assertion may be positive (must
|
||||||
match for the assertion to be true) or negative (must not match for the
|
match for the assertion to be true) or negative (must not match for the
|
||||||
assertion to be true). An assertion group is matched in the normal way,
|
assertion to be true). An assertion group is matched in the normal way,
|
||||||
and if it is true, matching continues after it, but with the matching position
|
and if it is true, matching continues after it, but with the matching position
|
||||||
in the subject string is was it was before the assertion was processed.
|
in the subject string reset to what it was before the assertion was processed.
|
||||||
.P
|
.P
|
||||||
A lookaround assertion may also appear as the condition in a
|
The Perl-compatible lookaround assertions are atomic. If an assertion is true,
|
||||||
|
but there is a subsequent matching failure, there is no backtracking into the
|
||||||
|
assertion. However, there are some cases where non-atomic assertions can be
|
||||||
|
useful. PCRE2 has some support for these, described in the section entitled
|
||||||
|
.\" HTML <a href="#nonatomicassertions">
|
||||||
|
.\" </a>
|
||||||
|
"Non-atomic assertions"
|
||||||
|
.\"
|
||||||
|
below, but they are not Perl-compatible.
|
||||||
|
.P
|
||||||
|
A lookaround assertion may appear as the condition in a
|
||||||
.\" HTML <a href="#conditions">
|
.\" HTML <a href="#conditions">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
conditional group
|
conditional group
|
||||||
|
@ -2380,9 +2390,6 @@ conditional group
|
||||||
(see below). In this case, the result of matching the assertion determines
|
(see below). In this case, the result of matching the assertion determines
|
||||||
which branch of the condition is followed.
|
which branch of the condition is followed.
|
||||||
.P
|
.P
|
||||||
Lookaround assertions are atomic. If an assertion is true, but there is a
|
|
||||||
subsequent matching failure, there is no backtracking into the assertion.
|
|
||||||
.P
|
|
||||||
Assertion groups are not capture groups. If an assertion contains capture
|
Assertion groups are not capture groups. If an assertion contains capture
|
||||||
groups within it, these are counted for the purposes of numbering the capture
|
groups within it, these are counted for the purposes of numbering the capture
|
||||||
groups in the whole pattern. Within each branch of an assertion, locally
|
groups in the whole pattern. Within each branch of an assertion, locally
|
||||||
|
@ -2435,11 +2442,11 @@ The assertion is obeyed just once when encountered during matching.
|
||||||
.SS "Alphabetic assertion names"
|
.SS "Alphabetic assertion names"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
Traditionally, symbolic sequences such as (?= and (?<= have been used to specify
|
Traditionally, symbolic sequences such as (?= and (?<= have been used to
|
||||||
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
|
specify lookaround assertions. Perl 5.28 introduced some experimental
|
||||||
alternatives which might be easier to remember. They all start with (* instead
|
alphabetic alternatives which might be easier to remember. They all start with
|
||||||
of (? and must be written using lower case letters. PCRE2 supports the
|
(* instead of (? and must be written using lower case letters. PCRE2 supports
|
||||||
following synonyms:
|
the following synonyms:
|
||||||
.sp
|
.sp
|
||||||
(*positive_lookahead: or (*pla: is the same as (?=
|
(*positive_lookahead: or (*pla: is the same as (?=
|
||||||
(*negative_lookahead: or (*nla: is the same as (?!
|
(*negative_lookahead: or (*nla: is the same as (?!
|
||||||
|
@ -2616,6 +2623,63 @@ is another pattern that matches "foo" preceded by three digits and any three
|
||||||
characters that are not "999".
|
characters that are not "999".
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.\" HTML <a name="nonatomicassertions"></a>
|
||||||
|
.SH "NON-ATOMIC ASSERTIONS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
The traditional Perl-compatible lookaround assertions are atomic. That is, if
|
||||||
|
an assertion is true, but there is a subsequent matching failure, there is no
|
||||||
|
backtracking into the assertion. However, there are some cases where non-atomic
|
||||||
|
positive assertions can be useful. PCRE2 provides these using the following
|
||||||
|
syntax:
|
||||||
|
.sp
|
||||||
|
(*non_atomic_positive_lookahead: or (*napla:
|
||||||
|
(*non_atomic_positive_lookbehind: or (*naplb:
|
||||||
|
.sp
|
||||||
|
Consider the problem of finding the right-most word in a string that also
|
||||||
|
appears earlier in the string, that is, it must appear at least twice in total.
|
||||||
|
This pattern returns the required result as captured substring 1:
|
||||||
|
.sp
|
||||||
|
^(?x)(*napla: .* \eb(\ew++)) (?> .*? \eb\e1\eb ){2}
|
||||||
|
.sp
|
||||||
|
For a subject such as "word1 word2 word3 word2 word3 word4" the result is
|
||||||
|
"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
|
||||||
|
"x" option, which causes white space (introduced for readability) to be
|
||||||
|
ignored. Inside the assertion, the greedy .* at first consumes the entire
|
||||||
|
string, but then has to backtrack until the rest of the assertion can match a
|
||||||
|
word, which is captured by group 1. In other words, when the assertion first
|
||||||
|
succeeds, it captures the right-most word in the string.
|
||||||
|
.P
|
||||||
|
The current matching point is then reset to the start of the subject, and the
|
||||||
|
rest of the pattern match checks for two occurrences of the captured word,
|
||||||
|
using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
|
||||||
|
if the last word in the string does not occur twice, this part of the pattern
|
||||||
|
fails. If a traditional atomic lookhead (?= or (*pla: had been used, the
|
||||||
|
assertion could not be re-entered, and the whole match would fail. The pattern
|
||||||
|
would succeed only if the very last word in the subject was found twice.
|
||||||
|
.P
|
||||||
|
Using a non-atomic lookahead, however, means that when the last word does not
|
||||||
|
occur twice in the string, the lookahead can backtrack and find the second-last
|
||||||
|
word, and so on, until either the match succeeds, or all words have been
|
||||||
|
tested.
|
||||||
|
.P
|
||||||
|
Two conditions must be met for a non-atomic assertion to be useful: the
|
||||||
|
contents of one or more capturing groups must change after a backtrack into the
|
||||||
|
assertion, and there must be a backreference to a changed group later in the
|
||||||
|
pattern. If this is not the case, the rest of the pattern match fails exactly
|
||||||
|
as before because nothing has changed, so using a non-atomic assertion just
|
||||||
|
wastes resources.
|
||||||
|
.P
|
||||||
|
Non-atomic assertions are not supported by the alternative matching function
|
||||||
|
\fBpcre2_dfa_match()\fP. They are also not supported by JIT (but may be in
|
||||||
|
future). Note that assertions that appear as conditions for
|
||||||
|
.\" HTML <a href="#conditions">
|
||||||
|
.\" </a>
|
||||||
|
conditional groups
|
||||||
|
.\"
|
||||||
|
(see below) must be atomic.
|
||||||
|
.
|
||||||
|
.
|
||||||
.SH "SCRIPT RUNS"
|
.SH "SCRIPT RUNS"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -2867,8 +2931,15 @@ than two digits.
|
||||||
.sp
|
.sp
|
||||||
If the condition is not in any of the above formats, it must be a parenthesized
|
If the condition is not in any of the above formats, it must be a parenthesized
|
||||||
assertion. This may be a positive or negative lookahead or lookbehind
|
assertion. This may be a positive or negative lookahead or lookbehind
|
||||||
assertion. Consider this pattern, again containing non-significant white space,
|
assertion. However, it must be a traditional atomic assertion, not one of the
|
||||||
and with the two alternatives on the second line:
|
PCRE2-specific
|
||||||
|
.\" HTML <a href="#nonatomicassertions">
|
||||||
|
.\" </a>
|
||||||
|
non-atomic assertions.
|
||||||
|
.\"
|
||||||
|
.P
|
||||||
|
Consider this pattern, again containing non-significant white space, and with
|
||||||
|
the two alternatives on the second line:
|
||||||
.sp
|
.sp
|
||||||
(?(?=[^a-z]*[a-z])
|
(?(?=[^a-z]*[a-z])
|
||||||
\ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} )
|
\ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} )
|
||||||
|
@ -3788,6 +3859,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 22 June 2019
|
Last updated: 13 July 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2SYNTAX 3 "11 February 2019" "PCRE2 10.33"
|
.TH PCRE2SYNTAX 3 "12 July 2019" "PCRE2 10.34"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
@ -522,6 +522,18 @@ setting with a similar syntax.
|
||||||
Each top-level branch of a lookbehind must be of a fixed length.
|
Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SH "NON-ATOMIC LOOKAROUND ASSERTIONS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||||
|
.sp
|
||||||
|
(*napla:...)
|
||||||
|
(*non_atomic_positive_lookahead:...)
|
||||||
|
.sp
|
||||||
|
(*naplb:...)
|
||||||
|
(*non_atomic_positive_lookbehind:...)
|
||||||
|
.
|
||||||
|
.
|
||||||
.SH "SCRIPT RUNS"
|
.SH "SCRIPT RUNS"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -654,6 +666,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 11 February 2019
|
Last updated: 12 July 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -307,6 +307,7 @@ pcre2_pattern_convert(). */
|
||||||
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
|
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
|
||||||
#define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE 196
|
#define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE 196
|
||||||
#define PCRE2_ERROR_TOO_MANY_CAPTURES 197
|
#define PCRE2_ERROR_TOO_MANY_CAPTURES 197
|
||||||
|
#define PCRE2_ERROR_CONDITION_ATOMIC_ASSERTION_EXPECTED 198
|
||||||
|
|
||||||
|
|
||||||
/* "Expected" matching error codes: no match and partial match. */
|
/* "Expected" matching error codes: no match and partial match. */
|
||||||
|
|
|
@ -624,6 +624,13 @@ for(;;)
|
||||||
case OP_ASSERTBACK_NOT:
|
case OP_ASSERTBACK_NOT:
|
||||||
case OP_ONCE:
|
case OP_ONCE:
|
||||||
return !entered_a_group;
|
return !entered_a_group;
|
||||||
|
|
||||||
|
/* Non-atomic assertions - don't possessify last iterator. This needs
|
||||||
|
more thought. */
|
||||||
|
|
||||||
|
case OP_ASSERT_NA:
|
||||||
|
case OP_ASSERTBACK_NA:
|
||||||
|
return FALSE;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Skip over the bracket and inspect what comes next. */
|
/* Skip over the bracket and inspect what comes next. */
|
||||||
|
|
|
@ -250,36 +250,41 @@ is present where expected in a conditional group. */
|
||||||
#define META_LOOKBEHIND 0x80250000u /* (?<= */
|
#define META_LOOKBEHIND 0x80250000u /* (?<= */
|
||||||
#define META_LOOKBEHINDNOT 0x80260000u /* (?<! */
|
#define META_LOOKBEHINDNOT 0x80260000u /* (?<! */
|
||||||
|
|
||||||
|
/* These cannot be conditions */
|
||||||
|
|
||||||
|
#define META_LOOKAHEAD_NA 0x80270000u /* (*napla: */
|
||||||
|
#define META_LOOKBEHIND_NA 0x80280000u /* (*naplb: */
|
||||||
|
|
||||||
/* These must be kept in this order, with consecutive values, and the _ARG
|
/* These must be kept in this order, with consecutive values, and the _ARG
|
||||||
versions of COMMIT, PRUNE, SKIP, and THEN immediately after their non-argument
|
versions of COMMIT, PRUNE, SKIP, and THEN immediately after their non-argument
|
||||||
versions. */
|
versions. */
|
||||||
|
|
||||||
#define META_MARK 0x80270000u /* (*MARK) */
|
#define META_MARK 0x80290000u /* (*MARK) */
|
||||||
#define META_ACCEPT 0x80280000u /* (*ACCEPT) */
|
#define META_ACCEPT 0x802a0000u /* (*ACCEPT) */
|
||||||
#define META_FAIL 0x80290000u /* (*FAIL) */
|
#define META_FAIL 0x802b0000u /* (*FAIL) */
|
||||||
#define META_COMMIT 0x802a0000u /* These */
|
#define META_COMMIT 0x802c0000u /* These */
|
||||||
#define META_COMMIT_ARG 0x802b0000u /* pairs */
|
#define META_COMMIT_ARG 0x802d0000u /* pairs */
|
||||||
#define META_PRUNE 0x802c0000u /* must */
|
#define META_PRUNE 0x802e0000u /* must */
|
||||||
#define META_PRUNE_ARG 0x802d0000u /* be */
|
#define META_PRUNE_ARG 0x802f0000u /* be */
|
||||||
#define META_SKIP 0x802e0000u /* kept */
|
#define META_SKIP 0x80300000u /* kept */
|
||||||
#define META_SKIP_ARG 0x802f0000u /* in */
|
#define META_SKIP_ARG 0x80310000u /* in */
|
||||||
#define META_THEN 0x80300000u /* this */
|
#define META_THEN 0x80320000u /* this */
|
||||||
#define META_THEN_ARG 0x80310000u /* order */
|
#define META_THEN_ARG 0x80330000u /* order */
|
||||||
|
|
||||||
/* These must be kept in groups of adjacent 3 values, and all together. */
|
/* These must be kept in groups of adjacent 3 values, and all together. */
|
||||||
|
|
||||||
#define META_ASTERISK 0x80320000u /* * */
|
#define META_ASTERISK 0x80340000u /* * */
|
||||||
#define META_ASTERISK_PLUS 0x80330000u /* *+ */
|
#define META_ASTERISK_PLUS 0x80350000u /* *+ */
|
||||||
#define META_ASTERISK_QUERY 0x80340000u /* *? */
|
#define META_ASTERISK_QUERY 0x80360000u /* *? */
|
||||||
#define META_PLUS 0x80350000u /* + */
|
#define META_PLUS 0x80370000u /* + */
|
||||||
#define META_PLUS_PLUS 0x80360000u /* ++ */
|
#define META_PLUS_PLUS 0x80380000u /* ++ */
|
||||||
#define META_PLUS_QUERY 0x80370000u /* +? */
|
#define META_PLUS_QUERY 0x80390000u /* +? */
|
||||||
#define META_QUERY 0x80380000u /* ? */
|
#define META_QUERY 0x803a0000u /* ? */
|
||||||
#define META_QUERY_PLUS 0x80390000u /* ?+ */
|
#define META_QUERY_PLUS 0x803b0000u /* ?+ */
|
||||||
#define META_QUERY_QUERY 0x803a0000u /* ?? */
|
#define META_QUERY_QUERY 0x803c0000u /* ?? */
|
||||||
#define META_MINMAX 0x803b0000u /* {n,m} repeat */
|
#define META_MINMAX 0x803d0000u /* {n,m} repeat */
|
||||||
#define META_MINMAX_PLUS 0x803c0000u /* {n,m}+ repeat */
|
#define META_MINMAX_PLUS 0x803e0000u /* {n,m}+ repeat */
|
||||||
#define META_MINMAX_QUERY 0x803d0000u /* {n,m}? repeat */
|
#define META_MINMAX_QUERY 0x803f0000u /* {n,m}? repeat */
|
||||||
|
|
||||||
#define META_FIRST_QUANTIFIER META_ASTERISK
|
#define META_FIRST_QUANTIFIER META_ASTERISK
|
||||||
#define META_LAST_QUANTIFIER META_MINMAX_QUERY
|
#define META_LAST_QUANTIFIER META_MINMAX_QUERY
|
||||||
|
@ -335,6 +340,8 @@ static unsigned char meta_extra_lengths[] = {
|
||||||
0, /* META_LOOKAHEADNOT */
|
0, /* META_LOOKAHEADNOT */
|
||||||
SIZEOFFSET, /* META_LOOKBEHIND */
|
SIZEOFFSET, /* META_LOOKBEHIND */
|
||||||
SIZEOFFSET, /* META_LOOKBEHINDNOT */
|
SIZEOFFSET, /* META_LOOKBEHINDNOT */
|
||||||
|
0, /* META_LOOKAHEAD_NA */
|
||||||
|
SIZEOFFSET, /* META_LOOKBEHIND_NA */
|
||||||
1, /* META_MARK - plus the string length */
|
1, /* META_MARK - plus the string length */
|
||||||
0, /* META_ACCEPT */
|
0, /* META_ACCEPT */
|
||||||
0, /* META_FAIL */
|
0, /* META_FAIL */
|
||||||
|
@ -637,10 +644,14 @@ typedef struct alasitem {
|
||||||
static const char alasnames[] =
|
static const char alasnames[] =
|
||||||
STRING_pla0
|
STRING_pla0
|
||||||
STRING_plb0
|
STRING_plb0
|
||||||
|
STRING_napla0
|
||||||
|
STRING_naplb0
|
||||||
STRING_nla0
|
STRING_nla0
|
||||||
STRING_nlb0
|
STRING_nlb0
|
||||||
STRING_positive_lookahead0
|
STRING_positive_lookahead0
|
||||||
STRING_positive_lookbehind0
|
STRING_positive_lookbehind0
|
||||||
|
STRING_non_atomic_positive_lookahead0
|
||||||
|
STRING_non_atomic_positive_lookbehind0
|
||||||
STRING_negative_lookahead0
|
STRING_negative_lookahead0
|
||||||
STRING_negative_lookbehind0
|
STRING_negative_lookbehind0
|
||||||
STRING_atomic0
|
STRING_atomic0
|
||||||
|
@ -652,10 +663,14 @@ static const char alasnames[] =
|
||||||
static const alasitem alasmeta[] = {
|
static const alasitem alasmeta[] = {
|
||||||
{ 3, META_LOOKAHEAD },
|
{ 3, META_LOOKAHEAD },
|
||||||
{ 3, META_LOOKBEHIND },
|
{ 3, META_LOOKBEHIND },
|
||||||
|
{ 5, META_LOOKAHEAD_NA },
|
||||||
|
{ 5, META_LOOKBEHIND_NA },
|
||||||
{ 3, META_LOOKAHEADNOT },
|
{ 3, META_LOOKAHEADNOT },
|
||||||
{ 3, META_LOOKBEHINDNOT },
|
{ 3, META_LOOKBEHINDNOT },
|
||||||
{ 18, META_LOOKAHEAD },
|
{ 18, META_LOOKAHEAD },
|
||||||
{ 19, META_LOOKBEHIND },
|
{ 19, META_LOOKBEHIND },
|
||||||
|
{ 29, META_LOOKAHEAD_NA },
|
||||||
|
{ 30, META_LOOKBEHIND_NA },
|
||||||
{ 18, META_LOOKAHEADNOT },
|
{ 18, META_LOOKAHEADNOT },
|
||||||
{ 19, META_LOOKBEHINDNOT },
|
{ 19, META_LOOKBEHINDNOT },
|
||||||
{ 6, META_ATOMIC },
|
{ 6, META_ATOMIC },
|
||||||
|
@ -784,7 +799,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
|
||||||
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
|
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
|
||||||
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
|
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
|
||||||
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
|
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
|
||||||
ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97 };
|
ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97, ERR98 };
|
||||||
|
|
||||||
/* This is a table of start-of-pattern options such as (*UTF) and settings such
|
/* This is a table of start-of-pattern options such as (*UTF) and settings such
|
||||||
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
|
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
|
||||||
|
@ -1015,6 +1030,7 @@ for (;;)
|
||||||
case META_NOCAPTURE: fprintf(stderr, "META (?:"); break;
|
case META_NOCAPTURE: fprintf(stderr, "META (?:"); break;
|
||||||
case META_LOOKAHEAD: fprintf(stderr, "META (?="); break;
|
case META_LOOKAHEAD: fprintf(stderr, "META (?="); break;
|
||||||
case META_LOOKAHEADNOT: fprintf(stderr, "META (?!"); break;
|
case META_LOOKAHEADNOT: fprintf(stderr, "META (?!"); break;
|
||||||
|
case META_LOOKAHEAD_NA: fprintf(stderr, "META (*napla:"); break;
|
||||||
case META_SCRIPT_RUN: fprintf(stderr, "META (*sr:"); break;
|
case META_SCRIPT_RUN: fprintf(stderr, "META (*sr:"); break;
|
||||||
case META_KET: fprintf(stderr, "META )"); break;
|
case META_KET: fprintf(stderr, "META )"); break;
|
||||||
case META_ALT: fprintf(stderr, "META | %d", meta_arg); break;
|
case META_ALT: fprintf(stderr, "META | %d", meta_arg); break;
|
||||||
|
@ -1046,6 +1062,12 @@ for (;;)
|
||||||
fprintf(stderr, "%zd", offset);
|
fprintf(stderr, "%zd", offset);
|
||||||
break;
|
break;
|
||||||
|
|
||||||
|
case META_LOOKBEHIND_NA:
|
||||||
|
fprintf(stderr, "META (*naplb: %d offset=", meta_arg);
|
||||||
|
GETOFFSET(offset, pptr);
|
||||||
|
fprintf(stderr, "%zd", offset);
|
||||||
|
break;
|
||||||
|
|
||||||
case META_LOOKBEHINDNOT:
|
case META_LOOKBEHINDNOT:
|
||||||
fprintf(stderr, "META (?<! %d offset=", meta_arg);
|
fprintf(stderr, "META (?<! %d offset=", meta_arg);
|
||||||
GETOFFSET(offset, pptr);
|
GETOFFSET(offset, pptr);
|
||||||
|
@ -3695,19 +3717,20 @@ while (ptr < ptrend)
|
||||||
goto FAILED;
|
goto FAILED;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Check for expecting an assertion condition. If so, only lookaround
|
/* Check for expecting an assertion condition. If so, only atomic
|
||||||
assertions are valid. */
|
lookaround assertions are valid. */
|
||||||
|
|
||||||
meta = alasmeta[i].meta;
|
meta = alasmeta[i].meta;
|
||||||
if (prev_expect_cond_assert > 0 &&
|
if (prev_expect_cond_assert > 0 &&
|
||||||
(meta < META_LOOKAHEAD || meta > META_LOOKBEHINDNOT))
|
(meta < META_LOOKAHEAD || meta > META_LOOKBEHINDNOT))
|
||||||
{
|
{
|
||||||
errorcode = ERR28; /* Assertion expected */
|
errorcode = (meta == META_LOOKAHEAD_NA || meta == META_LOOKBEHIND_NA)?
|
||||||
|
ERR98 : ERR28; /* (Atomic) assertion expected */
|
||||||
goto FAILED;
|
goto FAILED;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* The lookaround alphabetic synonyms can be almost entirely handled by
|
/* The lookaround alphabetic synonyms can mostly be handled by jumping
|
||||||
jumping to the code that handles the traditional symbolic forms. */
|
to the code that handles the traditional symbolic forms. */
|
||||||
|
|
||||||
switch(meta)
|
switch(meta)
|
||||||
{
|
{
|
||||||
|
@ -3721,11 +3744,17 @@ while (ptr < ptrend)
|
||||||
case META_LOOKAHEAD:
|
case META_LOOKAHEAD:
|
||||||
goto POSITIVE_LOOK_AHEAD;
|
goto POSITIVE_LOOK_AHEAD;
|
||||||
|
|
||||||
|
case META_LOOKAHEAD_NA:
|
||||||
|
*parsed_pattern++ = meta;
|
||||||
|
ptr++;
|
||||||
|
goto POST_ASSERTION;
|
||||||
|
|
||||||
case META_LOOKAHEADNOT:
|
case META_LOOKAHEADNOT:
|
||||||
goto NEGATIVE_LOOK_AHEAD;
|
goto NEGATIVE_LOOK_AHEAD;
|
||||||
|
|
||||||
case META_LOOKBEHIND:
|
case META_LOOKBEHIND:
|
||||||
case META_LOOKBEHINDNOT:
|
case META_LOOKBEHINDNOT:
|
||||||
|
case META_LOOKBEHIND_NA:
|
||||||
*parsed_pattern++ = meta;
|
*parsed_pattern++ = meta;
|
||||||
ptr--;
|
ptr--;
|
||||||
goto POST_LOOKBEHIND;
|
goto POST_LOOKBEHIND;
|
||||||
|
@ -4429,7 +4458,7 @@ while (ptr < ptrend)
|
||||||
*parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)?
|
*parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)?
|
||||||
META_LOOKBEHIND : META_LOOKBEHINDNOT;
|
META_LOOKBEHIND : META_LOOKBEHINDNOT;
|
||||||
|
|
||||||
POST_LOOKBEHIND: /* Come from (*plb: and (*nlb: */
|
POST_LOOKBEHIND: /* Come from (*plb: (*naplb: and (*nlb: */
|
||||||
*has_lookbehind = TRUE;
|
*has_lookbehind = TRUE;
|
||||||
offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2);
|
offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2);
|
||||||
PUTOFFSET(offset, parsed_pattern);
|
PUTOFFSET(offset, parsed_pattern);
|
||||||
|
@ -6300,6 +6329,11 @@ for (;; pptr++)
|
||||||
cb->assert_depth += 1;
|
cb->assert_depth += 1;
|
||||||
goto GROUP_PROCESS;
|
goto GROUP_PROCESS;
|
||||||
|
|
||||||
|
case META_LOOKAHEAD_NA:
|
||||||
|
bravalue = OP_ASSERT_NA;
|
||||||
|
cb->assert_depth += 1;
|
||||||
|
goto GROUP_PROCESS;
|
||||||
|
|
||||||
/* Optimize (?!) to (*FAIL) unless it is quantified - which is a weird
|
/* Optimize (?!) to (*FAIL) unless it is quantified - which is a weird
|
||||||
thing to do, but Perl allows all assertions to be quantified, and when
|
thing to do, but Perl allows all assertions to be quantified, and when
|
||||||
they contain capturing parentheses there may be a potential use for
|
they contain capturing parentheses there may be a potential use for
|
||||||
|
@ -6331,6 +6365,11 @@ for (;; pptr++)
|
||||||
cb->assert_depth += 1;
|
cb->assert_depth += 1;
|
||||||
goto GROUP_PROCESS;
|
goto GROUP_PROCESS;
|
||||||
|
|
||||||
|
case META_LOOKBEHIND_NA:
|
||||||
|
bravalue = OP_ASSERTBACK_NA;
|
||||||
|
cb->assert_depth += 1;
|
||||||
|
goto GROUP_PROCESS;
|
||||||
|
|
||||||
case META_ATOMIC:
|
case META_ATOMIC:
|
||||||
bravalue = OP_ONCE;
|
bravalue = OP_ONCE;
|
||||||
goto GROUP_PROCESS_NOTE_EMPTY;
|
goto GROUP_PROCESS_NOTE_EMPTY;
|
||||||
|
@ -7931,7 +7970,10 @@ length = 2 + 2*LINK_SIZE + skipunits;
|
||||||
/* Remember if this is a lookbehind assertion, and if it is, save its length
|
/* Remember if this is a lookbehind assertion, and if it is, save its length
|
||||||
and skip over the pattern offset. */
|
and skip over the pattern offset. */
|
||||||
|
|
||||||
lookbehind = *code == OP_ASSERTBACK || *code == OP_ASSERTBACK_NOT;
|
lookbehind = *code == OP_ASSERTBACK ||
|
||||||
|
*code == OP_ASSERTBACK_NOT ||
|
||||||
|
*code == OP_ASSERTBACK_NA;
|
||||||
|
|
||||||
if (lookbehind)
|
if (lookbehind)
|
||||||
{
|
{
|
||||||
lookbehindlength = META_DATA(pptr[-1]);
|
lookbehindlength = META_DATA(pptr[-1]);
|
||||||
|
@ -8802,8 +8844,10 @@ for (;; pptr++)
|
||||||
case META_COND_VERSION:
|
case META_COND_VERSION:
|
||||||
case META_LOOKAHEAD:
|
case META_LOOKAHEAD:
|
||||||
case META_LOOKAHEADNOT:
|
case META_LOOKAHEADNOT:
|
||||||
|
case META_LOOKAHEAD_NA:
|
||||||
case META_LOOKBEHIND:
|
case META_LOOKBEHIND:
|
||||||
case META_LOOKBEHINDNOT:
|
case META_LOOKBEHINDNOT:
|
||||||
|
case META_LOOKBEHIND_NA:
|
||||||
case META_NOCAPTURE:
|
case META_NOCAPTURE:
|
||||||
case META_SCRIPT_RUN:
|
case META_SCRIPT_RUN:
|
||||||
nestlevel++;
|
nestlevel++;
|
||||||
|
@ -9064,6 +9108,7 @@ for (;; pptr++)
|
||||||
|
|
||||||
case META_LOOKAHEAD:
|
case META_LOOKAHEAD:
|
||||||
case META_LOOKAHEADNOT:
|
case META_LOOKAHEADNOT:
|
||||||
|
case META_LOOKAHEAD_NA:
|
||||||
pptr = parsed_skip(pptr + 1, PSKIP_KET);
|
pptr = parsed_skip(pptr + 1, PSKIP_KET);
|
||||||
if (pptr == NULL) goto PARSED_SKIP_FAILED;
|
if (pptr == NULL) goto PARSED_SKIP_FAILED;
|
||||||
|
|
||||||
|
@ -9102,6 +9147,7 @@ for (;; pptr++)
|
||||||
|
|
||||||
case META_LOOKBEHIND:
|
case META_LOOKBEHIND:
|
||||||
case META_LOOKBEHINDNOT:
|
case META_LOOKBEHINDNOT:
|
||||||
|
case META_LOOKBEHIND_NA:
|
||||||
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
|
if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
|
||||||
return -1;
|
return -1;
|
||||||
if (max - branchlength > extra) extra = max - branchlength;
|
if (max - branchlength > extra) extra = max - branchlength;
|
||||||
|
@ -9453,6 +9499,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
|
||||||
case META_KET:
|
case META_KET:
|
||||||
case META_LOOKAHEAD:
|
case META_LOOKAHEAD:
|
||||||
case META_LOOKAHEADNOT:
|
case META_LOOKAHEADNOT:
|
||||||
|
case META_LOOKAHEAD_NA:
|
||||||
case META_NOCAPTURE:
|
case META_NOCAPTURE:
|
||||||
case META_PLUS:
|
case META_PLUS:
|
||||||
case META_PLUS_PLUS:
|
case META_PLUS_PLUS:
|
||||||
|
@ -9514,6 +9561,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
|
||||||
|
|
||||||
case META_LOOKBEHIND:
|
case META_LOOKBEHIND:
|
||||||
case META_LOOKBEHINDNOT:
|
case META_LOOKBEHINDNOT:
|
||||||
|
case META_LOOKBEHIND_NA:
|
||||||
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, NULL, cb))
|
if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount, NULL, cb))
|
||||||
return errorcode;
|
return errorcode;
|
||||||
break;
|
break;
|
||||||
|
|
|
@ -173,6 +173,8 @@ static const uint8_t coptable[] = {
|
||||||
0, /* Assert not */
|
0, /* Assert not */
|
||||||
0, /* Assert behind */
|
0, /* Assert behind */
|
||||||
0, /* Assert behind not */
|
0, /* Assert behind not */
|
||||||
|
0, /* NA assert */
|
||||||
|
0, /* NA assert behind */
|
||||||
0, /* ONCE */
|
0, /* ONCE */
|
||||||
0, /* SCRIPT_RUN */
|
0, /* SCRIPT_RUN */
|
||||||
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
|
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
|
||||||
|
@ -248,6 +250,8 @@ static const uint8_t poptable[] = {
|
||||||
0, /* Assert not */
|
0, /* Assert not */
|
||||||
0, /* Assert behind */
|
0, /* Assert behind */
|
||||||
0, /* Assert behind not */
|
0, /* Assert behind not */
|
||||||
|
0, /* NA assert */
|
||||||
|
0, /* NA assert behind */
|
||||||
0, /* ONCE */
|
0, /* ONCE */
|
||||||
0, /* SCRIPT_RUN */
|
0, /* SCRIPT_RUN */
|
||||||
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
|
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
|
||||||
|
|
|
@ -185,6 +185,7 @@ static const unsigned char compile_error_texts[] =
|
||||||
"(*alpha_assertion) not recognized\0"
|
"(*alpha_assertion) not recognized\0"
|
||||||
"script runs require Unicode support, which this version of PCRE2 does not have\0"
|
"script runs require Unicode support, which this version of PCRE2 does not have\0"
|
||||||
"too many capturing groups (maximum 65535)\0"
|
"too many capturing groups (maximum 65535)\0"
|
||||||
|
"atomic assertion expected after (?( or (?(?C)\0"
|
||||||
;
|
;
|
||||||
|
|
||||||
/* Match-time and UTF error texts are in the same format. */
|
/* Match-time and UTF error texts are in the same format. */
|
||||||
|
|
|
@ -883,12 +883,16 @@ a positive value. */
|
||||||
#define STRING_atomic0 "atomic\0"
|
#define STRING_atomic0 "atomic\0"
|
||||||
#define STRING_pla0 "pla\0"
|
#define STRING_pla0 "pla\0"
|
||||||
#define STRING_plb0 "plb\0"
|
#define STRING_plb0 "plb\0"
|
||||||
|
#define STRING_napla0 "napla\0"
|
||||||
|
#define STRING_naplb0 "naplb\0"
|
||||||
#define STRING_nla0 "nla\0"
|
#define STRING_nla0 "nla\0"
|
||||||
#define STRING_nlb0 "nlb\0"
|
#define STRING_nlb0 "nlb\0"
|
||||||
#define STRING_sr0 "sr\0"
|
#define STRING_sr0 "sr\0"
|
||||||
#define STRING_asr0 "asr\0"
|
#define STRING_asr0 "asr\0"
|
||||||
#define STRING_positive_lookahead0 "positive_lookahead\0"
|
#define STRING_positive_lookahead0 "positive_lookahead\0"
|
||||||
#define STRING_positive_lookbehind0 "positive_lookbehind\0"
|
#define STRING_positive_lookbehind0 "positive_lookbehind\0"
|
||||||
|
#define STRING_non_atomic_positive_lookahead0 "non_atomic_positive_lookahead\0"
|
||||||
|
#define STRING_non_atomic_positive_lookbehind0 "non_atomic_positive_lookbehind\0"
|
||||||
#define STRING_negative_lookahead0 "negative_lookahead\0"
|
#define STRING_negative_lookahead0 "negative_lookahead\0"
|
||||||
#define STRING_negative_lookbehind0 "negative_lookbehind\0"
|
#define STRING_negative_lookbehind0 "negative_lookbehind\0"
|
||||||
#define STRING_script_run0 "script_run\0"
|
#define STRING_script_run0 "script_run\0"
|
||||||
|
@ -1173,12 +1177,16 @@ only. */
|
||||||
#define STRING_atomic0 STR_a STR_t STR_o STR_m STR_i STR_c "\0"
|
#define STRING_atomic0 STR_a STR_t STR_o STR_m STR_i STR_c "\0"
|
||||||
#define STRING_pla0 STR_p STR_l STR_a "\0"
|
#define STRING_pla0 STR_p STR_l STR_a "\0"
|
||||||
#define STRING_plb0 STR_p STR_l STR_b "\0"
|
#define STRING_plb0 STR_p STR_l STR_b "\0"
|
||||||
|
#define STRING_napla0 STR_n STR_a STR_p STR_l STR_a "\0"
|
||||||
|
#define STRING_naplb0 STR_n STR_a STR_p STR_l STR_b "\0"
|
||||||
#define STRING_nla0 STR_n STR_l STR_a "\0"
|
#define STRING_nla0 STR_n STR_l STR_a "\0"
|
||||||
#define STRING_nlb0 STR_n STR_l STR_b "\0"
|
#define STRING_nlb0 STR_n STR_l STR_b "\0"
|
||||||
#define STRING_sr0 STR_s STR_r "\0"
|
#define STRING_sr0 STR_s STR_r "\0"
|
||||||
#define STRING_asr0 STR_a STR_s STR_r "\0"
|
#define STRING_asr0 STR_a STR_s STR_r "\0"
|
||||||
#define STRING_positive_lookahead0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
|
#define STRING_positive_lookahead0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
|
||||||
#define STRING_positive_lookbehind0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
|
#define STRING_positive_lookbehind0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
|
||||||
|
#define STRING_non_atomic_positive_lookahead0 STR_n STR_o STR_n STR_UNDERSCORE STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
|
||||||
|
#define STRING_non_atomic_positive_lookbehind0 STR_n STR_o STR_n STR_UNDERSCORE STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
|
||||||
#define STRING_negative_lookahead0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
|
#define STRING_negative_lookahead0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
|
||||||
#define STRING_negative_lookbehind0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
|
#define STRING_negative_lookbehind0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
|
||||||
#define STRING_script_run0 STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n "\0"
|
#define STRING_script_run0 STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n "\0"
|
||||||
|
@ -1303,7 +1311,7 @@ enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
|
||||||
Starting from 1 (i.e. after OP_END), the values up to OP_EOD must correspond in
|
Starting from 1 (i.e. after OP_END), the values up to OP_EOD must correspond in
|
||||||
order to the list of escapes immediately above. Furthermore, values up to
|
order to the list of escapes immediately above. Furthermore, values up to
|
||||||
OP_DOLLM must not be changed without adjusting the table called autoposstab in
|
OP_DOLLM must not be changed without adjusting the table called autoposstab in
|
||||||
pcre2_auto_possess.c
|
pcre2_auto_possess.c.
|
||||||
|
|
||||||
Whenever this list is updated, the two macro definitions that follow must be
|
Whenever this list is updated, the two macro definitions that follow must be
|
||||||
updated to match. The possessification table called "opcode_possessify" in
|
updated to match. The possessification table called "opcode_possessify" in
|
||||||
|
@ -1501,80 +1509,81 @@ enum {
|
||||||
OP_KETRMIN, /* 123 order. They are for groups the repeat for ever. */
|
OP_KETRMIN, /* 123 order. They are for groups the repeat for ever. */
|
||||||
OP_KETRPOS, /* 124 Possessive unlimited repeat. */
|
OP_KETRPOS, /* 124 Possessive unlimited repeat. */
|
||||||
|
|
||||||
/* The assertions must come before BRA, CBRA, ONCE, and COND, and the four
|
/* The assertions must come before BRA, CBRA, ONCE, and COND. */
|
||||||
asserts must remain in order. */
|
|
||||||
|
|
||||||
OP_REVERSE, /* 125 Move pointer back - used in lookbehind assertions */
|
OP_REVERSE, /* 125 Move pointer back - used in lookbehind assertions */
|
||||||
OP_ASSERT, /* 126 Positive lookahead */
|
OP_ASSERT, /* 126 Positive lookahead */
|
||||||
OP_ASSERT_NOT, /* 127 Negative lookahead */
|
OP_ASSERT_NOT, /* 127 Negative lookahead */
|
||||||
OP_ASSERTBACK, /* 128 Positive lookbehind */
|
OP_ASSERTBACK, /* 128 Positive lookbehind */
|
||||||
OP_ASSERTBACK_NOT, /* 129 Negative lookbehind */
|
OP_ASSERTBACK_NOT, /* 129 Negative lookbehind */
|
||||||
|
OP_ASSERT_NA, /* 130 Positive non-atomic lookahead */
|
||||||
|
OP_ASSERTBACK_NA, /* 131 Positive non-atomic lookbehind */
|
||||||
|
|
||||||
/* ONCE, SCRIPT_RUN, BRA, BRAPOS, CBRA, CBRAPOS, and COND must come
|
/* ONCE, SCRIPT_RUN, BRA, BRAPOS, CBRA, CBRAPOS, and COND must come
|
||||||
immediately after the assertions, with ONCE first, as there's a test for >=
|
immediately after the assertions, with ONCE first, as there's a test for >=
|
||||||
ONCE for a subpattern that isn't an assertion. The POS versions must
|
ONCE for a subpattern that isn't an assertion. The POS versions must
|
||||||
immediately follow the non-POS versions in each case. */
|
immediately follow the non-POS versions in each case. */
|
||||||
|
|
||||||
OP_ONCE, /* 130 Atomic group, contains captures */
|
OP_ONCE, /* 132 Atomic group, contains captures */
|
||||||
OP_SCRIPT_RUN, /* 131 Non-capture, but check characters' scripts */
|
OP_SCRIPT_RUN, /* 133 Non-capture, but check characters' scripts */
|
||||||
OP_BRA, /* 132 Start of non-capturing bracket */
|
OP_BRA, /* 134 Start of non-capturing bracket */
|
||||||
OP_BRAPOS, /* 133 Ditto, with unlimited, possessive repeat */
|
OP_BRAPOS, /* 135 Ditto, with unlimited, possessive repeat */
|
||||||
OP_CBRA, /* 134 Start of capturing bracket */
|
OP_CBRA, /* 136 Start of capturing bracket */
|
||||||
OP_CBRAPOS, /* 135 Ditto, with unlimited, possessive repeat */
|
OP_CBRAPOS, /* 137 Ditto, with unlimited, possessive repeat */
|
||||||
OP_COND, /* 136 Conditional group */
|
OP_COND, /* 138 Conditional group */
|
||||||
|
|
||||||
/* These five must follow the previous five, in the same order. There's a
|
/* These five must follow the previous five, in the same order. There's a
|
||||||
check for >= SBRA to distinguish the two sets. */
|
check for >= SBRA to distinguish the two sets. */
|
||||||
|
|
||||||
OP_SBRA, /* 137 Start of non-capturing bracket, check empty */
|
OP_SBRA, /* 139 Start of non-capturing bracket, check empty */
|
||||||
OP_SBRAPOS, /* 138 Ditto, with unlimited, possessive repeat */
|
OP_SBRAPOS, /* 149 Ditto, with unlimited, possessive repeat */
|
||||||
OP_SCBRA, /* 139 Start of capturing bracket, check empty */
|
OP_SCBRA, /* 141 Start of capturing bracket, check empty */
|
||||||
OP_SCBRAPOS, /* 140 Ditto, with unlimited, possessive repeat */
|
OP_SCBRAPOS, /* 142 Ditto, with unlimited, possessive repeat */
|
||||||
OP_SCOND, /* 141 Conditional group, check empty */
|
OP_SCOND, /* 143 Conditional group, check empty */
|
||||||
|
|
||||||
/* The next two pairs must (respectively) be kept together. */
|
/* The next two pairs must (respectively) be kept together. */
|
||||||
|
|
||||||
OP_CREF, /* 142 Used to hold a capture number as condition */
|
OP_CREF, /* 144 Used to hold a capture number as condition */
|
||||||
OP_DNCREF, /* 143 Used to point to duplicate names as a condition */
|
OP_DNCREF, /* 145 Used to point to duplicate names as a condition */
|
||||||
OP_RREF, /* 144 Used to hold a recursion number as condition */
|
OP_RREF, /* 146 Used to hold a recursion number as condition */
|
||||||
OP_DNRREF, /* 145 Used to point to duplicate names as a condition */
|
OP_DNRREF, /* 147 Used to point to duplicate names as a condition */
|
||||||
OP_FALSE, /* 146 Always false (used by DEFINE and VERSION) */
|
OP_FALSE, /* 148 Always false (used by DEFINE and VERSION) */
|
||||||
OP_TRUE, /* 147 Always true (used by VERSION) */
|
OP_TRUE, /* 149 Always true (used by VERSION) */
|
||||||
|
|
||||||
OP_BRAZERO, /* 148 These two must remain together and in this */
|
OP_BRAZERO, /* 150 These two must remain together and in this */
|
||||||
OP_BRAMINZERO, /* 149 order. */
|
OP_BRAMINZERO, /* 151 order. */
|
||||||
OP_BRAPOSZERO, /* 150 */
|
OP_BRAPOSZERO, /* 152 */
|
||||||
|
|
||||||
/* These are backtracking control verbs */
|
/* These are backtracking control verbs */
|
||||||
|
|
||||||
OP_MARK, /* 151 always has an argument */
|
OP_MARK, /* 153 always has an argument */
|
||||||
OP_PRUNE, /* 152 */
|
OP_PRUNE, /* 154 */
|
||||||
OP_PRUNE_ARG, /* 153 same, but with argument */
|
OP_PRUNE_ARG, /* 155 same, but with argument */
|
||||||
OP_SKIP, /* 154 */
|
OP_SKIP, /* 156 */
|
||||||
OP_SKIP_ARG, /* 155 same, but with argument */
|
OP_SKIP_ARG, /* 157 same, but with argument */
|
||||||
OP_THEN, /* 156 */
|
OP_THEN, /* 158 */
|
||||||
OP_THEN_ARG, /* 157 same, but with argument */
|
OP_THEN_ARG, /* 159 same, but with argument */
|
||||||
OP_COMMIT, /* 158 */
|
OP_COMMIT, /* 160 */
|
||||||
OP_COMMIT_ARG, /* 159 same, but with argument */
|
OP_COMMIT_ARG, /* 161 same, but with argument */
|
||||||
|
|
||||||
/* These are forced failure and success verbs. FAIL and ACCEPT do accept an
|
/* These are forced failure and success verbs. FAIL and ACCEPT do accept an
|
||||||
argument, but these cases can be compiled as, for example, (*MARK:X)(*FAIL)
|
argument, but these cases can be compiled as, for example, (*MARK:X)(*FAIL)
|
||||||
without the need for a special opcode. */
|
without the need for a special opcode. */
|
||||||
|
|
||||||
OP_FAIL, /* 160 */
|
OP_FAIL, /* 162 */
|
||||||
OP_ACCEPT, /* 161 */
|
OP_ACCEPT, /* 163 */
|
||||||
OP_ASSERT_ACCEPT, /* 162 Used inside assertions */
|
OP_ASSERT_ACCEPT, /* 164 Used inside assertions */
|
||||||
OP_CLOSE, /* 163 Used before OP_ACCEPT to close open captures */
|
OP_CLOSE, /* 165 Used before OP_ACCEPT to close open captures */
|
||||||
|
|
||||||
/* This is used to skip a subpattern with a {0} quantifier */
|
/* This is used to skip a subpattern with a {0} quantifier */
|
||||||
|
|
||||||
OP_SKIPZERO, /* 164 */
|
OP_SKIPZERO, /* 166 */
|
||||||
|
|
||||||
/* This is used to identify a DEFINE group during compilation so that it can
|
/* This is used to identify a DEFINE group during compilation so that it can
|
||||||
be checked for having only one branch. It is changed to OP_FALSE before
|
be checked for having only one branch. It is changed to OP_FALSE before
|
||||||
compilation finishes. */
|
compilation finishes. */
|
||||||
|
|
||||||
OP_DEFINE, /* 165 */
|
OP_DEFINE, /* 167 */
|
||||||
|
|
||||||
/* This is not an opcode, but is used to check that tables indexed by opcode
|
/* This is not an opcode, but is used to check that tables indexed by opcode
|
||||||
are the correct length, in order to catch updating errors - there have been
|
are the correct length, in order to catch updating errors - there have been
|
||||||
|
@ -1587,7 +1596,7 @@ enum {
|
||||||
/* *** NOTE NOTE NOTE *** Whenever the list above is updated, the two macro
|
/* *** NOTE NOTE NOTE *** Whenever the list above is updated, the two macro
|
||||||
definitions that follow must also be updated to match. There are also tables
|
definitions that follow must also be updated to match. There are also tables
|
||||||
called "opcode_possessify" in pcre2_compile.c and "coptable" and "poptable" in
|
called "opcode_possessify" in pcre2_compile.c and "coptable" and "poptable" in
|
||||||
pcre2_dfa_exec.c that must be updated. */
|
pcre2_dfa_match.c that must be updated. */
|
||||||
|
|
||||||
|
|
||||||
/* This macro defines textual names for all the opcodes. These are used only
|
/* This macro defines textual names for all the opcodes. These are used only
|
||||||
|
@ -1620,7 +1629,9 @@ some cases doesn't actually use these names at all). */
|
||||||
"class", "nclass", "xclass", "Ref", "Refi", "DnRef", "DnRefi", \
|
"class", "nclass", "xclass", "Ref", "Refi", "DnRef", "DnRefi", \
|
||||||
"Recurse", "Callout", "CalloutStr", \
|
"Recurse", "Callout", "CalloutStr", \
|
||||||
"Alt", "Ket", "KetRmax", "KetRmin", "KetRpos", \
|
"Alt", "Ket", "KetRmax", "KetRmin", "KetRpos", \
|
||||||
"Reverse", "Assert", "Assert not", "AssertB", "AssertB not", \
|
"Reverse", "Assert", "Assert not", \
|
||||||
|
"Assert back", "Assert back not", \
|
||||||
|
"Non-atomic assert", "Non-atomic assert back", \
|
||||||
"Once", \
|
"Once", \
|
||||||
"Script run", \
|
"Script run", \
|
||||||
"Bra", "BraPos", "CBra", "CBraPos", \
|
"Bra", "BraPos", "CBra", "CBraPos", \
|
||||||
|
@ -1705,6 +1716,8 @@ in UTF-8 mode. The code that uses this table must know about such things. */
|
||||||
1+LINK_SIZE, /* Assert not */ \
|
1+LINK_SIZE, /* Assert not */ \
|
||||||
1+LINK_SIZE, /* Assert behind */ \
|
1+LINK_SIZE, /* Assert behind */ \
|
||||||
1+LINK_SIZE, /* Assert behind not */ \
|
1+LINK_SIZE, /* Assert behind not */ \
|
||||||
|
1+LINK_SIZE, /* NA Assert */ \
|
||||||
|
1+LINK_SIZE, /* NA Assert behind */ \
|
||||||
1+LINK_SIZE, /* ONCE */ \
|
1+LINK_SIZE, /* ONCE */ \
|
||||||
1+LINK_SIZE, /* SCRIPT_RUN */ \
|
1+LINK_SIZE, /* SCRIPT_RUN */ \
|
||||||
1+LINK_SIZE, /* BRA */ \
|
1+LINK_SIZE, /* BRA */ \
|
||||||
|
|
|
@ -5127,6 +5127,8 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
||||||
|
|
||||||
case OP_ASSERT:
|
case OP_ASSERT:
|
||||||
case OP_ASSERTBACK:
|
case OP_ASSERTBACK:
|
||||||
|
case OP_ASSERT_NA:
|
||||||
|
case OP_ASSERTBACK_NA:
|
||||||
Lframe_type = GF_NOCAPTURE | Fop;
|
Lframe_type = GF_NOCAPTURE | Fop;
|
||||||
for (;;)
|
for (;;)
|
||||||
{
|
{
|
||||||
|
@ -5497,10 +5499,20 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
||||||
case OP_SCOND:
|
case OP_SCOND:
|
||||||
break;
|
break;
|
||||||
|
|
||||||
/* Positive assertions are like OP_ONCE, except that in addition the
|
/* Non-atomic positive assertions are like OP_BRA, except that the
|
||||||
subject pointer must be put back to where it was at the start of the
|
subject pointer must be put back to where it was at the start of the
|
||||||
assertion. */
|
assertion. */
|
||||||
|
|
||||||
|
case OP_ASSERT_NA:
|
||||||
|
case OP_ASSERTBACK_NA:
|
||||||
|
if (Feptr > mb->last_used_ptr) mb->last_used_ptr = Feptr;
|
||||||
|
Feptr = P->eptr;
|
||||||
|
break;
|
||||||
|
|
||||||
|
/* Atomic positive assertions are like OP_ONCE, except that in addition
|
||||||
|
the subject pointer must be put back to where it was at the start of the
|
||||||
|
assertion. */
|
||||||
|
|
||||||
case OP_ASSERT:
|
case OP_ASSERT:
|
||||||
case OP_ASSERTBACK:
|
case OP_ASSERTBACK:
|
||||||
if (Feptr > mb->last_used_ptr) mb->last_used_ptr = Feptr;
|
if (Feptr > mb->last_used_ptr) mb->last_used_ptr = Feptr;
|
||||||
|
|
|
@ -392,6 +392,8 @@ for(;;)
|
||||||
case OP_ASSERT_NOT:
|
case OP_ASSERT_NOT:
|
||||||
case OP_ASSERTBACK:
|
case OP_ASSERTBACK:
|
||||||
case OP_ASSERTBACK_NOT:
|
case OP_ASSERTBACK_NOT:
|
||||||
|
case OP_ASSERT_NA:
|
||||||
|
case OP_ASSERTBACK_NA:
|
||||||
case OP_ONCE:
|
case OP_ONCE:
|
||||||
case OP_SCRIPT_RUN:
|
case OP_SCRIPT_RUN:
|
||||||
case OP_COND:
|
case OP_COND:
|
||||||
|
|
|
@ -240,6 +240,8 @@ for (;;)
|
||||||
case OP_ASSERT_NOT:
|
case OP_ASSERT_NOT:
|
||||||
case OP_ASSERTBACK:
|
case OP_ASSERTBACK:
|
||||||
case OP_ASSERTBACK_NOT:
|
case OP_ASSERTBACK_NOT:
|
||||||
|
case OP_ASSERT_NA:
|
||||||
|
case OP_ASSERTBACK_NA:
|
||||||
do cc += GET(cc, 1); while (*cc == OP_ALT);
|
do cc += GET(cc, 1); while (*cc == OP_ALT);
|
||||||
/* Fall through */
|
/* Fall through */
|
||||||
|
|
||||||
|
@ -1089,6 +1091,7 @@ do
|
||||||
case OP_ONCE:
|
case OP_ONCE:
|
||||||
case OP_SCRIPT_RUN:
|
case OP_SCRIPT_RUN:
|
||||||
case OP_ASSERT:
|
case OP_ASSERT:
|
||||||
|
case OP_ASSERT_NA:
|
||||||
rc = set_start_bits(re, tcode, utf);
|
rc = set_start_bits(re, tcode, utf);
|
||||||
if (rc == SSB_FAIL || rc == SSB_UNKNOWN) return rc;
|
if (rc == SSB_FAIL || rc == SSB_UNKNOWN) return rc;
|
||||||
if (rc == SSB_DONE) try_next = FALSE; else
|
if (rc == SSB_DONE) try_next = FALSE; else
|
||||||
|
@ -1131,6 +1134,7 @@ do
|
||||||
case OP_ASSERT_NOT:
|
case OP_ASSERT_NOT:
|
||||||
case OP_ASSERTBACK:
|
case OP_ASSERTBACK:
|
||||||
case OP_ASSERTBACK_NOT:
|
case OP_ASSERTBACK_NOT:
|
||||||
|
case OP_ASSERTBACK_NA:
|
||||||
do tcode += GET(tcode, 1); while (*tcode == OP_ALT);
|
do tcode += GET(tcode, 1); while (*tcode == OP_ALT);
|
||||||
tcode += 1 + LINK_SIZE;
|
tcode += 1 + LINK_SIZE;
|
||||||
break;
|
break;
|
||||||
|
|
|
@ -5653,4 +5653,33 @@ a)"xI
|
||||||
# Multiplication overflow
|
# Multiplication overflow
|
||||||
/(X{65535})(?<=\1{32770})/
|
/(X{65535})(?<=\1{32770})/
|
||||||
|
|
||||||
|
# ---- Non-atomic assertion tests ----
|
||||||
|
|
||||||
|
# Expect error: not allowed as a condition
|
||||||
|
/(?(*napla:xx)bc)/
|
||||||
|
|
||||||
|
/\A(*pla:.*\b(\w++))(?>.*?\b\1\b){3}/
|
||||||
|
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
|
||||||
|
|
||||||
|
/\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/
|
||||||
|
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
|
||||||
|
|
||||||
|
/(*plb:(.)..|(.)...)(\1|\2)/
|
||||||
|
abcdb\=offset=4
|
||||||
|
abcda\=offset=4
|
||||||
|
|
||||||
|
/(*naplb:(.)..|(.)...)(\1|\2)/
|
||||||
|
abcdb\=offset=4
|
||||||
|
abcda\=offset=4
|
||||||
|
|
||||||
|
/(*non_atomic_positive_lookahead:ab)/B
|
||||||
|
|
||||||
|
/(*non_atomic_positive_lookbehind:ab)/B
|
||||||
|
|
||||||
|
/(*pla:ab+)/B
|
||||||
|
|
||||||
|
/(*napla:ab+)/B
|
||||||
|
|
||||||
|
# ----
|
||||||
|
|
||||||
# End of testinput2
|
# End of testinput2
|
||||||
|
|
|
@ -11117,7 +11117,7 @@ Matched, but too many substrings
|
||||||
------------------------------------------------------------------
|
------------------------------------------------------------------
|
||||||
Bra
|
Bra
|
||||||
Brazero
|
Brazero
|
||||||
AssertB
|
Assert back
|
||||||
Reverse
|
Reverse
|
||||||
CBra 1
|
CBra 1
|
||||||
abc
|
abc
|
||||||
|
@ -13346,7 +13346,7 @@ Failed: error 144 at offset 5: subpattern name must start with a non-digit
|
||||||
Ket
|
Ket
|
||||||
red
|
red
|
||||||
\b
|
\b
|
||||||
AssertB
|
Assert back
|
||||||
Reverse
|
Reverse
|
||||||
\w
|
\w
|
||||||
Ket
|
Ket
|
||||||
|
@ -13403,7 +13403,7 @@ Failed: error 133 at offset 7: parentheses are too deeply nested (stack check)
|
||||||
Once
|
Once
|
||||||
\s*+
|
\s*+
|
||||||
Ket
|
Ket
|
||||||
AssertB
|
Assert back
|
||||||
Reverse
|
Reverse
|
||||||
\w
|
\w
|
||||||
Ket
|
Ket
|
||||||
|
@ -16619,7 +16619,7 @@ No match
|
||||||
/(?<=(?=.){4,5}x)/B
|
/(?<=(?=.){4,5}x)/B
|
||||||
------------------------------------------------------------------
|
------------------------------------------------------------------
|
||||||
Bra
|
Bra
|
||||||
AssertB
|
Assert back
|
||||||
Reverse
|
Reverse
|
||||||
Assert
|
Assert
|
||||||
Any
|
Any
|
||||||
|
@ -17086,6 +17086,87 @@ Failed: error 187 at offset 15: lookbehind assertion is too long
|
||||||
/(X{65535})(?<=\1{32770})/
|
/(X{65535})(?<=\1{32770})/
|
||||||
Failed: error 187 at offset 10: lookbehind assertion is too long
|
Failed: error 187 at offset 10: lookbehind assertion is too long
|
||||||
|
|
||||||
|
# ---- Non-atomic assertion tests ----
|
||||||
|
|
||||||
|
# Expect error: not allowed as a condition
|
||||||
|
/(?(*napla:xx)bc)/
|
||||||
|
Failed: error 198 at offset 9: atomic assertion expected after (?( or (?(?C)
|
||||||
|
|
||||||
|
/\A(*pla:.*\b(\w++))(?>.*?\b\1\b){3}/
|
||||||
|
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
|
||||||
|
No match
|
||||||
|
|
||||||
|
/\A(*napla:.*\b(\w++))(?>.*?\b\1\b){3}/
|
||||||
|
word1 word3 word1 word2 word3 word2 word2 word1 word3 word4
|
||||||
|
0: word1 word3 word1 word2 word3 word2 word2 word1 word3
|
||||||
|
1: word3
|
||||||
|
|
||||||
|
/(*plb:(.)..|(.)...)(\1|\2)/
|
||||||
|
abcdb\=offset=4
|
||||||
|
0: b
|
||||||
|
1: b
|
||||||
|
2: <unset>
|
||||||
|
3: b
|
||||||
|
abcda\=offset=4
|
||||||
|
No match
|
||||||
|
|
||||||
|
/(*naplb:(.)..|(.)...)(\1|\2)/
|
||||||
|
abcdb\=offset=4
|
||||||
|
0: b
|
||||||
|
1: b
|
||||||
|
2: <unset>
|
||||||
|
3: b
|
||||||
|
abcda\=offset=4
|
||||||
|
0: a
|
||||||
|
1: <unset>
|
||||||
|
2: a
|
||||||
|
3: a
|
||||||
|
|
||||||
|
/(*non_atomic_positive_lookahead:ab)/B
|
||||||
|
------------------------------------------------------------------
|
||||||
|
Bra
|
||||||
|
Non-atomic assert
|
||||||
|
ab
|
||||||
|
Ket
|
||||||
|
Ket
|
||||||
|
End
|
||||||
|
------------------------------------------------------------------
|
||||||
|
|
||||||
|
/(*non_atomic_positive_lookbehind:ab)/B
|
||||||
|
------------------------------------------------------------------
|
||||||
|
Bra
|
||||||
|
Non-atomic assert back
|
||||||
|
Reverse
|
||||||
|
ab
|
||||||
|
Ket
|
||||||
|
Ket
|
||||||
|
End
|
||||||
|
------------------------------------------------------------------
|
||||||
|
|
||||||
|
/(*pla:ab+)/B
|
||||||
|
------------------------------------------------------------------
|
||||||
|
Bra
|
||||||
|
Assert
|
||||||
|
a
|
||||||
|
b++
|
||||||
|
Ket
|
||||||
|
Ket
|
||||||
|
End
|
||||||
|
------------------------------------------------------------------
|
||||||
|
|
||||||
|
/(*napla:ab+)/B
|
||||||
|
------------------------------------------------------------------
|
||||||
|
Bra
|
||||||
|
Non-atomic assert
|
||||||
|
a
|
||||||
|
b+
|
||||||
|
Ket
|
||||||
|
Ket
|
||||||
|
End
|
||||||
|
------------------------------------------------------------------
|
||||||
|
|
||||||
|
# ----
|
||||||
|
|
||||||
# End of testinput2
|
# End of testinput2
|
||||||
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
|
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
|
||||||
Error -62: bad serialized data
|
Error -62: bad serialized data
|
||||||
|
|
|
@ -4017,7 +4017,7 @@ MK: a\x{12345}b\x{09}(d)c
|
||||||
------------------------------------------------------------------
|
------------------------------------------------------------------
|
||||||
Bra
|
Bra
|
||||||
\b
|
\b
|
||||||
AssertB
|
Assert back
|
||||||
Reverse
|
Reverse
|
||||||
prop Xwd
|
prop Xwd
|
||||||
Ket
|
Ket
|
||||||
|
@ -4196,7 +4196,7 @@ Failed: error 125 at offset 2: lookbehind assertion is not fixed length
|
||||||
------------------------------------------------------------------
|
------------------------------------------------------------------
|
||||||
Bra
|
Bra
|
||||||
^
|
^
|
||||||
AssertB not
|
Assert back not
|
||||||
Assert
|
Assert
|
||||||
\x{10385c}
|
\x{10385c}
|
||||||
Ket
|
Ket
|
||||||
|
@ -4828,7 +4828,7 @@ MK: ABC
|
||||||
/(?<!)(*sr:)/B
|
/(?<!)(*sr:)/B
|
||||||
------------------------------------------------------------------
|
------------------------------------------------------------------
|
||||||
Bra
|
Bra
|
||||||
AssertB not
|
Assert back not
|
||||||
Ket
|
Ket
|
||||||
Script run
|
Script run
|
||||||
Ket
|
Ket
|
||||||
|
@ -4839,7 +4839,7 @@ MK: ABC
|
||||||
/(?<=abc(?=X(*sr:BXY)CCC)XBXYCCC)./B
|
/(?<=abc(?=X(*sr:BXY)CCC)XBXYCCC)./B
|
||||||
------------------------------------------------------------------
|
------------------------------------------------------------------
|
||||||
Bra
|
Bra
|
||||||
AssertB
|
Assert back
|
||||||
Reverse
|
Reverse
|
||||||
abc
|
abc
|
||||||
Assert
|
Assert
|
||||||
|
|
Loading…
Reference in New Issue