Documentation update.
This commit is contained in:
parent
b55ef12cc1
commit
77ef3e66ab
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "27 December 2016" "PCRE2 10.23"
|
.TH PCRE2PATTERN 3 "18 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -138,20 +138,23 @@ the application to apply the JIT optimization by calling
|
||||||
\fBpcre2_jit_compile()\fP is ignored.
|
\fBpcre2_jit_compile()\fP is ignored.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Setting match and recursion limits"
|
.SS "Setting match and backtracking depth limits"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
The caller of \fBpcre2_match()\fP can set a limit on the number of times the
|
The pcre2_match() function contains a counter that is incremented every time it
|
||||||
internal \fBmatch()\fP function is called and on the maximum depth of
|
goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
|
||||||
recursive calls. These facilities are provided to catch runaway matches that
|
this counter, which therefore limits the amount of computing resource used for
|
||||||
are provoked by patterns with huge matching trees (a typical example is a
|
a match. The maximum depth of nested backtracking can also be limited, and this
|
||||||
pattern with nested unlimited repeats) and to avoid running out of system stack
|
restricts the amount of heap memory that is used.
|
||||||
by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
|
.P
|
||||||
gives an error return. The limits can also be set by items at the start of the
|
These facilities are provided to catch runaway matches that are provoked by
|
||||||
pattern of the form
|
patterns with huge matching trees (a typical example is a pattern with nested
|
||||||
|
unlimited repeats applied to a long string that does not match). When one of
|
||||||
|
these limits is reached, \fBpcre2_match()\fP gives an error return. The limits
|
||||||
|
can also be set by items at the start of the pattern of the form
|
||||||
.sp
|
.sp
|
||||||
(*LIMIT_MATCH=d)
|
(*LIMIT_MATCH=d)
|
||||||
(*LIMIT_RECURSION=d)
|
(*LIMIT_DEPTH=d)
|
||||||
.sp
|
.sp
|
||||||
where d is any number of decimal digits. However, the value of the setting must
|
where d is any number of decimal digits. However, the value of the setting must
|
||||||
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
|
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
|
||||||
|
@ -159,10 +162,14 @@ for it to have any effect. In other words, the pattern writer can lower the
|
||||||
limits set by the programmer, but not raise them. If there is more than one
|
limits set by the programmer, but not raise them. If there is more than one
|
||||||
setting of one of these limits, the lower value is used.
|
setting of one of these limits, the lower value is used.
|
||||||
.P
|
.P
|
||||||
|
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||||
|
still recognized for backwards compatibility.
|
||||||
|
.P
|
||||||
The match limit is used (but in a different way) when JIT is being used, but it
|
The match limit is used (but in a different way) when JIT is being used, but it
|
||||||
is not relevant, and is ignored, when matching with \fBpcre2_dfa_match()\fP.
|
is not relevant, and is ignored, when matching with \fBpcre2_dfa_match()\fP.
|
||||||
However, the recursion limit is relevant for DFA matching, which does use some
|
However, the depth limit is relevant for DFA matching, which uses function
|
||||||
function recursion, in particular, for recursions within the pattern.
|
recursion for recursions within the pattern. In this case, the depth limit
|
||||||
|
controls the amount of system stack that is used.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="newlines"></a>
|
.\" HTML <a name="newlines"></a>
|
||||||
|
@ -206,8 +213,8 @@ The newline convention affects where the circumflex and dollar assertions are
|
||||||
true. It also affects the interpretation of the dot metacharacter when
|
true. It also affects the interpretation of the dot metacharacter when
|
||||||
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
|
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
|
||||||
what the \eR escape sequence matches. By default, this is any Unicode newline
|
what the \eR escape sequence matches. By default, this is any Unicode newline
|
||||||
sequence, for Perl compatibility. However, this can be changed; see the
|
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||||
description of \eR in the section entitled
|
section and the description of \eR in the section entitled
|
||||||
.\" HTML <a href="#newlineseq">
|
.\" HTML <a href="#newlineseq">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
"Newline sequences"
|
"Newline sequences"
|
||||||
|
@ -230,7 +237,7 @@ corresponding to PCRE2_BSR_UNICODE.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
||||||
character code rather than ASCII or Unicode (typically a mainframe system). In
|
character code instead of ASCII or Unicode (typically a mainframe system). In
|
||||||
the sections below, character code values are ASCII or Unicode; in an EBCDIC
|
the sections below, character code values are ASCII or Unicode; in an EBCDIC
|
||||||
environment these characters may have different code values, and there are no
|
environment these characters may have different code values, and there are no
|
||||||
code points greater than 255.
|
code points greater than 255.
|
||||||
|
@ -297,11 +304,11 @@ character that is not a number or a letter, it takes away any special meaning
|
||||||
that character may have. This use of backslash as an escape character applies
|
that character may have. This use of backslash as an escape character applies
|
||||||
both inside and outside character classes.
|
both inside and outside character classes.
|
||||||
.P
|
.P
|
||||||
For example, if you want to match a * character, you write \e* in the pattern.
|
For example, if you want to match a * character, you must write \e* in the
|
||||||
This escaping action applies whether or not the following character would
|
pattern. This escaping action applies whether or not the following character
|
||||||
otherwise be interpreted as a metacharacter, so it is always safe to precede a
|
would otherwise be interpreted as a metacharacter, so it is always safe to
|
||||||
non-alphanumeric with backslash to specify that it stands for itself. In
|
precede a non-alphanumeric with backslash to specify that it stands for itself.
|
||||||
particular, if you want to match a backslash, you write \e\e.
|
In particular, if you want to match a backslash, you write \e\e.
|
||||||
.P
|
.P
|
||||||
In a UTF mode, only ASCII numbers and letters have any special meaning after a
|
In a UTF mode, only ASCII numbers and letters have any special meaning after a
|
||||||
backslash. All other characters (in particular, those whose codepoints are
|
backslash. All other characters (in particular, those whose codepoints are
|
||||||
|
@ -331,7 +338,7 @@ An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
|
||||||
by \eE later in the pattern, the literal interpretation continues to the end of
|
by \eE later in the pattern, the literal interpretation continues to the end of
|
||||||
the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
|
the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
|
||||||
a character class, this causes an error, because the character class is not
|
a character class, this causes an error, because the character class is not
|
||||||
terminated.
|
terminated by a closing square bracket.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="digitsafterbackslash"></a>
|
.\" HTML <a name="digitsafterbackslash"></a>
|
||||||
|
@ -459,9 +466,9 @@ a hexadecimal digit appears between \ex{ and }, or if there is no terminating
|
||||||
.P
|
.P
|
||||||
If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just
|
If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just
|
||||||
described only when it is followed by two hexadecimal digits. Otherwise, it
|
described only when it is followed by two hexadecimal digits. Otherwise, it
|
||||||
matches a literal "x" character. In this mode mode, support for code points
|
matches a literal "x" character. In this mode, support for code points greater
|
||||||
greater than 256 is provided by \eu, which must be followed by four hexadecimal
|
than 256 is provided by \eu, which must be followed by four hexadecimal digits;
|
||||||
digits; otherwise it matches a literal "u" character.
|
otherwise it matches a literal "u" character.
|
||||||
.P
|
.P
|
||||||
Characters whose value is less than 256 can be defined by either of the two
|
Characters whose value is less than 256 can be defined by either of the two
|
||||||
syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in
|
syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in
|
||||||
|
@ -475,12 +482,10 @@ the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
|
||||||
Characters that are specified using octal or hexadecimal numbers are
|
Characters that are specified using octal or hexadecimal numbers are
|
||||||
limited to certain values, as follows:
|
limited to certain values, as follows:
|
||||||
.sp
|
.sp
|
||||||
8-bit non-UTF mode less than 0x100
|
8-bit non-UTF mode no greater than 0xff
|
||||||
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
|
16-bit non-UTF mode no greater than 0xffff
|
||||||
16-bit non-UTF mode less than 0x10000
|
32-bit non-UTF mode no greater than 0xffffffff
|
||||||
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
|
All UTF modes no greater than 0x10ffff and a valid codepoint
|
||||||
32-bit non-UTF mode less than 0x100000000
|
|
||||||
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
|
|
||||||
.sp
|
.sp
|
||||||
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
|
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
|
||||||
"surrogate" codepoints), and 0xffef.
|
"surrogate" codepoints), and 0xffef.
|
||||||
|
@ -506,7 +511,7 @@ In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
|
||||||
handler and used to modify the case of following characters. By default, PCRE2
|
handler and used to modify the case of following characters. By default, PCRE2
|
||||||
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
|
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
|
||||||
is set, \eU matches a "U" character, and \eu can be used to define a character
|
is set, \eU matches a "U" character, and \eu can be used to define a character
|
||||||
by code point, as described in the previous section.
|
by code point, as described above.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Absolute and relative back references"
|
.SS "Absolute and relative back references"
|
||||||
|
@ -714,7 +719,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
|
||||||
sequences that match characters with specific properties are available. In
|
sequences that match characters with specific properties are available. In
|
||||||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||||
characters whose codepoints are less than 256, but they do work in this mode.
|
characters whose codepoints are less than 256, but they do work in this mode.
|
||||||
The extra escape sequences are:
|
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
|
||||||
|
may be encountered. These are all treated as being in the Common script and
|
||||||
|
with an unassigned type. The extra escape sequences are:
|
||||||
.sp
|
.sp
|
||||||
\ep{\fIxx\fP} a character with the \fIxx\fP property
|
\ep{\fIxx\fP} a character with the \fIxx\fP property
|
||||||
\eP{\fIxx\fP} a character without the \fIxx\fP property
|
\eP{\fIxx\fP} a character without the \fIxx\fP property
|
||||||
|
@ -2224,15 +2231,8 @@ except that it does not cause the current matching position to be changed.
|
||||||
Assertion subpatterns are not capturing subpatterns. If such an assertion
|
Assertion subpatterns are not capturing subpatterns. If such an assertion
|
||||||
contains capturing subpatterns within it, these are counted for the purposes of
|
contains capturing subpatterns within it, these are counted for the purposes of
|
||||||
numbering the capturing subpatterns in the whole pattern. However, substring
|
numbering the capturing subpatterns in the whole pattern. However, substring
|
||||||
capturing is carried out only for positive assertions. (Perl sometimes, but not
|
capturing is normally carried out only for positive assertions (but see the
|
||||||
always, does do capturing in negative assertions.)
|
discussion of conditional subpatterns below).
|
||||||
.P
|
|
||||||
WARNING: If a positive assertion containing one or more capturing subpatterns
|
|
||||||
succeeds, but failure to match later in the pattern causes backtracking over
|
|
||||||
this assertion, the captures within the assertion are reset only if no higher
|
|
||||||
numbered captures are already set. This is, unfortunately, a fundamental
|
|
||||||
limitation of the current implementation; it may get removed in a future
|
|
||||||
reworking.
|
|
||||||
.P
|
.P
|
||||||
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
||||||
it makes no sense to assert the same thing several times, the side effect of
|
it makes no sense to assert the same thing several times, the side effect of
|
||||||
|
@ -2619,6 +2619,11 @@ presence of at least one letter in the subject. If a letter is found, the
|
||||||
subject is matched against the first alternative; otherwise it is matched
|
subject is matched against the first alternative; otherwise it is matched
|
||||||
against the second. This pattern matches strings in one of the two forms
|
against the second. This pattern matches strings in one of the two forms
|
||||||
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
|
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
|
||||||
|
.P
|
||||||
|
For Perl compatibility, if an assertion that is a condition contains capturing
|
||||||
|
subpatterns, any capturing that occurs is retained afterwards, for both
|
||||||
|
positive and negative assertions. (Compare non-conditional assertions, when
|
||||||
|
captures are retained only for positive assertions.)
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="comments"></a>
|
.\" HTML <a name="comments"></a>
|
||||||
|
@ -2798,88 +2803,53 @@ is the actual recursive call.
|
||||||
.SS "Differences in recursion processing between PCRE2 and Perl"
|
.SS "Differences in recursion processing between PCRE2 and Perl"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2
|
Some former differences between PCRE2 and Perl no longer exist.
|
||||||
(like Python, but unlike Perl), a recursive subpattern call is always treated
|
|
||||||
as an atomic group. That is, once it has matched some of the subject string, it
|
|
||||||
is never re-entered, even if it contains untried alternatives and there is a
|
|
||||||
subsequent matching failure. This can be illustrated by the following pattern,
|
|
||||||
which purports to match a palindromic string that contains an odd number of
|
|
||||||
characters (for example, "a", "aba", "abcba", "abcdcba"):
|
|
||||||
.sp
|
|
||||||
^(.|(.)(?1)\e2)$
|
|
||||||
.sp
|
|
||||||
The idea is that it either matches a single character, or two identical
|
|
||||||
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
|
|
||||||
it does not if the pattern is longer than three characters. Consider the
|
|
||||||
subject string "abcba":
|
|
||||||
.P
|
.P
|
||||||
At the top level, the first character is matched, but as it is not at the end
|
Before release 10.30, recursion processing in PCRE2 differed from Perl in that
|
||||||
of the string, the first alternative fails; the second alternative is taken
|
a recursive subpattern call was always treated as an atomic group. That is,
|
||||||
and the recursion kicks in. The recursive call to subpattern 1 successfully
|
once it had matched some of the subject string, it was never re-entered, even
|
||||||
matches the next character ("b"). (Note that the beginning and end of line
|
if it contained untried alternatives and there was a subsequent matching
|
||||||
tests are not part of the recursion).
|
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
||||||
.P
|
.P
|
||||||
Back at the top level, the next character ("c") is compared with what
|
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||||
subpattern 2 matched, which was "a". This fails. Because the recursion is
|
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||||
treated as an atomic group, there are now no backtracking points, and so the
|
is a matching failure later in the pattern. This is now compatible with the way
|
||||||
entire match fails. (Perl is able, at this point, to re-enter the recursion and
|
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
||||||
try the second alternative.) However, if the pattern is written with the
|
enclose it in an atomic group.
|
||||||
alternatives in the other order, things are different:
|
|
||||||
.sp
|
|
||||||
^((.)(?1)\e2|.)$
|
|
||||||
.sp
|
|
||||||
This time, the recursing alternative is tried first, and continues to recurse
|
|
||||||
until it runs out of characters, at which point the recursion fails. But this
|
|
||||||
time we do have another alternative to try at the higher level. That is the big
|
|
||||||
difference: in the previous case the remaining alternative is at a deeper
|
|
||||||
recursion level, which PCRE2 cannot use.
|
|
||||||
.P
|
.P
|
||||||
To change the pattern so that it matches all palindromic strings, not just
|
Supporting backtracking into recursions simplifies certain types of recursive
|
||||||
those with an odd number of characters, it is tempting to change the pattern to
|
pattern. For example, this pattern matches palindromic strings:
|
||||||
this:
|
|
||||||
.sp
|
.sp
|
||||||
^((.)(?1)\e2|.?)$
|
^((.)(?1)\e2|.?)$
|
||||||
.sp
|
.sp
|
||||||
Again, this works in Perl, but not in PCRE2, and for the same reason. When a
|
The second branch in the group matches a single central character in the
|
||||||
deeper recursion has matched a single character, it cannot be entered again in
|
palindrome when there are an odd number of characters, or nothing when there
|
||||||
order to match an empty string. The solution is to separate the two cases, and
|
are an even number of characters, but in order to work it has to be able to try
|
||||||
write out the odd and even cases as alternatives at the higher level:
|
the second case when the rest of the pattern match fails. If you want to match
|
||||||
|
typical palindromic phrases, the pattern has to ignore all non-word characters,
|
||||||
|
which can be done like this:
|
||||||
.sp
|
.sp
|
||||||
^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
|
^\eW*+((.)\eW*+(?1)\eW*+\e2|\eW*+.?)\eW*+$
|
||||||
.sp
|
|
||||||
If you want to match typical palindromic phrases, the pattern has to ignore all
|
|
||||||
non-word characters, which can be done like this:
|
|
||||||
.sp
|
|
||||||
^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\e4|\eW*+.\eW*+))\eW*+$
|
|
||||||
.sp
|
.sp
|
||||||
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
|
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
|
||||||
man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the
|
man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
|
||||||
use of the possessive quantifier *+ to avoid backtracking into sequences of
|
avoid backtracking into sequences of non-word characters. Without this, PCRE2
|
||||||
non-word characters. Without this, PCRE2 takes a great deal longer (ten times
|
takes a great deal longer (ten times or more) to match typical phrases, and
|
||||||
or more) to match typical phrases, and Perl takes so long that you think it has
|
Perl takes so long that you think it has gone into a loop.
|
||||||
gone into a loop.
|
|
||||||
.P
|
.P
|
||||||
\fBWARNING\fP: The palindrome-matching patterns above work only if the subject
|
Another way in which PCRE2 and Perl used to differ in their recursion
|
||||||
string does not start with a palindrome that is shorter than the entire string.
|
processing is in the handling of captured values. Formerly in Perl, when a
|
||||||
For example, although "abcba" is correctly matched, if the subject is "ababa",
|
subpattern was called recursively or as a subpattern (see the next section), it
|
||||||
PCRE2 finds the palindrome "aba" at the start, then fails at top level because
|
had no access to any values that were captured outside the recursion, whereas
|
||||||
the end of the string does not follow. Once again, it cannot jump back into the
|
in PCRE2 these values can be referenced. Consider this pattern:
|
||||||
recursion to try other alternatives, so the entire match fails.
|
|
||||||
.P
|
|
||||||
The second way in which PCRE2 and Perl differ in their recursion processing is
|
|
||||||
in the handling of captured values. In Perl, when a subpattern is called
|
|
||||||
recursively or as a subpattern (see the next section), it has no access to any
|
|
||||||
values that were captured outside the recursion, whereas in PCRE2 these values
|
|
||||||
can be referenced. Consider this pattern:
|
|
||||||
.sp
|
.sp
|
||||||
^(.)(\e1|a(?2))
|
^(.)(\e1|a(?2))
|
||||||
.sp
|
.sp
|
||||||
In PCRE2, this pattern matches "bab". The first capturing parentheses match "b",
|
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||||
then in the second group, when the back reference \e1 fails to match "b", the
|
the second group, when the back reference \e1 fails to match "b", the second
|
||||||
second alternative matches "a" and then recurses. In the recursion, \e1 does
|
alternative matches "a" and then recurses. In the recursion, \e1 does now match
|
||||||
now match "b" and so the whole match succeeds. In Perl, the pattern fails to
|
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||||
match because inside the recursive call \e1 cannot access the externally set
|
later versions (I tried 5.024) it now works.
|
||||||
value.
|
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="subpatternsassubroutines"></a>
|
.\" HTML <a name="subpatternsassubroutines"></a>
|
||||||
|
@ -2908,11 +2878,10 @@ matches "sense and sensibility" and "response and responsibility", but not
|
||||||
is used, it does match "sense and responsibility" as well as the other two
|
is used, it does match "sense and responsibility" as well as the other two
|
||||||
strings. Another example is given in the discussion of DEFINE above.
|
strings. Another example is given in the discussion of DEFINE above.
|
||||||
.P
|
.P
|
||||||
All subroutine calls, whether recursive or not, are always treated as atomic
|
Like recursions, subroutine calls used to be treated as atomic, but this
|
||||||
groups. That is, once a subroutine has matched some of the subject string, it
|
changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
|
||||||
is never re-entered, even if it contains untried alternatives and there is a
|
occur. However, any capturing parentheses that are set during the subroutine
|
||||||
subsequent matching failure. Any capturing parentheses that are set during the
|
call revert to their previous values afterwards.
|
||||||
subroutine call revert to their previous values afterwards.
|
|
||||||
.P
|
.P
|
||||||
Processing options such as case-independence are fixed when a subpattern is
|
Processing options such as case-independence are fixed when a subpattern is
|
||||||
defined, so if it is used as a subroutine, such options cannot be changed for
|
defined, so if it is used as a subroutine, such options cannot be changed for
|
||||||
|
@ -3025,16 +2994,10 @@ The doubling is removed before the string is passed to the callout function.
|
||||||
.SH "BACKTRACKING CONTROL"
|
.SH "BACKTRACKING CONTROL"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
|
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||||
are still described in the Perl documentation as "experimental and subject to
|
terminology) that modify the behaviour of backtracking during matching. They
|
||||||
change or removal in a future version of Perl". It goes on to say: "Their usage
|
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
|
||||||
in production code should be noted to avoid problems during upgrades." The same
|
possibly behaving differently depending on whether or not a name is present.
|
||||||
remarks apply to the PCRE2 features described in this section.
|
|
||||||
.P
|
|
||||||
The new verbs make use of what was previously invalid syntax: an opening
|
|
||||||
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
|
|
||||||
(*VERB:NAME). Some verbs take either form, possibly behaving differently
|
|
||||||
depending on whether or not a name is present.
|
|
||||||
.P
|
.P
|
||||||
By default, for compatibility with Perl, a name is any sequence of characters
|
By default, for compatibility with Perl, a name is any sequence of characters
|
||||||
that does not include a closing parenthesis. The name is not processed in
|
that does not include a closing parenthesis. The name is not processed in
|
||||||
|
@ -3061,7 +3024,7 @@ not there. Any number of these verbs may occur in a pattern.
|
||||||
.P
|
.P
|
||||||
Since these verbs are specifically related to backtracking, most of them can be
|
Since these verbs are specifically related to backtracking, most of them can be
|
||||||
used only when the pattern is to be matched using the traditional matching
|
used only when the pattern is to be matched using the traditional matching
|
||||||
function, because these use a backtracking algorithm. With the exception of
|
function, because that uses a backtracking algorithm. With the exception of
|
||||||
(*FAIL), which behaves like a failing negative assertion, the backtracking
|
(*FAIL), which behaves like a failing negative assertion, the backtracking
|
||||||
control verbs cause an error if encountered by the DFA matching function.
|
control verbs cause an error if encountered by the DFA matching function.
|
||||||
.P
|
.P
|
||||||
|
@ -3215,11 +3178,11 @@ to ensure that the match is always attempted.
|
||||||
The following verbs do nothing when they are encountered. Matching continues
|
The following verbs do nothing when they are encountered. Matching continues
|
||||||
with what follows, but if there is no subsequent match, causing a backtrack to
|
with what follows, but if there is no subsequent match, causing a backtrack to
|
||||||
the verb, a failure is forced. That is, backtracking cannot pass to the left of
|
the verb, a failure is forced. That is, backtracking cannot pass to the left of
|
||||||
the verb. However, when one of these verbs appears inside an atomic group
|
the verb. However, when one of these verbs appears inside an atomic group or in
|
||||||
(which includes any group that is called as a subroutine) or in an assertion
|
an assertion that is true, its effect is confined to that group, because once
|
||||||
that is true, its effect is confined to that group, because once the group has
|
the group has been matched, there is never any backtracking into it. In this
|
||||||
been matched, there is never any backtracking into it. In this situation,
|
situation, backtracking has to jump to the left of the entire atomic group or
|
||||||
backtracking has to jump to the left of the entire atomic group or assertion.
|
assertion.
|
||||||
.P
|
.P
|
||||||
These verbs differ in exactly what kind of failure occurs when backtracking
|
These verbs differ in exactly what kind of failure occurs when backtracking
|
||||||
reaches them. The behaviour described below is what happens when the verb is
|
reaches them. The behaviour described below is what happens when the verb is
|
||||||
|
@ -3279,8 +3242,8 @@ possessive quantifier, but there are some uses of (*PRUNE) that cannot be
|
||||||
expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
|
expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
|
||||||
as (*COMMIT).
|
as (*COMMIT).
|
||||||
.P
|
.P
|
||||||
The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
|
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
|
||||||
It is like (*MARK:NAME) in that the name is remembered for passing back to the
|
like (*MARK:NAME) in that the name is remembered for passing back to the
|
||||||
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
|
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
|
||||||
ignoring those set by (*PRUNE) or (*THEN).
|
ignoring those set by (*PRUNE) or (*THEN).
|
||||||
.sp
|
.sp
|
||||||
|
@ -3482,6 +3445,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 27 December 2016
|
Last updated: 18 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
Loading…
Reference in New Issue