Documentation update.
This commit is contained in:
parent
b55ef12cc1
commit
77ef3e66ab
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "27 December 2016" "PCRE2 10.23"
|
||||
.TH PCRE2PATTERN 3 "18 March 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -138,20 +138,23 @@ the application to apply the JIT optimization by calling
|
|||
\fBpcre2_jit_compile()\fP is ignored.
|
||||
.
|
||||
.
|
||||
.SS "Setting match and recursion limits"
|
||||
.SS "Setting match and backtracking depth limits"
|
||||
.rs
|
||||
.sp
|
||||
The caller of \fBpcre2_match()\fP can set a limit on the number of times the
|
||||
internal \fBmatch()\fP function is called and on the maximum depth of
|
||||
recursive calls. These facilities are provided to catch runaway matches that
|
||||
are provoked by patterns with huge matching trees (a typical example is a
|
||||
pattern with nested unlimited repeats) and to avoid running out of system stack
|
||||
by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
|
||||
gives an error return. The limits can also be set by items at the start of the
|
||||
pattern of the form
|
||||
The pcre2_match() function contains a counter that is incremented every time it
|
||||
goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
|
||||
this counter, which therefore limits the amount of computing resource used for
|
||||
a match. The maximum depth of nested backtracking can also be limited, and this
|
||||
restricts the amount of heap memory that is used.
|
||||
.P
|
||||
These facilities are provided to catch runaway matches that are provoked by
|
||||
patterns with huge matching trees (a typical example is a pattern with nested
|
||||
unlimited repeats applied to a long string that does not match). When one of
|
||||
these limits is reached, \fBpcre2_match()\fP gives an error return. The limits
|
||||
can also be set by items at the start of the pattern of the form
|
||||
.sp
|
||||
(*LIMIT_MATCH=d)
|
||||
(*LIMIT_RECURSION=d)
|
||||
(*LIMIT_DEPTH=d)
|
||||
.sp
|
||||
where d is any number of decimal digits. However, the value of the setting must
|
||||
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
|
||||
|
@ -159,10 +162,14 @@ for it to have any effect. In other words, the pattern writer can lower the
|
|||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
.P
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
still recognized for backwards compatibility.
|
||||
.P
|
||||
The match limit is used (but in a different way) when JIT is being used, but it
|
||||
is not relevant, and is ignored, when matching with \fBpcre2_dfa_match()\fP.
|
||||
However, the recursion limit is relevant for DFA matching, which does use some
|
||||
function recursion, in particular, for recursions within the pattern.
|
||||
However, the depth limit is relevant for DFA matching, which uses function
|
||||
recursion for recursions within the pattern. In this case, the depth limit
|
||||
controls the amount of system stack that is used.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="newlines"></a>
|
||||
|
@ -206,8 +213,8 @@ The newline convention affects where the circumflex and dollar assertions are
|
|||
true. It also affects the interpretation of the dot metacharacter when
|
||||
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
|
||||
what the \eR escape sequence matches. By default, this is any Unicode newline
|
||||
sequence, for Perl compatibility. However, this can be changed; see the
|
||||
description of \eR in the section entitled
|
||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||
section and the description of \eR in the section entitled
|
||||
.\" HTML <a href="#newlineseq">
|
||||
.\" </a>
|
||||
"Newline sequences"
|
||||
|
@ -230,7 +237,7 @@ corresponding to PCRE2_BSR_UNICODE.
|
|||
.rs
|
||||
.sp
|
||||
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
||||
character code rather than ASCII or Unicode (typically a mainframe system). In
|
||||
character code instead of ASCII or Unicode (typically a mainframe system). In
|
||||
the sections below, character code values are ASCII or Unicode; in an EBCDIC
|
||||
environment these characters may have different code values, and there are no
|
||||
code points greater than 255.
|
||||
|
@ -297,11 +304,11 @@ character that is not a number or a letter, it takes away any special meaning
|
|||
that character may have. This use of backslash as an escape character applies
|
||||
both inside and outside character classes.
|
||||
.P
|
||||
For example, if you want to match a * character, you write \e* in the pattern.
|
||||
This escaping action applies whether or not the following character would
|
||||
otherwise be interpreted as a metacharacter, so it is always safe to precede a
|
||||
non-alphanumeric with backslash to specify that it stands for itself. In
|
||||
particular, if you want to match a backslash, you write \e\e.
|
||||
For example, if you want to match a * character, you must write \e* in the
|
||||
pattern. This escaping action applies whether or not the following character
|
||||
would otherwise be interpreted as a metacharacter, so it is always safe to
|
||||
precede a non-alphanumeric with backslash to specify that it stands for itself.
|
||||
In particular, if you want to match a backslash, you write \e\e.
|
||||
.P
|
||||
In a UTF mode, only ASCII numbers and letters have any special meaning after a
|
||||
backslash. All other characters (in particular, those whose codepoints are
|
||||
|
@ -331,7 +338,7 @@ An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
|
|||
by \eE later in the pattern, the literal interpretation continues to the end of
|
||||
the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
|
||||
a character class, this causes an error, because the character class is not
|
||||
terminated.
|
||||
terminated by a closing square bracket.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="digitsafterbackslash"></a>
|
||||
|
@ -459,9 +466,9 @@ a hexadecimal digit appears between \ex{ and }, or if there is no terminating
|
|||
.P
|
||||
If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just
|
||||
described only when it is followed by two hexadecimal digits. Otherwise, it
|
||||
matches a literal "x" character. In this mode mode, support for code points
|
||||
greater than 256 is provided by \eu, which must be followed by four hexadecimal
|
||||
digits; otherwise it matches a literal "u" character.
|
||||
matches a literal "x" character. In this mode, support for code points greater
|
||||
than 256 is provided by \eu, which must be followed by four hexadecimal digits;
|
||||
otherwise it matches a literal "u" character.
|
||||
.P
|
||||
Characters whose value is less than 256 can be defined by either of the two
|
||||
syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in
|
||||
|
@ -475,12 +482,10 @@ the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
|
|||
Characters that are specified using octal or hexadecimal numbers are
|
||||
limited to certain values, as follows:
|
||||
.sp
|
||||
8-bit non-UTF mode less than 0x100
|
||||
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
|
||||
16-bit non-UTF mode less than 0x10000
|
||||
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
|
||||
32-bit non-UTF mode less than 0x100000000
|
||||
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
|
||||
8-bit non-UTF mode no greater than 0xff
|
||||
16-bit non-UTF mode no greater than 0xffff
|
||||
32-bit non-UTF mode no greater than 0xffffffff
|
||||
All UTF modes no greater than 0x10ffff and a valid codepoint
|
||||
.sp
|
||||
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
|
||||
"surrogate" codepoints), and 0xffef.
|
||||
|
@ -506,7 +511,7 @@ In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
|
|||
handler and used to modify the case of following characters. By default, PCRE2
|
||||
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
|
||||
is set, \eU matches a "U" character, and \eu can be used to define a character
|
||||
by code point, as described in the previous section.
|
||||
by code point, as described above.
|
||||
.
|
||||
.
|
||||
.SS "Absolute and relative back references"
|
||||
|
@ -714,7 +719,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
|
|||
sequences that match characters with specific properties are available. In
|
||||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||
characters whose codepoints are less than 256, but they do work in this mode.
|
||||
The extra escape sequences are:
|
||||
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
|
||||
may be encountered. These are all treated as being in the Common script and
|
||||
with an unassigned type. The extra escape sequences are:
|
||||
.sp
|
||||
\ep{\fIxx\fP} a character with the \fIxx\fP property
|
||||
\eP{\fIxx\fP} a character without the \fIxx\fP property
|
||||
|
@ -2224,15 +2231,8 @@ except that it does not cause the current matching position to be changed.
|
|||
Assertion subpatterns are not capturing subpatterns. If such an assertion
|
||||
contains capturing subpatterns within it, these are counted for the purposes of
|
||||
numbering the capturing subpatterns in the whole pattern. However, substring
|
||||
capturing is carried out only for positive assertions. (Perl sometimes, but not
|
||||
always, does do capturing in negative assertions.)
|
||||
.P
|
||||
WARNING: If a positive assertion containing one or more capturing subpatterns
|
||||
succeeds, but failure to match later in the pattern causes backtracking over
|
||||
this assertion, the captures within the assertion are reset only if no higher
|
||||
numbered captures are already set. This is, unfortunately, a fundamental
|
||||
limitation of the current implementation; it may get removed in a future
|
||||
reworking.
|
||||
capturing is normally carried out only for positive assertions (but see the
|
||||
discussion of conditional subpatterns below).
|
||||
.P
|
||||
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
||||
it makes no sense to assert the same thing several times, the side effect of
|
||||
|
@ -2619,6 +2619,11 @@ presence of at least one letter in the subject. If a letter is found, the
|
|||
subject is matched against the first alternative; otherwise it is matched
|
||||
against the second. This pattern matches strings in one of the two forms
|
||||
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
|
||||
.P
|
||||
For Perl compatibility, if an assertion that is a condition contains capturing
|
||||
subpatterns, any capturing that occurs is retained afterwards, for both
|
||||
positive and negative assertions. (Compare non-conditional assertions, when
|
||||
captures are retained only for positive assertions.)
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="comments"></a>
|
||||
|
@ -2798,88 +2803,53 @@ is the actual recursive call.
|
|||
.SS "Differences in recursion processing between PCRE2 and Perl"
|
||||
.rs
|
||||
.sp
|
||||
Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2
|
||||
(like Python, but unlike Perl), a recursive subpattern call is always treated
|
||||
as an atomic group. That is, once it has matched some of the subject string, it
|
||||
is never re-entered, even if it contains untried alternatives and there is a
|
||||
subsequent matching failure. This can be illustrated by the following pattern,
|
||||
which purports to match a palindromic string that contains an odd number of
|
||||
characters (for example, "a", "aba", "abcba", "abcdcba"):
|
||||
.sp
|
||||
^(.|(.)(?1)\e2)$
|
||||
.sp
|
||||
The idea is that it either matches a single character, or two identical
|
||||
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
|
||||
it does not if the pattern is longer than three characters. Consider the
|
||||
subject string "abcba":
|
||||
Some former differences between PCRE2 and Perl no longer exist.
|
||||
.P
|
||||
At the top level, the first character is matched, but as it is not at the end
|
||||
of the string, the first alternative fails; the second alternative is taken
|
||||
and the recursion kicks in. The recursive call to subpattern 1 successfully
|
||||
matches the next character ("b"). (Note that the beginning and end of line
|
||||
tests are not part of the recursion).
|
||||
Before release 10.30, recursion processing in PCRE2 differed from Perl in that
|
||||
a recursive subpattern call was always treated as an atomic group. That is,
|
||||
once it had matched some of the subject string, it was never re-entered, even
|
||||
if it contained untried alternatives and there was a subsequent matching
|
||||
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
||||
.P
|
||||
Back at the top level, the next character ("c") is compared with what
|
||||
subpattern 2 matched, which was "a". This fails. Because the recursion is
|
||||
treated as an atomic group, there are now no backtracking points, and so the
|
||||
entire match fails. (Perl is able, at this point, to re-enter the recursion and
|
||||
try the second alternative.) However, if the pattern is written with the
|
||||
alternatives in the other order, things are different:
|
||||
.sp
|
||||
^((.)(?1)\e2|.)$
|
||||
.sp
|
||||
This time, the recursing alternative is tried first, and continues to recurse
|
||||
until it runs out of characters, at which point the recursion fails. But this
|
||||
time we do have another alternative to try at the higher level. That is the big
|
||||
difference: in the previous case the remaining alternative is at a deeper
|
||||
recursion level, which PCRE2 cannot use.
|
||||
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||
is a matching failure later in the pattern. This is now compatible with the way
|
||||
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
||||
enclose it in an atomic group.
|
||||
.P
|
||||
To change the pattern so that it matches all palindromic strings, not just
|
||||
those with an odd number of characters, it is tempting to change the pattern to
|
||||
this:
|
||||
Supporting backtracking into recursions simplifies certain types of recursive
|
||||
pattern. For example, this pattern matches palindromic strings:
|
||||
.sp
|
||||
^((.)(?1)\e2|.?)$
|
||||
.sp
|
||||
Again, this works in Perl, but not in PCRE2, and for the same reason. When a
|
||||
deeper recursion has matched a single character, it cannot be entered again in
|
||||
order to match an empty string. The solution is to separate the two cases, and
|
||||
write out the odd and even cases as alternatives at the higher level:
|
||||
The second branch in the group matches a single central character in the
|
||||
palindrome when there are an odd number of characters, or nothing when there
|
||||
are an even number of characters, but in order to work it has to be able to try
|
||||
the second case when the rest of the pattern match fails. If you want to match
|
||||
typical palindromic phrases, the pattern has to ignore all non-word characters,
|
||||
which can be done like this:
|
||||
.sp
|
||||
^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
|
||||
.sp
|
||||
If you want to match typical palindromic phrases, the pattern has to ignore all
|
||||
non-word characters, which can be done like this:
|
||||
.sp
|
||||
^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\e4|\eW*+.\eW*+))\eW*+$
|
||||
^\eW*+((.)\eW*+(?1)\eW*+\e2|\eW*+.?)\eW*+$
|
||||
.sp
|
||||
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
|
||||
man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the
|
||||
use of the possessive quantifier *+ to avoid backtracking into sequences of
|
||||
non-word characters. Without this, PCRE2 takes a great deal longer (ten times
|
||||
or more) to match typical phrases, and Perl takes so long that you think it has
|
||||
gone into a loop.
|
||||
man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
|
||||
avoid backtracking into sequences of non-word characters. Without this, PCRE2
|
||||
takes a great deal longer (ten times or more) to match typical phrases, and
|
||||
Perl takes so long that you think it has gone into a loop.
|
||||
.P
|
||||
\fBWARNING\fP: The palindrome-matching patterns above work only if the subject
|
||||
string does not start with a palindrome that is shorter than the entire string.
|
||||
For example, although "abcba" is correctly matched, if the subject is "ababa",
|
||||
PCRE2 finds the palindrome "aba" at the start, then fails at top level because
|
||||
the end of the string does not follow. Once again, it cannot jump back into the
|
||||
recursion to try other alternatives, so the entire match fails.
|
||||
.P
|
||||
The second way in which PCRE2 and Perl differ in their recursion processing is
|
||||
in the handling of captured values. In Perl, when a subpattern is called
|
||||
recursively or as a subpattern (see the next section), it has no access to any
|
||||
values that were captured outside the recursion, whereas in PCRE2 these values
|
||||
can be referenced. Consider this pattern:
|
||||
Another way in which PCRE2 and Perl used to differ in their recursion
|
||||
processing is in the handling of captured values. Formerly in Perl, when a
|
||||
subpattern was called recursively or as a subpattern (see the next section), it
|
||||
had no access to any values that were captured outside the recursion, whereas
|
||||
in PCRE2 these values can be referenced. Consider this pattern:
|
||||
.sp
|
||||
^(.)(\e1|a(?2))
|
||||
.sp
|
||||
In PCRE2, this pattern matches "bab". The first capturing parentheses match "b",
|
||||
then in the second group, when the back reference \e1 fails to match "b", the
|
||||
second alternative matches "a" and then recurses. In the recursion, \e1 does
|
||||
now match "b" and so the whole match succeeds. In Perl, the pattern fails to
|
||||
match because inside the recursive call \e1 cannot access the externally set
|
||||
value.
|
||||
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||
the second group, when the back reference \e1 fails to match "b", the second
|
||||
alternative matches "a" and then recurses. In the recursion, \e1 does now match
|
||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
later versions (I tried 5.024) it now works.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="subpatternsassubroutines"></a>
|
||||
|
@ -2908,11 +2878,10 @@ matches "sense and sensibility" and "response and responsibility", but not
|
|||
is used, it does match "sense and responsibility" as well as the other two
|
||||
strings. Another example is given in the discussion of DEFINE above.
|
||||
.P
|
||||
All subroutine calls, whether recursive or not, are always treated as atomic
|
||||
groups. That is, once a subroutine has matched some of the subject string, it
|
||||
is never re-entered, even if it contains untried alternatives and there is a
|
||||
subsequent matching failure. Any capturing parentheses that are set during the
|
||||
subroutine call revert to their previous values afterwards.
|
||||
Like recursions, subroutine calls used to be treated as atomic, but this
|
||||
changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
|
||||
occur. However, any capturing parentheses that are set during the subroutine
|
||||
call revert to their previous values afterwards.
|
||||
.P
|
||||
Processing options such as case-independence are fixed when a subpattern is
|
||||
defined, so if it is used as a subroutine, such options cannot be changed for
|
||||
|
@ -3025,16 +2994,10 @@ The doubling is removed before the string is passed to the callout function.
|
|||
.SH "BACKTRACKING CONTROL"
|
||||
.rs
|
||||
.sp
|
||||
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
|
||||
are still described in the Perl documentation as "experimental and subject to
|
||||
change or removal in a future version of Perl". It goes on to say: "Their usage
|
||||
in production code should be noted to avoid problems during upgrades." The same
|
||||
remarks apply to the PCRE2 features described in this section.
|
||||
.P
|
||||
The new verbs make use of what was previously invalid syntax: an opening
|
||||
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
|
||||
(*VERB:NAME). Some verbs take either form, possibly behaving differently
|
||||
depending on whether or not a name is present.
|
||||
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||
terminology) that modify the behaviour of backtracking during matching. They
|
||||
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
|
||||
possibly behaving differently depending on whether or not a name is present.
|
||||
.P
|
||||
By default, for compatibility with Perl, a name is any sequence of characters
|
||||
that does not include a closing parenthesis. The name is not processed in
|
||||
|
@ -3061,7 +3024,7 @@ not there. Any number of these verbs may occur in a pattern.
|
|||
.P
|
||||
Since these verbs are specifically related to backtracking, most of them can be
|
||||
used only when the pattern is to be matched using the traditional matching
|
||||
function, because these use a backtracking algorithm. With the exception of
|
||||
function, because that uses a backtracking algorithm. With the exception of
|
||||
(*FAIL), which behaves like a failing negative assertion, the backtracking
|
||||
control verbs cause an error if encountered by the DFA matching function.
|
||||
.P
|
||||
|
@ -3215,11 +3178,11 @@ to ensure that the match is always attempted.
|
|||
The following verbs do nothing when they are encountered. Matching continues
|
||||
with what follows, but if there is no subsequent match, causing a backtrack to
|
||||
the verb, a failure is forced. That is, backtracking cannot pass to the left of
|
||||
the verb. However, when one of these verbs appears inside an atomic group
|
||||
(which includes any group that is called as a subroutine) or in an assertion
|
||||
that is true, its effect is confined to that group, because once the group has
|
||||
been matched, there is never any backtracking into it. In this situation,
|
||||
backtracking has to jump to the left of the entire atomic group or assertion.
|
||||
the verb. However, when one of these verbs appears inside an atomic group or in
|
||||
an assertion that is true, its effect is confined to that group, because once
|
||||
the group has been matched, there is never any backtracking into it. In this
|
||||
situation, backtracking has to jump to the left of the entire atomic group or
|
||||
assertion.
|
||||
.P
|
||||
These verbs differ in exactly what kind of failure occurs when backtracking
|
||||
reaches them. The behaviour described below is what happens when the verb is
|
||||
|
@ -3279,8 +3242,8 @@ possessive quantifier, but there are some uses of (*PRUNE) that cannot be
|
|||
expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
|
||||
as (*COMMIT).
|
||||
.P
|
||||
The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
|
||||
It is like (*MARK:NAME) in that the name is remembered for passing back to the
|
||||
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
|
||||
like (*MARK:NAME) in that the name is remembered for passing back to the
|
||||
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
|
||||
ignoring those set by (*PRUNE) or (*THEN).
|
||||
.sp
|
||||
|
@ -3482,6 +3445,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 27 December 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
Last updated: 18 March 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
.fi
|
||||
|
|
Loading…
Reference in New Issue