Documentation update.

This commit is contained in:
Philip.Hazel 2017-03-19 14:22:50 +00:00
parent b55ef12cc1
commit 77ef3e66ab
1 changed files with 99 additions and 136 deletions

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "27 December 2016" "PCRE2 10.23"
.TH PCRE2PATTERN 3 "18 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -138,20 +138,23 @@ the application to apply the JIT optimization by calling
\fBpcre2_jit_compile()\fP is ignored.
.
.
.SS "Setting match and recursion limits"
.SS "Setting match and backtracking depth limits"
.rs
.sp
The caller of \fBpcre2_match()\fP can set a limit on the number of times the
internal \fBmatch()\fP function is called and on the maximum depth of
recursive calls. These facilities are provided to catch runaway matches that
are provoked by patterns with huge matching trees (a typical example is a
pattern with nested unlimited repeats) and to avoid running out of system stack
by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
gives an error return. The limits can also be set by items at the start of the
pattern of the form
The pcre2_match() function contains a counter that is incremented every time it
goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
this counter, which therefore limits the amount of computing resource used for
a match. The maximum depth of nested backtracking can also be limited, and this
restricts the amount of heap memory that is used.
.P
These facilities are provided to catch runaway matches that are provoked by
patterns with huge matching trees (a typical example is a pattern with nested
unlimited repeats applied to a long string that does not match). When one of
these limits is reached, \fBpcre2_match()\fP gives an error return. The limits
can also be set by items at the start of the pattern of the form
.sp
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
(*LIMIT_DEPTH=d)
.sp
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
@ -159,10 +162,14 @@ for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
.P
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
still recognized for backwards compatibility.
.P
The match limit is used (but in a different way) when JIT is being used, but it
is not relevant, and is ignored, when matching with \fBpcre2_dfa_match()\fP.
However, the recursion limit is relevant for DFA matching, which does use some
function recursion, in particular, for recursions within the pattern.
However, the depth limit is relevant for DFA matching, which uses function
recursion for recursions within the pattern. In this case, the depth limit
controls the amount of system stack that is used.
.
.
.\" HTML <a name="newlines"></a>
@ -206,8 +213,8 @@ The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
what the \eR escape sequence matches. By default, this is any Unicode newline
sequence, for Perl compatibility. However, this can be changed; see the
description of \eR in the section entitled
sequence, for Perl compatibility. However, this can be changed; see the next
section and the description of \eR in the section entitled
.\" HTML <a href="#newlineseq">
.\" </a>
"Newline sequences"
@ -230,7 +237,7 @@ corresponding to PCRE2_BSR_UNICODE.
.rs
.sp
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
character code rather than ASCII or Unicode (typically a mainframe system). In
character code instead of ASCII or Unicode (typically a mainframe system). In
the sections below, character code values are ASCII or Unicode; in an EBCDIC
environment these characters may have different code values, and there are no
code points greater than 255.
@ -297,11 +304,11 @@ character that is not a number or a letter, it takes away any special meaning
that character may have. This use of backslash as an escape character applies
both inside and outside character classes.
.P
For example, if you want to match a * character, you write \e* in the pattern.
This escaping action applies whether or not the following character would
otherwise be interpreted as a metacharacter, so it is always safe to precede a
non-alphanumeric with backslash to specify that it stands for itself. In
particular, if you want to match a backslash, you write \e\e.
For example, if you want to match a * character, you must write \e* in the
pattern. This escaping action applies whether or not the following character
would otherwise be interpreted as a metacharacter, so it is always safe to
precede a non-alphanumeric with backslash to specify that it stands for itself.
In particular, if you want to match a backslash, you write \e\e.
.P
In a UTF mode, only ASCII numbers and letters have any special meaning after a
backslash. All other characters (in particular, those whose codepoints are
@ -331,7 +338,7 @@ An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
by \eE later in the pattern, the literal interpretation continues to the end of
the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
a character class, this causes an error, because the character class is not
terminated.
terminated by a closing square bracket.
.
.
.\" HTML <a name="digitsafterbackslash"></a>
@ -459,9 +466,9 @@ a hexadecimal digit appears between \ex{ and }, or if there is no terminating
.P
If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just
described only when it is followed by two hexadecimal digits. Otherwise, it
matches a literal "x" character. In this mode mode, support for code points
greater than 256 is provided by \eu, which must be followed by four hexadecimal
digits; otherwise it matches a literal "u" character.
matches a literal "x" character. In this mode, support for code points greater
than 256 is provided by \eu, which must be followed by four hexadecimal digits;
otherwise it matches a literal "u" character.
.P
Characters whose value is less than 256 can be defined by either of the two
syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in
@ -475,12 +482,10 @@ the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
.sp
8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
8-bit non-UTF mode no greater than 0xff
16-bit non-UTF mode no greater than 0xffff
32-bit non-UTF mode no greater than 0xffffffff
All UTF modes no greater than 0x10ffff and a valid codepoint
.sp
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
"surrogate" codepoints), and 0xffef.
@ -506,7 +511,7 @@ In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
is set, \eU matches a "U" character, and \eu can be used to define a character
by code point, as described in the previous section.
by code point, as described above.
.
.
.SS "Absolute and relative back references"
@ -714,7 +719,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
sequences that match characters with specific properties are available. In
8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose codepoints are less than 256, but they do work in this mode.
The extra escape sequences are:
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
may be encountered. These are all treated as being in the Common script and
with an unassigned type. The extra escape sequences are:
.sp
\ep{\fIxx\fP} a character with the \fIxx\fP property
\eP{\fIxx\fP} a character without the \fIxx\fP property
@ -2224,15 +2231,8 @@ except that it does not cause the current matching position to be changed.
Assertion subpatterns are not capturing subpatterns. If such an assertion
contains capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. However, substring
capturing is carried out only for positive assertions. (Perl sometimes, but not
always, does do capturing in negative assertions.)
.P
WARNING: If a positive assertion containing one or more capturing subpatterns
succeeds, but failure to match later in the pattern causes backtracking over
this assertion, the captures within the assertion are reset only if no higher
numbered captures are already set. This is, unfortunately, a fundamental
limitation of the current implementation; it may get removed in a future
reworking.
capturing is normally carried out only for positive assertions (but see the
discussion of conditional subpatterns below).
.P
For compatibility with Perl, most assertion subpatterns may be repeated; though
it makes no sense to assert the same thing several times, the side effect of
@ -2619,6 +2619,11 @@ presence of at least one letter in the subject. If a letter is found, the
subject is matched against the first alternative; otherwise it is matched
against the second. This pattern matches strings in one of the two forms
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
.P
For Perl compatibility, if an assertion that is a condition contains capturing
subpatterns, any capturing that occurs is retained afterwards, for both
positive and negative assertions. (Compare non-conditional assertions, when
captures are retained only for positive assertions.)
.
.
.\" HTML <a name="comments"></a>
@ -2798,88 +2803,53 @@ is the actual recursive call.
.SS "Differences in recursion processing between PCRE2 and Perl"
.rs
.sp
Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2
(like Python, but unlike Perl), a recursive subpattern call is always treated
as an atomic group. That is, once it has matched some of the subject string, it
is never re-entered, even if it contains untried alternatives and there is a
subsequent matching failure. This can be illustrated by the following pattern,
which purports to match a palindromic string that contains an odd number of
characters (for example, "a", "aba", "abcba", "abcdcba"):
.sp
^(.|(.)(?1)\e2)$
.sp
The idea is that it either matches a single character, or two identical
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
it does not if the pattern is longer than three characters. Consider the
subject string "abcba":
Some former differences between PCRE2 and Perl no longer exist.
.P
At the top level, the first character is matched, but as it is not at the end
of the string, the first alternative fails; the second alternative is taken
and the recursion kicks in. The recursive call to subpattern 1 successfully
matches the next character ("b"). (Note that the beginning and end of line
tests are not part of the recursion).
Before release 10.30, recursion processing in PCRE2 differed from Perl in that
a recursive subpattern call was always treated as an atomic group. That is,
once it had matched some of the subject string, it was never re-entered, even
if it contained untried alternatives and there was a subsequent matching
failure. (Historical note: PCRE implemented recursion before Perl did.)
.P
Back at the top level, the next character ("c") is compared with what
subpattern 2 matched, which was "a". This fails. Because the recursion is
treated as an atomic group, there are now no backtracking points, and so the
entire match fails. (Perl is able, at this point, to re-enter the recursion and
try the second alternative.) However, if the pattern is written with the
alternatives in the other order, things are different:
.sp
^((.)(?1)\e2|.)$
.sp
This time, the recursing alternative is tried first, and continues to recurse
until it runs out of characters, at which point the recursion fails. But this
time we do have another alternative to try at the higher level. That is the big
difference: in the previous case the remaining alternative is at a deeper
recursion level, which PCRE2 cannot use.
Starting with release 10.30, recursive subroutine calls are no longer treated
as atomic. That is, they can be re-entered to try unused alternatives if there
is a matching failure later in the pattern. This is now compatible with the way
Perl works. If you want a subroutine call to be atomic, you must explicitly
enclose it in an atomic group.
.P
To change the pattern so that it matches all palindromic strings, not just
those with an odd number of characters, it is tempting to change the pattern to
this:
Supporting backtracking into recursions simplifies certain types of recursive
pattern. For example, this pattern matches palindromic strings:
.sp
^((.)(?1)\e2|.?)$
.sp
Again, this works in Perl, but not in PCRE2, and for the same reason. When a
deeper recursion has matched a single character, it cannot be entered again in
order to match an empty string. The solution is to separate the two cases, and
write out the odd and even cases as alternatives at the higher level:
The second branch in the group matches a single central character in the
palindrome when there are an odd number of characters, or nothing when there
are an even number of characters, but in order to work it has to be able to try
the second case when the rest of the pattern match fails. If you want to match
typical palindromic phrases, the pattern has to ignore all non-word characters,
which can be done like this:
.sp
^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
.sp
If you want to match typical palindromic phrases, the pattern has to ignore all
non-word characters, which can be done like this:
.sp
^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\e4|\eW*+.\eW*+))\eW*+$
^\eW*+((.)\eW*+(?1)\eW*+\e2|\eW*+.?)\eW*+$
.sp
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the
use of the possessive quantifier *+ to avoid backtracking into sequences of
non-word characters. Without this, PCRE2 takes a great deal longer (ten times
or more) to match typical phrases, and Perl takes so long that you think it has
gone into a loop.
man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
avoid backtracking into sequences of non-word characters. Without this, PCRE2
takes a great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop.
.P
\fBWARNING\fP: The palindrome-matching patterns above work only if the subject
string does not start with a palindrome that is shorter than the entire string.
For example, although "abcba" is correctly matched, if the subject is "ababa",
PCRE2 finds the palindrome "aba" at the start, then fails at top level because
the end of the string does not follow. Once again, it cannot jump back into the
recursion to try other alternatives, so the entire match fails.
.P
The second way in which PCRE2 and Perl differ in their recursion processing is
in the handling of captured values. In Perl, when a subpattern is called
recursively or as a subpattern (see the next section), it has no access to any
values that were captured outside the recursion, whereas in PCRE2 these values
can be referenced. Consider this pattern:
Another way in which PCRE2 and Perl used to differ in their recursion
processing is in the handling of captured values. Formerly in Perl, when a
subpattern was called recursively or as a subpattern (see the next section), it
had no access to any values that were captured outside the recursion, whereas
in PCRE2 these values can be referenced. Consider this pattern:
.sp
^(.)(\e1|a(?2))
.sp
In PCRE2, this pattern matches "bab". The first capturing parentheses match "b",
then in the second group, when the back reference \e1 fails to match "b", the
second alternative matches "a" and then recurses. In the recursion, \e1 does
now match "b" and so the whole match succeeds. In Perl, the pattern fails to
match because inside the recursive call \e1 cannot access the externally set
value.
This pattern matches "bab". The first capturing parentheses match "b", then in
the second group, when the back reference \e1 fails to match "b", the second
alternative matches "a" and then recurses. In the recursion, \e1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
.
.
.\" HTML <a name="subpatternsassubroutines"></a>
@ -2908,11 +2878,10 @@ matches "sense and sensibility" and "response and responsibility", but not
is used, it does match "sense and responsibility" as well as the other two
strings. Another example is given in the discussion of DEFINE above.
.P
All subroutine calls, whether recursive or not, are always treated as atomic
groups. That is, once a subroutine has matched some of the subject string, it
is never re-entered, even if it contains untried alternatives and there is a
subsequent matching failure. Any capturing parentheses that are set during the
subroutine call revert to their previous values afterwards.
Like recursions, subroutine calls used to be treated as atomic, but this
changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
occur. However, any capturing parentheses that are set during the subroutine
call revert to their previous values afterwards.
.P
Processing options such as case-independence are fixed when a subpattern is
defined, so if it is used as a subroutine, such options cannot be changed for
@ -3025,16 +2994,10 @@ The doubling is removed before the string is passed to the callout function.
.SH "BACKTRACKING CONTROL"
.rs
.sp
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
are still described in the Perl documentation as "experimental and subject to
change or removal in a future version of Perl". It goes on to say: "Their usage
in production code should be noted to avoid problems during upgrades." The same
remarks apply to the PCRE2 features described in this section.
.P
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
(*VERB:NAME). Some verbs take either form, possibly behaving differently
depending on whether or not a name is present.
There are a number of special "Backtracking Control Verbs" (to use Perl's
terminology) that modify the behaviour of backtracking during matching. They
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
possibly behaving differently depending on whether or not a name is present.
.P
By default, for compatibility with Perl, a name is any sequence of characters
that does not include a closing parenthesis. The name is not processed in
@ -3061,7 +3024,7 @@ not there. Any number of these verbs may occur in a pattern.
.P
Since these verbs are specifically related to backtracking, most of them can be
used only when the pattern is to be matched using the traditional matching
function, because these use a backtracking algorithm. With the exception of
function, because that uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, the backtracking
control verbs cause an error if encountered by the DFA matching function.
.P
@ -3215,11 +3178,11 @@ to ensure that the match is always attempted.
The following verbs do nothing when they are encountered. Matching continues
with what follows, but if there is no subsequent match, causing a backtrack to
the verb, a failure is forced. That is, backtracking cannot pass to the left of
the verb. However, when one of these verbs appears inside an atomic group
(which includes any group that is called as a subroutine) or in an assertion
that is true, its effect is confined to that group, because once the group has
been matched, there is never any backtracking into it. In this situation,
backtracking has to jump to the left of the entire atomic group or assertion.
the verb. However, when one of these verbs appears inside an atomic group or in
an assertion that is true, its effect is confined to that group, because once
the group has been matched, there is never any backtracking into it. In this
situation, backtracking has to jump to the left of the entire atomic group or
assertion.
.P
These verbs differ in exactly what kind of failure occurs when backtracking
reaches them. The behaviour described below is what happens when the verb is
@ -3279,8 +3242,8 @@ possessive quantifier, but there are some uses of (*PRUNE) that cannot be
expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
as (*COMMIT).
.P
The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
It is like (*MARK:NAME) in that the name is remembered for passing back to the
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
like (*MARK:NAME) in that the name is remembered for passing back to the
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
ignoring those set by (*PRUNE) or (*THEN).
.sp
@ -3482,6 +3445,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 27 December 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 18 March 2017
Copyright (c) 1997-2017 University of Cambridge.
.fi