From 77ef3e66ab5ce439e6b3f0bf3a987400ac77a23f Mon Sep 17 00:00:00 2001 From: "Philip.Hazel" Date: Sun, 19 Mar 2017 14:22:50 +0000 Subject: [PATCH] Documentation update. --- doc/pcre2pattern.3 | 235 +++++++++++++++++++-------------------------- 1 file changed, 99 insertions(+), 136 deletions(-) diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index 4c869c1..2325e0c 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "27 December 2016" "PCRE2 10.23" +.TH PCRE2PATTERN 3 "18 March 2017" "PCRE2 10.30" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -138,20 +138,23 @@ the application to apply the JIT optimization by calling \fBpcre2_jit_compile()\fP is ignored. . . -.SS "Setting match and recursion limits" +.SS "Setting match and backtracking depth limits" .rs .sp -The caller of \fBpcre2_match()\fP can set a limit on the number of times the -internal \fBmatch()\fP function is called and on the maximum depth of -recursive calls. These facilities are provided to catch runaway matches that -are provoked by patterns with huge matching trees (a typical example is a -pattern with nested unlimited repeats) and to avoid running out of system stack -by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP -gives an error return. The limits can also be set by items at the start of the -pattern of the form +The pcre2_match() function contains a counter that is incremented every time it +goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on +this counter, which therefore limits the amount of computing resource used for +a match. The maximum depth of nested backtracking can also be limited, and this +restricts the amount of heap memory that is used. +.P +These facilities are provided to catch runaway matches that are provoked by +patterns with huge matching trees (a typical example is a pattern with nested +unlimited repeats applied to a long string that does not match). When one of +these limits is reached, \fBpcre2_match()\fP gives an error return. The limits +can also be set by items at the start of the pattern of the form .sp (*LIMIT_MATCH=d) - (*LIMIT_RECURSION=d) + (*LIMIT_DEPTH=d) .sp where d is any number of decimal digits. However, the value of the setting must be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP @@ -159,10 +162,14 @@ for it to have any effect. In other words, the pattern writer can lower the limits set by the programmer, but not raise them. If there is more than one setting of one of these limits, the lower value is used. .P +Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is +still recognized for backwards compatibility. +.P The match limit is used (but in a different way) when JIT is being used, but it is not relevant, and is ignored, when matching with \fBpcre2_dfa_match()\fP. -However, the recursion limit is relevant for DFA matching, which does use some -function recursion, in particular, for recursions within the pattern. +However, the depth limit is relevant for DFA matching, which uses function +recursion for recursions within the pattern. In this case, the depth limit +controls the amount of system stack that is used. . . .\" HTML @@ -206,8 +213,8 @@ The newline convention affects where the circumflex and dollar assertions are true. It also affects the interpretation of the dot metacharacter when PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect what the \eR escape sequence matches. By default, this is any Unicode newline -sequence, for Perl compatibility. However, this can be changed; see the -description of \eR in the section entitled +sequence, for Perl compatibility. However, this can be changed; see the next +section and the description of \eR in the section entitled .\" HTML .\" "Newline sequences" @@ -230,7 +237,7 @@ corresponding to PCRE2_BSR_UNICODE. .rs .sp PCRE2 can be compiled to run in an environment that uses EBCDIC as its -character code rather than ASCII or Unicode (typically a mainframe system). In +character code instead of ASCII or Unicode (typically a mainframe system). In the sections below, character code values are ASCII or Unicode; in an EBCDIC environment these characters may have different code values, and there are no code points greater than 255. @@ -297,11 +304,11 @@ character that is not a number or a letter, it takes away any special meaning that character may have. This use of backslash as an escape character applies both inside and outside character classes. .P -For example, if you want to match a * character, you write \e* in the pattern. -This escaping action applies whether or not the following character would -otherwise be interpreted as a metacharacter, so it is always safe to precede a -non-alphanumeric with backslash to specify that it stands for itself. In -particular, if you want to match a backslash, you write \e\e. +For example, if you want to match a * character, you must write \e* in the +pattern. This escaping action applies whether or not the following character +would otherwise be interpreted as a metacharacter, so it is always safe to +precede a non-alphanumeric with backslash to specify that it stands for itself. +In particular, if you want to match a backslash, you write \e\e. .P In a UTF mode, only ASCII numbers and letters have any special meaning after a backslash. All other characters (in particular, those whose codepoints are @@ -331,7 +338,7 @@ An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed by \eE later in the pattern, the literal interpretation continues to the end of the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside a character class, this causes an error, because the character class is not -terminated. +terminated by a closing square bracket. . . .\" HTML @@ -459,9 +466,9 @@ a hexadecimal digit appears between \ex{ and }, or if there is no terminating .P If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just described only when it is followed by two hexadecimal digits. Otherwise, it -matches a literal "x" character. In this mode mode, support for code points -greater than 256 is provided by \eu, which must be followed by four hexadecimal -digits; otherwise it matches a literal "u" character. +matches a literal "x" character. In this mode, support for code points greater +than 256 is provided by \eu, which must be followed by four hexadecimal digits; +otherwise it matches a literal "u" character. .P Characters whose value is less than 256 can be defined by either of the two syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in @@ -475,12 +482,10 @@ the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows: .sp - 8-bit non-UTF mode less than 0x100 - 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint - 16-bit non-UTF mode less than 0x10000 - 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint - 32-bit non-UTF mode less than 0x100000000 - 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint + 8-bit non-UTF mode no greater than 0xff + 16-bit non-UTF mode no greater than 0xffff + 32-bit non-UTF mode no greater than 0xffffffff + All UTF modes no greater than 0x10ffff and a valid codepoint .sp Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called "surrogate" codepoints), and 0xffef. @@ -506,7 +511,7 @@ In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string handler and used to modify the case of following characters. By default, PCRE2 does not support these escape sequences. However, if the PCRE2_ALT_BSUX option is set, \eU matches a "U" character, and \eu can be used to define a character -by code point, as described in the previous section. +by code point, as described above. . . .SS "Absolute and relative back references" @@ -714,7 +719,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape sequences that match characters with specific properties are available. In 8-bit non-UTF-8 mode, these sequences are of course limited to testing characters whose codepoints are less than 256, but they do work in this mode. -The extra escape sequences are: +In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit) +may be encountered. These are all treated as being in the Common script and +with an unassigned type. The extra escape sequences are: .sp \ep{\fIxx\fP} a character with the \fIxx\fP property \eP{\fIxx\fP} a character without the \fIxx\fP property @@ -2224,15 +2231,8 @@ except that it does not cause the current matching position to be changed. Assertion subpatterns are not capturing subpatterns. If such an assertion contains capturing subpatterns within it, these are counted for the purposes of numbering the capturing subpatterns in the whole pattern. However, substring -capturing is carried out only for positive assertions. (Perl sometimes, but not -always, does do capturing in negative assertions.) -.P -WARNING: If a positive assertion containing one or more capturing subpatterns -succeeds, but failure to match later in the pattern causes backtracking over -this assertion, the captures within the assertion are reset only if no higher -numbered captures are already set. This is, unfortunately, a fundamental -limitation of the current implementation; it may get removed in a future -reworking. +capturing is normally carried out only for positive assertions (but see the +discussion of conditional subpatterns below). .P For compatibility with Perl, most assertion subpatterns may be repeated; though it makes no sense to assert the same thing several times, the side effect of @@ -2619,6 +2619,11 @@ presence of at least one letter in the subject. If a letter is found, the subject is matched against the first alternative; otherwise it is matched against the second. This pattern matches strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. +.P +For Perl compatibility, if an assertion that is a condition contains capturing +subpatterns, any capturing that occurs is retained afterwards, for both +positive and negative assertions. (Compare non-conditional assertions, when +captures are retained only for positive assertions.) . . .\" HTML @@ -2798,88 +2803,53 @@ is the actual recursive call. .SS "Differences in recursion processing between PCRE2 and Perl" .rs .sp -Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2 -(like Python, but unlike Perl), a recursive subpattern call is always treated -as an atomic group. That is, once it has matched some of the subject string, it -is never re-entered, even if it contains untried alternatives and there is a -subsequent matching failure. This can be illustrated by the following pattern, -which purports to match a palindromic string that contains an odd number of -characters (for example, "a", "aba", "abcba", "abcdcba"): -.sp - ^(.|(.)(?1)\e2)$ -.sp -The idea is that it either matches a single character, or two identical -characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2 -it does not if the pattern is longer than three characters. Consider the -subject string "abcba": +Some former differences between PCRE2 and Perl no longer exist. .P -At the top level, the first character is matched, but as it is not at the end -of the string, the first alternative fails; the second alternative is taken -and the recursion kicks in. The recursive call to subpattern 1 successfully -matches the next character ("b"). (Note that the beginning and end of line -tests are not part of the recursion). +Before release 10.30, recursion processing in PCRE2 differed from Perl in that +a recursive subpattern call was always treated as an atomic group. That is, +once it had matched some of the subject string, it was never re-entered, even +if it contained untried alternatives and there was a subsequent matching +failure. (Historical note: PCRE implemented recursion before Perl did.) .P -Back at the top level, the next character ("c") is compared with what -subpattern 2 matched, which was "a". This fails. Because the recursion is -treated as an atomic group, there are now no backtracking points, and so the -entire match fails. (Perl is able, at this point, to re-enter the recursion and -try the second alternative.) However, if the pattern is written with the -alternatives in the other order, things are different: -.sp - ^((.)(?1)\e2|.)$ -.sp -This time, the recursing alternative is tried first, and continues to recurse -until it runs out of characters, at which point the recursion fails. But this -time we do have another alternative to try at the higher level. That is the big -difference: in the previous case the remaining alternative is at a deeper -recursion level, which PCRE2 cannot use. +Starting with release 10.30, recursive subroutine calls are no longer treated +as atomic. That is, they can be re-entered to try unused alternatives if there +is a matching failure later in the pattern. This is now compatible with the way +Perl works. If you want a subroutine call to be atomic, you must explicitly +enclose it in an atomic group. .P -To change the pattern so that it matches all palindromic strings, not just -those with an odd number of characters, it is tempting to change the pattern to -this: +Supporting backtracking into recursions simplifies certain types of recursive +pattern. For example, this pattern matches palindromic strings: .sp ^((.)(?1)\e2|.?)$ .sp -Again, this works in Perl, but not in PCRE2, and for the same reason. When a -deeper recursion has matched a single character, it cannot be entered again in -order to match an empty string. The solution is to separate the two cases, and -write out the odd and even cases as alternatives at the higher level: +The second branch in the group matches a single central character in the +palindrome when there are an odd number of characters, or nothing when there +are an even number of characters, but in order to work it has to be able to try +the second case when the rest of the pattern match fails. If you want to match +typical palindromic phrases, the pattern has to ignore all non-word characters, +which can be done like this: .sp - ^(?:((.)(?1)\e2|)|((.)(?3)\e4|.)) -.sp -If you want to match typical palindromic phrases, the pattern has to ignore all -non-word characters, which can be done like this: -.sp - ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\e4|\eW*+.\eW*+))\eW*+$ + ^\eW*+((.)\eW*+(?1)\eW*+\e2|\eW*+.?)\eW*+$ .sp If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A -man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the -use of the possessive quantifier *+ to avoid backtracking into sequences of -non-word characters. Without this, PCRE2 takes a great deal longer (ten times -or more) to match typical phrases, and Perl takes so long that you think it has -gone into a loop. +man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to +avoid backtracking into sequences of non-word characters. Without this, PCRE2 +takes a great deal longer (ten times or more) to match typical phrases, and +Perl takes so long that you think it has gone into a loop. .P -\fBWARNING\fP: The palindrome-matching patterns above work only if the subject -string does not start with a palindrome that is shorter than the entire string. -For example, although "abcba" is correctly matched, if the subject is "ababa", -PCRE2 finds the palindrome "aba" at the start, then fails at top level because -the end of the string does not follow. Once again, it cannot jump back into the -recursion to try other alternatives, so the entire match fails. -.P -The second way in which PCRE2 and Perl differ in their recursion processing is -in the handling of captured values. In Perl, when a subpattern is called -recursively or as a subpattern (see the next section), it has no access to any -values that were captured outside the recursion, whereas in PCRE2 these values -can be referenced. Consider this pattern: +Another way in which PCRE2 and Perl used to differ in their recursion +processing is in the handling of captured values. Formerly in Perl, when a +subpattern was called recursively or as a subpattern (see the next section), it +had no access to any values that were captured outside the recursion, whereas +in PCRE2 these values can be referenced. Consider this pattern: .sp ^(.)(\e1|a(?2)) .sp -In PCRE2, this pattern matches "bab". The first capturing parentheses match "b", -then in the second group, when the back reference \e1 fails to match "b", the -second alternative matches "a" and then recurses. In the recursion, \e1 does -now match "b" and so the whole match succeeds. In Perl, the pattern fails to -match because inside the recursive call \e1 cannot access the externally set -value. +This pattern matches "bab". The first capturing parentheses match "b", then in +the second group, when the back reference \e1 fails to match "b", the second +alternative matches "a" and then recurses. In the recursion, \e1 does now match +"b" and so the whole match succeeds. This match used to fail in Perl, but in +later versions (I tried 5.024) it now works. . . .\" HTML @@ -2908,11 +2878,10 @@ matches "sense and sensibility" and "response and responsibility", but not is used, it does match "sense and responsibility" as well as the other two strings. Another example is given in the discussion of DEFINE above. .P -All subroutine calls, whether recursive or not, are always treated as atomic -groups. That is, once a subroutine has matched some of the subject string, it -is never re-entered, even if it contains untried alternatives and there is a -subsequent matching failure. Any capturing parentheses that are set during the -subroutine call revert to their previous values afterwards. +Like recursions, subroutine calls used to be treated as atomic, but this +changed at PCRE2 release 10.30, so backtracking into subroutine calls can now +occur. However, any capturing parentheses that are set during the subroutine +call revert to their previous values afterwards. .P Processing options such as case-independence are fixed when a subpattern is defined, so if it is used as a subroutine, such options cannot be changed for @@ -3025,16 +2994,10 @@ The doubling is removed before the string is passed to the callout function. .SH "BACKTRACKING CONTROL" .rs .sp -Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which -are still described in the Perl documentation as "experimental and subject to -change or removal in a future version of Perl". It goes on to say: "Their usage -in production code should be noted to avoid problems during upgrades." The same -remarks apply to the PCRE2 features described in this section. -.P -The new verbs make use of what was previously invalid syntax: an opening -parenthesis followed by an asterisk. They are generally of the form (*VERB) or -(*VERB:NAME). Some verbs take either form, possibly behaving differently -depending on whether or not a name is present. +There are a number of special "Backtracking Control Verbs" (to use Perl's +terminology) that modify the behaviour of backtracking during matching. They +are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form, +possibly behaving differently depending on whether or not a name is present. .P By default, for compatibility with Perl, a name is any sequence of characters that does not include a closing parenthesis. The name is not processed in @@ -3061,7 +3024,7 @@ not there. Any number of these verbs may occur in a pattern. .P Since these verbs are specifically related to backtracking, most of them can be used only when the pattern is to be matched using the traditional matching -function, because these use a backtracking algorithm. With the exception of +function, because that uses a backtracking algorithm. With the exception of (*FAIL), which behaves like a failing negative assertion, the backtracking control verbs cause an error if encountered by the DFA matching function. .P @@ -3215,11 +3178,11 @@ to ensure that the match is always attempted. The following verbs do nothing when they are encountered. Matching continues with what follows, but if there is no subsequent match, causing a backtrack to the verb, a failure is forced. That is, backtracking cannot pass to the left of -the verb. However, when one of these verbs appears inside an atomic group -(which includes any group that is called as a subroutine) or in an assertion -that is true, its effect is confined to that group, because once the group has -been matched, there is never any backtracking into it. In this situation, -backtracking has to jump to the left of the entire atomic group or assertion. +the verb. However, when one of these verbs appears inside an atomic group or in +an assertion that is true, its effect is confined to that group, because once +the group has been matched, there is never any backtracking into it. In this +situation, backtracking has to jump to the left of the entire atomic group or +assertion. .P These verbs differ in exactly what kind of failure occurs when backtracking reaches them. The behaviour described below is what happens when the verb is @@ -3279,8 +3242,8 @@ possessive quantifier, but there are some uses of (*PRUNE) that cannot be expressed in any other way. In an anchored pattern (*PRUNE) has the same effect as (*COMMIT). .P -The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE). -It is like (*MARK:NAME) in that the name is remembered for passing back to the +The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is +like (*MARK:NAME) in that the name is remembered for passing back to the caller. However, (*SKIP:NAME) searches only for names set with (*MARK), ignoring those set by (*PRUNE) or (*THEN). .sp @@ -3482,6 +3445,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 27 December 2016 -Copyright (c) 1997-2016 University of Cambridge. +Last updated: 18 March 2017 +Copyright (c) 1997-2017 University of Cambridge. .fi