Documentation update.

This commit is contained in:
Philip.Hazel 2018-08-03 16:56:54 +00:00
parent b196143523
commit c722bf2399
3 changed files with 889 additions and 809 deletions

View File

@ -249,7 +249,7 @@ is used.
<P>
The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \N when not followed by an
PCRE2_DOTALL is not set, and the behaviour of \N when not followed by an
opening brace. However, it does not affect what the \R escape sequence
matches. By default, this is any Unicode newline sequence, for Perl
compatibility. However, this can be changed; see the next section and the
@ -357,7 +357,7 @@ of the pattern.
If you want to remove the special meaning from a sequence of characters, you
can do so by putting them between \Q and \E. This is different from Perl in
that $ and @ are handled as literals in \Q...\E sequences in PCRE2, whereas
in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish
in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish
backslash interpolation" on any backslashes between \Q and \E which, its
documentation says, "may lead to confusing results". PCRE2 treats a backslash
between \Q and \E just like any other character. Note the following examples:
@ -400,7 +400,7 @@ these escapes are as follows:
\o{ddd..} character with octal code ddd..
\xhh character with hex code hh
\x{hhh..} character with hex code hhh.. (default mode)
\N{U+hhh..} character with Unicode code point hhh..
\N{U+hhh..} character with Unicode code point hhh..
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
</pre>
Note that when \N is not followed by an opening brace (curly bracket) it has
@ -590,7 +590,7 @@ Another use of backslash is for specifying generic character types:
\D any character that is not a decimal digit
\h any horizontal white space character
\H any character that is not a horizontal white space character
\N any character that is not a newline
\N any character that is not a newline
\s any white space character
\S any character that is not a white space character
\v any vertical white space character
@ -600,8 +600,8 @@ Another use of backslash is for specifying generic character types:
</pre>
The \N escape sequence has the same meaning as
<a href="#fullstopdot">the "." metacharacter</a>
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
meaning of \N. Note that when \N is followed by an opening brace it has a
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
meaning of \N. Note that when \N is followed by an opening brace it has a
different meaning. See the section entitled
<a href="#digitsafterbackslash">"Non-printing characters"</a>
above for details. Perl also uses \N{name} to specify characters by Unicode
@ -1030,8 +1030,8 @@ grapheme cluster", and treats the sequence as an atomic group
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
define the boundaries of extended grapheme clusters. The rules are defined in
Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
abandoned the use of some previous properties that had been used for emojis.
Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
abandoned the use of some previous properties that had been used for emojis.
Instead it introduced various emoji-specific properties. PCRE2 uses only the
Extended Pictographic property.
</P>
@ -1316,7 +1316,7 @@ special meaning in a character class.
<P>
The escape sequence \N when not followed by an opening brace behaves like a
dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
it matches any character except one that signifies the end of a line.
it matches any character except one that signifies the end of a line.
</P>
<P>
When \N is followed by an opening brace it has a different meaning. See the
@ -1642,7 +1642,7 @@ documentation. The option letters are:
xx for PCRE2_EXTENDED_MORE
</pre>
For example, (?im) sets caseless, multiline matching. It is also possible to
unset these options by preceding the relevant letters with a hyphen, for
unset these options by preceding the relevant letters with a hyphen, for
example (?-im). The two "extended" options are not independent; unsetting either
one cancels the effects of both of them.
</P>
@ -1654,9 +1654,9 @@ appears both before and after the hyphen, the option is unset. An empty options
setting "(?)" is allowed. Needless to say, it has no effect.
</P>
<P>
If the first character following (? is a circumflex, it causes all of the above
options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
the circumflex to cause some options to be re-instated, but a hyphen may not
If the first character following (? is a circumflex, it causes all of the above
options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
the circumflex to cause some options to be re-instated, but a hyphen may not
appear.
</P>
<P>
@ -1813,41 +1813,68 @@ duplicate named subpatterns, as described in the next section.
<br><a name="SEC16" href="#TOC1">NAMED SUBPATTERNS</a><br>
<P>
Identifying capturing parentheses by number is simple, but it can be very hard
to keep track of the numbers in complicated regular expressions. Furthermore,
if an expression is modified, the numbers may change. To help with this
difficulty, PCRE2 supports the naming of subpatterns. This feature was not
added to Perl until release 5.10. Python had the feature earlier, and PCRE1
to keep track of the numbers in complicated patterns. Furthermore, if an
expression is modified, the numbers may change. To help with this difficulty,
PCRE2 supports the naming of capturing subpatterns. This feature was not added
to Perl until release 5.10. Python had the feature earlier, and PCRE1
introduced it at release 4.0, using the Python syntax. PCRE2 supports both the
Perl and the Python syntax. Perl allows identically numbered subpatterns to
have different names, but PCRE2 does not.
Perl and the Python syntax.
</P>
<P>
In PCRE2, a subpattern can be named in one of three ways: (?&#60;name&#62;...) or
(?'name'...) as in Perl, or (?P&#60;name&#62;...) as in Python. References to capturing
parentheses from other parts of the pattern, such as
In PCRE2, a capturing subpattern can be named in one of three ways:
(?&#60;name&#62;...) or (?'name'...) as in Perl, or (?P&#60;name&#62;...) as in Python. Names
consist of up to 32 alphanumeric characters and underscores, but must start
with a non-digit. References to capturing parentheses from other parts of the
pattern, such as
<a href="#backreferences">backreferences,</a>
<a href="#recursion">recursion,</a>
and
<a href="#conditions">conditions,</a>
can be made by name as well as by number.
can all be made by name as well as by number.
</P>
<P>
Names consist of up to 32 alphanumeric characters and underscores, but must
start with a non-digit. Named capturing parentheses are still allocated numbers
as well as names, exactly as if the names were not present. The PCRE2 API
provides function calls for extracting the name-to-number translation table
from a compiled pattern. There are also convenience functions for extracting a
captured substring by name.
Named capturing parentheses are allocated numbers as well as names, exactly as
if the names were not present. In both PCRE2 and Perl, capturing subpatterns
are primarily identified by numbers; any names are just aliases for these
numbers. The PCRE2 API provides function calls for extracting the complete
name-to-number translation table from a compiled pattern, as well as
convenience functions for extracting captured substrings by name.
</P>
<P>
By default, a name must be unique within a pattern, but it is possible to relax
this constraint by setting the PCRE2_DUPNAMES option at compile time.
(Duplicate names are also always permitted for subpatterns with the same
number, set up as described in the previous section.) Duplicate names can be
useful for patterns where only one instance of the named parentheses can match.
Suppose you want to match the name of a weekday, either as a 3-letter
abbreviation or as the full name, and in both cases you want to extract the
abbreviation. This pattern (ignoring the line breaks) does the job:
<b>Warning:</b> When more than one subpattern has the same number, as described
in the previous section, a name given to one of them applies to all of them.
Perl allows identically numbered subpatterns to have different names. Consider
this pattern, where there are two capturing subpatterns, both numbered 1:
<pre>
(?|(?&#60;AA&#62;aa)|(?&#60;BB&#62;bb))
</pre>
Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
a successful match, both names yield the same value (either "aa" or "bb").
</P>
<P>
In an attempt to reduce confusion, PCRE2 does not allow the same group number
to be associated with more than one name. The example above provokes a
compile-time error. However, there is still scope for confusion. Consider this
pattern:
<pre>
(?|(?&#60;AA&#62;aa)|(bb))
</pre>
Although the second subpattern number 1 is not explicitly named, the name AA is
still an alias for subpattern 1. Whether the pattern matches "aa" or "bb", a
reference by name to group AA yields the matched string.
</P>
<P>
By default, a name must be unique within a pattern, except that duplicate names
are permitted for subpatterns with the same number, for example:
<pre>
(?|(?&#60;AA&#62;aa)|(?&#60;AA&#62;bb))
</pre>
The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
option at compile time, or by the use of (?J) within the pattern. Duplicate
names can be useful for patterns where only one instance of the named
parentheses can match. Suppose you want to match the name of a weekday, either
as a 3-letter abbreviation or as the full name, and in both cases you want to
extract the abbreviation. This pattern (ignoring the line breaks) does the job:
<pre>
(?&#60;DN&#62;Mon|Fri|Sun)(?:day)?|
(?&#60;DN&#62;Tue)(?:sday)?|
@ -1856,13 +1883,11 @@ abbreviation. This pattern (ignoring the line breaks) does the job:
(?&#60;DN&#62;Sat)(?:urday)?
</pre>
There are five capturing substrings, but only one is ever set after a match.
(An alternative way of solving this problem is to use a "branch reset"
subpattern, as described in the previous section.)
</P>
<P>
The convenience functions for extracting the data by name returns the substring
for the first (and in this example, the only) subpattern of that name that
matched. This saves searching to find which numbered subpattern it was.
matched. This saves searching to find which numbered subpattern it was. (An
alternative way of solving this problem is to use a "branch reset" subpattern,
as described in the previous section.)
</P>
<P>
If you make a backreference to a non-unique named subpattern from elsewhere in
@ -1878,8 +1903,7 @@ for the reference. For example, this pattern matches both "foofoo" and
<P>
If you make a subroutine call to a non-unique named subpattern, the one that
corresponds to the first occurrence of the name is used. In the absence of
duplicate numbers (see the previous section) this is the one with the lowest
number.
duplicate numbers this is the one with the lowest number.
</P>
<P>
If you use a named reference in a condition
@ -1893,14 +1917,6 @@ handling named subpatterns, see the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<P>
<b>Warning:</b> You cannot use different names to distinguish between two
subpatterns with the same number because PCRE2 uses only the numbers when
matching. For this reason, an error is given at compile time if different names
are given to subpatterns with the same number. However, you can always give the
same name to subpatterns with the same number, even when PCRE2_DUPNAMES is not
set.
</P>
<br><a name="SEC17" href="#TOC1">REPETITION</a><br>
<P>
Repetition is specified by quantifiers, which can follow any of the following
@ -2327,14 +2343,14 @@ the subject string is as it was before the assertion was processed.
<P>
Assertion subpatterns are not capturing subpatterns. If an assertion contains
capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. Within each branch of
numbering the capturing subpatterns in the whole pattern. Within each branch of
an assertion, locally captured substrings may be referenced in the usual way.
For example, a sequence such as (.)\g{-1} can be used to check that two
For example, a sequence such as (.)\g{-1} can be used to check that two
adjacent characters are the same.
</P>
<P>
When a branch within an assertion fails to match, any substrings that were
captured are discarded (as happens with any pattern branch that fails to
captured are discarded (as happens with any pattern branch that fails to
match). A negative assertion succeeds only when all its branches fail to match;
this means that no captured substrings are ever retained after a successful
negative assertion. When an assertion contains a matching branch, what happens
@ -2348,7 +2364,7 @@ assertion has failed. If the assertion is being used as a condition in a
<a href="#conditions">conditional subpattern</a>
(see below), captured substrings are retained, because matching continues with
the "no" branch of the condition. For other failing negative assertions,
control passes to the previous backtracking point, thus discarding any captured
control passes to the previous backtracking point, thus discarding any captured
strings within the assertion.
</P>
<P>
@ -2957,10 +2973,12 @@ later versions (I tried 5.024) it now works.
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<P>
If the syntax for a recursive subpattern call (either by number or by
name) is used outside the parentheses to which it refers, it operates like a
subroutine in a programming language. The called subpattern may be defined
before or after the reference. A numbered reference can be absolute or
relative, as in these examples:
name) is used outside the parentheses to which it refers, it operates a bit
like a subroutine in a programming language. More accurately, PCRE2 treats the
referenced subpattern as an independent subpattern which it tries to match at
the current matching position. The called subpattern may be defined before or
after the reference. A numbered reference can be absolute or relative, as in
these examples:
<pre>
(...(absolute)...)...(?2)...
(...(relative)...)...(?-1)...
@ -2993,6 +3011,13 @@ different calls. For example, consider this pattern:
</pre>
It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
</P>
<P>
The behaviour of
<a href="#backtrackcontrol">backtracking control verbs</a>
in subpatterns when called as subroutines is described in the section entitled
<a href="#btsub">"Backtracking verbs in subroutines"</a>
below.
<a name="onigurumasubroutines"></a></P>
<br><a name="SEC25" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
<P>
@ -3111,7 +3136,7 @@ are faulted.
</P>
<P>
A closing parenthesis can be included in a name either as \) or between \Q
and \E. In addition to backslash processing, if the PCRE2_EXTENDED or
and \E. In addition to backslash processing, if the PCRE2_EXTENDED or
PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
skipped, and #-comments are recognized, exactly as in the rest of the pattern.
PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
@ -3157,7 +3182,7 @@ in the
documentation.
</P>
<P>
Experiments with Perl suggest that it too has similar optimizations, and like
Experiments with Perl suggest that it too has similar optimizations, and like
PCRE2, turning them off can change the result of a match.
</P>
<br><b>
@ -3185,7 +3210,7 @@ the outer parentheses.
<pre>
(*FAIL) or (*FAIL:NAME)
</pre>
This verb causes a matching failure, forcing backtracking to occur. It may be
This verb causes a matching failure, forcing backtracking to occur. It may be
abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
documentation notes that it is probably useful only when combined with (?{}) or
(??{}). Those are, of course, Perl features that are not present in PCRE2. The
@ -3197,7 +3222,7 @@ A match with the string "aaaa" always fails, but the callout is taken before
each backtrack happens (in this example, 10 times).
</P>
<P>
(*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as
(*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as
(*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively.
</P>
<br><b>
@ -3220,7 +3245,7 @@ matching path is passed back to the caller as described in the section entitled
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. This applies to all instances of (*MARK), including those inside
assertions and atomic groups. (There are differences in those cases when
assertions and atomic groups. (There are differences in those cases when
(*MARK) is used in conjunction with (*SKIP) as described below.)
</P>
<P>
@ -3300,7 +3325,7 @@ the current starting point, or not at all. For example:
a+(*COMMIT)b
</pre>
This matches "xxaab" but not "aacaab". It can be thought of as a kind of
dynamic anchor, or "I've started, so I must finish."
dynamic anchor, or "I've started, so I must finish."
</P>
<P>
The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COMMIT). It is
@ -3524,7 +3549,7 @@ subpattern.
(*ACCEPT) in a standalone positive assertion causes the assertion to succeed
without any further processing; captured strings and a (*MARK) name (if set)
are retained. In a standalone negative assertion, (*ACCEPT) causes the
assertion to fail without any further processing; captured substrings and any
assertion to fail without any further processing; captured substrings and any
(*MARK) name are discarded.
</P>
<P>
@ -3533,11 +3558,11 @@ a positive assertion and false for a negative one; captured substrings are
retained in both cases.
</P>
<P>
The remaining verbs act only when a later failure causes a backtrack to
reach them. This means that their effect is confined to the assertion,
The remaining verbs act only when a later failure causes a backtrack to
reach them. This means that their effect is confined to the assertion,
because lookaround assertions are atomic. A backtrack that occurs after an
assertion is complete does not jump back into the assertion. Note in particular
that a (*MARK) name that is set in an assertion is not "seen" by an instance of
assertion is complete does not jump back into the assertion. Note in particular
that a (*MARK) name that is set in an assertion is not "seen" by an instance of
(*SKIP:NAME) latter in the pattern.
</P>
<P>

File diff suppressed because it is too large Load Diff

View File

@ -218,7 +218,7 @@ is used.
.P
The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an
PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an
opening brace. However, it does not affect what the \eR escape sequence
matches. By default, this is any Unicode newline sequence, for Perl
compatibility. However, this can be changed; see the next section and the
@ -331,7 +331,7 @@ of the pattern.
If you want to remove the special meaning from a sequence of characters, you
can do so by putting them between \eQ and \eE. This is different from Perl in
that $ and @ are handled as literals in \eQ...\eE sequences in PCRE2, whereas
in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish
in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish
backslash interpolation" on any backslashes between \eQ and \eE which, its
documentation says, "may lead to confusing results". PCRE2 treats a backslash
between \eQ and \eE just like any other character. Note the following examples:
@ -377,7 +377,7 @@ these escapes are as follows:
\eo{ddd..} character with octal code ddd..
\exhh character with hex code hh
\ex{hhh..} character with hex code hhh.. (default mode)
\eN{U+hhh..} character with Unicode code point hhh..
\eN{U+hhh..} character with Unicode code point hhh..
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
.sp
Note that when \eN is not followed by an opening brace (curly bracket) it has
@ -581,7 +581,7 @@ Another use of backslash is for specifying generic character types:
\eD any character that is not a decimal digit
\eh any horizontal white space character
\eH any character that is not a horizontal white space character
\eN any character that is not a newline
\eN any character that is not a newline
\es any white space character
\eS any character that is not a white space character
\ev any vertical white space character
@ -594,8 +594,8 @@ The \eN escape sequence has the same meaning as
.\" </a>
the "." metacharacter
.\"
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
meaning of \eN. Note that when \eN is followed by an opening brace it has a
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
meaning of \eN. Note that when \eN is followed by an opening brace it has a
different meaning. See the section entitled
.\" HTML <a href="#digitsafterbackslash">
.\" </a>
@ -1029,8 +1029,8 @@ grapheme cluster", and treats the sequence as an atomic group
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
define the boundaries of extended grapheme clusters. The rules are defined in
Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
abandoned the use of some previous properties that had been used for emojis.
Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
abandoned the use of some previous properties that had been used for emojis.
Instead it introduced various emoji-specific properties. PCRE2 uses only the
Extended Pictographic property.
.P
@ -1310,7 +1310,7 @@ special meaning in a character class.
.P
The escape sequence \eN when not followed by an opening brace behaves like a
dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
it matches any character except one that signifies the end of a line.
it matches any character except one that signifies the end of a line.
.P
When \eN is followed by an opening brace it has a different meaning. See the
section entitled
@ -1643,7 +1643,7 @@ documentation. The option letters are:
xx for PCRE2_EXTENDED_MORE
.sp
For example, (?im) sets caseless, multiline matching. It is also possible to
unset these options by preceding the relevant letters with a hyphen, for
unset these options by preceding the relevant letters with a hyphen, for
example (?-im). The two "extended" options are not independent; unsetting either
one cancels the effects of both of them.
.P
@ -1653,9 +1653,9 @@ permitted. Only one hyphen may appear in the options string. If a letter
appears both before and after the hyphen, the option is unset. An empty options
setting "(?)" is allowed. Needless to say, it has no effect.
.P
If the first character following (? is a circumflex, it causes all of the above
options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
the circumflex to cause some options to be re-instated, but a hyphen may not
If the first character following (? is a circumflex, it causes all of the above
options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
the circumflex to cause some options to be re-instated, but a hyphen may not
appear.
.P
The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
@ -1815,17 +1815,18 @@ duplicate named subpatterns, as described in the next section.
.rs
.sp
Identifying capturing parentheses by number is simple, but it can be very hard
to keep track of the numbers in complicated regular expressions. Furthermore,
if an expression is modified, the numbers may change. To help with this
difficulty, PCRE2 supports the naming of subpatterns. This feature was not
added to Perl until release 5.10. Python had the feature earlier, and PCRE1
to keep track of the numbers in complicated patterns. Furthermore, if an
expression is modified, the numbers may change. To help with this difficulty,
PCRE2 supports the naming of capturing subpatterns. This feature was not added
to Perl until release 5.10. Python had the feature earlier, and PCRE1
introduced it at release 4.0, using the Python syntax. PCRE2 supports both the
Perl and the Python syntax. Perl allows identically numbered subpatterns to
have different names, but PCRE2 does not.
Perl and the Python syntax.
.P
In PCRE2, a subpattern can be named in one of three ways: (?<name>...) or
(?'name'...) as in Perl, or (?P<name>...) as in Python. References to capturing
parentheses from other parts of the pattern, such as
In PCRE2, a capturing subpattern can be named in one of three ways:
(?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. Names
consist of up to 32 alphanumeric characters and underscores, but must start
with a non-digit. References to capturing parentheses from other parts of the
pattern, such as
.\" HTML <a href="#backreferences">
.\" </a>
backreferences,
@ -1839,23 +1840,47 @@ and
.\" </a>
conditions,
.\"
can be made by name as well as by number.
can all be made by name as well as by number.
.P
Names consist of up to 32 alphanumeric characters and underscores, but must
start with a non-digit. Named capturing parentheses are still allocated numbers
as well as names, exactly as if the names were not present. The PCRE2 API
provides function calls for extracting the name-to-number translation table
from a compiled pattern. There are also convenience functions for extracting a
captured substring by name.
Named capturing parentheses are allocated numbers as well as names, exactly as
if the names were not present. In both PCRE2 and Perl, capturing subpatterns
are primarily identified by numbers; any names are just aliases for these
numbers. The PCRE2 API provides function calls for extracting the complete
name-to-number translation table from a compiled pattern, as well as
convenience functions for extracting captured substrings by name.
.P
By default, a name must be unique within a pattern, but it is possible to relax
this constraint by setting the PCRE2_DUPNAMES option at compile time.
(Duplicate names are also always permitted for subpatterns with the same
number, set up as described in the previous section.) Duplicate names can be
useful for patterns where only one instance of the named parentheses can match.
Suppose you want to match the name of a weekday, either as a 3-letter
abbreviation or as the full name, and in both cases you want to extract the
abbreviation. This pattern (ignoring the line breaks) does the job:
\fBWarning:\fP When more than one subpattern has the same number, as described
in the previous section, a name given to one of them applies to all of them.
Perl allows identically numbered subpatterns to have different names. Consider
this pattern, where there are two capturing subpatterns, both numbered 1:
.sp
(?|(?<AA>aa)|(?<BB>bb))
.sp
Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
a successful match, both names yield the same value (either "aa" or "bb").
.P
In an attempt to reduce confusion, PCRE2 does not allow the same group number
to be associated with more than one name. The example above provokes a
compile-time error. However, there is still scope for confusion. Consider this
pattern:
.sp
(?|(?<AA>aa)|(bb))
.sp
Although the second subpattern number 1 is not explicitly named, the name AA is
still an alias for subpattern 1. Whether the pattern matches "aa" or "bb", a
reference by name to group AA yields the matched string.
.P
By default, a name must be unique within a pattern, except that duplicate names
are permitted for subpatterns with the same number, for example:
.sp
(?|(?<AA>aa)|(?<AA>bb))
.sp
The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
option at compile time, or by the use of (?J) within the pattern. Duplicate
names can be useful for patterns where only one instance of the named
parentheses can match. Suppose you want to match the name of a weekday, either
as a 3-letter abbreviation or as the full name, and in both cases you want to
extract the abbreviation. This pattern (ignoring the line breaks) does the job:
.sp
(?<DN>Mon|Fri|Sun)(?:day)?|
(?<DN>Tue)(?:sday)?|
@ -1864,12 +1889,11 @@ abbreviation. This pattern (ignoring the line breaks) does the job:
(?<DN>Sat)(?:urday)?
.sp
There are five capturing substrings, but only one is ever set after a match.
(An alternative way of solving this problem is to use a "branch reset"
subpattern, as described in the previous section.)
.P
The convenience functions for extracting the data by name returns the substring
for the first (and in this example, the only) subpattern of that name that
matched. This saves searching to find which numbered subpattern it was.
matched. This saves searching to find which numbered subpattern it was. (An
alternative way of solving this problem is to use a "branch reset" subpattern,
as described in the previous section.)
.P
If you make a backreference to a non-unique named subpattern from elsewhere in
the pattern, the subpatterns to which the name refers are checked in the order
@ -1882,8 +1906,7 @@ for the reference. For example, this pattern matches both "foofoo" and
.P
If you make a subroutine call to a non-unique named subpattern, the one that
corresponds to the first occurrence of the name is used. In the absence of
duplicate numbers (see the previous section) this is the one with the lowest
number.
duplicate numbers this is the one with the lowest number.
.P
If you use a named reference in a condition
test (see the
@ -1901,13 +1924,6 @@ handling named subpatterns, see the
\fBpcre2api\fP
.\"
documentation.
.P
\fBWarning:\fP You cannot use different names to distinguish between two
subpatterns with the same number because PCRE2 uses only the numbers when
matching. For this reason, an error is given at compile time if different names
are given to subpatterns with the same number. However, you can always give the
same name to subpatterns with the same number, even when PCRE2_DUPNAMES is not
set.
.
.
.SH REPETITION
@ -2336,13 +2352,13 @@ the subject string is as it was before the assertion was processed.
.P
Assertion subpatterns are not capturing subpatterns. If an assertion contains
capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. Within each branch of
numbering the capturing subpatterns in the whole pattern. Within each branch of
an assertion, locally captured substrings may be referenced in the usual way.
For example, a sequence such as (.)\eg{-1} can be used to check that two
For example, a sequence such as (.)\eg{-1} can be used to check that two
adjacent characters are the same.
.P
When a branch within an assertion fails to match, any substrings that were
captured are discarded (as happens with any pattern branch that fails to
captured are discarded (as happens with any pattern branch that fails to
match). A negative assertion succeeds only when all its branches fail to match;
this means that no captured substrings are ever retained after a successful
negative assertion. When an assertion contains a matching branch, what happens
@ -2358,7 +2374,7 @@ conditional subpattern
.\"
(see below), captured substrings are retained, because matching continues with
the "no" branch of the condition. For other failing negative assertions,
control passes to the previous backtracking point, thus discarding any captured
control passes to the previous backtracking point, thus discarding any captured
strings within the assertion.
.P
For compatibility with Perl, most assertion subpatterns may be repeated; though
@ -2982,10 +2998,12 @@ later versions (I tried 5.024) it now works.
.rs
.sp
If the syntax for a recursive subpattern call (either by number or by
name) is used outside the parentheses to which it refers, it operates like a
subroutine in a programming language. The called subpattern may be defined
before or after the reference. A numbered reference can be absolute or
relative, as in these examples:
name) is used outside the parentheses to which it refers, it operates a bit
like a subroutine in a programming language. More accurately, PCRE2 treats the
referenced subpattern as an independent subpattern which it tries to match at
the current matching position. The called subpattern may be defined before or
after the reference. A numbered reference can be absolute or relative, as in
these examples:
.sp
(...(absolute)...)...(?2)...
(...(relative)...)...(?-1)...
@ -3016,6 +3034,18 @@ different calls. For example, consider this pattern:
.sp
It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
.P
The behaviour of
.\" HTML <a href="#backtrackcontrol">
.\" </a>
backtracking control verbs
.\"
in subpatterns when called as subroutines is described in the section entitled
.\" HTML <a href="#btsub">
.\" </a>
"Backtracking verbs in subroutines"
.\"
below.
.
.
.\" HTML <a name="onigurumasubroutines"></a>
@ -3137,7 +3167,7 @@ only backslash items that are permitted are \eQ, \eE, and sequences such as
are faulted.
.P
A closing parenthesis can be included in a name either as \e) or between \eQ
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or
PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
skipped, and #-comments are recognized, exactly as in the rest of the pattern.
PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
@ -3194,7 +3224,7 @@ in the
.\"
documentation.
.P
Experiments with Perl suggest that it too has similar optimizations, and like
Experiments with Perl suggest that it too has similar optimizations, and like
PCRE2, turning them off can change the result of a match.
.
.
@ -3221,7 +3251,7 @@ the outer parentheses.
.sp
(*FAIL) or (*FAIL:NAME)
.sp
This verb causes a matching failure, forcing backtracking to occur. It may be
This verb causes a matching failure, forcing backtracking to occur. It may be
abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
documentation notes that it is probably useful only when combined with (?{}) or
(??{}). Those are, of course, Perl features that are not present in PCRE2. The
@ -3232,7 +3262,7 @@ nearest equivalent is the callout feature, as for example in this pattern:
A match with the string "aaaa" always fails, but the callout is taken before
each backtrack happens (in this example, 10 times).
.P
(*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as
(*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as
(*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively.
.
.
@ -3259,7 +3289,7 @@ in the
\fBpcre2api\fP
.\"
documentation. This applies to all instances of (*MARK), including those inside
assertions and atomic groups. (There are differences in those cases when
assertions and atomic groups. (There are differences in those cases when
(*MARK) is used in conjunction with (*SKIP) as described below.)
.P
As well as (*MARK), the (*COMMIT), (*PRUNE) and (*THEN) verbs may have
@ -3336,7 +3366,7 @@ the current starting point, or not at all. For example:
a+(*COMMIT)b
.sp
This matches "xxaab" but not "aacaab". It can be thought of as a kind of
dynamic anchor, or "I've started, so I must finish."
dynamic anchor, or "I've started, so I must finish."
.P
The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COMMIT). It is
like (*MARK:NAME) in that the name is remembered for passing back to the
@ -3424,7 +3454,7 @@ following \fBpcre2test\fP examples:
data: abc
0: b
1: b
.sp
.sp
In the first example, the (*MARK) setting is in an atomic group, so it is not
seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. This allows
the second branch of the pattern to be tried at the first character position.
@ -3551,18 +3581,18 @@ subpattern.
(*ACCEPT) in a standalone positive assertion causes the assertion to succeed
without any further processing; captured strings and a (*MARK) name (if set)
are retained. In a standalone negative assertion, (*ACCEPT) causes the
assertion to fail without any further processing; captured substrings and any
assertion to fail without any further processing; captured substrings and any
(*MARK) name are discarded.
.P
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
a positive assertion and false for a negative one; captured substrings are
retained in both cases.
.P
The remaining verbs act only when a later failure causes a backtrack to
reach them. This means that their effect is confined to the assertion,
The remaining verbs act only when a later failure causes a backtrack to
reach them. This means that their effect is confined to the assertion,
because lookaround assertions are atomic. A backtrack that occurs after an
assertion is complete does not jump back into the assertion. Note in particular
that a (*MARK) name that is set in an assertion is not "seen" by an instance of
assertion is complete does not jump back into the assertion. Note in particular
that a (*MARK) name that is set in an assertion is not "seen" by an instance of
(*SKIP:NAME) latter in the pattern.
.P
The effect of (*THEN) is not allowed to escape beyond an assertion. If there