Documentation update

This commit is contained in:
Philip.Hazel 2020-10-07 16:27:20 +00:00
parent fff544a1e9
commit 6d4936dc29
6 changed files with 1310 additions and 1255 deletions

View File

@ -1492,10 +1492,13 @@ letters in the subject. It is equivalent to Perl's /i option, and it can be
changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
PCRE2_UCP is set, Unicode properties are used for all characters with more than
one other case, and for all characters whose code points are greater than
U+007F. For lower valued characters with only one other case, a lookup table is
used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is
used for all code points less than 256, and higher code points (available only
in 16-bit or 32-bit mode) are treated as not having another case.
U+007F. Note that there are two ASCII characters, K and S, that, in addition to
their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin
sign) and U+017F (long S) respectively. For lower valued characters with only
one other case, a lookup table is used for speed. When neither PCRE2_UTF nor
PCRE2_UCP is set, a lookup table is used for all code points less than 256, and
higher code points (available only in 16-bit or 32-bit mode) are treated as not
having another case.
<pre>
PCRE2_DOLLAR_ENDONLY
</pre>
@ -3956,7 +3959,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 19 March 2020
Last updated: 05 October 2020
<br>
Copyright &copy; 1997-2020 University of Cambridge.
<br>

View File

@ -16,10 +16,10 @@ please consult the man page, in case the conversion went wrong.
DIFFERENCES BETWEEN PCRE2 AND PERL
</b><br>
<P>
This document describes the differences in the ways that PCRE2 and Perl handle
regular expressions. The differences described here are with respect to Perl
versions 5.26, but as both Perl and PCRE2 are continually changing, the
information may sometimes be out of date.
This document describes some of the differences in the ways that PCRE2 and Perl
handle regular expressions. The differences described here are with respect to
Perl version 5.32.0, but as both Perl and PCRE2 are continually changing, the
information may at times be out of date.
</P>
<P>
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
@ -33,12 +33,15 @@ they do not mean what you might think. For example, (?!a){3} does not assert
that the next three characters are not "a". It just asserts that the next
character is not "a" three times (in principle; PCRE2 optimizes this to run the
assertion just once). Perl allows some repeat quantifiers on other assertions,
for example, \b* (but not \b{3}), but these do not seem to have any use.
for example, \b* (but not \b{3}, though oddly it does allow ^{3}), but these
do not seem to have any use. PCRE2 does not allow any kind of quantifier on
non-lookaround assertions.
</P>
<P>
3. Capture groups that occur inside negative lookaround assertions are counted,
but their entries in the offsets vector are set only when a negative assertion
is a condition that has a matching branch (that is, the condition is false).
Perl may set such capture groups in other circumstances.
</P>
<P>
4. The following Perl escape sequences are not supported: \F, \l, \L, \u,
@ -56,10 +59,12 @@ interprets them.
built with Unicode support (the default). The properties that can be tested
with \p and \P are limited to the general category properties such as Lu and
Nd, script names such as Greek or Han, and the derived properties Any and L&.
PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
documentation says "Because Perl hides the need for the user to understand the
internal representation of Unicode characters, there is no need to implement
the somewhat messy concept of surrogates."
Both PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its use
is limited. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation for details. The long synonyms for property names that Perl
supports (such as \p{Letter}) are not supported by PCRE2, nor is it permitted
to prefix any of these properties with "Is".
</P>
<P>
6. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
@ -79,7 +84,8 @@ other character. Note the following examples:
\QA\B\E A\B A\B
\Q\\E \ \\E
</pre>
The \Q...\E sequence is recognized both inside and outside character classes.
The \Q...\E sequence is recognized both inside and outside character classes
by both PCRE2 and Perl.
</P>
<P>
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
@ -94,13 +100,13 @@ to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
into subroutine calls is now supported, as in Perl.
</P>
<P>
9. If any of the backtracking control verbs are used in a group that is called
as a subroutine (whether or not recursively), their effect is confined to that
group; it does not extend to the surrounding pattern. This is not always the
case in Perl. In particular, if (*THEN) is present in a group that is called as
a subroutine, its action is limited to that group, even if the group does not
contain any | characters. Note that such groups are processed as anchored
at the point where they are tested.
9. In PCRE2, if any of the backtracking control verbs are used in a group that
is called as a subroutine (whether or not recursively), their effect is
confined to that group; it does not extend to the surrounding pattern. This is
not always the case in Perl. In particular, if (*THEN) is present in a group
that is called as a subroutine, its action is limited to that group, even if
the group does not contain any | characters. Note that such groups are
processed as anchored at the point where they are tested.
</P>
<P>
10. If a pattern contains more than one backtracking control verb, the first
@ -110,55 +116,56 @@ triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
same as PCRE2, but there are cases where it differs.
</P>
<P>
11. Most backtracking verbs in assertions have their normal actions. They are
not confined to the assertion.
</P>
<P>
12. There are some differences that are concerned with the settings of captured
11. There are some differences that are concerned with the settings of captured
strings when part of a pattern is repeated. For example, matching "aba" against
the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
"b".
</P>
<P>
13. PCRE2's handling of duplicate capture group numbers and names is not as
12. PCRE2's handling of duplicate capture group numbers and names is not as
general as Perl's. This is a consequence of the fact the PCRE2 works internally
just with numbers, using an external table to translate between numbers and
names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b&#62;B), where the two
names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b&#62;B)), where the two
capture groups have the same number but different names, is not supported, and
causes an error at compile time. If it were allowed, it would not be possible
to distinguish which group matched, because both names map to capture group
number 1. To avoid this confusing situation, an error is given at compile time.
</P>
<P>
14. Perl used to recognize comments in some places that PCRE2 does not, for
13. Perl used to recognize comments in some places that PCRE2 does not, for
example, between the ( and ? at the start of a group. If the /x modifier is
set, Perl allowed white space between ( and ? though the latest Perls give an
error (for a while it was just deprecated). There may still be some cases where
Perl behaves differently.
</P>
<P>
15. Perl, when in warning mode, gives warnings for character classes such as
14. Perl, when in warning mode, gives warnings for character classes such as
[A-\d] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
warning features, so it gives an error in these cases because they are almost
certainly user mistakes.
</P>
<P>
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
15. In PCRE2, the upper/lower case character properties Lu and Ll are not
affected when case-independent matching is specified. For example, \p{Lu}
always matches an upper case letter. I think Perl has changed in this respect;
in the release at the time of writing (5.24), \p{Lu} and \p{Ll} match all
in the release at the time of writing (5.32), \p{Lu} and \p{Ll} match all
letters, regardless of case, when case independence is specified.
</P>
<P>
16. From release 5.32.0, Perl locks out the use of \K in lookaround
assertions. In PCRE2, \K is acted on when it occurs in positive assertions,
but is ignored in negative assertions.
</P>
<P>
17. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
Perl 5.10 included new features that were not in earlier versions of Perl, some
of which (such as named parentheses) were in PCRE2 for some time before. This
list is with respect to Perl 5.26:
list is with respect to Perl 5.32:
<br>
<br>
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
each alternative toplevel branch of a lookbehind assertion can match a
different length of string. Perl requires them all to have the same length.
<br>
<br>
(b) From PCRE2 10.23, backreferences to groups of fixed length are supported
@ -203,7 +210,7 @@ different way and is not Perl-compatible.
<br>
<br>
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
the start of a pattern that set overall options that cannot be changed within
the start of a pattern. These set overall options that cannot be changed within
the pattern.
<br>
<br>
@ -239,7 +246,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 13 July 2019
Last updated: 06 October 2020
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -289,8 +289,11 @@ corresponding characters in the subject. As a trivial example, the pattern
The quick brown fox
</pre>
matches a portion of a subject string that is identical to itself. When
caseless matching is specified (the PCRE2_CASELESS option), letters are matched
independently of case.
caseless matching is specified (the PCRE2_CASELESS option or (?i) within the
pattern), letters are matched independently of case. Note that there are two
ASCII characters, K and S, that, in addition to their lower case ASCII
equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
(long S) respectively when either PCRE2_UTF or PCRE2_UCP is set.
</P>
<P>
The power of regular expressions comes from the ability to include wild cards,
@ -326,6 +329,20 @@ a character class the only metacharacters are:
[ POSIX character class (if followed by POSIX syntax)
] terminates the character class
</pre>
If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
the pattern, other than in a character class, and characters between a #
outside a character class and the next newline, inclusive, are ignored. An
escaping backslash can be used to include a white space or a # character as
part of the pattern. If the PCRE2_EXTENDED_MORE option is set, the same
applies, but in addition unescaped space and horizontal tab characters are
ignored inside a character class. Note: only these two characters are ignored,
not the full set of pattern white space characters that are ignored outside a
character class. Option settings can be changed within a pattern; see the
section entitled
<a href="#internaloptions">"Internal Option Setting"</a>
below.
</P>
<P>
The following sections describe the use of each of the metacharacters.
</P>
<br><a name="SEC5" href="#TOC1">BACKSLASH</a><br>
@ -343,16 +360,9 @@ precede a non-alphanumeric with backslash to specify that it stands for itself.
In particular, if you want to match a backslash, you write \\.
</P>
<P>
In a UTF mode, only ASCII digits and letters have any special meaning after a
backslash. All other characters (in particular, those whose code points are
greater than 127) are treated as literals.
</P>
<P>
If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
the pattern (other than in a character class), and characters between a #
outside a character class and the next newline, inclusive, are ignored. An
escaping backslash can be used to include a white space or # character as part
of the pattern.
Only ASCII digits and letters have any special meaning after a backslash. All
other characters (in particular, those whose code points are greater than 127)
are treated as literals.
</P>
<P>
If you want to treat all characters in a sequence as literals, you can do so by
@ -1165,8 +1175,9 @@ For example, when the pattern
matches "foobar", the first substring is still set to "foo".
</P>
<P>
Perl documents that the use of \K within assertions is "not well defined". In
PCRE2, \K is acted upon when it occurs inside positive assertions, but is
Perl used to document that the use of \K within lookaround assertions is "not
well defined", but from version 5.32.0 Perl does not support this usage at all.
In PCRE2, \K is acted upon when it occurs inside positive assertions, but is
ignored in negative assertions. Note that when a pattern such as (?=ab\K)
matches, the reported start of the match can be greater than the end of the
match. Using \K in a lookbehind assertion at the start of a pattern can also
@ -1443,7 +1454,10 @@ Characters in a class may be specified by their code points using \o, \x, or
\N{U+hh..} in the usual way. When caseless matching is set, any letters in a
class represent both their upper case and lower case versions, so for example,
a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
match "A", whereas a caseful version would.
match "A", whereas a caseful version would. Note that there are two ASCII
characters, K and S, that, in addition to their lower case ASCII equivalents,
are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
respectively when either PCRE2_UTF or PCRE2_UCP is set.
</P>
<P>
Characters that might indicate line breaks are never treated in any special way
@ -3838,7 +3852,7 @@ Cambridge, England.
</P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 February 2020
Last updated: 06 October 2020
<br>
Copyright &copy; 1997-2020 University of Cambridge.
<br>

View File

@ -1463,11 +1463,14 @@ COMPILING A PATTERN
it can be changed within a pattern by a (?i) option setting. If either
PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all
characters with more than one other case, and for all characters whose
code points are greater than U+007F. For lower valued characters with
only one other case, a lookup table is used for speed. When neither
PCRE2_UTF nor PCRE2_UCP is set, a lookup table is used for all code
points less than 256, and higher code points (available only in 16-bit
or 32-bit mode) are treated as not having another case.
code points are greater than U+007F. Note that there are two ASCII
characters, K and S, that, in addition to their lower case ASCII equiv-
alents, are case-equivalent with U+212A (Kelvin sign) and U+017F (long
S) respectively. For lower valued characters with only one other case,
a lookup table is used for speed. When neither PCRE2_UTF nor PCRE2_UCP
is set, a lookup table is used for all code points less than 256, and
higher code points (available only in 16-bit or 32-bit mode) are
treated as not having another case.
PCRE2_DOLLAR_ENDONLY
@ -3793,7 +3796,7 @@ AUTHOR
REVISION
Last updated: 19 March 2020
Last updated: 05 October 2020
Copyright (c) 1997-2020 University of Cambridge.
------------------------------------------------------------------------------
@ -4831,10 +4834,10 @@ NAME
DIFFERENCES BETWEEN PCRE2 AND PERL
This document describes the differences in the ways that PCRE2 and Perl
handle regular expressions. The differences described here are with re-
spect to Perl versions 5.26, but as both Perl and PCRE2 are continually
changing, the information may sometimes be out of date.
This document describes some of the differences in the ways that PCRE2
and Perl handle regular expressions. The differences described here are
with respect to Perl version 5.32.0, but as both Perl and PCRE2 are
continually changing, the information may at times be out of date.
1. PCRE2 has only a subset of Perl's Unicode support. Details of what
it does have are given in the pcre2unicode page.
@ -4845,12 +4848,15 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
serts that the next character is not "a" three times (in principle;
PCRE2 optimizes this to run the assertion just once). Perl allows some
repeat quantifiers on other assertions, for example, \b* (but not
\b{3}), but these do not seem to have any use.
\b{3}, though oddly it does allow ^{3}), but these do not seem to have
any use. PCRE2 does not allow any kind of quantifier on non-lookaround
assertions.
3. Capture groups that occur inside negative lookaround assertions are
counted, but their entries in the offsets vector are set only when a
negative assertion is a condition that has a matching branch (that is,
the condition is false).
the condition is false). Perl may set such capture groups in other
circumstances.
4. The following Perl escape sequences are not supported: \F, \l, \L,
\u, \U, and \N when followed by a character name. \N on its own, match-
@ -4866,11 +4872,11 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
is built with Unicode support (the default). The properties that can be
tested with \p and \P are limited to the general category properties
such as Lu and Nd, script names such as Greek or Han, and the derived
properties Any and L&. PCRE2 does support the Cs (surrogate) property,
which Perl does not; the Perl documentation says "Because Perl hides
the need for the user to understand the internal representation of Uni-
code characters, there is no need to implement the somewhat messy con-
cept of surrogates."
properties Any and L&. Both PCRE2 and Perl support the Cs (surrogate)
property, but in PCRE2 its use is limited. See the pcre2pattern docu-
mentation for details. The long synonyms for property names that Perl
supports (such as \p{Letter}) are not supported by PCRE2, nor is it
permitted to prefix any of these properties with "Is".
6. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
in between are treated as literals. However, this is slightly different
@ -4892,7 +4898,7 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
\Q\\E \ \\E
The \Q...\E sequence is recognized both inside and outside character
classes.
classes by both PCRE2 and Perl.
7. Fairly obviously, PCRE2 does not support the (?{code}) and
(??{code}) constructions. However, PCRE2 does have a "callout" feature,
@ -4903,14 +4909,14 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
groups up to PCRE2 release 10.23, but from release 10.30 this changed,
and backtracking into subroutine calls is now supported, as in Perl.
9. If any of the backtracking control verbs are used in a group that is
called as a subroutine (whether or not recursively), their effect is
confined to that group; it does not extend to the surrounding pattern.
This is not always the case in Perl. In particular, if (*THEN) is
present in a group that is called as a subroutine, its action is lim-
ited to that group, even if the group does not contain any | charac-
ters. Note that such groups are processed as anchored at the point
where they are tested.
9. In PCRE2, if any of the backtracking control verbs are used in a
group that is called as a subroutine (whether or not recursively),
their effect is confined to that group; it does not extend to the sur-
rounding pattern. This is not always the case in Perl. In particular,
if (*THEN) is present in a group that is called as a subroutine, its
action is limited to that group, even if the group does not contain any
| characters. Note that such groups are processed as anchored at the
point where they are tested.
10. If a pattern contains more than one backtracking control verb, the
first one that is backtracked onto acts. For example, in the pattern
@ -4918,51 +4924,52 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
it is the same as PCRE2, but there are cases where it differs.
11. Most backtracking verbs in assertions have their normal actions.
They are not confined to the assertion.
12. There are some differences that are concerned with the settings of
11. There are some differences that are concerned with the settings of
captured strings when part of a pattern is repeated. For example,
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
set, but in PCRE2 it is set to "b".
13. PCRE2's handling of duplicate capture group numbers and names is
12. PCRE2's handling of duplicate capture group numbers and names is
not as general as Perl's. This is a consequence of the fact the PCRE2
works internally just with numbers, using an external table to trans-
late between numbers and names. In particular, a pattern such as
(?|(?<a>A)|(?<b>B), where the two capture groups have the same number
(?|(?<a>A)|(?<b>B)), where the two capture groups have the same number
but different names, is not supported, and causes an error at compile
time. If it were allowed, it would not be possible to distinguish which
group matched, because both names map to capture group number 1. To
avoid this confusing situation, an error is given at compile time.
14. Perl used to recognize comments in some places that PCRE2 does not,
13. Perl used to recognize comments in some places that PCRE2 does not,
for example, between the ( and ? at the start of a group. If the /x
modifier is set, Perl allowed white space between ( and ? though the
latest Perls give an error (for a while it was just deprecated). There
may still be some cases where Perl behaves differently.
15. Perl, when in warning mode, gives warnings for character classes
14. Perl, when in warning mode, gives warnings for character classes
such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
als. PCRE2 has no warning features, so it gives an error in these cases
because they are almost certainly user mistakes.
16. In PCRE2, the upper/lower case character properties Lu and Ll are
15. In PCRE2, the upper/lower case character properties Lu and Ll are
not affected when case-independent matching is specified. For example,
\p{Lu} always matches an upper case letter. I think Perl has changed in
this respect; in the release at the time of writing (5.24), \p{Lu} and
this respect; in the release at the time of writing (5.32), \p{Lu} and
\p{Ll} match all letters, regardless of case, when case independence is
specified.
16. From release 5.32.0, Perl locks out the use of \K in lookaround as-
sertions. In PCRE2, \K is acted on when it occurs in positive asser-
tions, but is ignored in negative assertions.
17. PCRE2 provides some extensions to the Perl regular expression fa-
cilities. Perl 5.10 includes new features that are not in earlier ver-
sions of Perl, some of which (such as named parentheses) were in PCRE2
for some time before. This list is with respect to Perl 5.26:
cilities. Perl 5.10 included new features that were not in earlier
versions of Perl, some of which (such as named parentheses) were in
PCRE2 for some time before. This list is with respect to Perl 5.32:
(a) Although lookbehind assertions in PCRE2 must match fixed length
strings, each alternative branch of a lookbehind assertion can match a
different length of string. Perl requires them all to have the same
length.
strings, each alternative toplevel branch of a lookbehind assertion can
match a different length of string. Perl requires them all to have the
same length.
(b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
ported in lookbehinds, provided that there is no possibility of refer-
@ -4997,7 +5004,7 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
different way and is not Perl-compatible.
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT)
at the start of a pattern that set overall options that cannot be
at the start of a pattern. These set overall options that cannot be
changed within the pattern.
(m) PCRE2 supports non-atomic positive lookaround assertions. This is
@ -5026,7 +5033,7 @@ AUTHOR
REVISION
Last updated: 13 July 2019
Last updated: 06 October 2020
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
@ -6353,8 +6360,12 @@ CHARACTERS AND METACHARACTERS
The quick brown fox
matches a portion of a subject string that is identical to itself. When
caseless matching is specified (the PCRE2_CASELESS option), letters are
matched independently of case.
caseless matching is specified (the PCRE2_CASELESS option or (?i)
within the pattern), letters are matched independently of case. Note
that there are two ASCII characters, K and S, that, in addition to
their lower case ASCII equivalents, are case-equivalent with Unicode
U+212A (Kelvin sign) and U+017F (long S) respectively when either
PCRE2_UTF or PCRE2_UCP is set.
The power of regular expressions comes from the ability to include wild
cards, character classes, alternatives, and repetitions in the pattern.
@ -6389,6 +6400,18 @@ CHARACTERS AND METACHARACTERS
[ POSIX character class (if followed by POSIX syntax)
] terminates the character class
If a pattern is compiled with the PCRE2_EXTENDED option, most white
space in the pattern, other than in a character class, and characters
between a # outside a character class and the next newline, inclusive,
are ignored. An escaping backslash can be used to include a white space
or a # character as part of the pattern. If the PCRE2_EXTENDED_MORE op-
tion is set, the same applies, but in addition unescaped space and hor-
izontal tab characters are ignored inside a character class. Note: only
these two characters are ignored, not the full set of pattern white
space characters that are ignored outside a character class. Option
settings can be changed within a pattern; see the section entitled "In-
ternal Option Setting" below.
The following sections describe the use of each of the metacharacters.
@ -6406,15 +6429,9 @@ BACKSLASH
that it stands for itself. In particular, if you want to match a back-
slash, you write \\.
In a UTF mode, only ASCII digits and letters have any special meaning
after a backslash. All other characters (in particular, those whose
code points are greater than 127) are treated as literals.
If a pattern is compiled with the PCRE2_EXTENDED option, most white
space in the pattern (other than in a character class), and characters
between a # outside a character class and the next newline, inclusive,
are ignored. An escaping backslash can be used to include a white space
or # character as part of the pattern.
Only ASCII digits and letters have any special meaning after a back-
slash. All other characters (in particular, those whose code points are
greater than 127) are treated as literals.
If you want to treat all characters in a sequence as literals, you can
do so by putting them between \Q and \E. This is different from Perl in
@ -7039,13 +7056,14 @@ BACKSLASH
matches "foobar", the first substring is still set to "foo".
Perl documents that the use of \K within assertions is "not well de-
fined". In PCRE2, \K is acted upon when it occurs inside positive as-
sertions, but is ignored in negative assertions. Note that when a pat-
tern such as (?=ab\K) matches, the reported start of the match can be
greater than the end of the match. Using \K in a lookbehind assertion
at the start of a pattern can also lead to odd effects. For example,
consider this pattern:
Perl used to document that the use of \K within lookaround assertions
is "not well defined", but from version 5.32.0 Perl does not support
this usage at all. In PCRE2, \K is acted upon when it occurs inside
positive assertions, but is ignored in negative assertions. Note that
when a pattern such as (?=ab\K) matches, the reported start of the
match can be greater than the end of the match. Using \K in a lookbe-
hind assertion at the start of a pattern can also lead to odd effects.
For example, consider this pattern:
(?<=\Kfoo)bar
@ -7301,7 +7319,10 @@ SQUARE BRACKETS AND CHARACTER CLASSES
letters in a class represent both their upper case and lower case ver-
sions, so for example, a caseless [aeiou] matches "A" as well as "a",
and a caseless [^aeiou] does not match "A", whereas a caseful version
would.
would. Note that there are two ASCII characters, K and S, that, in ad-
dition to their lower case ASCII equivalents, are case-equivalent with
Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
ther PCRE2_UTF or PCRE2_UCP is set.
Characters that might indicate line breaks are never treated in any
special way when matching character classes, whatever line-ending se-
@ -9559,7 +9580,7 @@ AUTHOR
REVISION
Last updated: 24 February 2020
Last updated: 06 October 2020
Copyright (c) 1997-2020 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,13 +1,13 @@
.TH PCRE2COMPAT 3 "13 July 2019" "PCRE2 10.34"
.TH PCRE2COMPAT 3 "06 October 2020" "PCRE2 10.36"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
.rs
.sp
This document describes the differences in the ways that PCRE2 and Perl handle
regular expressions. The differences described here are with respect to Perl
versions 5.26, but as both Perl and PCRE2 are continually changing, the
information may sometimes be out of date.
This document describes some of the differences in the ways that PCRE2 and Perl
handle regular expressions. The differences described here are with respect to
Perl version 5.32.0, but as both Perl and PCRE2 are continually changing, the
information may at times be out of date.
.P
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
have are given in the
@ -21,11 +21,14 @@ they do not mean what you might think. For example, (?!a){3} does not assert
that the next three characters are not "a". It just asserts that the next
character is not "a" three times (in principle; PCRE2 optimizes this to run the
assertion just once). Perl allows some repeat quantifiers on other assertions,
for example, \eb* (but not \eb{3}), but these do not seem to have any use.
for example, \eb* (but not \eb{3}, though oddly it does allow ^{3}), but these
do not seem to have any use. PCRE2 does not allow any kind of quantifier on
non-lookaround assertions.
.P
3. Capture groups that occur inside negative lookaround assertions are counted,
but their entries in the offsets vector are set only when a negative assertion
is a condition that has a matching branch (that is, the condition is false).
Perl may set such capture groups in other circumstances.
.P
4. The following Perl escape sequences are not supported: \eF, \el, \eL, \eu,
\eU, and \eN when followed by a character name. \eN on its own, matching a
@ -41,10 +44,14 @@ interprets them.
built with Unicode support (the default). The properties that can be tested
with \ep and \eP are limited to the general category properties such as Lu and
Nd, script names such as Greek or Han, and the derived properties Any and L&.
PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
documentation says "Because Perl hides the need for the user to understand the
internal representation of Unicode characters, there is no need to implement
the somewhat messy concept of surrogates."
Both PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its use
is limited. See the
.\" HREF
\fBpcre2pattern\fP
.\"
documentation for details. The long synonyms for property names that Perl
supports (such as \ep{Letter}) are not supported by PCRE2, nor is it permitted
to prefix any of these properties with "Is".
.P
6. PCRE2 supports the \eQ...\eE escape for quoting substrings. Characters
in between are treated as literals. However, this is slightly different from
@ -65,7 +72,8 @@ other character. Note the following examples:
\eQA\eB\eE A\eB A\eB
\eQ\e\eE \e \e\eE
.sp
The \eQ...\eE sequence is recognized both inside and outside character classes.
The \eQ...\eE sequence is recognized both inside and outside character classes
by both PCRE2 and Perl.
.P
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
constructions. However, PCRE2 does have a "callout" feature, which allows an
@ -79,13 +87,13 @@ documentation for details.
to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
into subroutine calls is now supported, as in Perl.
.P
9. If any of the backtracking control verbs are used in a group that is called
as a subroutine (whether or not recursively), their effect is confined to that
group; it does not extend to the surrounding pattern. This is not always the
case in Perl. In particular, if (*THEN) is present in a group that is called as
a subroutine, its action is limited to that group, even if the group does not
contain any | characters. Note that such groups are processed as anchored
at the point where they are tested.
9. In PCRE2, if any of the backtracking control verbs are used in a group that
is called as a subroutine (whether or not recursively), their effect is
confined to that group; it does not extend to the surrounding pattern. This is
not always the case in Perl. In particular, if (*THEN) is present in a group
that is called as a subroutine, its action is limited to that group, even if
the group does not contain any | characters. Note that such groups are
processed as anchored at the point where they are tested.
.P
10. If a pattern contains more than one backtracking control verb, the first
one that is backtracked onto acts. For example, in the pattern
@ -93,48 +101,49 @@ A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
same as PCRE2, but there are cases where it differs.
.P
11. Most backtracking verbs in assertions have their normal actions. They are
not confined to the assertion.
.P
12. There are some differences that are concerned with the settings of captured
11. There are some differences that are concerned with the settings of captured
strings when part of a pattern is repeated. For example, matching "aba" against
the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
"b".
.P
13. PCRE2's handling of duplicate capture group numbers and names is not as
12. PCRE2's handling of duplicate capture group numbers and names is not as
general as Perl's. This is a consequence of the fact the PCRE2 works internally
just with numbers, using an external table to translate between numbers and
names. In particular, a pattern such as (?|(?<a>A)|(?<b>B), where the two
names. In particular, a pattern such as (?|(?<a>A)|(?<b>B)), where the two
capture groups have the same number but different names, is not supported, and
causes an error at compile time. If it were allowed, it would not be possible
to distinguish which group matched, because both names map to capture group
number 1. To avoid this confusing situation, an error is given at compile time.
.P
14. Perl used to recognize comments in some places that PCRE2 does not, for
13. Perl used to recognize comments in some places that PCRE2 does not, for
example, between the ( and ? at the start of a group. If the /x modifier is
set, Perl allowed white space between ( and ? though the latest Perls give an
error (for a while it was just deprecated). There may still be some cases where
Perl behaves differently.
.P
15. Perl, when in warning mode, gives warnings for character classes such as
14. Perl, when in warning mode, gives warnings for character classes such as
[A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
warning features, so it gives an error in these cases because they are almost
certainly user mistakes.
.P
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
15. In PCRE2, the upper/lower case character properties Lu and Ll are not
affected when case-independent matching is specified. For example, \ep{Lu}
always matches an upper case letter. I think Perl has changed in this respect;
in the release at the time of writing (5.24), \ep{Lu} and \ep{Ll} match all
in the release at the time of writing (5.32), \ep{Lu} and \ep{Ll} match all
letters, regardless of case, when case independence is specified.
.P
16. From release 5.32.0, Perl locks out the use of \eK in lookaround
assertions. In PCRE2, \eK is acted on when it occurs in positive assertions,
but is ignored in negative assertions.
.P
17. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
Perl 5.10 included new features that were not in earlier versions of Perl, some
of which (such as named parentheses) were in PCRE2 for some time before. This
list is with respect to Perl 5.26:
list is with respect to Perl 5.32:
.sp
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
each alternative toplevel branch of a lookbehind assertion can match a
different length of string. Perl requires them all to have the same length.
.sp
(b) From PCRE2 10.23, backreferences to groups of fixed length are supported
in lookbehinds, provided that there is no possibility of referencing a
@ -168,7 +177,7 @@ variable interpolation, but not general hooks on every match.
different way and is not Perl-compatible.
.sp
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
the start of a pattern that set overall options that cannot be changed within
the start of a pattern. These set overall options that cannot be changed within
the pattern.
.sp
(m) PCRE2 supports non-atomic positive lookaround assertions. This is an
@ -203,6 +212,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 13 July 2019
Last updated: 06 October 2020
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "05 October 2020" "PCRE2 10.35"
.TH PCRE2PATTERN 3 "06 October 2020" "PCRE2 10.35"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -1168,8 +1168,9 @@ For example, when the pattern
.sp
matches "foobar", the first substring is still set to "foo".
.P
Perl documents that the use of \eK within assertions is "not well defined". In
PCRE2, \eK is acted upon when it occurs inside positive assertions, but is
Perl used to document that the use of \eK within lookaround assertions is "not
well defined", but from version 5.32.0 Perl does not support this usage at all.
In PCRE2, \eK is acted upon when it occurs inside positive assertions, but is
ignored in negative assertions. Note that when a pattern such as (?=ab\eK)
matches, the reported start of the match can be greater than the end of the
match. Using \eK in a lookbehind assertion at the start of a pattern can also
@ -3897,6 +3898,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 05 October 2020
Last updated: 06 October 2020
Copyright (c) 1997-2020 University of Cambridge.
.fi