Documentation update

This commit is contained in:
Philip.Hazel 2020-10-07 16:27:20 +00:00
parent fff544a1e9
commit 6d4936dc29
6 changed files with 1310 additions and 1255 deletions

View File

@ -1492,10 +1492,13 @@ letters in the subject. It is equivalent to Perl's /i option, and it can be
changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
PCRE2_UCP is set, Unicode properties are used for all characters with more than
one other case, and for all characters whose code points are greater than
U+007F. For lower valued characters with only one other case, a lookup table is
used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is
used for all code points less than 256, and higher code points (available only
in 16-bit or 32-bit mode) are treated as not having another case.
U+007F. Note that there are two ASCII characters, K and S, that, in addition to
their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin
sign) and U+017F (long S) respectively. For lower valued characters with only
one other case, a lookup table is used for speed. When neither PCRE2_UTF nor
PCRE2_UCP is set, a lookup table is used for all code points less than 256, and
higher code points (available only in 16-bit or 32-bit mode) are treated as not
having another case.
<pre>
PCRE2_DOLLAR_ENDONLY
</pre>
@ -3956,7 +3959,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 19 March 2020
Last updated: 05 October 2020
<br>
Copyright &copy; 1997-2020 University of Cambridge.
<br>

View File

@ -16,10 +16,10 @@ please consult the man page, in case the conversion went wrong.
DIFFERENCES BETWEEN PCRE2 AND PERL
</b><br>
<P>
This document describes the differences in the ways that PCRE2 and Perl handle
regular expressions. The differences described here are with respect to Perl
versions 5.26, but as both Perl and PCRE2 are continually changing, the
information may sometimes be out of date.
This document describes some of the differences in the ways that PCRE2 and Perl
handle regular expressions. The differences described here are with respect to
Perl version 5.32.0, but as both Perl and PCRE2 are continually changing, the
information may at times be out of date.
</P>
<P>
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
@ -33,12 +33,15 @@ they do not mean what you might think. For example, (?!a){3} does not assert
that the next three characters are not "a". It just asserts that the next
character is not "a" three times (in principle; PCRE2 optimizes this to run the
assertion just once). Perl allows some repeat quantifiers on other assertions,
for example, \b* (but not \b{3}), but these do not seem to have any use.
for example, \b* (but not \b{3}, though oddly it does allow ^{3}), but these
do not seem to have any use. PCRE2 does not allow any kind of quantifier on
non-lookaround assertions.
</P>
<P>
3. Capture groups that occur inside negative lookaround assertions are counted,
but their entries in the offsets vector are set only when a negative assertion
is a condition that has a matching branch (that is, the condition is false).
is a condition that has a matching branch (that is, the condition is false).
Perl may set such capture groups in other circumstances.
</P>
<P>
4. The following Perl escape sequences are not supported: \F, \l, \L, \u,
@ -56,10 +59,12 @@ interprets them.
built with Unicode support (the default). The properties that can be tested
with \p and \P are limited to the general category properties such as Lu and
Nd, script names such as Greek or Han, and the derived properties Any and L&.
PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
documentation says "Because Perl hides the need for the user to understand the
internal representation of Unicode characters, there is no need to implement
the somewhat messy concept of surrogates."
Both PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its use
is limited. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation for details. The long synonyms for property names that Perl
supports (such as \p{Letter}) are not supported by PCRE2, nor is it permitted
to prefix any of these properties with "Is".
</P>
<P>
6. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
@ -79,7 +84,8 @@ other character. Note the following examples:
\QA\B\E A\B A\B
\Q\\E \ \\E
</pre>
The \Q...\E sequence is recognized both inside and outside character classes.
The \Q...\E sequence is recognized both inside and outside character classes
by both PCRE2 and Perl.
</P>
<P>
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
@ -94,13 +100,13 @@ to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
into subroutine calls is now supported, as in Perl.
</P>
<P>
9. If any of the backtracking control verbs are used in a group that is called
as a subroutine (whether or not recursively), their effect is confined to that
group; it does not extend to the surrounding pattern. This is not always the
case in Perl. In particular, if (*THEN) is present in a group that is called as
a subroutine, its action is limited to that group, even if the group does not
contain any | characters. Note that such groups are processed as anchored
at the point where they are tested.
9. In PCRE2, if any of the backtracking control verbs are used in a group that
is called as a subroutine (whether or not recursively), their effect is
confined to that group; it does not extend to the surrounding pattern. This is
not always the case in Perl. In particular, if (*THEN) is present in a group
that is called as a subroutine, its action is limited to that group, even if
the group does not contain any | characters. Note that such groups are
processed as anchored at the point where they are tested.
</P>
<P>
10. If a pattern contains more than one backtracking control verb, the first
@ -110,55 +116,56 @@ triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
same as PCRE2, but there are cases where it differs.
</P>
<P>
11. Most backtracking verbs in assertions have their normal actions. They are
not confined to the assertion.
</P>
<P>
12. There are some differences that are concerned with the settings of captured
11. There are some differences that are concerned with the settings of captured
strings when part of a pattern is repeated. For example, matching "aba" against
the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
"b".
</P>
<P>
13. PCRE2's handling of duplicate capture group numbers and names is not as
12. PCRE2's handling of duplicate capture group numbers and names is not as
general as Perl's. This is a consequence of the fact the PCRE2 works internally
just with numbers, using an external table to translate between numbers and
names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b&#62;B), where the two
names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b&#62;B)), where the two
capture groups have the same number but different names, is not supported, and
causes an error at compile time. If it were allowed, it would not be possible
to distinguish which group matched, because both names map to capture group
number 1. To avoid this confusing situation, an error is given at compile time.
</P>
<P>
14. Perl used to recognize comments in some places that PCRE2 does not, for
13. Perl used to recognize comments in some places that PCRE2 does not, for
example, between the ( and ? at the start of a group. If the /x modifier is
set, Perl allowed white space between ( and ? though the latest Perls give an
error (for a while it was just deprecated). There may still be some cases where
Perl behaves differently.
</P>
<P>
15. Perl, when in warning mode, gives warnings for character classes such as
14. Perl, when in warning mode, gives warnings for character classes such as
[A-\d] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
warning features, so it gives an error in these cases because they are almost
certainly user mistakes.
</P>
<P>
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
15. In PCRE2, the upper/lower case character properties Lu and Ll are not
affected when case-independent matching is specified. For example, \p{Lu}
always matches an upper case letter. I think Perl has changed in this respect;
in the release at the time of writing (5.24), \p{Lu} and \p{Ll} match all
in the release at the time of writing (5.32), \p{Lu} and \p{Ll} match all
letters, regardless of case, when case independence is specified.
</P>
<P>
16. From release 5.32.0, Perl locks out the use of \K in lookaround
assertions. In PCRE2, \K is acted on when it occurs in positive assertions,
but is ignored in negative assertions.
</P>
<P>
17. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
Perl 5.10 included new features that were not in earlier versions of Perl, some
of which (such as named parentheses) were in PCRE2 for some time before. This
list is with respect to Perl 5.26:
list is with respect to Perl 5.32:
<br>
<br>
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
each alternative toplevel branch of a lookbehind assertion can match a
different length of string. Perl requires them all to have the same length.
<br>
<br>
(b) From PCRE2 10.23, backreferences to groups of fixed length are supported
@ -203,7 +210,7 @@ different way and is not Perl-compatible.
<br>
<br>
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
the start of a pattern that set overall options that cannot be changed within
the start of a pattern. These set overall options that cannot be changed within
the pattern.
<br>
<br>
@ -239,7 +246,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 13 July 2019
Last updated: 06 October 2020
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -289,8 +289,11 @@ corresponding characters in the subject. As a trivial example, the pattern
The quick brown fox
</pre>
matches a portion of a subject string that is identical to itself. When
caseless matching is specified (the PCRE2_CASELESS option), letters are matched
independently of case.
caseless matching is specified (the PCRE2_CASELESS option or (?i) within the
pattern), letters are matched independently of case. Note that there are two
ASCII characters, K and S, that, in addition to their lower case ASCII
equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
(long S) respectively when either PCRE2_UTF or PCRE2_UCP is set.
</P>
<P>
The power of regular expressions comes from the ability to include wild cards,
@ -326,6 +329,20 @@ a character class the only metacharacters are:
[ POSIX character class (if followed by POSIX syntax)
] terminates the character class
</pre>
If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
the pattern, other than in a character class, and characters between a #
outside a character class and the next newline, inclusive, are ignored. An
escaping backslash can be used to include a white space or a # character as
part of the pattern. If the PCRE2_EXTENDED_MORE option is set, the same
applies, but in addition unescaped space and horizontal tab characters are
ignored inside a character class. Note: only these two characters are ignored,
not the full set of pattern white space characters that are ignored outside a
character class. Option settings can be changed within a pattern; see the
section entitled
<a href="#internaloptions">"Internal Option Setting"</a>
below.
</P>
<P>
The following sections describe the use of each of the metacharacters.
</P>
<br><a name="SEC5" href="#TOC1">BACKSLASH</a><br>
@ -343,16 +360,9 @@ precede a non-alphanumeric with backslash to specify that it stands for itself.
In particular, if you want to match a backslash, you write \\.
</P>
<P>
In a UTF mode, only ASCII digits and letters have any special meaning after a
backslash. All other characters (in particular, those whose code points are
greater than 127) are treated as literals.
</P>
<P>
If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
the pattern (other than in a character class), and characters between a #
outside a character class and the next newline, inclusive, are ignored. An
escaping backslash can be used to include a white space or # character as part
of the pattern.
Only ASCII digits and letters have any special meaning after a backslash. All
other characters (in particular, those whose code points are greater than 127)
are treated as literals.
</P>
<P>
If you want to treat all characters in a sequence as literals, you can do so by
@ -1165,8 +1175,9 @@ For example, when the pattern
matches "foobar", the first substring is still set to "foo".
</P>
<P>
Perl documents that the use of \K within assertions is "not well defined". In
PCRE2, \K is acted upon when it occurs inside positive assertions, but is
Perl used to document that the use of \K within lookaround assertions is "not
well defined", but from version 5.32.0 Perl does not support this usage at all.
In PCRE2, \K is acted upon when it occurs inside positive assertions, but is
ignored in negative assertions. Note that when a pattern such as (?=ab\K)
matches, the reported start of the match can be greater than the end of the
match. Using \K in a lookbehind assertion at the start of a pattern can also
@ -1443,7 +1454,10 @@ Characters in a class may be specified by their code points using \o, \x, or
\N{U+hh..} in the usual way. When caseless matching is set, any letters in a
class represent both their upper case and lower case versions, so for example,
a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
match "A", whereas a caseful version would.
match "A", whereas a caseful version would. Note that there are two ASCII
characters, K and S, that, in addition to their lower case ASCII equivalents,
are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
respectively when either PCRE2_UTF or PCRE2_UCP is set.
</P>
<P>
Characters that might indicate line breaks are never treated in any special way
@ -3838,7 +3852,7 @@ Cambridge, England.
</P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 February 2020
Last updated: 06 October 2020
<br>
Copyright &copy; 1997-2020 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -1,13 +1,13 @@
.TH PCRE2COMPAT 3 "13 July 2019" "PCRE2 10.34"
.TH PCRE2COMPAT 3 "06 October 2020" "PCRE2 10.36"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
.rs
.sp
This document describes the differences in the ways that PCRE2 and Perl handle
regular expressions. The differences described here are with respect to Perl
versions 5.26, but as both Perl and PCRE2 are continually changing, the
information may sometimes be out of date.
This document describes some of the differences in the ways that PCRE2 and Perl
handle regular expressions. The differences described here are with respect to
Perl version 5.32.0, but as both Perl and PCRE2 are continually changing, the
information may at times be out of date.
.P
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
have are given in the
@ -21,11 +21,14 @@ they do not mean what you might think. For example, (?!a){3} does not assert
that the next three characters are not "a". It just asserts that the next
character is not "a" three times (in principle; PCRE2 optimizes this to run the
assertion just once). Perl allows some repeat quantifiers on other assertions,
for example, \eb* (but not \eb{3}), but these do not seem to have any use.
for example, \eb* (but not \eb{3}, though oddly it does allow ^{3}), but these
do not seem to have any use. PCRE2 does not allow any kind of quantifier on
non-lookaround assertions.
.P
3. Capture groups that occur inside negative lookaround assertions are counted,
but their entries in the offsets vector are set only when a negative assertion
is a condition that has a matching branch (that is, the condition is false).
is a condition that has a matching branch (that is, the condition is false).
Perl may set such capture groups in other circumstances.
.P
4. The following Perl escape sequences are not supported: \eF, \el, \eL, \eu,
\eU, and \eN when followed by a character name. \eN on its own, matching a
@ -41,10 +44,14 @@ interprets them.
built with Unicode support (the default). The properties that can be tested
with \ep and \eP are limited to the general category properties such as Lu and
Nd, script names such as Greek or Han, and the derived properties Any and L&.
PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
documentation says "Because Perl hides the need for the user to understand the
internal representation of Unicode characters, there is no need to implement
the somewhat messy concept of surrogates."
Both PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its use
is limited. See the
.\" HREF
\fBpcre2pattern\fP
.\"
documentation for details. The long synonyms for property names that Perl
supports (such as \ep{Letter}) are not supported by PCRE2, nor is it permitted
to prefix any of these properties with "Is".
.P
6. PCRE2 supports the \eQ...\eE escape for quoting substrings. Characters
in between are treated as literals. However, this is slightly different from
@ -65,7 +72,8 @@ other character. Note the following examples:
\eQA\eB\eE A\eB A\eB
\eQ\e\eE \e \e\eE
.sp
The \eQ...\eE sequence is recognized both inside and outside character classes.
The \eQ...\eE sequence is recognized both inside and outside character classes
by both PCRE2 and Perl.
.P
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
constructions. However, PCRE2 does have a "callout" feature, which allows an
@ -79,13 +87,13 @@ documentation for details.
to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
into subroutine calls is now supported, as in Perl.
.P
9. If any of the backtracking control verbs are used in a group that is called
as a subroutine (whether or not recursively), their effect is confined to that
group; it does not extend to the surrounding pattern. This is not always the
case in Perl. In particular, if (*THEN) is present in a group that is called as
a subroutine, its action is limited to that group, even if the group does not
contain any | characters. Note that such groups are processed as anchored
at the point where they are tested.
9. In PCRE2, if any of the backtracking control verbs are used in a group that
is called as a subroutine (whether or not recursively), their effect is
confined to that group; it does not extend to the surrounding pattern. This is
not always the case in Perl. In particular, if (*THEN) is present in a group
that is called as a subroutine, its action is limited to that group, even if
the group does not contain any | characters. Note that such groups are
processed as anchored at the point where they are tested.
.P
10. If a pattern contains more than one backtracking control verb, the first
one that is backtracked onto acts. For example, in the pattern
@ -93,48 +101,49 @@ A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
same as PCRE2, but there are cases where it differs.
.P
11. Most backtracking verbs in assertions have their normal actions. They are
not confined to the assertion.
.P
12. There are some differences that are concerned with the settings of captured
11. There are some differences that are concerned with the settings of captured
strings when part of a pattern is repeated. For example, matching "aba" against
the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
"b".
.P
13. PCRE2's handling of duplicate capture group numbers and names is not as
12. PCRE2's handling of duplicate capture group numbers and names is not as
general as Perl's. This is a consequence of the fact the PCRE2 works internally
just with numbers, using an external table to translate between numbers and
names. In particular, a pattern such as (?|(?<a>A)|(?<b>B), where the two
names. In particular, a pattern such as (?|(?<a>A)|(?<b>B)), where the two
capture groups have the same number but different names, is not supported, and
causes an error at compile time. If it were allowed, it would not be possible
to distinguish which group matched, because both names map to capture group
number 1. To avoid this confusing situation, an error is given at compile time.
.P
14. Perl used to recognize comments in some places that PCRE2 does not, for
13. Perl used to recognize comments in some places that PCRE2 does not, for
example, between the ( and ? at the start of a group. If the /x modifier is
set, Perl allowed white space between ( and ? though the latest Perls give an
error (for a while it was just deprecated). There may still be some cases where
Perl behaves differently.
.P
15. Perl, when in warning mode, gives warnings for character classes such as
14. Perl, when in warning mode, gives warnings for character classes such as
[A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
warning features, so it gives an error in these cases because they are almost
certainly user mistakes.
.P
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
15. In PCRE2, the upper/lower case character properties Lu and Ll are not
affected when case-independent matching is specified. For example, \ep{Lu}
always matches an upper case letter. I think Perl has changed in this respect;
in the release at the time of writing (5.24), \ep{Lu} and \ep{Ll} match all
in the release at the time of writing (5.32), \ep{Lu} and \ep{Ll} match all
letters, regardless of case, when case independence is specified.
.P
16. From release 5.32.0, Perl locks out the use of \eK in lookaround
assertions. In PCRE2, \eK is acted on when it occurs in positive assertions,
but is ignored in negative assertions.
.P
17. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
Perl 5.10 included new features that were not in earlier versions of Perl, some
of which (such as named parentheses) were in PCRE2 for some time before. This
list is with respect to Perl 5.26:
list is with respect to Perl 5.32:
.sp
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
each alternative toplevel branch of a lookbehind assertion can match a
different length of string. Perl requires them all to have the same length.
.sp
(b) From PCRE2 10.23, backreferences to groups of fixed length are supported
in lookbehinds, provided that there is no possibility of referencing a
@ -168,7 +177,7 @@ variable interpolation, but not general hooks on every match.
different way and is not Perl-compatible.
.sp
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
the start of a pattern that set overall options that cannot be changed within
the start of a pattern. These set overall options that cannot be changed within
the pattern.
.sp
(m) PCRE2 supports non-atomic positive lookaround assertions. This is an
@ -203,6 +212,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 13 July 2019
Last updated: 06 October 2020
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "05 October 2020" "PCRE2 10.35"
.TH PCRE2PATTERN 3 "06 October 2020" "PCRE2 10.35"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -1168,8 +1168,9 @@ For example, when the pattern
.sp
matches "foobar", the first substring is still set to "foo".
.P
Perl documents that the use of \eK within assertions is "not well defined". In
PCRE2, \eK is acted upon when it occurs inside positive assertions, but is
Perl used to document that the use of \eK within lookaround assertions is "not
well defined", but from version 5.32.0 Perl does not support this usage at all.
In PCRE2, \eK is acted upon when it occurs inside positive assertions, but is
ignored in negative assertions. Note that when a pattern such as (?=ab\eK)
matches, the reported start of the match can be greater than the end of the
match. Using \eK in a lookbehind assertion at the start of a pattern can also
@ -3897,6 +3898,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 05 October 2020
Last updated: 06 October 2020
Copyright (c) 1997-2020 University of Cambridge.
.fi