Documentation update.
This commit is contained in:
parent
7fe5e441ff
commit
424bba4d15
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2COMPAT 3 "18 October 2016" "PCRE2 10.23"
|
||||
.TH PCRE2COMPAT 3 "29 March 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
|
||||
|
@ -6,7 +6,8 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
|||
.sp
|
||||
This document describes the differences in the ways that PCRE2 and Perl handle
|
||||
regular expressions. The differences described here are with respect to Perl
|
||||
versions 5.10 and above.
|
||||
versions 5.24, but as both Perl and PCRE2 are continually changing, the
|
||||
information may sometimes be out of date.
|
||||
.P
|
||||
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
|
||||
have are given in the
|
||||
|
@ -15,16 +16,17 @@ have are given in the
|
|||
.\"
|
||||
page.
|
||||
.P
|
||||
2. PCRE2 allows repeat quantifiers only on parenthesized assertions, but they
|
||||
do not mean what you might think. For example, (?!a){3} does not assert that
|
||||
the next three characters are not "a". It just asserts that the next character
|
||||
is not "a" three times (in principle: PCRE2 optimizes this to run the assertion
|
||||
just once). Perl allows repeat quantifiers on other assertions such as \eb, but
|
||||
these do not seem to have any use.
|
||||
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
|
||||
they do not mean what you might think. For example, (?!a){3} does not assert
|
||||
that the next three characters are not "a". It just asserts that the next
|
||||
character is not "a" three times (in principle: PCRE2 optimizes this to run the
|
||||
assertion just once). Perl allows some repeat quantifiers on other assertions,
|
||||
for example, \eb* (but not \eb{3}), but these do not seem to have any use.
|
||||
.P
|
||||
3. Capturing subpatterns that occur inside negative lookahead assertions are
|
||||
counted, but their entries in the offsets vector are never set. Perl sometimes
|
||||
(but not always) sets its numerical variables from inside negative assertions.
|
||||
3. Capturing subpatterns that occur inside negative lookaround assertions are
|
||||
counted, but their entries in the offsets vector are set only if the assertion
|
||||
is a condition. Perl has changed its behaviour in this regard from time to
|
||||
time.
|
||||
.P
|
||||
4. The following Perl escape sequences are not supported: \el, \eu, \eL,
|
||||
\eU, and \eN when followed by a character name or Unicode value. (\eN on its
|
||||
|
@ -35,13 +37,13 @@ generated by default. However, if the PCRE2_ALT_BSUX option is set,
|
|||
\eU and \eu are interpreted as ECMAScript interprets them.
|
||||
.P
|
||||
5. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is
|
||||
built with Unicode support. The properties that can be tested with \ep and \eP
|
||||
are limited to the general category properties such as Lu and Nd, script names
|
||||
such as Greek or Han, and the derived properties Any and L&. PCRE2 does support
|
||||
the Cs (surrogate) property, which Perl does not; the Perl documentation says
|
||||
"Because Perl hides the need for the user to understand the internal
|
||||
representation of Unicode characters, there is no need to implement the
|
||||
somewhat messy concept of surrogates."
|
||||
built with Unicode support (the default). The properties that can be tested
|
||||
with \ep and \eP are limited to the general category properties such as Lu and
|
||||
Nd, script names such as Greek or Han, and the derived properties Any and L&.
|
||||
PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
|
||||
documentation says "Because Perl hides the need for the user to understand the
|
||||
internal representation of Unicode characters, there is no need to implement
|
||||
the somewhat messy concept of surrogates."
|
||||
.P
|
||||
6. PCRE2 does support the \eQ...\eE escape for quoting substrings. Characters
|
||||
in between are treated as literals. This is slightly different from Perl in
|
||||
|
@ -60,29 +62,16 @@ Note the following examples:
|
|||
The \eQ...\eE sequence is recognized both inside and outside character classes.
|
||||
.P
|
||||
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
|
||||
constructions. However, there is support for recursive patterns. This is not
|
||||
available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE2 "callout"
|
||||
feature allows an external function to be called during pattern matching. See
|
||||
the
|
||||
constructions. However, there is support PCRE2's "callout" feature, which
|
||||
allows an external function to be called during pattern matching. See the
|
||||
.\" HREF
|
||||
\fBpcre2callout\fP
|
||||
.\"
|
||||
documentation for details.
|
||||
.P
|
||||
8. Subroutine calls (whether recursive or not) are treated as atomic groups.
|
||||
Atomic recursion is like Python, but unlike Perl. Captured values that are set
|
||||
outside a subroutine call can be referenced from inside in PCRE2, but not in
|
||||
Perl. There is a discussion that explains these differences in more detail in
|
||||
the
|
||||
.\" HTML <a href="pcre2pattern.html#recursiondifference">
|
||||
.\" </a>
|
||||
section on recursion differences from Perl
|
||||
.\"
|
||||
in the
|
||||
.\" HREF
|
||||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
page.
|
||||
8. Subroutine calls (whether recursive or not) were treated as atomic groups up
|
||||
to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
|
||||
into subroutine calls is now supported, as in Perl.
|
||||
.P
|
||||
9. If any of the backtracking control verbs are used in a subpattern that is
|
||||
called as a subroutine (whether or not recursively), their effect is confined
|
||||
|
@ -130,13 +119,13 @@ certainly user mistakes.
|
|||
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
|
||||
affected when case-independent matching is specified. For example, \ep{Lu}
|
||||
always matches an upper case letter. I think Perl has changed in this respect;
|
||||
in the release at the time of writing (5.16), \ep{Lu} and \ep{Ll} match all
|
||||
in the release at the time of writing (5.24), \ep{Lu} and \ep{Ll} match all
|
||||
letters, regardless of case, when case independence is specified.
|
||||
.P
|
||||
17. PCRE2 provides some extensions to the Perl regular expression facilities.
|
||||
Perl 5.10 includes new features that are not in earlier versions of Perl, some
|
||||
of which (such as named parentheses) have been in PCRE2 for some time. This
|
||||
list is with respect to Perl 5.10:
|
||||
of which (such as named parentheses) were in PCRE2 for some time before. This
|
||||
list is with respect to Perl 5.24:
|
||||
.sp
|
||||
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
|
||||
each alternative branch of a lookbehind assertion can match a different length
|
||||
|
@ -190,6 +179,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 18 October 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
Last updated: 29 March 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
.fi
|
||||
|
|
Loading…
Reference in New Issue