Make /x more Perl-compatible by recognizing all of Unicode's "Pattern White

Space" characters, not just the ASCII ones.
This commit is contained in:
Philip.Hazel 2018-08-03 09:38:36 +00:00
parent 6e245572b8
commit b196143523
15 changed files with 1374 additions and 1205 deletions

View File

@ -133,6 +133,13 @@ terminated by (*ACCEPT).
29. Add support for \N{U+dddd}, but not in EBCDIC environments.
30. Add support for (?^) for unsetting all imnsx options.
31. The PCRE2_EXTENDED (/x) option only ever discarded space characters whose
code point was less than 256 and that were recognized by the lookup table
generated by pcre2_maketables(), which uses isspace() to identify white space.
Now, when Unicode support is compiled, PCRE2_EXTENDED also discards U+0085,
U+200E, U+200F, U+2028, and U+2029, which are additional characters defined by
Unicode as "Pattern White Space". This makes PCRE2 compatible with Perl.
Version 10.31 12-February-2018

View File

@ -837,10 +837,10 @@ page for details.
</P>
<P>
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
option, the newline convention affects the recognition of white space and the
end of internal comments starting with #. The value is saved with the compiled
pattern for subsequent use by the JIT compiler and by the two interpreted
matching functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
option, the newline convention affects the recognition of the end of internal
comments starting with #. The value is saved with the compiled pattern for
subsequent use by the JIT compiler and by the two interpreted matching
functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
<br>
<br>
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
@ -1424,9 +1424,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
option is set, normal backslash processing is applied to verb names and only an
unescaped closing parenthesis terminates the name. A closing parenthesis can be
included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
skipped and #-comments are recognized in this mode, exactly as in the rest of
the pattern.
or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
whitespace in verb names is skipped and #-comments are recognized, exactly as
in the rest of the pattern.
<pre>
PCRE2_AUTO_CALLOUT
</pre>
@ -1510,15 +1510,36 @@ is not allowed within sequences such as (?&#62; that introduce various
parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
Ignorable white space is permitted between an item and a following quantifier
and between a quantifier and a following + that indicates possessiveness.
PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
a pattern by a (?x) option setting.
</P>
<P>
PCRE2_EXTENDED also causes characters between an unescaped # outside a
character class and the next newline, inclusive, to be ignored, which makes it
possible to include comments inside complicated patterns. Note that the end of
this type of comment is a literal newline sequence in the pattern; escape
sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
equivalent to Perl's /x option, and it can be changed within a pattern by a
(?x) option setting.
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
white space only those characters with code points less than 256 that are
flagged as white space in its low-character table. The table is normally
created by
<a href="pcre2_maketables.html"><b>pcre2_maketables()</b>,</a>
which uses the <b>isspace()</b> function to identify space characters. In most
ASCII environments, the relevant characters are those with code points 0x0009
(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
(carriage return), and 0x0020 (space).
</P>
<P>
When PCRE2 is compiled with Unicode support, in addition to these characters,
five more Unicode "Pattern White Space" characters are recognized by
PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
separator). This set of characters is the same as recognized by Perl's /x
option. Note that the horizontal and vertical space characters that are matched
by the \h and \v escapes in patterns are a much bigger set.
</P>
<P>
As well as ignoring most white space, PCRE2_EXTENDED also causes characters
between an unescaped # outside a character class and the next newline,
inclusive, to be ignored, which makes it possible to include comments inside
complicated patterns. Note that the end of this type of comment is a literal
newline sequence in the pattern; escape sequences that happen to represent a
newline do not count.
</P>
<P>
Which characters are interpreted as newlines can be specified by a setting in
@ -1531,9 +1552,11 @@ built.
PCRE2_EXTENDED_MORE
</pre>
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
and horizontal tab characters are ignored inside a character class.
PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
changed within a pattern by a (?xx) option setting.
and horizontal tab characters are ignored inside a character class. Note: only
these two characters are ignored, not the full set of pattern white space
characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
equivalent to Perl's /xx option, and it can be changed within a pattern by a
(?xx) option setting.
<pre>
PCRE2_FIRSTLINE
</pre>
@ -3635,7 +3658,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 27 July 2018
Last updated: 03 August 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -1628,9 +1628,11 @@ alternative in the subpattern.
<br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
<P>
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
are Perl-compatible) can be changed from within the pattern by a sequence of
Perl option letters enclosed between "(?" and ")". The option letters are
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
changed from within the pattern by a sequence of letters enclosed between "(?"
and ")". These options are Perl-compatible, and are described in detail in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. The option letters are:
<pre>
i for PCRE2_CASELESS
m for PCRE2_MULTILINE
@ -2275,8 +2277,9 @@ unset value matches an empty string.
Because there may be many capturing parentheses in a pattern, all digits
following a backslash are taken as part of a potential backreference number.
If the pattern continues with a digit character, some delimiter must be used to
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
white space. Otherwise, the \g{ syntax or an empty comment (see
terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
option is set, this can be white space. Otherwise, the \g{ syntax or an empty
comment (see
<a href="#comments">"Comments"</a>
below) can be used.
</P>
@ -2744,12 +2747,12 @@ no part in the pattern matching.
<P>
The sequence (?# marks the start of a comment that continues up to the next
closing parenthesis. Nested parentheses are not permitted. If the
PCRE2_EXTENDED option is set, an unescaped # character also introduces a
comment, which in this case continues to immediately after the next newline
character or character sequence in the pattern. Which characters are
interpreted as newlines is controlled by an option passed to the compiling
function or by a special sequence at the start of the pattern, as described in
the section entitled
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
also introduces a comment, which in this case continues to immediately after
the next newline character or character sequence in the pattern. Which
characters are interpreted as newlines is controlled by an option passed to the
compiling function or by a special sequence at the start of the pattern, as
described in the section entitled
<a href="#newlines">"Newline conventions"</a>
above. Note that the end of this type of comment is a literal newline sequence
in the pattern; escape sequences that happen to represent a newline do not
@ -3108,10 +3111,11 @@ are faulted.
</P>
<P>
A closing parenthesis can be included in a name either as \) or between \Q
and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is
also set, unescaped whitespace in verb names is skipped, and #-comments are
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
and \E. In addition to backslash processing, if the PCRE2_EXTENDED or
PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
skipped, and #-comments are recognized, exactly as in the rest of the pattern.
PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
PCRE2_ALT_VERBNAMES is also set.
</P>
<P>
The maximum length of a name is 255 in the 8-bit library and 65535 in the
@ -3590,7 +3594,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 28 July 2018
Last updated: 03 August 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -446,6 +446,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
</P>
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
<P>
Changes of these options within a group are automatically cancelled at the end
of the group.
<pre>
(?i) caseless
(?J) allow duplicate names
@ -632,7 +634,7 @@ Cambridge, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 28 July 2018
Last updated: 01 August 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "27 July 2018" "PCRE2 10.32"
.TH PCRE2API 3 "03 August 2018" "PCRE2 10.32"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -775,10 +775,10 @@ sequence such as (*CRLF). See the
page for details.
.P
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
option, the newline convention affects the recognition of white space and the
end of internal comments starting with #. The value is saved with the compiled
pattern for subsequent use by the JIT compiler and by the two interpreted
matching functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
option, the newline convention affects the recognition of the end of internal
comments starting with #. The value is saved with the compiled pattern for
subsequent use by the JIT compiler and by the two interpreted matching
functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
.sp
.nf
.B int pcre2_set_parens_nest_limit(pcre2_compile_context *\fIccontext\fP,
@ -1356,9 +1356,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
option is set, normal backslash processing is applied to verb names and only an
unescaped closing parenthesis terminates the name. A closing parenthesis can be
included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED
or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
skipped and #-comments are recognized in this mode, exactly as in the rest of
the pattern.
or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
whitespace in verb names is skipped and #-comments are recognized, exactly as
in the rest of the pattern.
.sp
PCRE2_AUTO_CALLOUT
.sp
@ -1445,14 +1445,35 @@ is not allowed within sequences such as (?> that introduce various
parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
Ignorable white space is permitted between an item and a following quantifier
and between a quantifier and a following + that indicates possessiveness.
PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
a pattern by a (?x) option setting.
.P
PCRE2_EXTENDED also causes characters between an unescaped # outside a
character class and the next newline, inclusive, to be ignored, which makes it
possible to include comments inside complicated patterns. Note that the end of
this type of comment is a literal newline sequence in the pattern; escape
sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
equivalent to Perl's /x option, and it can be changed within a pattern by a
(?x) option setting.
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
white space only those characters with code points less than 256 that are
flagged as white space in its low-character table. The table is normally
created by
.\" HREF
\fBpcre2_maketables()\fP,
.\"
which uses the \fBisspace()\fP function to identify space characters. In most
ASCII environments, the relevant characters are those with code points 0x0009
(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
(carriage return), and 0x0020 (space).
.P
When PCRE2 is compiled with Unicode support, in addition to these characters,
five more Unicode "Pattern White Space" characters are recognized by
PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
separator). This set of characters is the same as recognized by Perl's /x
option. Note that the horizontal and vertical space characters that are matched
by the \eh and \ev escapes in patterns are a much bigger set.
.P
As well as ignoring most white space, PCRE2_EXTENDED also causes characters
between an unescaped # outside a character class and the next newline,
inclusive, to be ignored, which makes it possible to include comments inside
complicated patterns. Note that the end of this type of comment is a literal
newline sequence in the pattern; escape sequences that happen to represent a
newline do not count.
.P
Which characters are interpreted as newlines can be specified by a setting in
the compile context that is passed to \fBpcre2_compile()\fP or by a special
@ -1467,9 +1488,11 @@ built.
PCRE2_EXTENDED_MORE
.sp
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
and horizontal tab characters are ignored inside a character class.
PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
changed within a pattern by a (?xx) option setting.
and horizontal tab characters are ignored inside a character class. Note: only
these two characters are ignored, not the full set of pattern white space
characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
equivalent to Perl's /xx option, and it can be changed within a pattern by a
(?xx) option setting.
.sp
PCRE2_FIRSTLINE
.sp
@ -3641,6 +3664,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 27 July 2018
Last updated: 03 August 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "28 July 2018" "PCRE2 10.32"
.TH PCRE2PATTERN 3 "03 August 2018" "PCRE2 10.32"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -1627,9 +1627,13 @@ alternative in the subpattern.
.rs
.sp
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
are Perl-compatible) can be changed from within the pattern by a sequence of
Perl option letters enclosed between "(?" and ")". The option letters are
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
changed from within the pattern by a sequence of letters enclosed between "(?"
and ")". These options are Perl-compatible, and are described in detail in the
.\" HREF
\fBpcre2api\fP
.\"
documentation. The option letters are:
.sp
i for PCRE2_CASELESS
m for PCRE2_MULTILINE
@ -2273,8 +2277,9 @@ unset value matches an empty string.
Because there may be many capturing parentheses in a pattern, all digits
following a backslash are taken as part of a potential backreference number.
If the pattern continues with a digit character, some delimiter must be used to
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
white space. Otherwise, the \eg{ syntax or an empty comment (see
terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
option is set, this can be white space. Otherwise, the \eg{ syntax or an empty
comment (see
.\" HTML <a href="#comments">
.\" </a>
"Comments"
@ -2762,12 +2767,12 @@ no part in the pattern matching.
.P
The sequence (?# marks the start of a comment that continues up to the next
closing parenthesis. Nested parentheses are not permitted. If the
PCRE2_EXTENDED option is set, an unescaped # character also introduces a
comment, which in this case continues to immediately after the next newline
character or character sequence in the pattern. Which characters are
interpreted as newlines is controlled by an option passed to the compiling
function or by a special sequence at the start of the pattern, as described in
the section entitled
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
also introduces a comment, which in this case continues to immediately after
the next newline character or character sequence in the pattern. Which
characters are interpreted as newlines is controlled by an option passed to the
compiling function or by a special sequence at the start of the pattern, as
described in the section entitled
.\" HTML <a href="#newlines">
.\" </a>
"Newline conventions"
@ -3132,10 +3137,11 @@ only backslash items that are permitted are \eQ, \eE, and sequences such as
are faulted.
.P
A closing parenthesis can be included in a name either as \e) or between \eQ
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is
also set, unescaped whitespace in verb names is skipped, and #-comments are
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or
PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
skipped, and #-comments are recognized, exactly as in the rest of the pattern.
PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
PCRE2_ALT_VERBNAMES is also set.
.P
The maximum length of a name is 255 in the 8-bit library and 65535 in the
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
@ -3614,6 +3620,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 28 July 2018
Last updated: 03 August 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "28 July 2018" "PCRE2 10.32"
.TH PCRE2SYNTAX 3 "01 August 2018" "PCRE2 10.32"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -421,6 +421,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
.
.SH "OPTION SETTING"
.rs
Changes of these options within a group are automatically cancelled at the end
of the group.
.sp
(?i) caseless
(?J) allow duplicate names
@ -619,6 +621,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 28 July 2018
Last updated: 01 August 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -2468,11 +2468,17 @@ while (ptr < ptrend)
/* EITHER: not both options set */
((options & (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) !=
(PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) ||
/* OR: character > 255 */
c > 255 ||
/* OR: not a # comment or white space */
(c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0)
))
#ifdef SUPPORT_UNICODE
/* OR: character > 255 AND not Unicode Pattern White Space */
(c > 255 && (c|1) != 0x200f && (c|1) != 0x2029) ||
#endif
/* OR: not a # comment or isspace() white space */
(c < 256 && c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0
#ifdef SUPPORT_UNICODE
/* and not CHAR_NEL when Unicode is supported */
&& c != CHAR_NEL
#endif
)))
{
PCRE2_SIZE verbnamelength;
@ -2554,11 +2560,18 @@ while (ptr < ptrend)
/* Skip over whitespace and # comments in extended mode. Note that c is a
character, not a code unit, so we must not use MAX_255 to test its size
because MAX_255 tests code units and is assumed TRUE in 8-bit mode. */
because MAX_255 tests code units and is assumed TRUE in 8-bit mode. The
whitespace characters are those designated as "Pattern White Space" by
Unicode, which are the isspace() characters plus CHAR_NEL (newline), which is
U+0085 in Unicode, plus U+200E, U+200F, U+2028, and U+2029. These are a
subset of space characters that match \h and \v. */
if ((options & PCRE2_EXTENDED) != 0)
{
if (c < 256 && (cb->ctypes[c] & ctype_space) != 0) continue;
#ifdef SUPPORT_UNICODE
if (c == CHAR_NEL || (c|1) == 0x200f || (c|1) == 0x2029) continue;
#endif
if (c == CHAR_NUMBER_SIGN)
{
while (ptr < ptrend)

2
testdata/testinput1 vendored
View File

@ -6257,5 +6257,5 @@ ef) x/x,mark
\= Expect no match
aBCDEF
AbCDe f
# End of testinput1

15
testdata/testinput4 vendored
View File

@ -2293,5 +2293,20 @@
/[\N{U+1234}]/utf
\x{1234}
# Test the full list of Unicode "Pattern White Space" characters that are to
# be ignored by /x. The pattern lines below may show up oddly in text editors
# or when listed to the screen. Note that characters such as U+2002, which are
# matched as space by \h and \v are *not* "Pattern White Space".
/A…B/x,utf
AB
/AB/x,utf
A\x{2002}B
\= Expect no match
AB
# -------
# End of testinput4

14
testdata/testinput5 vendored
View File

@ -2091,4 +2091,18 @@
/\N{U}/
# This tests the non-UTF Unicode NEL pattern whitespace character, only
# recognized by PCRE2 with /x when there is Unicode support.
/A
…B/x
AB
# This tests Unicode Pattern White Space characters in verb names when they
# are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
# with code points greater than 255 between A, B, and C in the pattern.
/(*: A‎B
C)abc/x,utf,mark,alt_verbnames
abc
# End of testinput5

View File

@ -9920,5 +9920,5 @@ No match, mark = X
No match
AbCDe f
No match
# End of testinput1

18
testdata/testoutput4 vendored
View File

@ -3711,5 +3711,23 @@ No match
/[\N{U+1234}]/utf
\x{1234}
0: \x{1234}
# Test the full list of Unicode "Pattern White Space" characters that are to
# be ignored by /x. The pattern lines below may show up oddly in text editors
# or when listed to the screen. Note that characters such as U+2002, which are
# matched as space by \h and \v are *not* "Pattern White Space".
/A…B/x,utf
AB
0: AB
/AB/x,utf
A\x{2002}B
0: A\x{2002}B
\= Expect no match
AB
No match
# -------
# End of testinput4

17
testdata/testoutput5 vendored
View File

@ -4756,4 +4756,21 @@ Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
/\N{U}/
Failed: error 137 at offset 2: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u
# This tests the non-UTF Unicode NEL pattern whitespace character, only
# recognized by PCRE2 with /x when there is Unicode support.
/A
…B/x
AB
0: AB
# This tests Unicode Pattern White Space characters in verb names when they
# are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
# with code points greater than 255 between A, B, and C in the pattern.
/(*: A‎B
C)abc/x,utf,mark,alt_verbnames
abc
0: abc
MK: ABC
# End of testinput5