Make /x more Perl-compatible by recognizing all of Unicode's "Pattern White

Space" characters, not just the ASCII ones.
This commit is contained in:
Philip.Hazel 2018-08-03 09:38:36 +00:00
parent 6e245572b8
commit b196143523
15 changed files with 1374 additions and 1205 deletions

View File

@ -133,6 +133,13 @@ terminated by (*ACCEPT).
29. Add support for \N{U+dddd}, but not in EBCDIC environments. 29. Add support for \N{U+dddd}, but not in EBCDIC environments.
30. Add support for (?^) for unsetting all imnsx options. 30. Add support for (?^) for unsetting all imnsx options.
31. The PCRE2_EXTENDED (/x) option only ever discarded space characters whose
code point was less than 256 and that were recognized by the lookup table
generated by pcre2_maketables(), which uses isspace() to identify white space.
Now, when Unicode support is compiled, PCRE2_EXTENDED also discards U+0085,
U+200E, U+200F, U+2028, and U+2029, which are additional characters defined by
Unicode as "Pattern White Space". This makes PCRE2 compatible with Perl.
Version 10.31 12-February-2018 Version 10.31 12-February-2018

View File

@ -837,10 +837,10 @@ page for details.
</P> </P>
<P> <P>
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
option, the newline convention affects the recognition of white space and the option, the newline convention affects the recognition of the end of internal
end of internal comments starting with #. The value is saved with the compiled comments starting with #. The value is saved with the compiled pattern for
pattern for subsequent use by the JIT compiler and by the two interpreted subsequent use by the JIT compiler and by the two interpreted matching
matching functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>. functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
<br> <br>
<br> <br>
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b> <b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
@ -1424,9 +1424,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
option is set, normal backslash processing is applied to verb names and only an option is set, normal backslash processing is applied to verb names and only an
unescaped closing parenthesis terminates the name. A closing parenthesis can be unescaped closing parenthesis terminates the name. A closing parenthesis can be
included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
skipped and #-comments are recognized in this mode, exactly as in the rest of whitespace in verb names is skipped and #-comments are recognized, exactly as
the pattern. in the rest of the pattern.
<pre> <pre>
PCRE2_AUTO_CALLOUT PCRE2_AUTO_CALLOUT
</pre> </pre>
@ -1510,15 +1510,36 @@ is not allowed within sequences such as (?&#62; that introduce various
parenthesized subpatterns, nor within numerical quantifiers such as {1,3}. parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
Ignorable white space is permitted between an item and a following quantifier Ignorable white space is permitted between an item and a following quantifier
and between a quantifier and a following + that indicates possessiveness. and between a quantifier and a following + that indicates possessiveness.
PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
a pattern by a (?x) option setting.
</P> </P>
<P> <P>
PCRE2_EXTENDED also causes characters between an unescaped # outside a When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
character class and the next newline, inclusive, to be ignored, which makes it white space only those characters with code points less than 256 that are
possible to include comments inside complicated patterns. Note that the end of flagged as white space in its low-character table. The table is normally
this type of comment is a literal newline sequence in the pattern; escape created by
sequences that happen to represent a newline do not count. PCRE2_EXTENDED is <a href="pcre2_maketables.html"><b>pcre2_maketables()</b>,</a>
equivalent to Perl's /x option, and it can be changed within a pattern by a which uses the <b>isspace()</b> function to identify space characters. In most
(?x) option setting. ASCII environments, the relevant characters are those with code points 0x0009
(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
(carriage return), and 0x0020 (space).
</P>
<P>
When PCRE2 is compiled with Unicode support, in addition to these characters,
five more Unicode "Pattern White Space" characters are recognized by
PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
separator). This set of characters is the same as recognized by Perl's /x
option. Note that the horizontal and vertical space characters that are matched
by the \h and \v escapes in patterns are a much bigger set.
</P>
<P>
As well as ignoring most white space, PCRE2_EXTENDED also causes characters
between an unescaped # outside a character class and the next newline,
inclusive, to be ignored, which makes it possible to include comments inside
complicated patterns. Note that the end of this type of comment is a literal
newline sequence in the pattern; escape sequences that happen to represent a
newline do not count.
</P> </P>
<P> <P>
Which characters are interpreted as newlines can be specified by a setting in Which characters are interpreted as newlines can be specified by a setting in
@ -1531,9 +1552,11 @@ built.
PCRE2_EXTENDED_MORE PCRE2_EXTENDED_MORE
</pre> </pre>
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
and horizontal tab characters are ignored inside a character class. and horizontal tab characters are ignored inside a character class. Note: only
PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be these two characters are ignored, not the full set of pattern white space
changed within a pattern by a (?xx) option setting. characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
equivalent to Perl's /xx option, and it can be changed within a pattern by a
(?xx) option setting.
<pre> <pre>
PCRE2_FIRSTLINE PCRE2_FIRSTLINE
</pre> </pre>
@ -3635,7 +3658,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 27 July 2018 Last updated: 03 August 2018
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>

View File

@ -1628,9 +1628,11 @@ alternative in the subpattern.
<br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br> <br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
<P> <P>
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
are Perl-compatible) can be changed from within the pattern by a sequence of changed from within the pattern by a sequence of letters enclosed between "(?"
Perl option letters enclosed between "(?" and ")". The option letters are and ")". These options are Perl-compatible, and are described in detail in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. The option letters are:
<pre> <pre>
i for PCRE2_CASELESS i for PCRE2_CASELESS
m for PCRE2_MULTILINE m for PCRE2_MULTILINE
@ -2275,8 +2277,9 @@ unset value matches an empty string.
Because there may be many capturing parentheses in a pattern, all digits Because there may be many capturing parentheses in a pattern, all digits
following a backslash are taken as part of a potential backreference number. following a backslash are taken as part of a potential backreference number.
If the pattern continues with a digit character, some delimiter must be used to If the pattern continues with a digit character, some delimiter must be used to
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
white space. Otherwise, the \g{ syntax or an empty comment (see option is set, this can be white space. Otherwise, the \g{ syntax or an empty
comment (see
<a href="#comments">"Comments"</a> <a href="#comments">"Comments"</a>
below) can be used. below) can be used.
</P> </P>
@ -2744,12 +2747,12 @@ no part in the pattern matching.
<P> <P>
The sequence (?# marks the start of a comment that continues up to the next The sequence (?# marks the start of a comment that continues up to the next
closing parenthesis. Nested parentheses are not permitted. If the closing parenthesis. Nested parentheses are not permitted. If the
PCRE2_EXTENDED option is set, an unescaped # character also introduces a PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
comment, which in this case continues to immediately after the next newline also introduces a comment, which in this case continues to immediately after
character or character sequence in the pattern. Which characters are the next newline character or character sequence in the pattern. Which
interpreted as newlines is controlled by an option passed to the compiling characters are interpreted as newlines is controlled by an option passed to the
function or by a special sequence at the start of the pattern, as described in compiling function or by a special sequence at the start of the pattern, as
the section entitled described in the section entitled
<a href="#newlines">"Newline conventions"</a> <a href="#newlines">"Newline conventions"</a>
above. Note that the end of this type of comment is a literal newline sequence above. Note that the end of this type of comment is a literal newline sequence
in the pattern; escape sequences that happen to represent a newline do not in the pattern; escape sequences that happen to represent a newline do not
@ -3108,10 +3111,11 @@ are faulted.
</P> </P>
<P> <P>
A closing parenthesis can be included in a name either as \) or between \Q A closing parenthesis can be included in a name either as \) or between \Q
and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is and \E. In addition to backslash processing, if the PCRE2_EXTENDED or
also set, unescaped whitespace in verb names is skipped, and #-comments are PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not skipped, and #-comments are recognized, exactly as in the rest of the pattern.
affect verb names unless PCRE2_ALT_VERBNAMES is also set. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
PCRE2_ALT_VERBNAMES is also set.
</P> </P>
<P> <P>
The maximum length of a name is 255 in the 8-bit library and 65535 in the The maximum length of a name is 255 in the 8-bit library and 65535 in the
@ -3590,7 +3594,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 28 July 2018 Last updated: 03 August 2018
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>

View File

@ -446,6 +446,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
</P> </P>
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br> <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
<P> <P>
Changes of these options within a group are automatically cancelled at the end
of the group.
<pre> <pre>
(?i) caseless (?i) caseless
(?J) allow duplicate names (?J) allow duplicate names
@ -632,7 +634,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br> <br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 28 July 2018 Last updated: 01 August 2018
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "27 July 2018" "PCRE2 10.32" .TH PCRE2API 3 "03 August 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -775,10 +775,10 @@ sequence such as (*CRLF). See the
page for details. page for details.
.P .P
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
option, the newline convention affects the recognition of white space and the option, the newline convention affects the recognition of the end of internal
end of internal comments starting with #. The value is saved with the compiled comments starting with #. The value is saved with the compiled pattern for
pattern for subsequent use by the JIT compiler and by the two interpreted subsequent use by the JIT compiler and by the two interpreted matching
matching functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP. functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
.sp .sp
.nf .nf
.B int pcre2_set_parens_nest_limit(pcre2_compile_context *\fIccontext\fP, .B int pcre2_set_parens_nest_limit(pcre2_compile_context *\fIccontext\fP,
@ -1356,9 +1356,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
option is set, normal backslash processing is applied to verb names and only an option is set, normal backslash processing is applied to verb names and only an
unescaped closing parenthesis terminates the name. A closing parenthesis can be unescaped closing parenthesis terminates the name. A closing parenthesis can be
included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED
or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
skipped and #-comments are recognized in this mode, exactly as in the rest of whitespace in verb names is skipped and #-comments are recognized, exactly as
the pattern. in the rest of the pattern.
.sp .sp
PCRE2_AUTO_CALLOUT PCRE2_AUTO_CALLOUT
.sp .sp
@ -1445,14 +1445,35 @@ is not allowed within sequences such as (?> that introduce various
parenthesized subpatterns, nor within numerical quantifiers such as {1,3}. parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
Ignorable white space is permitted between an item and a following quantifier Ignorable white space is permitted between an item and a following quantifier
and between a quantifier and a following + that indicates possessiveness. and between a quantifier and a following + that indicates possessiveness.
PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
a pattern by a (?x) option setting.
.P .P
PCRE2_EXTENDED also causes characters between an unescaped # outside a When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
character class and the next newline, inclusive, to be ignored, which makes it white space only those characters with code points less than 256 that are
possible to include comments inside complicated patterns. Note that the end of flagged as white space in its low-character table. The table is normally
this type of comment is a literal newline sequence in the pattern; escape created by
sequences that happen to represent a newline do not count. PCRE2_EXTENDED is .\" HREF
equivalent to Perl's /x option, and it can be changed within a pattern by a \fBpcre2_maketables()\fP,
(?x) option setting. .\"
which uses the \fBisspace()\fP function to identify space characters. In most
ASCII environments, the relevant characters are those with code points 0x0009
(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
(carriage return), and 0x0020 (space).
.P
When PCRE2 is compiled with Unicode support, in addition to these characters,
five more Unicode "Pattern White Space" characters are recognized by
PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
separator). This set of characters is the same as recognized by Perl's /x
option. Note that the horizontal and vertical space characters that are matched
by the \eh and \ev escapes in patterns are a much bigger set.
.P
As well as ignoring most white space, PCRE2_EXTENDED also causes characters
between an unescaped # outside a character class and the next newline,
inclusive, to be ignored, which makes it possible to include comments inside
complicated patterns. Note that the end of this type of comment is a literal
newline sequence in the pattern; escape sequences that happen to represent a
newline do not count.
.P .P
Which characters are interpreted as newlines can be specified by a setting in Which characters are interpreted as newlines can be specified by a setting in
the compile context that is passed to \fBpcre2_compile()\fP or by a special the compile context that is passed to \fBpcre2_compile()\fP or by a special
@ -1467,9 +1488,11 @@ built.
PCRE2_EXTENDED_MORE PCRE2_EXTENDED_MORE
.sp .sp
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
and horizontal tab characters are ignored inside a character class. and horizontal tab characters are ignored inside a character class. Note: only
PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be these two characters are ignored, not the full set of pattern white space
changed within a pattern by a (?xx) option setting. characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
equivalent to Perl's /xx option, and it can be changed within a pattern by a
(?xx) option setting.
.sp .sp
PCRE2_FIRSTLINE PCRE2_FIRSTLINE
.sp .sp
@ -3641,6 +3664,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 27 July 2018 Last updated: 03 August 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "28 July 2018" "PCRE2 10.32" .TH PCRE2PATTERN 3 "03 August 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -1627,9 +1627,13 @@ alternative in the subpattern.
.rs .rs
.sp .sp
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
are Perl-compatible) can be changed from within the pattern by a sequence of changed from within the pattern by a sequence of letters enclosed between "(?"
Perl option letters enclosed between "(?" and ")". The option letters are and ")". These options are Perl-compatible, and are described in detail in the
.\" HREF
\fBpcre2api\fP
.\"
documentation. The option letters are:
.sp .sp
i for PCRE2_CASELESS i for PCRE2_CASELESS
m for PCRE2_MULTILINE m for PCRE2_MULTILINE
@ -2273,8 +2277,9 @@ unset value matches an empty string.
Because there may be many capturing parentheses in a pattern, all digits Because there may be many capturing parentheses in a pattern, all digits
following a backslash are taken as part of a potential backreference number. following a backslash are taken as part of a potential backreference number.
If the pattern continues with a digit character, some delimiter must be used to If the pattern continues with a digit character, some delimiter must be used to
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
white space. Otherwise, the \eg{ syntax or an empty comment (see option is set, this can be white space. Otherwise, the \eg{ syntax or an empty
comment (see
.\" HTML <a href="#comments"> .\" HTML <a href="#comments">
.\" </a> .\" </a>
"Comments" "Comments"
@ -2762,12 +2767,12 @@ no part in the pattern matching.
.P .P
The sequence (?# marks the start of a comment that continues up to the next The sequence (?# marks the start of a comment that continues up to the next
closing parenthesis. Nested parentheses are not permitted. If the closing parenthesis. Nested parentheses are not permitted. If the
PCRE2_EXTENDED option is set, an unescaped # character also introduces a PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
comment, which in this case continues to immediately after the next newline also introduces a comment, which in this case continues to immediately after
character or character sequence in the pattern. Which characters are the next newline character or character sequence in the pattern. Which
interpreted as newlines is controlled by an option passed to the compiling characters are interpreted as newlines is controlled by an option passed to the
function or by a special sequence at the start of the pattern, as described in compiling function or by a special sequence at the start of the pattern, as
the section entitled described in the section entitled
.\" HTML <a href="#newlines"> .\" HTML <a href="#newlines">
.\" </a> .\" </a>
"Newline conventions" "Newline conventions"
@ -3132,10 +3137,11 @@ only backslash items that are permitted are \eQ, \eE, and sequences such as
are faulted. are faulted.
.P .P
A closing parenthesis can be included in a name either as \e) or between \eQ A closing parenthesis can be included in a name either as \e) or between \eQ
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or
also set, unescaped whitespace in verb names is skipped, and #-comments are PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not skipped, and #-comments are recognized, exactly as in the rest of the pattern.
affect verb names unless PCRE2_ALT_VERBNAMES is also set. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
PCRE2_ALT_VERBNAMES is also set.
.P .P
The maximum length of a name is 255 in the 8-bit library and 65535 in the The maximum length of a name is 255 in the 8-bit library and 65535 in the
16-bit and 32-bit libraries. If the name is empty, that is, if the closing 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
@ -3614,6 +3620,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 28 July 2018 Last updated: 03 August 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "28 July 2018" "PCRE2 10.32" .TH PCRE2SYNTAX 3 "01 August 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -421,6 +421,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
. .
.SH "OPTION SETTING" .SH "OPTION SETTING"
.rs .rs
Changes of these options within a group are automatically cancelled at the end
of the group.
.sp .sp
(?i) caseless (?i) caseless
(?J) allow duplicate names (?J) allow duplicate names
@ -619,6 +621,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 28 July 2018 Last updated: 01 August 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -2468,11 +2468,17 @@ while (ptr < ptrend)
/* EITHER: not both options set */ /* EITHER: not both options set */
((options & (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) != ((options & (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) !=
(PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) || (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) ||
/* OR: character > 255 */ #ifdef SUPPORT_UNICODE
c > 255 || /* OR: character > 255 AND not Unicode Pattern White Space */
/* OR: not a # comment or white space */ (c > 255 && (c|1) != 0x200f && (c|1) != 0x2029) ||
(c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0) #endif
)) /* OR: not a # comment or isspace() white space */
(c < 256 && c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0
#ifdef SUPPORT_UNICODE
/* and not CHAR_NEL when Unicode is supported */
&& c != CHAR_NEL
#endif
)))
{ {
PCRE2_SIZE verbnamelength; PCRE2_SIZE verbnamelength;
@ -2554,11 +2560,18 @@ while (ptr < ptrend)
/* Skip over whitespace and # comments in extended mode. Note that c is a /* Skip over whitespace and # comments in extended mode. Note that c is a
character, not a code unit, so we must not use MAX_255 to test its size character, not a code unit, so we must not use MAX_255 to test its size
because MAX_255 tests code units and is assumed TRUE in 8-bit mode. */ because MAX_255 tests code units and is assumed TRUE in 8-bit mode. The
whitespace characters are those designated as "Pattern White Space" by
Unicode, which are the isspace() characters plus CHAR_NEL (newline), which is
U+0085 in Unicode, plus U+200E, U+200F, U+2028, and U+2029. These are a
subset of space characters that match \h and \v. */
if ((options & PCRE2_EXTENDED) != 0) if ((options & PCRE2_EXTENDED) != 0)
{ {
if (c < 256 && (cb->ctypes[c] & ctype_space) != 0) continue; if (c < 256 && (cb->ctypes[c] & ctype_space) != 0) continue;
#ifdef SUPPORT_UNICODE
if (c == CHAR_NEL || (c|1) == 0x200f || (c|1) == 0x2029) continue;
#endif
if (c == CHAR_NUMBER_SIGN) if (c == CHAR_NUMBER_SIGN)
{ {
while (ptr < ptrend) while (ptr < ptrend)

2
testdata/testinput1 vendored
View File

@ -6257,5 +6257,5 @@ ef) x/x,mark
\= Expect no match \= Expect no match
aBCDEF aBCDEF
AbCDe f AbCDe f
# End of testinput1 # End of testinput1

15
testdata/testinput4 vendored
View File

@ -2293,5 +2293,20 @@
/[\N{U+1234}]/utf /[\N{U+1234}]/utf
\x{1234} \x{1234}
# Test the full list of Unicode "Pattern White Space" characters that are to
# be ignored by /x. The pattern lines below may show up oddly in text editors
# or when listed to the screen. Note that characters such as U+2002, which are
# matched as space by \h and \v are *not* "Pattern White Space".
/A…B/x,utf
AB
/AB/x,utf
A\x{2002}B
\= Expect no match
AB
# -------
# End of testinput4 # End of testinput4

14
testdata/testinput5 vendored
View File

@ -2091,4 +2091,18 @@
/\N{U}/ /\N{U}/
# This tests the non-UTF Unicode NEL pattern whitespace character, only
# recognized by PCRE2 with /x when there is Unicode support.
/A
…B/x
AB
# This tests Unicode Pattern White Space characters in verb names when they
# are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
# with code points greater than 255 between A, B, and C in the pattern.
/(*: A‎B
C)abc/x,utf,mark,alt_verbnames
abc
# End of testinput5 # End of testinput5

View File

@ -9920,5 +9920,5 @@ No match, mark = X
No match No match
AbCDe f AbCDe f
No match No match
# End of testinput1 # End of testinput1

18
testdata/testoutput4 vendored
View File

@ -3711,5 +3711,23 @@ No match
/[\N{U+1234}]/utf /[\N{U+1234}]/utf
\x{1234} \x{1234}
0: \x{1234} 0: \x{1234}
# Test the full list of Unicode "Pattern White Space" characters that are to
# be ignored by /x. The pattern lines below may show up oddly in text editors
# or when listed to the screen. Note that characters such as U+2002, which are
# matched as space by \h and \v are *not* "Pattern White Space".
/A…B/x,utf
AB
0: AB
/AB/x,utf
A\x{2002}B
0: A\x{2002}B
\= Expect no match
AB
No match
# -------
# End of testinput4 # End of testinput4

17
testdata/testoutput5 vendored
View File

@ -4756,4 +4756,21 @@ Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
/\N{U}/ /\N{U}/
Failed: error 137 at offset 2: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u Failed: error 137 at offset 2: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u
# This tests the non-UTF Unicode NEL pattern whitespace character, only
# recognized by PCRE2 with /x when there is Unicode support.
/A
…B/x
AB
0: AB
# This tests Unicode Pattern White Space characters in verb names when they
# are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
# with code points greater than 255 between A, B, and C in the pattern.
/(*: A‎B
C)abc/x,utf,mark,alt_verbnames
abc
0: abc
MK: ABC
# End of testinput5 # End of testinput5