Make /x more Perl-compatible by recognizing all of Unicode's "Pattern White
Space" characters, not just the ASCII ones.
This commit is contained in:
parent
6e245572b8
commit
b196143523
|
@ -134,6 +134,13 @@ terminated by (*ACCEPT).
|
|||
|
||||
30. Add support for (?^) for unsetting all imnsx options.
|
||||
|
||||
31. The PCRE2_EXTENDED (/x) option only ever discarded space characters whose
|
||||
code point was less than 256 and that were recognized by the lookup table
|
||||
generated by pcre2_maketables(), which uses isspace() to identify white space.
|
||||
Now, when Unicode support is compiled, PCRE2_EXTENDED also discards U+0085,
|
||||
U+200E, U+200F, U+2028, and U+2029, which are additional characters defined by
|
||||
Unicode as "Pattern White Space". This makes PCRE2 compatible with Perl.
|
||||
|
||||
|
||||
Version 10.31 12-February-2018
|
||||
------------------------------
|
||||
|
|
|
@ -837,10 +837,10 @@ page for details.
|
|||
</P>
|
||||
<P>
|
||||
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
||||
option, the newline convention affects the recognition of white space and the
|
||||
end of internal comments starting with #. The value is saved with the compiled
|
||||
pattern for subsequent use by the JIT compiler and by the two interpreted
|
||||
matching functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
|
||||
option, the newline convention affects the recognition of the end of internal
|
||||
comments starting with #. The value is saved with the compiled pattern for
|
||||
subsequent use by the JIT compiler and by the two interpreted matching
|
||||
functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
|
@ -1424,9 +1424,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
|
|||
option is set, normal backslash processing is applied to verb names and only an
|
||||
unescaped closing parenthesis terminates the name. A closing parenthesis can be
|
||||
included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
|
||||
or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
|
||||
skipped and #-comments are recognized in this mode, exactly as in the rest of
|
||||
the pattern.
|
||||
or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
|
||||
whitespace in verb names is skipped and #-comments are recognized, exactly as
|
||||
in the rest of the pattern.
|
||||
<pre>
|
||||
PCRE2_AUTO_CALLOUT
|
||||
</pre>
|
||||
|
@ -1510,15 +1510,36 @@ is not allowed within sequences such as (?> that introduce various
|
|||
parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
|
||||
Ignorable white space is permitted between an item and a following quantifier
|
||||
and between a quantifier and a following + that indicates possessiveness.
|
||||
PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
|
||||
a pattern by a (?x) option setting.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_EXTENDED also causes characters between an unescaped # outside a
|
||||
character class and the next newline, inclusive, to be ignored, which makes it
|
||||
possible to include comments inside complicated patterns. Note that the end of
|
||||
this type of comment is a literal newline sequence in the pattern; escape
|
||||
sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
|
||||
equivalent to Perl's /x option, and it can be changed within a pattern by a
|
||||
(?x) option setting.
|
||||
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
|
||||
white space only those characters with code points less than 256 that are
|
||||
flagged as white space in its low-character table. The table is normally
|
||||
created by
|
||||
<a href="pcre2_maketables.html"><b>pcre2_maketables()</b>,</a>
|
||||
which uses the <b>isspace()</b> function to identify space characters. In most
|
||||
ASCII environments, the relevant characters are those with code points 0x0009
|
||||
(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
|
||||
(carriage return), and 0x0020 (space).
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is compiled with Unicode support, in addition to these characters,
|
||||
five more Unicode "Pattern White Space" characters are recognized by
|
||||
PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
|
||||
U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
|
||||
separator). This set of characters is the same as recognized by Perl's /x
|
||||
option. Note that the horizontal and vertical space characters that are matched
|
||||
by the \h and \v escapes in patterns are a much bigger set.
|
||||
</P>
|
||||
<P>
|
||||
As well as ignoring most white space, PCRE2_EXTENDED also causes characters
|
||||
between an unescaped # outside a character class and the next newline,
|
||||
inclusive, to be ignored, which makes it possible to include comments inside
|
||||
complicated patterns. Note that the end of this type of comment is a literal
|
||||
newline sequence in the pattern; escape sequences that happen to represent a
|
||||
newline do not count.
|
||||
</P>
|
||||
<P>
|
||||
Which characters are interpreted as newlines can be specified by a setting in
|
||||
|
@ -1531,9 +1552,11 @@ built.
|
|||
PCRE2_EXTENDED_MORE
|
||||
</pre>
|
||||
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
|
||||
and horizontal tab characters are ignored inside a character class.
|
||||
PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
|
||||
changed within a pattern by a (?xx) option setting.
|
||||
and horizontal tab characters are ignored inside a character class. Note: only
|
||||
these two characters are ignored, not the full set of pattern white space
|
||||
characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
|
||||
equivalent to Perl's /xx option, and it can be changed within a pattern by a
|
||||
(?xx) option setting.
|
||||
<pre>
|
||||
PCRE2_FIRSTLINE
|
||||
</pre>
|
||||
|
@ -3635,7 +3658,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 27 July 2018
|
||||
Last updated: 03 August 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -1628,9 +1628,11 @@ alternative in the subpattern.
|
|||
<br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
|
||||
<P>
|
||||
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
|
||||
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
|
||||
are Perl-compatible) can be changed from within the pattern by a sequence of
|
||||
Perl option letters enclosed between "(?" and ")". The option letters are
|
||||
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
|
||||
changed from within the pattern by a sequence of letters enclosed between "(?"
|
||||
and ")". These options are Perl-compatible, and are described in detail in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation. The option letters are:
|
||||
<pre>
|
||||
i for PCRE2_CASELESS
|
||||
m for PCRE2_MULTILINE
|
||||
|
@ -2275,8 +2277,9 @@ unset value matches an empty string.
|
|||
Because there may be many capturing parentheses in a pattern, all digits
|
||||
following a backslash are taken as part of a potential backreference number.
|
||||
If the pattern continues with a digit character, some delimiter must be used to
|
||||
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
|
||||
white space. Otherwise, the \g{ syntax or an empty comment (see
|
||||
terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
||||
option is set, this can be white space. Otherwise, the \g{ syntax or an empty
|
||||
comment (see
|
||||
<a href="#comments">"Comments"</a>
|
||||
below) can be used.
|
||||
</P>
|
||||
|
@ -2744,12 +2747,12 @@ no part in the pattern matching.
|
|||
<P>
|
||||
The sequence (?# marks the start of a comment that continues up to the next
|
||||
closing parenthesis. Nested parentheses are not permitted. If the
|
||||
PCRE2_EXTENDED option is set, an unescaped # character also introduces a
|
||||
comment, which in this case continues to immediately after the next newline
|
||||
character or character sequence in the pattern. Which characters are
|
||||
interpreted as newlines is controlled by an option passed to the compiling
|
||||
function or by a special sequence at the start of the pattern, as described in
|
||||
the section entitled
|
||||
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
|
||||
also introduces a comment, which in this case continues to immediately after
|
||||
the next newline character or character sequence in the pattern. Which
|
||||
characters are interpreted as newlines is controlled by an option passed to the
|
||||
compiling function or by a special sequence at the start of the pattern, as
|
||||
described in the section entitled
|
||||
<a href="#newlines">"Newline conventions"</a>
|
||||
above. Note that the end of this type of comment is a literal newline sequence
|
||||
in the pattern; escape sequences that happen to represent a newline do not
|
||||
|
@ -3108,10 +3111,11 @@ are faulted.
|
|||
</P>
|
||||
<P>
|
||||
A closing parenthesis can be included in a name either as \) or between \Q
|
||||
and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is
|
||||
also set, unescaped whitespace in verb names is skipped, and #-comments are
|
||||
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
|
||||
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
|
||||
and \E. In addition to backslash processing, if the PCRE2_EXTENDED or
|
||||
PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
|
||||
skipped, and #-comments are recognized, exactly as in the rest of the pattern.
|
||||
PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
|
||||
PCRE2_ALT_VERBNAMES is also set.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||
|
@ -3590,7 +3594,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 July 2018
|
||||
Last updated: 03 August 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -446,6 +446,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
|||
</P>
|
||||
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
|
||||
<P>
|
||||
Changes of these options within a group are automatically cancelled at the end
|
||||
of the group.
|
||||
<pre>
|
||||
(?i) caseless
|
||||
(?J) allow duplicate names
|
||||
|
@ -632,7 +634,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 July 2018
|
||||
Last updated: 01 August 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
103
doc/pcre2.txt
103
doc/pcre2.txt
|
@ -869,10 +869,10 @@ PCRE2 CONTEXTS
|
|||
|
||||
When a pattern is compiled with the PCRE2_EXTENDED or
|
||||
PCRE2_EXTENDED_MORE option, the newline convention affects the recogni-
|
||||
tion of white space and the end of internal comments starting with #.
|
||||
The value is saved with the compiled pattern for subsequent use by the
|
||||
JIT compiler and by the two interpreted matching functions,
|
||||
pcre2_match() and pcre2_dfa_match().
|
||||
tion of the end of internal comments starting with #. The value is
|
||||
saved with the compiled pattern for subsequent use by the JIT compiler
|
||||
and by the two interpreted matching functions, pcre2_match() and
|
||||
pcre2_dfa_match().
|
||||
|
||||
int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
|
||||
uint32_t value);
|
||||
|
@ -1413,9 +1413,9 @@ COMPILING A PATTERN
|
|||
processing is applied to verb names and only an unescaped closing
|
||||
parenthesis terminates the name. A closing parenthesis can be included
|
||||
in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED or
|
||||
PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names
|
||||
is skipped and #-comments are recognized in this mode, exactly as in
|
||||
the rest of the pattern.
|
||||
PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
|
||||
whitespace in verb names is skipped and #-comments are recognized,
|
||||
exactly as in the rest of the pattern.
|
||||
|
||||
PCRE2_AUTO_CALLOUT
|
||||
|
||||
|
@ -1498,15 +1498,34 @@ COMPILING A PATTERN
|
|||
introduce various parenthesized subpatterns, nor within numerical quan-
|
||||
tifiers such as {1,3}. Ignorable white space is permitted between an
|
||||
item and a following quantifier and between a quantifier and a follow-
|
||||
ing + that indicates possessiveness.
|
||||
ing + that indicates possessiveness. PCRE2_EXTENDED is equivalent to
|
||||
Perl's /x option, and it can be changed within a pattern by a (?x)
|
||||
option setting.
|
||||
|
||||
PCRE2_EXTENDED also causes characters between an unescaped # outside a
|
||||
character class and the next newline, inclusive, to be ignored, which
|
||||
makes it possible to include comments inside complicated patterns. Note
|
||||
that the end of this type of comment is a literal newline sequence in
|
||||
the pattern; escape sequences that happen to represent a newline do not
|
||||
count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
|
||||
changed within a pattern by a (?x) option setting.
|
||||
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog-
|
||||
nizes as white space only those characters with code points less than
|
||||
256 that are flagged as white space in its low-character table. The ta-
|
||||
ble is normally created by pcre2_maketables(), which uses the isspace()
|
||||
function to identify space characters. In most ASCII environments, the
|
||||
relevant characters are those with code points 0x0009 (tab), 0x000A
|
||||
(linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage
|
||||
return), and 0x0020 (space).
|
||||
|
||||
When PCRE2 is compiled with Unicode support, in addition to these char-
|
||||
acters, five more Unicode "Pattern White Space" characters are recog-
|
||||
nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
|
||||
right mark), U+200F (right-to-left mark), U+2028 (line separator), and
|
||||
U+2029 (paragraph separator). This set of characters is the same as
|
||||
recognized by Perl's /x option. Note that the horizontal and vertical
|
||||
space characters that are matched by the \h and \v escapes in patterns
|
||||
are a much bigger set.
|
||||
|
||||
As well as ignoring most white space, PCRE2_EXTENDED also causes char-
|
||||
acters between an unescaped # outside a character class and the next
|
||||
newline, inclusive, to be ignored, which makes it possible to include
|
||||
comments inside complicated patterns. Note that the end of this type of
|
||||
comment is a literal newline sequence in the pattern; escape sequences
|
||||
that happen to represent a newline do not count.
|
||||
|
||||
Which characters are interpreted as newlines can be specified by a set-
|
||||
ting in the compile context that is passed to pcre2_compile() or by a
|
||||
|
@ -1518,7 +1537,9 @@ COMPILING A PATTERN
|
|||
|
||||
This option has the effect of PCRE2_EXTENDED, but, in addition,
|
||||
unescaped space and horizontal tab characters are ignored inside a
|
||||
character class. PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx
|
||||
character class. Note: only these two characters are ignored, not the
|
||||
full set of pattern white space characters that are ignored outside a
|
||||
character class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx
|
||||
option, and it can be changed within a pattern by a (?xx) option set-
|
||||
ting.
|
||||
|
||||
|
@ -3521,7 +3542,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 27 July 2018
|
||||
Last updated: 03 August 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -7192,9 +7213,10 @@ INTERNAL OPTION SETTING
|
|||
|
||||
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
|
||||
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
|
||||
(which are Perl-compatible) can be changed from within the pattern by a
|
||||
sequence of Perl option letters enclosed between "(?" and ")". The
|
||||
option letters are
|
||||
can be changed from within the pattern by a sequence of letters
|
||||
enclosed between "(?" and ")". These options are Perl-compatible, and
|
||||
are described in detail in the pcre2api documentation. The option let-
|
||||
ters are:
|
||||
|
||||
i for PCRE2_CASELESS
|
||||
m for PCRE2_MULTILINE
|
||||
|
@ -7814,8 +7836,9 @@ BACKREFERENCES
|
|||
its following a backslash are taken as part of a potential backrefer-
|
||||
ence number. If the pattern continues with a digit character, some
|
||||
delimiter must be used to terminate the backreference. If the
|
||||
PCRE2_EXTENDED option is set, this can be white space. Otherwise, the
|
||||
\g{ syntax or an empty comment (see "Comments" below) can be used.
|
||||
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this can be white
|
||||
space. Otherwise, the \g{ syntax or an empty comment (see "Comments"
|
||||
below) can be used.
|
||||
|
||||
Recursive backreferences
|
||||
|
||||
|
@ -8245,17 +8268,17 @@ COMMENTS
|
|||
|
||||
The sequence (?# marks the start of a comment that continues up to the
|
||||
next closing parenthesis. Nested parentheses are not permitted. If the
|
||||
PCRE2_EXTENDED option is set, an unescaped # character also introduces
|
||||
a comment, which in this case continues to immediately after the next
|
||||
newline character or character sequence in the pattern. Which charac-
|
||||
ters are interpreted as newlines is controlled by an option passed to
|
||||
the compiling function or by a special sequence at the start of the
|
||||
pattern, as described in the section entitled "Newline conventions"
|
||||
above. Note that the end of this type of comment is a literal newline
|
||||
sequence in the pattern; escape sequences that happen to represent a
|
||||
newline do not count. For example, consider this pattern when
|
||||
PCRE2_EXTENDED is set, and the default newline convention (a single
|
||||
linefeed character) is in force:
|
||||
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped #
|
||||
character also introduces a comment, which in this case continues to
|
||||
immediately after the next newline character or character sequence in
|
||||
the pattern. Which characters are interpreted as newlines is controlled
|
||||
by an option passed to the compiling function or by a special sequence
|
||||
at the start of the pattern, as described in the section entitled "New-
|
||||
line conventions" above. Note that the end of this type of comment is a
|
||||
literal newline sequence in the pattern; escape sequences that happen
|
||||
to represent a newline do not count. For example, consider this pattern
|
||||
when PCRE2_EXTENDED is set, and the default newline convention (a sin-
|
||||
gle linefeed character) is in force:
|
||||
|
||||
abc #comment \n still comment
|
||||
|
||||
|
@ -8602,10 +8625,10 @@ BACKTRACKING CONTROL
|
|||
|
||||
A closing parenthesis can be included in a name either as \) or between
|
||||
\Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
|
||||
option is also set, unescaped whitespace in verb names is skipped, and
|
||||
#-comments are recognized, exactly as in the rest of the pattern.
|
||||
PCRE2_EXTENDED does not affect verb names unless PCRE2_ALT_VERBNAMES is
|
||||
also set.
|
||||
or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
|
||||
names is skipped, and #-comments are recognized, exactly as in the rest
|
||||
of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
|
||||
verb names unless PCRE2_ALT_VERBNAMES is also set.
|
||||
|
||||
The maximum length of a name is 255 in the 8-bit library and 65535 in
|
||||
the 16-bit and 32-bit libraries. If the name is empty, that is, if the
|
||||
|
@ -9049,7 +9072,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 28 July 2018
|
||||
Last updated: 03 August 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -10151,6 +10174,8 @@ COMMENT
|
|||
|
||||
|
||||
OPTION SETTING
|
||||
Changes of these options within a group are automatically cancelled at
|
||||
the end of the group.
|
||||
|
||||
(?i) caseless
|
||||
(?J) allow duplicate names
|
||||
|
@ -10337,7 +10362,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 28 July 2018
|
||||
Last updated: 01 August 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "27 July 2018" "PCRE2 10.32"
|
||||
.TH PCRE2API 3 "03 August 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -775,10 +775,10 @@ sequence such as (*CRLF). See the
|
|||
page for details.
|
||||
.P
|
||||
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
||||
option, the newline convention affects the recognition of white space and the
|
||||
end of internal comments starting with #. The value is saved with the compiled
|
||||
pattern for subsequent use by the JIT compiler and by the two interpreted
|
||||
matching functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
|
||||
option, the newline convention affects the recognition of the end of internal
|
||||
comments starting with #. The value is saved with the compiled pattern for
|
||||
subsequent use by the JIT compiler and by the two interpreted matching
|
||||
functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
|
||||
.sp
|
||||
.nf
|
||||
.B int pcre2_set_parens_nest_limit(pcre2_compile_context *\fIccontext\fP,
|
||||
|
@ -1356,9 +1356,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
|
|||
option is set, normal backslash processing is applied to verb names and only an
|
||||
unescaped closing parenthesis terminates the name. A closing parenthesis can be
|
||||
included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED
|
||||
or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
|
||||
skipped and #-comments are recognized in this mode, exactly as in the rest of
|
||||
the pattern.
|
||||
or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
|
||||
whitespace in verb names is skipped and #-comments are recognized, exactly as
|
||||
in the rest of the pattern.
|
||||
.sp
|
||||
PCRE2_AUTO_CALLOUT
|
||||
.sp
|
||||
|
@ -1445,14 +1445,35 @@ is not allowed within sequences such as (?> that introduce various
|
|||
parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
|
||||
Ignorable white space is permitted between an item and a following quantifier
|
||||
and between a quantifier and a following + that indicates possessiveness.
|
||||
PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
|
||||
a pattern by a (?x) option setting.
|
||||
.P
|
||||
PCRE2_EXTENDED also causes characters between an unescaped # outside a
|
||||
character class and the next newline, inclusive, to be ignored, which makes it
|
||||
possible to include comments inside complicated patterns. Note that the end of
|
||||
this type of comment is a literal newline sequence in the pattern; escape
|
||||
sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
|
||||
equivalent to Perl's /x option, and it can be changed within a pattern by a
|
||||
(?x) option setting.
|
||||
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
|
||||
white space only those characters with code points less than 256 that are
|
||||
flagged as white space in its low-character table. The table is normally
|
||||
created by
|
||||
.\" HREF
|
||||
\fBpcre2_maketables()\fP,
|
||||
.\"
|
||||
which uses the \fBisspace()\fP function to identify space characters. In most
|
||||
ASCII environments, the relevant characters are those with code points 0x0009
|
||||
(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
|
||||
(carriage return), and 0x0020 (space).
|
||||
.P
|
||||
When PCRE2 is compiled with Unicode support, in addition to these characters,
|
||||
five more Unicode "Pattern White Space" characters are recognized by
|
||||
PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
|
||||
U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
|
||||
separator). This set of characters is the same as recognized by Perl's /x
|
||||
option. Note that the horizontal and vertical space characters that are matched
|
||||
by the \eh and \ev escapes in patterns are a much bigger set.
|
||||
.P
|
||||
As well as ignoring most white space, PCRE2_EXTENDED also causes characters
|
||||
between an unescaped # outside a character class and the next newline,
|
||||
inclusive, to be ignored, which makes it possible to include comments inside
|
||||
complicated patterns. Note that the end of this type of comment is a literal
|
||||
newline sequence in the pattern; escape sequences that happen to represent a
|
||||
newline do not count.
|
||||
.P
|
||||
Which characters are interpreted as newlines can be specified by a setting in
|
||||
the compile context that is passed to \fBpcre2_compile()\fP or by a special
|
||||
|
@ -1467,9 +1488,11 @@ built.
|
|||
PCRE2_EXTENDED_MORE
|
||||
.sp
|
||||
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
|
||||
and horizontal tab characters are ignored inside a character class.
|
||||
PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
|
||||
changed within a pattern by a (?xx) option setting.
|
||||
and horizontal tab characters are ignored inside a character class. Note: only
|
||||
these two characters are ignored, not the full set of pattern white space
|
||||
characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
|
||||
equivalent to Perl's /xx option, and it can be changed within a pattern by a
|
||||
(?xx) option setting.
|
||||
.sp
|
||||
PCRE2_FIRSTLINE
|
||||
.sp
|
||||
|
@ -3641,6 +3664,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 27 July 2018
|
||||
Last updated: 03 August 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "28 July 2018" "PCRE2 10.32"
|
||||
.TH PCRE2PATTERN 3 "03 August 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -1627,9 +1627,13 @@ alternative in the subpattern.
|
|||
.rs
|
||||
.sp
|
||||
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
|
||||
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
|
||||
are Perl-compatible) can be changed from within the pattern by a sequence of
|
||||
Perl option letters enclosed between "(?" and ")". The option letters are
|
||||
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
|
||||
changed from within the pattern by a sequence of letters enclosed between "(?"
|
||||
and ")". These options are Perl-compatible, and are described in detail in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
documentation. The option letters are:
|
||||
.sp
|
||||
i for PCRE2_CASELESS
|
||||
m for PCRE2_MULTILINE
|
||||
|
@ -2273,8 +2277,9 @@ unset value matches an empty string.
|
|||
Because there may be many capturing parentheses in a pattern, all digits
|
||||
following a backslash are taken as part of a potential backreference number.
|
||||
If the pattern continues with a digit character, some delimiter must be used to
|
||||
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
|
||||
white space. Otherwise, the \eg{ syntax or an empty comment (see
|
||||
terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
||||
option is set, this can be white space. Otherwise, the \eg{ syntax or an empty
|
||||
comment (see
|
||||
.\" HTML <a href="#comments">
|
||||
.\" </a>
|
||||
"Comments"
|
||||
|
@ -2762,12 +2767,12 @@ no part in the pattern matching.
|
|||
.P
|
||||
The sequence (?# marks the start of a comment that continues up to the next
|
||||
closing parenthesis. Nested parentheses are not permitted. If the
|
||||
PCRE2_EXTENDED option is set, an unescaped # character also introduces a
|
||||
comment, which in this case continues to immediately after the next newline
|
||||
character or character sequence in the pattern. Which characters are
|
||||
interpreted as newlines is controlled by an option passed to the compiling
|
||||
function or by a special sequence at the start of the pattern, as described in
|
||||
the section entitled
|
||||
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
|
||||
also introduces a comment, which in this case continues to immediately after
|
||||
the next newline character or character sequence in the pattern. Which
|
||||
characters are interpreted as newlines is controlled by an option passed to the
|
||||
compiling function or by a special sequence at the start of the pattern, as
|
||||
described in the section entitled
|
||||
.\" HTML <a href="#newlines">
|
||||
.\" </a>
|
||||
"Newline conventions"
|
||||
|
@ -3132,10 +3137,11 @@ only backslash items that are permitted are \eQ, \eE, and sequences such as
|
|||
are faulted.
|
||||
.P
|
||||
A closing parenthesis can be included in a name either as \e) or between \eQ
|
||||
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is
|
||||
also set, unescaped whitespace in verb names is skipped, and #-comments are
|
||||
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
|
||||
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
|
||||
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or
|
||||
PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
|
||||
skipped, and #-comments are recognized, exactly as in the rest of the pattern.
|
||||
PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
|
||||
PCRE2_ALT_VERBNAMES is also set.
|
||||
.P
|
||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
||||
|
@ -3614,6 +3620,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 28 July 2018
|
||||
Last updated: 03 August 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "28 July 2018" "PCRE2 10.32"
|
||||
.TH PCRE2SYNTAX 3 "01 August 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -421,6 +421,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
|||
.
|
||||
.SH "OPTION SETTING"
|
||||
.rs
|
||||
Changes of these options within a group are automatically cancelled at the end
|
||||
of the group.
|
||||
.sp
|
||||
(?i) caseless
|
||||
(?J) allow duplicate names
|
||||
|
@ -619,6 +621,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 28 July 2018
|
||||
Last updated: 01 August 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -2468,11 +2468,17 @@ while (ptr < ptrend)
|
|||
/* EITHER: not both options set */
|
||||
((options & (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) !=
|
||||
(PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) ||
|
||||
/* OR: character > 255 */
|
||||
c > 255 ||
|
||||
/* OR: not a # comment or white space */
|
||||
(c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0)
|
||||
))
|
||||
#ifdef SUPPORT_UNICODE
|
||||
/* OR: character > 255 AND not Unicode Pattern White Space */
|
||||
(c > 255 && (c|1) != 0x200f && (c|1) != 0x2029) ||
|
||||
#endif
|
||||
/* OR: not a # comment or isspace() white space */
|
||||
(c < 256 && c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0
|
||||
#ifdef SUPPORT_UNICODE
|
||||
/* and not CHAR_NEL when Unicode is supported */
|
||||
&& c != CHAR_NEL
|
||||
#endif
|
||||
)))
|
||||
{
|
||||
PCRE2_SIZE verbnamelength;
|
||||
|
||||
|
@ -2554,11 +2560,18 @@ while (ptr < ptrend)
|
|||
|
||||
/* Skip over whitespace and # comments in extended mode. Note that c is a
|
||||
character, not a code unit, so we must not use MAX_255 to test its size
|
||||
because MAX_255 tests code units and is assumed TRUE in 8-bit mode. */
|
||||
because MAX_255 tests code units and is assumed TRUE in 8-bit mode. The
|
||||
whitespace characters are those designated as "Pattern White Space" by
|
||||
Unicode, which are the isspace() characters plus CHAR_NEL (newline), which is
|
||||
U+0085 in Unicode, plus U+200E, U+200F, U+2028, and U+2029. These are a
|
||||
subset of space characters that match \h and \v. */
|
||||
|
||||
if ((options & PCRE2_EXTENDED) != 0)
|
||||
{
|
||||
if (c < 256 && (cb->ctypes[c] & ctype_space) != 0) continue;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (c == CHAR_NEL || (c|1) == 0x200f || (c|1) == 0x2029) continue;
|
||||
#endif
|
||||
if (c == CHAR_NUMBER_SIGN)
|
||||
{
|
||||
while (ptr < ptrend)
|
||||
|
|
|
@ -2294,4 +2294,19 @@
|
|||
/[\N{U+1234}]/utf
|
||||
\x{1234}
|
||||
|
||||
# Test the full list of Unicode "Pattern White Space" characters that are to
|
||||
# be ignored by /x. The pattern lines below may show up oddly in text editors
|
||||
# or when listed to the screen. Note that characters such as U+2002, which are
|
||||
# matched as space by \h and \v are *not* "Pattern White Space".
|
||||
|
||||
/A
B/x,utf
|
||||
AB
|
||||
|
||||
/A B/x,utf
|
||||
A\x{2002}B
|
||||
\= Expect no match
|
||||
AB
|
||||
|
||||
# -------
|
||||
|
||||
# End of testinput4
|
||||
|
|
|
@ -2091,4 +2091,18 @@
|
|||
|
||||
/\N{U}/
|
||||
|
||||
# This tests the non-UTF Unicode NEL pattern whitespace character, only
|
||||
# recognized by PCRE2 with /x when there is Unicode support.
|
||||
|
||||
/A
|
||||
…B/x
|
||||
AB
|
||||
|
||||
# This tests Unicode Pattern White Space characters in verb names when they
|
||||
# are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
|
||||
# with code points greater than 255 between A, B, and C in the pattern.
|
||||
|
||||
/(*: A‎B
C)abc/x,utf,mark,alt_verbnames
|
||||
abc
|
||||
|
||||
# End of testinput5
|
||||
|
|
|
@ -3712,4 +3712,22 @@ No match
|
|||
\x{1234}
|
||||
0: \x{1234}
|
||||
|
||||
# Test the full list of Unicode "Pattern White Space" characters that are to
|
||||
# be ignored by /x. The pattern lines below may show up oddly in text editors
|
||||
# or when listed to the screen. Note that characters such as U+2002, which are
|
||||
# matched as space by \h and \v are *not* "Pattern White Space".
|
||||
|
||||
/A
B/x,utf
|
||||
AB
|
||||
0: AB
|
||||
|
||||
/A B/x,utf
|
||||
A\x{2002}B
|
||||
0: A\x{2002}B
|
||||
\= Expect no match
|
||||
AB
|
||||
No match
|
||||
|
||||
# -------
|
||||
|
||||
# End of testinput4
|
||||
|
|
|
@ -4756,4 +4756,21 @@ Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
|
|||
/\N{U}/
|
||||
Failed: error 137 at offset 2: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u
|
||||
|
||||
# This tests the non-UTF Unicode NEL pattern whitespace character, only
|
||||
# recognized by PCRE2 with /x when there is Unicode support.
|
||||
|
||||
/A
|
||||
…B/x
|
||||
AB
|
||||
0: AB
|
||||
|
||||
# This tests Unicode Pattern White Space characters in verb names when they
|
||||
# are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
|
||||
# with code points greater than 255 between A, B, and C in the pattern.
|
||||
|
||||
/(*: A‎B
C)abc/x,utf,mark,alt_verbnames
|
||||
abc
|
||||
0: abc
|
||||
MK: ABC
|
||||
|
||||
# End of testinput5
|
||||
|
|
Loading…
Reference in New Issue