Make /x more Perl-compatible by recognizing all of Unicode's "Pattern White
Space" characters, not just the ASCII ones.
This commit is contained in:
parent
6e245572b8
commit
b196143523
|
@ -134,6 +134,13 @@ terminated by (*ACCEPT).
|
||||||
|
|
||||||
30. Add support for (?^) for unsetting all imnsx options.
|
30. Add support for (?^) for unsetting all imnsx options.
|
||||||
|
|
||||||
|
31. The PCRE2_EXTENDED (/x) option only ever discarded space characters whose
|
||||||
|
code point was less than 256 and that were recognized by the lookup table
|
||||||
|
generated by pcre2_maketables(), which uses isspace() to identify white space.
|
||||||
|
Now, when Unicode support is compiled, PCRE2_EXTENDED also discards U+0085,
|
||||||
|
U+200E, U+200F, U+2028, and U+2029, which are additional characters defined by
|
||||||
|
Unicode as "Pattern White Space". This makes PCRE2 compatible with Perl.
|
||||||
|
|
||||||
|
|
||||||
Version 10.31 12-February-2018
|
Version 10.31 12-February-2018
|
||||||
------------------------------
|
------------------------------
|
||||||
|
|
|
@ -837,10 +837,10 @@ page for details.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
||||||
option, the newline convention affects the recognition of white space and the
|
option, the newline convention affects the recognition of the end of internal
|
||||||
end of internal comments starting with #. The value is saved with the compiled
|
comments starting with #. The value is saved with the compiled pattern for
|
||||||
pattern for subsequent use by the JIT compiler and by the two interpreted
|
subsequent use by the JIT compiler and by the two interpreted matching
|
||||||
matching functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
|
functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
|
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||||
|
@ -1424,9 +1424,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
|
||||||
option is set, normal backslash processing is applied to verb names and only an
|
option is set, normal backslash processing is applied to verb names and only an
|
||||||
unescaped closing parenthesis terminates the name. A closing parenthesis can be
|
unescaped closing parenthesis terminates the name. A closing parenthesis can be
|
||||||
included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
|
included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
|
||||||
or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
|
or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
|
||||||
skipped and #-comments are recognized in this mode, exactly as in the rest of
|
whitespace in verb names is skipped and #-comments are recognized, exactly as
|
||||||
the pattern.
|
in the rest of the pattern.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_AUTO_CALLOUT
|
PCRE2_AUTO_CALLOUT
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1510,15 +1510,36 @@ is not allowed within sequences such as (?> that introduce various
|
||||||
parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
|
parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
|
||||||
Ignorable white space is permitted between an item and a following quantifier
|
Ignorable white space is permitted between an item and a following quantifier
|
||||||
and between a quantifier and a following + that indicates possessiveness.
|
and between a quantifier and a following + that indicates possessiveness.
|
||||||
|
PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
|
||||||
|
a pattern by a (?x) option setting.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
PCRE2_EXTENDED also causes characters between an unescaped # outside a
|
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
|
||||||
character class and the next newline, inclusive, to be ignored, which makes it
|
white space only those characters with code points less than 256 that are
|
||||||
possible to include comments inside complicated patterns. Note that the end of
|
flagged as white space in its low-character table. The table is normally
|
||||||
this type of comment is a literal newline sequence in the pattern; escape
|
created by
|
||||||
sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
|
<a href="pcre2_maketables.html"><b>pcre2_maketables()</b>,</a>
|
||||||
equivalent to Perl's /x option, and it can be changed within a pattern by a
|
which uses the <b>isspace()</b> function to identify space characters. In most
|
||||||
(?x) option setting.
|
ASCII environments, the relevant characters are those with code points 0x0009
|
||||||
|
(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
|
||||||
|
(carriage return), and 0x0020 (space).
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
When PCRE2 is compiled with Unicode support, in addition to these characters,
|
||||||
|
five more Unicode "Pattern White Space" characters are recognized by
|
||||||
|
PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
|
||||||
|
U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
|
||||||
|
separator). This set of characters is the same as recognized by Perl's /x
|
||||||
|
option. Note that the horizontal and vertical space characters that are matched
|
||||||
|
by the \h and \v escapes in patterns are a much bigger set.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
As well as ignoring most white space, PCRE2_EXTENDED also causes characters
|
||||||
|
between an unescaped # outside a character class and the next newline,
|
||||||
|
inclusive, to be ignored, which makes it possible to include comments inside
|
||||||
|
complicated patterns. Note that the end of this type of comment is a literal
|
||||||
|
newline sequence in the pattern; escape sequences that happen to represent a
|
||||||
|
newline do not count.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Which characters are interpreted as newlines can be specified by a setting in
|
Which characters are interpreted as newlines can be specified by a setting in
|
||||||
|
@ -1531,9 +1552,11 @@ built.
|
||||||
PCRE2_EXTENDED_MORE
|
PCRE2_EXTENDED_MORE
|
||||||
</pre>
|
</pre>
|
||||||
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
|
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
|
||||||
and horizontal tab characters are ignored inside a character class.
|
and horizontal tab characters are ignored inside a character class. Note: only
|
||||||
PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
|
these two characters are ignored, not the full set of pattern white space
|
||||||
changed within a pattern by a (?xx) option setting.
|
characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
|
||||||
|
equivalent to Perl's /xx option, and it can be changed within a pattern by a
|
||||||
|
(?xx) option setting.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_FIRSTLINE
|
PCRE2_FIRSTLINE
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -3635,7 +3658,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 27 July 2018
|
Last updated: 03 August 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -1628,9 +1628,11 @@ alternative in the subpattern.
|
||||||
<br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
|
<br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
|
||||||
<P>
|
<P>
|
||||||
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
|
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
|
||||||
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
|
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
|
||||||
are Perl-compatible) can be changed from within the pattern by a sequence of
|
changed from within the pattern by a sequence of letters enclosed between "(?"
|
||||||
Perl option letters enclosed between "(?" and ")". The option letters are
|
and ")". These options are Perl-compatible, and are described in detail in the
|
||||||
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
|
documentation. The option letters are:
|
||||||
<pre>
|
<pre>
|
||||||
i for PCRE2_CASELESS
|
i for PCRE2_CASELESS
|
||||||
m for PCRE2_MULTILINE
|
m for PCRE2_MULTILINE
|
||||||
|
@ -2275,8 +2277,9 @@ unset value matches an empty string.
|
||||||
Because there may be many capturing parentheses in a pattern, all digits
|
Because there may be many capturing parentheses in a pattern, all digits
|
||||||
following a backslash are taken as part of a potential backreference number.
|
following a backslash are taken as part of a potential backreference number.
|
||||||
If the pattern continues with a digit character, some delimiter must be used to
|
If the pattern continues with a digit character, some delimiter must be used to
|
||||||
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
|
terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
||||||
white space. Otherwise, the \g{ syntax or an empty comment (see
|
option is set, this can be white space. Otherwise, the \g{ syntax or an empty
|
||||||
|
comment (see
|
||||||
<a href="#comments">"Comments"</a>
|
<a href="#comments">"Comments"</a>
|
||||||
below) can be used.
|
below) can be used.
|
||||||
</P>
|
</P>
|
||||||
|
@ -2744,12 +2747,12 @@ no part in the pattern matching.
|
||||||
<P>
|
<P>
|
||||||
The sequence (?# marks the start of a comment that continues up to the next
|
The sequence (?# marks the start of a comment that continues up to the next
|
||||||
closing parenthesis. Nested parentheses are not permitted. If the
|
closing parenthesis. Nested parentheses are not permitted. If the
|
||||||
PCRE2_EXTENDED option is set, an unescaped # character also introduces a
|
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
|
||||||
comment, which in this case continues to immediately after the next newline
|
also introduces a comment, which in this case continues to immediately after
|
||||||
character or character sequence in the pattern. Which characters are
|
the next newline character or character sequence in the pattern. Which
|
||||||
interpreted as newlines is controlled by an option passed to the compiling
|
characters are interpreted as newlines is controlled by an option passed to the
|
||||||
function or by a special sequence at the start of the pattern, as described in
|
compiling function or by a special sequence at the start of the pattern, as
|
||||||
the section entitled
|
described in the section entitled
|
||||||
<a href="#newlines">"Newline conventions"</a>
|
<a href="#newlines">"Newline conventions"</a>
|
||||||
above. Note that the end of this type of comment is a literal newline sequence
|
above. Note that the end of this type of comment is a literal newline sequence
|
||||||
in the pattern; escape sequences that happen to represent a newline do not
|
in the pattern; escape sequences that happen to represent a newline do not
|
||||||
|
@ -3108,10 +3111,11 @@ are faulted.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
A closing parenthesis can be included in a name either as \) or between \Q
|
A closing parenthesis can be included in a name either as \) or between \Q
|
||||||
and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is
|
and \E. In addition to backslash processing, if the PCRE2_EXTENDED or
|
||||||
also set, unescaped whitespace in verb names is skipped, and #-comments are
|
PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
|
||||||
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
|
skipped, and #-comments are recognized, exactly as in the rest of the pattern.
|
||||||
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
|
PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
|
||||||
|
PCRE2_ALT_VERBNAMES is also set.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||||
|
@ -3590,7 +3594,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 28 July 2018
|
Last updated: 03 August 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -446,6 +446,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
|
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
Changes of these options within a group are automatically cancelled at the end
|
||||||
|
of the group.
|
||||||
<pre>
|
<pre>
|
||||||
(?i) caseless
|
(?i) caseless
|
||||||
(?J) allow duplicate names
|
(?J) allow duplicate names
|
||||||
|
@ -632,7 +634,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 28 July 2018
|
Last updated: 01 August 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
2273
doc/pcre2.txt
2273
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "27 July 2018" "PCRE2 10.32"
|
.TH PCRE2API 3 "03 August 2018" "PCRE2 10.32"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -775,10 +775,10 @@ sequence such as (*CRLF). See the
|
||||||
page for details.
|
page for details.
|
||||||
.P
|
.P
|
||||||
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
||||||
option, the newline convention affects the recognition of white space and the
|
option, the newline convention affects the recognition of the end of internal
|
||||||
end of internal comments starting with #. The value is saved with the compiled
|
comments starting with #. The value is saved with the compiled pattern for
|
||||||
pattern for subsequent use by the JIT compiler and by the two interpreted
|
subsequent use by the JIT compiler and by the two interpreted matching
|
||||||
matching functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
|
functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
.B int pcre2_set_parens_nest_limit(pcre2_compile_context *\fIccontext\fP,
|
.B int pcre2_set_parens_nest_limit(pcre2_compile_context *\fIccontext\fP,
|
||||||
|
@ -1356,9 +1356,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
|
||||||
option is set, normal backslash processing is applied to verb names and only an
|
option is set, normal backslash processing is applied to verb names and only an
|
||||||
unescaped closing parenthesis terminates the name. A closing parenthesis can be
|
unescaped closing parenthesis terminates the name. A closing parenthesis can be
|
||||||
included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED
|
included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED
|
||||||
or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
|
or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
|
||||||
skipped and #-comments are recognized in this mode, exactly as in the rest of
|
whitespace in verb names is skipped and #-comments are recognized, exactly as
|
||||||
the pattern.
|
in the rest of the pattern.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_AUTO_CALLOUT
|
PCRE2_AUTO_CALLOUT
|
||||||
.sp
|
.sp
|
||||||
|
@ -1445,14 +1445,35 @@ is not allowed within sequences such as (?> that introduce various
|
||||||
parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
|
parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
|
||||||
Ignorable white space is permitted between an item and a following quantifier
|
Ignorable white space is permitted between an item and a following quantifier
|
||||||
and between a quantifier and a following + that indicates possessiveness.
|
and between a quantifier and a following + that indicates possessiveness.
|
||||||
|
PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
|
||||||
|
a pattern by a (?x) option setting.
|
||||||
.P
|
.P
|
||||||
PCRE2_EXTENDED also causes characters between an unescaped # outside a
|
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
|
||||||
character class and the next newline, inclusive, to be ignored, which makes it
|
white space only those characters with code points less than 256 that are
|
||||||
possible to include comments inside complicated patterns. Note that the end of
|
flagged as white space in its low-character table. The table is normally
|
||||||
this type of comment is a literal newline sequence in the pattern; escape
|
created by
|
||||||
sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
|
.\" HREF
|
||||||
equivalent to Perl's /x option, and it can be changed within a pattern by a
|
\fBpcre2_maketables()\fP,
|
||||||
(?x) option setting.
|
.\"
|
||||||
|
which uses the \fBisspace()\fP function to identify space characters. In most
|
||||||
|
ASCII environments, the relevant characters are those with code points 0x0009
|
||||||
|
(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
|
||||||
|
(carriage return), and 0x0020 (space).
|
||||||
|
.P
|
||||||
|
When PCRE2 is compiled with Unicode support, in addition to these characters,
|
||||||
|
five more Unicode "Pattern White Space" characters are recognized by
|
||||||
|
PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
|
||||||
|
U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
|
||||||
|
separator). This set of characters is the same as recognized by Perl's /x
|
||||||
|
option. Note that the horizontal and vertical space characters that are matched
|
||||||
|
by the \eh and \ev escapes in patterns are a much bigger set.
|
||||||
|
.P
|
||||||
|
As well as ignoring most white space, PCRE2_EXTENDED also causes characters
|
||||||
|
between an unescaped # outside a character class and the next newline,
|
||||||
|
inclusive, to be ignored, which makes it possible to include comments inside
|
||||||
|
complicated patterns. Note that the end of this type of comment is a literal
|
||||||
|
newline sequence in the pattern; escape sequences that happen to represent a
|
||||||
|
newline do not count.
|
||||||
.P
|
.P
|
||||||
Which characters are interpreted as newlines can be specified by a setting in
|
Which characters are interpreted as newlines can be specified by a setting in
|
||||||
the compile context that is passed to \fBpcre2_compile()\fP or by a special
|
the compile context that is passed to \fBpcre2_compile()\fP or by a special
|
||||||
|
@ -1467,9 +1488,11 @@ built.
|
||||||
PCRE2_EXTENDED_MORE
|
PCRE2_EXTENDED_MORE
|
||||||
.sp
|
.sp
|
||||||
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
|
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
|
||||||
and horizontal tab characters are ignored inside a character class.
|
and horizontal tab characters are ignored inside a character class. Note: only
|
||||||
PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
|
these two characters are ignored, not the full set of pattern white space
|
||||||
changed within a pattern by a (?xx) option setting.
|
characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
|
||||||
|
equivalent to Perl's /xx option, and it can be changed within a pattern by a
|
||||||
|
(?xx) option setting.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_FIRSTLINE
|
PCRE2_FIRSTLINE
|
||||||
.sp
|
.sp
|
||||||
|
@ -3641,6 +3664,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 27 July 2018
|
Last updated: 03 August 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "28 July 2018" "PCRE2 10.32"
|
.TH PCRE2PATTERN 3 "03 August 2018" "PCRE2 10.32"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -1627,9 +1627,13 @@ alternative in the subpattern.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
|
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
|
||||||
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
|
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
|
||||||
are Perl-compatible) can be changed from within the pattern by a sequence of
|
changed from within the pattern by a sequence of letters enclosed between "(?"
|
||||||
Perl option letters enclosed between "(?" and ")". The option letters are
|
and ")". These options are Perl-compatible, and are described in detail in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2api\fP
|
||||||
|
.\"
|
||||||
|
documentation. The option letters are:
|
||||||
.sp
|
.sp
|
||||||
i for PCRE2_CASELESS
|
i for PCRE2_CASELESS
|
||||||
m for PCRE2_MULTILINE
|
m for PCRE2_MULTILINE
|
||||||
|
@ -2273,8 +2277,9 @@ unset value matches an empty string.
|
||||||
Because there may be many capturing parentheses in a pattern, all digits
|
Because there may be many capturing parentheses in a pattern, all digits
|
||||||
following a backslash are taken as part of a potential backreference number.
|
following a backslash are taken as part of a potential backreference number.
|
||||||
If the pattern continues with a digit character, some delimiter must be used to
|
If the pattern continues with a digit character, some delimiter must be used to
|
||||||
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
|
terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
||||||
white space. Otherwise, the \eg{ syntax or an empty comment (see
|
option is set, this can be white space. Otherwise, the \eg{ syntax or an empty
|
||||||
|
comment (see
|
||||||
.\" HTML <a href="#comments">
|
.\" HTML <a href="#comments">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
"Comments"
|
"Comments"
|
||||||
|
@ -2762,12 +2767,12 @@ no part in the pattern matching.
|
||||||
.P
|
.P
|
||||||
The sequence (?# marks the start of a comment that continues up to the next
|
The sequence (?# marks the start of a comment that continues up to the next
|
||||||
closing parenthesis. Nested parentheses are not permitted. If the
|
closing parenthesis. Nested parentheses are not permitted. If the
|
||||||
PCRE2_EXTENDED option is set, an unescaped # character also introduces a
|
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
|
||||||
comment, which in this case continues to immediately after the next newline
|
also introduces a comment, which in this case continues to immediately after
|
||||||
character or character sequence in the pattern. Which characters are
|
the next newline character or character sequence in the pattern. Which
|
||||||
interpreted as newlines is controlled by an option passed to the compiling
|
characters are interpreted as newlines is controlled by an option passed to the
|
||||||
function or by a special sequence at the start of the pattern, as described in
|
compiling function or by a special sequence at the start of the pattern, as
|
||||||
the section entitled
|
described in the section entitled
|
||||||
.\" HTML <a href="#newlines">
|
.\" HTML <a href="#newlines">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
"Newline conventions"
|
"Newline conventions"
|
||||||
|
@ -3132,10 +3137,11 @@ only backslash items that are permitted are \eQ, \eE, and sequences such as
|
||||||
are faulted.
|
are faulted.
|
||||||
.P
|
.P
|
||||||
A closing parenthesis can be included in a name either as \e) or between \eQ
|
A closing parenthesis can be included in a name either as \e) or between \eQ
|
||||||
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is
|
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or
|
||||||
also set, unescaped whitespace in verb names is skipped, and #-comments are
|
PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
|
||||||
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
|
skipped, and #-comments are recognized, exactly as in the rest of the pattern.
|
||||||
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
|
PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
|
||||||
|
PCRE2_ALT_VERBNAMES is also set.
|
||||||
.P
|
.P
|
||||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||||
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
||||||
|
@ -3614,6 +3620,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 28 July 2018
|
Last updated: 03 August 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2SYNTAX 3 "28 July 2018" "PCRE2 10.32"
|
.TH PCRE2SYNTAX 3 "01 August 2018" "PCRE2 10.32"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
@ -421,6 +421,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||||
.
|
.
|
||||||
.SH "OPTION SETTING"
|
.SH "OPTION SETTING"
|
||||||
.rs
|
.rs
|
||||||
|
Changes of these options within a group are automatically cancelled at the end
|
||||||
|
of the group.
|
||||||
.sp
|
.sp
|
||||||
(?i) caseless
|
(?i) caseless
|
||||||
(?J) allow duplicate names
|
(?J) allow duplicate names
|
||||||
|
@ -619,6 +621,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 28 July 2018
|
Last updated: 01 August 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -2468,11 +2468,17 @@ while (ptr < ptrend)
|
||||||
/* EITHER: not both options set */
|
/* EITHER: not both options set */
|
||||||
((options & (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) !=
|
((options & (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) !=
|
||||||
(PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) ||
|
(PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) ||
|
||||||
/* OR: character > 255 */
|
#ifdef SUPPORT_UNICODE
|
||||||
c > 255 ||
|
/* OR: character > 255 AND not Unicode Pattern White Space */
|
||||||
/* OR: not a # comment or white space */
|
(c > 255 && (c|1) != 0x200f && (c|1) != 0x2029) ||
|
||||||
(c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0)
|
#endif
|
||||||
))
|
/* OR: not a # comment or isspace() white space */
|
||||||
|
(c < 256 && c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
/* and not CHAR_NEL when Unicode is supported */
|
||||||
|
&& c != CHAR_NEL
|
||||||
|
#endif
|
||||||
|
)))
|
||||||
{
|
{
|
||||||
PCRE2_SIZE verbnamelength;
|
PCRE2_SIZE verbnamelength;
|
||||||
|
|
||||||
|
@ -2554,11 +2560,18 @@ while (ptr < ptrend)
|
||||||
|
|
||||||
/* Skip over whitespace and # comments in extended mode. Note that c is a
|
/* Skip over whitespace and # comments in extended mode. Note that c is a
|
||||||
character, not a code unit, so we must not use MAX_255 to test its size
|
character, not a code unit, so we must not use MAX_255 to test its size
|
||||||
because MAX_255 tests code units and is assumed TRUE in 8-bit mode. */
|
because MAX_255 tests code units and is assumed TRUE in 8-bit mode. The
|
||||||
|
whitespace characters are those designated as "Pattern White Space" by
|
||||||
|
Unicode, which are the isspace() characters plus CHAR_NEL (newline), which is
|
||||||
|
U+0085 in Unicode, plus U+200E, U+200F, U+2028, and U+2029. These are a
|
||||||
|
subset of space characters that match \h and \v. */
|
||||||
|
|
||||||
if ((options & PCRE2_EXTENDED) != 0)
|
if ((options & PCRE2_EXTENDED) != 0)
|
||||||
{
|
{
|
||||||
if (c < 256 && (cb->ctypes[c] & ctype_space) != 0) continue;
|
if (c < 256 && (cb->ctypes[c] & ctype_space) != 0) continue;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (c == CHAR_NEL || (c|1) == 0x200f || (c|1) == 0x2029) continue;
|
||||||
|
#endif
|
||||||
if (c == CHAR_NUMBER_SIGN)
|
if (c == CHAR_NUMBER_SIGN)
|
||||||
{
|
{
|
||||||
while (ptr < ptrend)
|
while (ptr < ptrend)
|
||||||
|
|
|
@ -2294,4 +2294,19 @@
|
||||||
/[\N{U+1234}]/utf
|
/[\N{U+1234}]/utf
|
||||||
\x{1234}
|
\x{1234}
|
||||||
|
|
||||||
|
# Test the full list of Unicode "Pattern White Space" characters that are to
|
||||||
|
# be ignored by /x. The pattern lines below may show up oddly in text editors
|
||||||
|
# or when listed to the screen. Note that characters such as U+2002, which are
|
||||||
|
# matched as space by \h and \v are *not* "Pattern White Space".
|
||||||
|
|
||||||
|
/A
B/x,utf
|
||||||
|
AB
|
||||||
|
|
||||||
|
/A B/x,utf
|
||||||
|
A\x{2002}B
|
||||||
|
\= Expect no match
|
||||||
|
AB
|
||||||
|
|
||||||
|
# -------
|
||||||
|
|
||||||
# End of testinput4
|
# End of testinput4
|
||||||
|
|
|
@ -2091,4 +2091,18 @@
|
||||||
|
|
||||||
/\N{U}/
|
/\N{U}/
|
||||||
|
|
||||||
|
# This tests the non-UTF Unicode NEL pattern whitespace character, only
|
||||||
|
# recognized by PCRE2 with /x when there is Unicode support.
|
||||||
|
|
||||||
|
/A
|
||||||
|
…B/x
|
||||||
|
AB
|
||||||
|
|
||||||
|
# This tests Unicode Pattern White Space characters in verb names when they
|
||||||
|
# are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
|
||||||
|
# with code points greater than 255 between A, B, and C in the pattern.
|
||||||
|
|
||||||
|
/(*: A‎B
C)abc/x,utf,mark,alt_verbnames
|
||||||
|
abc
|
||||||
|
|
||||||
# End of testinput5
|
# End of testinput5
|
||||||
|
|
|
@ -3712,4 +3712,22 @@ No match
|
||||||
\x{1234}
|
\x{1234}
|
||||||
0: \x{1234}
|
0: \x{1234}
|
||||||
|
|
||||||
|
# Test the full list of Unicode "Pattern White Space" characters that are to
|
||||||
|
# be ignored by /x. The pattern lines below may show up oddly in text editors
|
||||||
|
# or when listed to the screen. Note that characters such as U+2002, which are
|
||||||
|
# matched as space by \h and \v are *not* "Pattern White Space".
|
||||||
|
|
||||||
|
/A
B/x,utf
|
||||||
|
AB
|
||||||
|
0: AB
|
||||||
|
|
||||||
|
/A B/x,utf
|
||||||
|
A\x{2002}B
|
||||||
|
0: A\x{2002}B
|
||||||
|
\= Expect no match
|
||||||
|
AB
|
||||||
|
No match
|
||||||
|
|
||||||
|
# -------
|
||||||
|
|
||||||
# End of testinput4
|
# End of testinput4
|
||||||
|
|
|
@ -4756,4 +4756,21 @@ Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
|
||||||
/\N{U}/
|
/\N{U}/
|
||||||
Failed: error 137 at offset 2: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u
|
Failed: error 137 at offset 2: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u
|
||||||
|
|
||||||
|
# This tests the non-UTF Unicode NEL pattern whitespace character, only
|
||||||
|
# recognized by PCRE2 with /x when there is Unicode support.
|
||||||
|
|
||||||
|
/A
|
||||||
|
…B/x
|
||||||
|
AB
|
||||||
|
0: AB
|
||||||
|
|
||||||
|
# This tests Unicode Pattern White Space characters in verb names when they
|
||||||
|
# are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
|
||||||
|
# with code points greater than 255 between A, B, and C in the pattern.
|
||||||
|
|
||||||
|
/(*: A‎B
C)abc/x,utf,mark,alt_verbnames
|
||||||
|
abc
|
||||||
|
0: abc
|
||||||
|
MK: ABC
|
||||||
|
|
||||||
# End of testinput5
|
# End of testinput5
|
||||||
|
|
Loading…
Reference in New Issue