Implement PCRE2_EXTRA_ALT_BSUX to support ECMAscript 6's \u{hhh..} syntax.

This commit is contained in:
Philip.Hazel 2019-02-12 17:50:19 +00:00
parent d90de8b053
commit 8c8deae8eb
26 changed files with 1310 additions and 1112 deletions

View File

@ -125,6 +125,9 @@ processing or a crash could result.
names, as Perl does. There was a small bug in this new code, found by
ClusterFuzz 12950, fixed before release.
31. Implemented PCRE2_EXTRA_ALT_BSUX to support ECMAScript 6's \u{hhh}
construct.
Version 10.32 10-September-2018
-------------------------------

View File

@ -86,7 +86,12 @@ PCRE2 must be built with Unicode support (the default) in order to use
PCRE2_UTF, PCRE2_UCP and related options.
</P>
<P>
The yield of the function is a pointer to a private data structure that
Additional options may be set in the compile context via the
<a href="pcre2_set_compile_extra_options.html"><b>pcre2_set_compile_extra_options</b></a>
function.
</P>
<P>
The yield of this function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected.
</P>
<P>

View File

@ -20,7 +20,7 @@ SYNOPSIS
</P>
<P>
<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
<b> PCRE2_SIZE <i>extra_options</i>);</b>
<b> uint32_t <i>extra_options</i>);</b>
</P>
<br><b>
DESCRIPTION
@ -31,6 +31,7 @@ housed in a compile context. It completely replaces all the bits. The extra
options are:
<pre>
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes
PCRE2_EXTRA_ALT_BSUX Extended alternate \u, \U, and \x handling
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \r as \n
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines

View File

@ -1298,7 +1298,7 @@ are needed. The <b>pcre2_code_copy_with_tables()</b> provides this facility.
Copies of both the code and the tables are made, with the new code pointing to
the new tables. The memory for the new tables is automatically freed when
<b>pcre2_code_free()</b> is called for the new copy of the compiled code. If
<b>pcre2_code_copy_withy_tables()</b> is called with a NULL argument, it returns
<b>pcre2_code_copy_with_tables()</b> is called with a NULL argument, it returns
NULL.
</P>
<P>
@ -1315,7 +1315,7 @@ PCRE2_COPY_MATCHED_SUBJECT option, which is described in the section entitled
</P>
<P>
The <i>options</i> argument for <b>pcre2_compile()</b> contains various bit
settings that affect the compilation. It should be zero if no options are
settings that affect the compilation. It should be zero if none of them are
required. The available options are described below. Some of them (in
particular, those that are compatible with Perl, but some others as well) can
also be set and unset from within the pattern (see the detailed description in
@ -1330,8 +1330,9 @@ compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and PCRE2_NO_UTF_CHECK
options can be set at the time of matching as well as at compile time.
</P>
<P>
Other, less frequently required compile-time parameters (for example, the
newline setting) can be provided in a compile context (as described
Some additional options and less frequently required compile-time parameters
(for example, the newline setting) can be provided in a compile context (as
described
<a href="#compilecontext">above).</a>
</P>
<P>
@ -1384,7 +1385,13 @@ This code fragment shows a typical straightforward call to
&errorcode, /* for error code */
&erroffset, /* for error offset */
NULL); /* no compile context */
</pre>
</PRE>
</P>
<br><b>
Main compile options
</b><br>
<P>
The following names for option bits are defined in the <b>pcre2.h</b> header
file:
<pre>
@ -1424,6 +1431,14 @@ hexadecimal digits, in which case the hexadecimal number defines the code point
to match. By default, as in Perl, a hexadecimal number is always expected after
\x, but it may have zero, one, or two digits (so, for example, \xz matches a
binary zero character followed by z).
</P>
<P>
ECMAscript 6 added additional functionality to \u. This can be accessed using
the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile options"
<a href="#extracompileoptions">below).</a>
Note that this alternative escape handling applies only to patterns. Neither of
these options affects the processing of replacement strings passed to
<b>pcre2_substitute()</b>.
<pre>
PCRE2_ALT_CIRCUMFLEX
</pre>
@ -1830,9 +1845,8 @@ characters with code points greater than 127.
Extra compile options
</b><br>
<P>
Unlike the main compile-time options, the extra options are not saved with the
compiled pattern. The option bits that can be set in a compile context by
calling the <b>pcre2_set_compile_extra_options()</b> function are as follows:
The option bits that can be set in a compile context by calling the
<b>pcre2_set_compile_extra_options()</b> function are as follows:
<pre>
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
</pre>
@ -1857,6 +1871,14 @@ If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
point values in UTF-8 and UTF-32 patterns no longer provoke errors and are
incorporated in the compiled pattern. However, they can only match subject
characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
<pre>
PCRE2_EXTRA_ALT_BSUX
</pre>
The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and \x in
the way that ECMAscript (aka JavaScript) does. Additional functionality was
defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has the effect of
PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} as a hexadecimal
character code, where hhh.. is any number of hexadecimal digits.
<pre>
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
</pre>
@ -3382,7 +3404,8 @@ capture groups and letters within \Q...\E quoted sequences.
<P>
Note that case forcing sequences such as \U...\E do not nest. For example,
the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no
effect.
effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EXTRA_ALT_BSUX options do
not apply to not apply to replacement strings.
</P>
<P>
The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
@ -3784,7 +3807,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 04 February 2019
Last updated: 12 February 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -47,8 +47,9 @@ non-newline character, and \N{U+dd..}, matching a Unicode code point, are
supported. The escapes that modify the case of following letters are
implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE2, an error is
generated by default. However, if the PCRE2_ALT_BSUX option is set, \U and \u
are interpreted as ECMAScript interprets them.
generated by default. However, if either of the PCRE2_ALT_BSUX or
PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are interpreted as ECMAScript
interprets them.
</P>
<P>
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
@ -233,7 +234,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 03 February 2019
Last updated: 12 February 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -399,12 +399,33 @@ environment, these escapes are as follows:
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..
\N{U+hhh..} character with Unicode hex code point hhh..
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
</pre>
There are some legacy applications where the escape sequence \r is expected to
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \r in a
pattern is converted to \n so that it matches a LF (linefeed) instead of a CR
(carriage return) character.
By default, after \x that is not followed by {, from zero to two hexadecimal
digits are read (letters can be in upper or lower case). Any number of
hexadecimal digits may appear between \x{ and }. If a character other than a
hexadecimal digit appears between \x{ and }, or if there is no terminating },
an error occurs.
</P>
<P>
Characters whose code points are less than 256 can be defined by either of the
two syntaxes for \x or by an octal sequence. There is no difference in the way
they are handled. For example, \xdc is exactly the same as \x{dc} or \334.
However, using the braced versions does make such sequences easier to read.
</P>
<P>
Support is available for some ECMAScript (aka JavaScript) escape sequences via
two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \x followed
by { is not recognized. Only if \x is followed by two hexadecimal digits is it
recognized as a character escape. Otherwise it is interpreted as a literal "x"
character. In this mode, support for code points greater than 256 is provided
by \u, which must be followed by four hexadecimal digits; otherwise it is
interpreted as a literal "u" character.
</P>
<P>
PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
\u{hhh..} is recognized as the character specified by hexadecimal code point.
There may be any number of hexadecimal digits. This syntax is from ECMAScript
6.
</P>
<P>
The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
@ -414,6 +435,12 @@ Note that when \N is not followed by an opening brace (curly bracket) it has
an entirely different meaning, matching any character that is not a newline.
</P>
<P>
There are some legacy applications where the escape sequence \r is expected to
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \r in a
pattern is converted to \n so that it matches a LF (linefeed) instead of a CR
(carriage return) character.
</P>
<P>
The precise effect of \cx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
@ -500,28 +527,6 @@ Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal
digits are ever read.
</P>
<P>
By default, after \x that is not followed by {, from zero to two hexadecimal
digits are read (letters can be in upper or lower case). Any number of
hexadecimal digits may appear between \x{ and }. If a character other than
a hexadecimal digit appears between \x{ and }, or if there is no terminating
}, an error occurs.
</P>
<P>
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
described only when it is followed by two hexadecimal digits. Otherwise, it
matches a literal "x" character. In this mode, support for code points greater
than 256 is provided by \u, which must be followed by four hexadecimal digits;
otherwise it matches a literal "u" character. This syntax makes PCRE2 behave
like ECMAscript (aka JavaScript). Code points greater than 0xFFFF are not
supported.
</P>
<P>
Characters whose value is less than 256 can be defined by either of the two
syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no difference in
the way they are handled. For example, \xdc is exactly the same as \x{dc} (or
\u00dc in PCRE2_ALT_BSUX mode).
</P>
<br><b>
Constraints on character values
</b><br>
@ -560,9 +565,10 @@ Unsupported escape sequences
<P>
In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
is set, \U matches a "U" character, and \u can be used to define a character
by code point, as described above.
does not support these escape sequences in patterns. However, if either of the
PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U matches a "U"
character, and \u can be used to define a character by code point, as
described above.
</P>
<br><b>
Absolute and relative backreferences
@ -3721,7 +3727,7 @@ Cambridge, England.
</P>
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
<P>
Last updated: 04 February 2019
Last updated: 12 February 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -58,7 +58,8 @@ documentation. This document contains a quick-reference summary of the syntax.
</P>
<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
<P>
This table applies to ASCII and Unicode environments.
This table applies to ASCII and Unicode environments. An unrecognized escape
sequence causes an error.
<pre>
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII printing character
@ -70,12 +71,25 @@ This table applies to ASCII and Unicode environments.
\0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd..
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\xhh character with hex code hh
\x{hh..} character with hex code hh..
</pre>
If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
following are also recognized:
<pre>
\U the character "U"
\uhhhh character with hex code hhhh
\u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
</pre>
When \x is not followed by {, from zero to two hexadecimal digits are read,
but in ALT_BSUX mode \x must be followed by two hexadecimal digits to be
recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits
or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it
matches a literal "u".
</P>
<P>
Note that \0dd is always an octal code. The treatment of backslash followed by
a non-zero digit is complicated; for details see the section
<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
@ -86,13 +100,6 @@ also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not
supported in EBCDIC environments. Note that \N not followed by an opening
curly bracket has a different meaning (see below).
</P>
<P>
When \x is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
it matches a literal "u".
</P>
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
<P>
<pre>
@ -660,7 +667,7 @@ Cambridge, England.
</P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P>
Last updated: 03 February 2019
Last updated: 11 February 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -609,6 +609,7 @@ for a description of the effects of these options.
escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF
/x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE
extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX
firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE
@ -2075,7 +2076,7 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 03 February 2019
Last updated: 11 February 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2_COMPILE 3 "16 June 2017" "PCRE2 10.30"
.TH PCRE2_COMPILE 3 "11 February 2019" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -73,7 +73,13 @@ The option bits are:
PCRE2 must be built with Unicode support (the default) in order to use
PCRE2_UTF, PCRE2_UCP and related options.
.P
The yield of the function is a pointer to a private data structure that
Additional options may be set in the compile context via the
.\" HREF
\fBpcre2_set_compile_extra_options\fP
.\"
function.
.P
The yield of this function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected.
.P
There is a complete description of the PCRE2 native API, with more detail on

View File

@ -1,4 +1,4 @@
.TH PCRE2_SET_COMPILE_EXTRA_OPTIONS 3 "21 September 2018" "PCRE2 10.33"
.TH PCRE2_SET_COMPILE_EXTRA_OPTIONS 3 "11 February 2019" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -8,7 +8,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.PP
.nf
.B int pcre2_set_compile_extra_options(pcre2_compile_context *\fIccontext\fP,
.B " PCRE2_SIZE \fIextra_options\fP);"
.B " uint32_t \fIextra_options\fP);"
.fi
.
.SH DESCRIPTION
@ -21,6 +21,9 @@ options are:
.\" JOIN
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \ex{df800} to \ex{dfff}
in UTF-8 and UTF-32 modes
.\" JOIN
PCRE2_EXTRA_ALT_BSUX Extended alternate \eu, \eU, and \ex
handling
.\" JOIN
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
a literal following character

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "04 February 2019" "PCRE2 10.33"
.TH PCRE2API 3 "12 February 2019" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -1231,7 +1231,7 @@ are needed. The \fBpcre2_code_copy_with_tables()\fP provides this facility.
Copies of both the code and the tables are made, with the new code pointing to
the new tables. The memory for the new tables is automatically freed when
\fBpcre2_code_free()\fP is called for the new copy of the compiled code. If
\fBpcre2_code_copy_withy_tables()\fP is called with a NULL argument, it returns
\fBpcre2_code_copy_with_tables()\fP is called with a NULL argument, it returns
NULL.
.P
NOTE: When one of the matching functions is called, pointers to the compiled
@ -1252,7 +1252,7 @@ below.
.\"
.P
The \fIoptions\fP argument for \fBpcre2_compile()\fP contains various bit
settings that affect the compilation. It should be zero if no options are
settings that affect the compilation. It should be zero if none of them are
required. The available options are described below. Some of them (in
particular, those that are compatible with Perl, but some others as well) can
also be set and unset from within the pattern (see the detailed description in
@ -1267,8 +1267,9 @@ contents of the \fIoptions\fP argument specifies their settings at the start of
compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and PCRE2_NO_UTF_CHECK
options can be set at the time of matching as well as at compile time.
.P
Other, less frequently required compile-time parameters (for example, the
newline setting) can be provided in a compile context (as described
Some additional options and less frequently required compile-time parameters
(for example, the newline setting) can be provided in a compile context (as
described
.\" HTML <a href="#compilecontext">
.\" </a>
above).
@ -1325,6 +1326,11 @@ This code fragment shows a typical straightforward call to
&erroffset, /* for error offset */
NULL); /* no compile context */
.sp
.
.
.SS "Main compile options"
.rs
.sp
The following names for option bits are defined in the \fBpcre2.h\fP header
file:
.sp
@ -1361,6 +1367,16 @@ hexadecimal digits, in which case the hexadecimal number defines the code point
to match. By default, as in Perl, a hexadecimal number is always expected after
\ex, but it may have zero, one, or two digits (so, for example, \exz matches a
binary zero character followed by z).
.P
ECMAscript 6 added additional functionality to \eu. This can be accessed using
the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile options"
.\" HTML <a href="#extracompileoptions">
.\" </a>
below).
.\"
Note that this alternative escape handling applies only to patterns. Neither of
these options affects the processing of replacement strings passed to
\fBpcre2_substitute()\fP.
.sp
PCRE2_ALT_CIRCUMFLEX
.sp
@ -1788,9 +1804,8 @@ characters with code points greater than 127.
.SS "Extra compile options"
.rs
.sp
Unlike the main compile-time options, the extra options are not saved with the
compiled pattern. The option bits that can be set in a compile context by
calling the \fBpcre2_set_compile_extra_options()\fP function are as follows:
The option bits that can be set in a compile context by calling the
\fBpcre2_set_compile_extra_options()\fP function are as follows:
.sp
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
.sp
@ -1813,6 +1828,14 @@ If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
point values in UTF-8 and UTF-32 patterns no longer provoke errors and are
incorporated in the compiled pattern. However, they can only match subject
characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
.sp
PCRE2_EXTRA_ALT_BSUX
.sp
The original option PCRE2_ALT_BSUX causes PCRE2 to process \eU, \eu, and \ex in
the way that ECMAscript (aka JavaScript) does. Additional functionality was
defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has the effect of
PCRE2_ALT_BSUX, but in addition it recognizes \eu{hhh..} as a hexadecimal
character code, where hhh.. is any number of hexadecimal digits.
.sp
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
.sp
@ -3383,7 +3406,8 @@ capture groups and letters within \eQ...\eE quoted sequences.
.P
Note that case forcing sequences such as \eU...\eE do not nest. For example,
the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no
effect.
effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EXTRA_ALT_BSUX options do
not apply to not apply to replacement strings.
.P
The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
flexibility to capture group substitution. The syntax is similar to that used
@ -3792,6 +3816,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 04 February 2019
Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2COMPAT 3 "03 February 2019" "PCRE2 10.33"
.TH PCRE2COMPAT 3 "12 February 2019" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@ -33,8 +33,9 @@ non-newline character, and \eN{U+dd..}, matching a Unicode code point, are
supported. The escapes that modify the case of following letters are
implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE2, an error is
generated by default. However, if the PCRE2_ALT_BSUX option is set, \eU and \eu
are interpreted as ECMAScript interprets them.
generated by default. However, if either of the PCRE2_ALT_BSUX or
PCRE2_EXTRA_ALT_BSUX options is set, \eU and \eu are interpreted as ECMAScript
interprets them.
.P
5. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is
built with Unicode support (the default). The properties that can be tested
@ -198,6 +199,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 03 February 2019
Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "04 February 2019" "PCRE2 10.33"
.TH PCRE2PATTERN 3 "12 February 2019" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -373,12 +373,30 @@ environment, these escapes are as follows:
\exhh character with hex code hh
\ex{hhh..} character with hex code hhh..
\eN{U+hhh..} character with Unicode hex code point hhh..
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
.sp
There are some legacy applications where the escape sequence \er is expected to
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a
pattern is converted to \en so that it matches a LF (linefeed) instead of a CR
(carriage return) character.
By default, after \ex that is not followed by {, from zero to two hexadecimal
digits are read (letters can be in upper or lower case). Any number of
hexadecimal digits may appear between \ex{ and }. If a character other than a
hexadecimal digit appears between \ex{ and }, or if there is no terminating },
an error occurs.
.P
Characters whose code points are less than 256 can be defined by either of the
two syntaxes for \ex or by an octal sequence. There is no difference in the way
they are handled. For example, \exdc is exactly the same as \ex{dc} or \e334.
However, using the braced versions does make such sequences easier to read.
.P
Support is available for some ECMAScript (aka JavaScript) escape sequences via
two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \ex followed
by { is not recognized. Only if \ex is followed by two hexadecimal digits is it
recognized as a character escape. Otherwise it is interpreted as a literal "x"
character. In this mode, support for code points greater than 256 is provided
by \eu, which must be followed by four hexadecimal digits; otherwise it is
interpreted as a literal "u" character.
.P
PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
\eu{hhh..} is recognized as the character specified by hexadecimal code point.
There may be any number of hexadecimal digits. This syntax is from ECMAScript
6.
.P
The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
@ -386,6 +404,11 @@ is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
Note that when \eN is not followed by an opening brace (curly bracket) it has
an entirely different meaning, matching any character that is not a newline.
.P
There are some legacy applications where the escape sequence \er is expected to
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a
pattern is converted to \en so that it matches a LF (linefeed) instead of a CR
(carriage return) character.
.P
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
@ -477,25 +500,6 @@ for themselves. For example, outside a character class:
Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal
digits are ever read.
.P
By default, after \ex that is not followed by {, from zero to two hexadecimal
digits are read (letters can be in upper or lower case). Any number of
hexadecimal digits may appear between \ex{ and }. If a character other than
a hexadecimal digit appears between \ex{ and }, or if there is no terminating
}, an error occurs.
.P
If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just
described only when it is followed by two hexadecimal digits. Otherwise, it
matches a literal "x" character. In this mode, support for code points greater
than 256 is provided by \eu, which must be followed by four hexadecimal digits;
otherwise it matches a literal "u" character. This syntax makes PCRE2 behave
like ECMAscript (aka JavaScript). Code points greater than 0xFFFF are not
supported.
.P
Characters whose value is less than 256 can be defined by either of the two
syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in
the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
\eu00dc in PCRE2_ALT_BSUX mode).
.
.
.SS "Constraints on character values"
@ -534,9 +538,10 @@ character class, these sequences have different meanings.
.sp
In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
is set, \eU matches a "U" character, and \eu can be used to define a character
by code point, as described above.
does not support these escape sequences in patterns. However, if either of the
PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \eU matches a "U"
character, and \eu can be used to define a character by code point, as
described above.
.
.
.SS "Absolute and relative backreferences"
@ -3758,6 +3763,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 04 February 2019
Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "03 February 2019" "PCRE2 10.33"
.TH PCRE2SYNTAX 3 "11 February 2019" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -22,7 +22,8 @@ documentation. This document contains a quick-reference summary of the syntax.
.SH "ESCAPED CHARACTERS"
.rs
.sp
This table applies to ASCII and Unicode environments.
This table applies to ASCII and Unicode environments. An unrecognized escape
sequence causes an error.
.sp
\ea alarm, that is, the BEL character (hex 07)
\ecx "control-x", where x is any ASCII printing character
@ -34,12 +35,24 @@ This table applies to ASCII and Unicode environments.
\e0dd character with octal code 0dd
\eddd character with octal code ddd, or backreference
\eo{ddd..} character with octal code ddd..
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\eN{U+hh..} character with Unicode code point hh.. (Unicode mode only)
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\exhh character with hex code hh
\ex{hh..} character with hex code hh..
.sp
If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
following are also recognized:
.sp
\eU the character "U"
\euhhhh character with hex code hhhh
\eu{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
.sp
When \ex is not followed by {, from zero to two hexadecimal digits are read,
but in ALT_BSUX mode \ex must be followed by two hexadecimal digits to be
recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits
or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it
matches a literal "u".
.P
Note that \e0dd is always an octal code. The treatment of backslash followed by
a non-zero digit is complicated; for details see the section
.\" HTML <a href="pcre2pattern.html#digitsafterbackslash">
@ -54,12 +67,6 @@ documentation, where details of escape processing in EBCDIC environments are
also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not
supported in EBCDIC environments. Note that \eN not followed by an opening
curly bracket has a different meaning (see below).
.P
When \ex is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits,
it matches a literal "u".
.
.
.SH "CHARACTER TYPES"
@ -647,6 +654,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 03 February 2019
Last updated: 11 February 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "03 February 2019" "PCRE 10.33"
.TH PCRE2TEST 1 "11 February 2019" "PCRE 10.33"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@ -568,6 +568,7 @@ for a description of the effects of these options.
escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF
/x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE
extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX
firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE
@ -2056,6 +2057,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 03 February 2019
Last updated: 11 February 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -547,6 +547,7 @@ PATTERN MODIFIERS
escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF
/x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE
extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX
firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE
@ -1887,5 +1888,5 @@ AUTHOR
REVISION
Last updated: 03 February 2019
Last updated: 11 February 2019
Copyright (c) 1997-2019 University of Cambridge.

View File

@ -150,6 +150,7 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_EXTRA_MATCH_WORD 0x00000004u /* C */
#define PCRE2_EXTRA_MATCH_LINE 0x00000008u /* C */
#define PCRE2_EXTRA_ESCAPED_CR_IS_LF 0x00000010u /* C */
#define PCRE2_EXTRA_ALT_BSUX 0x00000020u /* C */
/* These are for pcre2_jit_compile(). */

View File

@ -764,7 +764,7 @@ are allowed. */
#define PUBLIC_COMPILE_EXTRA_OPTIONS \
(PUBLIC_LITERAL_COMPILE_EXTRA_OPTIONS| \
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES|PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL| \
PCRE2_EXTRA_ESCAPED_CR_IS_LF)
PCRE2_EXTRA_ESCAPED_CR_IS_LF|PCRE2_EXTRA_ALT_BSUX)
/* Compile time error code numbers. They are given names so that they can more
easily be tracked. When a new number is added, the tables called eint1 and
@ -1459,7 +1459,8 @@ Returns: zero => a data character
int
PRIV(check_escape)(PCRE2_SPTR *ptrptr, PCRE2_SPTR ptrend, uint32_t *chptr,
int *errorcodeptr, uint32_t options, BOOL isclass, compile_block *cb)
int *errorcodeptr, uint32_t options, uint32_t extra_options, BOOL isclass,
compile_block *cb)
{
BOOL utf = (options & PCRE2_UTF) != 0;
PCRE2_SPTR ptr = *ptrptr;
@ -1495,8 +1496,7 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
if (i > 0)
{
c = (uint32_t)i;
if (cb != NULL && c == CHAR_CR &&
(cb->cx->extra_options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)
if (c == CHAR_CR && (extra_options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)
c = CHAR_LF;
}
else /* Negative table entry */
@ -1551,22 +1551,28 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
/* Escapes that need further processing, including those that are unknown, have
a zero entry in the lookup table. When called from pcre2_substitute(), only \c,
\o, and \x are recognized (and \u when BSUX is set). */
\o, and \x are recognized (\u and \U can never appear as they are used for case
forcing). */
else
{
int s;
PCRE2_SPTR oldptr;
BOOL overflow;
int s;
BOOL alt_bsux =
((options & PCRE2_ALT_BSUX) | (extra_options & PCRE2_EXTRA_ALT_BSUX)) != 0;
/* Filter calls from pcre2_substitute(). */
if (cb == NULL && c != CHAR_c && c != CHAR_o && c != CHAR_x &&
(c != CHAR_u || (options & PCRE2_ALT_BSUX) != 0))
if (cb == NULL)
{
*errorcodeptr = ERR3;
return 0;
}
if (c != CHAR_c && c != CHAR_o && c != CHAR_x)
{
*errorcodeptr = ERR3;
return 0;
}
alt_bsux = FALSE; /* Do not modify \x handling */
}
switch (c)
{
@ -1579,40 +1585,74 @@ else
*errorcodeptr = ERR37;
break;
/* \u is unrecognized when PCRE2_ALT_BSUX is not set. When it is treated
specially, \u must be followed by four hex digits. Otherwise it is a
lowercase u letter. */
/* \u is unrecognized when neither PCRE2_ALT_BSUX nor PCRE2_EXTRA_ALT_BSUX
is set. Otherwise, \u must be followed by exactly four hex digits or, if
PCRE2_EXTRA_ALT_BSUX is set, by any number of hex digits in braces.
Otherwise it is a lowercase u letter. This gives some compatibility with
ECMAScript (aka JavaScript). */
case CHAR_u:
if ((options & PCRE2_ALT_BSUX) == 0) *errorcodeptr = ERR37; else
if (!alt_bsux) *errorcodeptr = ERR37; else
{
uint32_t xc;
if (ptrend - ptr < 4) break; /* Less than 4 chars */
if ((cc = XDIGIT(ptr[0])) == 0xff) break; /* Not a hex digit */
if ((xc = XDIGIT(ptr[1])) == 0xff) break; /* Not a hex digit */
cc = (cc << 4) | xc;
if ((xc = XDIGIT(ptr[2])) == 0xff) break; /* Not a hex digit */
cc = (cc << 4) | xc;
if ((xc = XDIGIT(ptr[3])) == 0xff) break; /* Not a hex digit */
c = (cc << 4) | xc;
ptr += 4;
if (*ptr == CHAR_LEFT_CURLY_BRACKET &&
(extra_options & PCRE2_EXTRA_ALT_BSUX) != 0)
{
PCRE2_SPTR hptr = ptr + 1;
cc = 0;
while (hptr < ptrend && (xc = XDIGIT(*hptr)) != 0xff)
{
if ((cc & 0xf0000000) != 0) /* Test for 32-bit overflow */
{
*errorcodeptr = ERR77;
ptr = hptr; /* Show where */
break; /* *hptr != } will cause another break below */
}
cc = (cc << 4) | xc;
hptr++;
}
if (hptr == ptr + 1 || /* No hex digits */
hptr >= ptrend || /* Hit end of input */
*hptr != CHAR_RIGHT_CURLY_BRACKET) /* No } terminator */
break; /* Hex escape not recognized */
c = cc; /* Accept the code point */
ptr = hptr + 1;
}
else /* Must be exactly 4 hex digits */
{
if (ptrend - ptr < 4) break; /* Less than 4 chars */
if ((cc = XDIGIT(ptr[0])) == 0xff) break; /* Not a hex digit */
if ((xc = XDIGIT(ptr[1])) == 0xff) break; /* Not a hex digit */
cc = (cc << 4) | xc;
if ((xc = XDIGIT(ptr[2])) == 0xff) break; /* Not a hex digit */
cc = (cc << 4) | xc;
if ((xc = XDIGIT(ptr[3])) == 0xff) break; /* Not a hex digit */
c = (cc << 4) | xc;
ptr += 4;
}
if (utf)
{
if (c > 0x10ffffU) *errorcodeptr = ERR77;
else
if (c >= 0xd800 && c <= 0xdfff &&
(cb->cx->extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0)
*errorcodeptr = ERR73;
(extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0)
*errorcodeptr = ERR73;
}
else if (c > MAX_NON_UTF_CHAR) *errorcodeptr = ERR77;
}
break;
/* \U is unrecognized unless PCRE2_ALT_BSUX is set, in which case it is an
upper case letter. */
/* \U is unrecognized unless PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set,
in which case it is an upper case letter. */
case CHAR_U:
if ((options & PCRE2_ALT_BSUX) == 0) *errorcodeptr = ERR37;
if (!alt_bsux) *errorcodeptr = ERR37;
break;
/* In a character class, \g is just a literal "g". Outside a character
@ -1791,8 +1831,8 @@ else
}
else if (ptr < ptrend && *ptr++ == CHAR_RIGHT_CURLY_BRACKET)
{
if (utf && c >= 0xd800 && c <= 0xdfff && (cb == NULL ||
(cb->cx->extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0))
if (utf && c >= 0xd800 && c <= 0xdfff &&
(extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0)
{
ptr--;
*errorcodeptr = ERR73;
@ -1806,11 +1846,11 @@ else
}
break;
/* \x is complicated. When PCRE2_ALT_BSUX is set, \x must be followed by
two hexadecimal digits. Otherwise it is a lowercase x letter. */
/* When PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set, \x must be followed
by two hexadecimal digits. Otherwise it is a lowercase x letter. */
case CHAR_x:
if ((options & PCRE2_ALT_BSUX) != 0)
if (alt_bsux)
{
uint32_t xc;
if (ptrend - ptr < 2) break; /* Less than 2 characters */
@ -1818,9 +1858,9 @@ else
if ((xc = XDIGIT(ptr[1])) == 0xff) break; /* Not a hex digit */
c = (cc << 4) | xc;
ptr += 2;
} /* End PCRE2_ALT_BSUX handling */
}
/* Handle \x in Perl's style. \x{ddd} is a character number which can be
/* Handle \x in Perl's style. \x{ddd} is a character code which can be
greater than 0xff in UTF-8 or non-8bit mode, but only if the ddd are hex
digits. If not, { used to be treated as a data character. However, Perl
seems to read hex digits up to the first non-such, and ignore the rest, so
@ -1864,8 +1904,8 @@ else
}
else if (ptr < ptrend && *ptr++ == CHAR_RIGHT_CURLY_BRACKET)
{
if (utf && c >= 0xd800 && c <= 0xdfff && (cb == NULL ||
(cb->cx->extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0))
if (utf && c >= 0xd800 && c <= 0xdfff &&
(extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0)
{
ptr--;
*errorcodeptr = ERR73;
@ -2438,6 +2478,7 @@ uint32_t *parsed_pattern = cb->parsed_pattern;
uint32_t *parsed_pattern_end = cb->parsed_pattern_end;
uint32_t meta_quantifier = 0;
uint32_t add_after_mark = 0;
uint32_t extra_options = cb->cx->extra_options;
uint16_t nest_depth = 0;
int after_manual_callout = 0;
int expect_cond_assert = 0;
@ -2461,12 +2502,12 @@ nest_save *top_nest, *end_nests;
/* Insert leading items for word and line matching (features provided for the
benefit of pcre2grep). */
if ((cb->cx->extra_options & PCRE2_EXTRA_MATCH_LINE) != 0)
if ((extra_options & PCRE2_EXTRA_MATCH_LINE) != 0)
{
*parsed_pattern++ = META_CIRCUMFLEX;
*parsed_pattern++ = META_NOCAPTURE;
}
else if ((cb->cx->extra_options & PCRE2_EXTRA_MATCH_WORD) != 0)
else if ((extra_options & PCRE2_EXTRA_MATCH_WORD) != 0)
{
*parsed_pattern++ = META_ESCAPE + ESC_b;
*parsed_pattern++ = META_NOCAPTURE;
@ -2631,7 +2672,7 @@ while (ptr < ptrend)
if ((options & PCRE2_ALT_VERBNAMES) != 0)
{
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options,
FALSE, cb);
cb->cx->extra_options, FALSE, cb);
if (errorcode != 0) goto FAILED;
}
else escape = 0; /* Treat all as literal */
@ -2821,11 +2862,11 @@ while (ptr < ptrend)
case CHAR_BACKSLASH:
tempptr = ptr;
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options,
FALSE, cb);
cb->cx->extra_options, FALSE, cb);
if (errorcode != 0)
{
ESCAPE_FAILED:
if ((cb->cx->extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0)
if ((extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0)
goto FAILED;
ptr = tempptr;
if (ptr >= ptrend) c = CHAR_BACKSLASH; else
@ -3382,12 +3423,12 @@ while (ptr < ptrend)
else
{
tempptr = ptr;
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode,
options, TRUE, cb);
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options,
cb->cx->extra_options, TRUE, cb);
if (errorcode != 0)
{
if ((cb->cx->extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0)
if ((extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0)
goto FAILED;
ptr = tempptr;
if (ptr >= ptrend) c = CHAR_BACKSLASH; else
@ -4545,12 +4586,12 @@ parsed_pattern = manage_callouts(ptr, &previous_callout, auto_callout,
/* Insert trailing items for word and line matching (features provided for the
benefit of pcre2grep). */
if ((cb->cx->extra_options & PCRE2_EXTRA_MATCH_LINE) != 0)
if ((extra_options & PCRE2_EXTRA_MATCH_LINE) != 0)
{
*parsed_pattern++ = META_KET;
*parsed_pattern++ = META_DOLLAR;
}
else if ((cb->cx->extra_options & PCRE2_EXTRA_MATCH_WORD) != 0)
else if ((extra_options & PCRE2_EXTRA_MATCH_WORD) != 0)
{
*parsed_pattern++ = META_KET;
*parsed_pattern++ = META_ESCAPE + ESC_b;

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016-2018 University of Cambridge
New API code Copyright (c) 2016-2019 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -1942,7 +1942,7 @@ is available. */
extern int _pcre2_auto_possessify(PCRE2_UCHAR *, BOOL,
const compile_block *);
extern int _pcre2_check_escape(PCRE2_SPTR *, PCRE2_SPTR, uint32_t *,
int *, uint32_t, BOOL, compile_block *);
int *, uint32_t, uint32_t, BOOL, compile_block *);
extern PCRE2_SPTR _pcre2_extuni(uint32_t, PCRE2_SPTR, PCRE2_SPTR, PCRE2_SPTR,
BOOL, int *);
extern PCRE2_SPTR _pcre2_find_bracket(PCRE2_SPTR, BOOL, int);

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016-2018 University of Cambridge
New API code Copyright (c) 2016-2019 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -129,7 +129,7 @@ for (; ptr < ptrend; ptr++)
ptr += 1; /* Must point after \ */
erc = PRIV(check_escape)(&ptr, ptrend, &ch, &errorcode,
code->overall_options, FALSE, NULL);
code->overall_options, code->extra_options, FALSE, NULL);
ptr -= 1; /* Back to last code unit of escape */
if (errorcode != 0)
{
@ -774,7 +774,7 @@ do
ptr++; /* Point after \ */
rc = PRIV(check_escape)(&ptr, repend, &ch, &errorcode,
code->overall_options, FALSE, NULL);
code->overall_options, code->extra_options, FALSE, NULL);
if (errorcode != 0) goto BADESCAPE;
switch(rc)

View File

@ -646,6 +646,7 @@ static modstruct modlist[] = {
{ "expand", MOD_PAT, MOD_CTL, CTL_EXPAND, PO(control) },
{ "extended", MOD_PATP, MOD_OPT, PCRE2_EXTENDED, PO(options) },
{ "extended_more", MOD_PATP, MOD_OPT, PCRE2_EXTENDED_MORE, PO(options) },
{ "extra_alt_bsux", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ALT_BSUX, CO(extra_options) },
{ "find_limits", MOD_DAT, MOD_CTL, CTL_FINDLIMITS, DO(control) },
{ "firstline", MOD_PAT, MOD_OPT, PCRE2_FIRSTLINE, PO(options) },
{ "framesize", MOD_PAT, MOD_CTL, CTL_FRAMESIZE, PO(control) },
@ -4189,10 +4190,11 @@ show_compile_extra_options(uint32_t options, const char *before,
const char *after)
{
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
else fprintf(outfile, "%s%s%s%s%s%s%s",
else fprintf(outfile, "%s%s%s%s%s%s%s%s",
before,
((options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) != 0)? " allow_surrogate_escapes" : "",
((options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) != 0)? " bad_escape_is_literal" : "",
((options & PCRE2_EXTRA_ALT_BSUX) != 0)? " extra_alt_bsux" : "",
((options & PCRE2_EXTRA_MATCH_WORD) != 0)? " match_word" : "",
((options & PCRE2_EXTRA_MATCH_LINE) != 0)? " match_line" : "",
((options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)? " escaped_cr_is_lf" : "",

26
testdata/testinput2 vendored
View File

@ -2408,13 +2408,13 @@
\= Expect no match
cat
/(\3)(\1)(a)/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/(\3)(\1)(a)/allow_empty_class,match_unset_backref,dupnames
cat
/TA]/
The ACTA] comes
/TA]/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/TA]/allow_empty_class,match_unset_backref,dupnames
The ACTA] comes
/(?2)[]a()b](abc)/
@ -2446,25 +2446,25 @@
/a[^]b/
/a[]b/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/a[]b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match
ab
/a[]+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/a[]+b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match
ab
/a[]*+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/a[]*+b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match
ab
/a[^]b/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/a[^]b/allow_empty_class,match_unset_backref,dupnames
aXb
a\nb
\= Expect no match
ab
/a[^]+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/a[^]+b/allow_empty_class,match_unset_backref,dupnames
aXb
a\nX\nXb
\= Expect no match
@ -2903,10 +2903,10 @@
xxxxabcde\=ps
xxxxabcde\=ph
/(\3)(\1)(a)/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/(\3)(\1)(a)/allow_empty_class,match_unset_backref,dupnames
cat
/(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames
/(\3)(\1)(a)/I,allow_empty_class,match_unset_backref,dupnames
cat
/(\3)(\1)(a)/I
@ -3418,6 +3418,14 @@
aU0041z
\= Expect no match
aAz
/^\u{7a}/alt_bsux
u{7a}
\= Expect no match
zoo
/^\u{7a}/extra_alt_bsux
zoo
/(?(?=c)c|d)++Y/B

7
testdata/testinput5 vendored
View File

@ -333,13 +333,13 @@
/[[:a\x{100}b:]]/utf
/a[^]b/utf,alt_bsux,allow_empty_class,match_unset_backref
/a[^]b/utf,allow_empty_class,match_unset_backref
a\x{1234}b
a\nb
\= Expect no match
ab
/a[^]+b/utf,alt_bsux,allow_empty_class,match_unset_backref
/a[^]+b/utf,allow_empty_class,match_unset_backref
aXb
a\nX\nX\x{1234}b
\= Expect no match
@ -814,6 +814,9 @@
/\ud800/utf,alt_bsux,allow_empty_class,match_unset_backref
/^\u{0000000000010ffff}/utf,extra_alt_bsux
\x{10ffff}
/^a+[a\x{200}]/B,utf
aa

31
testdata/testoutput2 vendored
View File

@ -8774,7 +8774,7 @@ No match
cat
No match
/(\3)(\1)(a)/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/(\3)(\1)(a)/allow_empty_class,match_unset_backref,dupnames
cat
0: a
1:
@ -8785,7 +8785,7 @@ No match
The ACTA] comes
0: TA]
/TA]/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/TA]/allow_empty_class,match_unset_backref,dupnames
The ACTA] comes
0: TA]
@ -8833,22 +8833,22 @@ Failed: error 106 at offset 4: missing terminating ] for character class
/a[^]b/
Failed: error 106 at offset 5: missing terminating ] for character class
/a[]b/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/a[]b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match
ab
No match
/a[]+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/a[]+b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match
ab
No match
/a[]*+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/a[]*+b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match
ab
No match
/a[^]b/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/a[^]b/allow_empty_class,match_unset_backref,dupnames
aXb
0: aXb
a\nb
@ -8857,7 +8857,7 @@ No match
ab
No match
/a[^]+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/a[^]+b/allow_empty_class,match_unset_backref,dupnames
aXb
0: aXb
a\nX\nXb
@ -9971,17 +9971,17 @@ Partial match: abca
xxxxabcde\=ph
Partial match: abcde
/(\3)(\1)(a)/alt_bsux,allow_empty_class,match_unset_backref,dupnames
/(\3)(\1)(a)/allow_empty_class,match_unset_backref,dupnames
cat
0: a
1:
2:
3: a
/(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames
/(\3)(\1)(a)/I,allow_empty_class,match_unset_backref,dupnames
Capture group count = 3
Max back reference = 3
Options: alt_bsux allow_empty_class dupnames match_unset_backref
Options: allow_empty_class dupnames match_unset_backref
Last code unit = 'a'
Subject length lower bound = 1
cat
@ -11364,6 +11364,17 @@ No match
\= Expect no match
aAz
No match
/^\u{7a}/alt_bsux
u{7a}
0: u{7a}
\= Expect no match
zoo
No match
/^\u{7a}/extra_alt_bsux
zoo
0: z
/(?(?=c)c|d)++Y/B
------------------------------------------------------------------

View File

@ -798,7 +798,7 @@ No match
/[[:a\x{100}b:]]/utf
Failed: error 130 at offset 3: unknown POSIX class name
/a[^]b/utf,alt_bsux,allow_empty_class,match_unset_backref
/a[^]b/utf,allow_empty_class,match_unset_backref
a\x{1234}b
0: a\x{1234}b
a\nb
@ -807,7 +807,7 @@ Failed: error 130 at offset 3: unknown POSIX class name
ab
No match
/a[^]+b/utf,alt_bsux,allow_empty_class,match_unset_backref
/a[^]+b/utf,allow_empty_class,match_unset_backref
aXb
0: aXb
a\nX\nX\x{1234}b
@ -1734,6 +1734,10 @@ No match
/\ud800/utf,alt_bsux,allow_empty_class,match_unset_backref
Failed: error 173 at offset 6: disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
/^\u{0000000000010ffff}/utf,extra_alt_bsux
\x{10ffff}
0: \x{10ffff}
/^a+[a\x{200}]/B,utf
------------------------------------------------------------------
Bra