Implement PCRE2_EXTRA_ALT_BSUX to support ECMAscript 6's \u{hhh..} syntax.

This commit is contained in:
Philip.Hazel 2019-02-12 17:50:19 +00:00
parent d90de8b053
commit 8c8deae8eb
26 changed files with 1310 additions and 1112 deletions

View File

@ -125,6 +125,9 @@ processing or a crash could result.
names, as Perl does. There was a small bug in this new code, found by names, as Perl does. There was a small bug in this new code, found by
ClusterFuzz 12950, fixed before release. ClusterFuzz 12950, fixed before release.
31. Implemented PCRE2_EXTRA_ALT_BSUX to support ECMAScript 6's \u{hhh}
construct.
Version 10.32 10-September-2018 Version 10.32 10-September-2018
------------------------------- -------------------------------

View File

@ -86,7 +86,12 @@ PCRE2 must be built with Unicode support (the default) in order to use
PCRE2_UTF, PCRE2_UCP and related options. PCRE2_UTF, PCRE2_UCP and related options.
</P> </P>
<P> <P>
The yield of the function is a pointer to a private data structure that Additional options may be set in the compile context via the
<a href="pcre2_set_compile_extra_options.html"><b>pcre2_set_compile_extra_options</b></a>
function.
</P>
<P>
The yield of this function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected. contains the compiled pattern, or NULL if an error was detected.
</P> </P>
<P> <P>

View File

@ -20,7 +20,7 @@ SYNOPSIS
</P> </P>
<P> <P>
<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b> <b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
<b> PCRE2_SIZE <i>extra_options</i>);</b> <b> uint32_t <i>extra_options</i>);</b>
</P> </P>
<br><b> <br><b>
DESCRIPTION DESCRIPTION
@ -31,6 +31,7 @@ housed in a compile context. It completely replaces all the bits. The extra
options are: options are:
<pre> <pre>
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes
PCRE2_EXTRA_ALT_BSUX Extended alternate \u, \U, and \x handling
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \r as \n PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \r as \n
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines

View File

@ -1298,7 +1298,7 @@ are needed. The <b>pcre2_code_copy_with_tables()</b> provides this facility.
Copies of both the code and the tables are made, with the new code pointing to Copies of both the code and the tables are made, with the new code pointing to
the new tables. The memory for the new tables is automatically freed when the new tables. The memory for the new tables is automatically freed when
<b>pcre2_code_free()</b> is called for the new copy of the compiled code. If <b>pcre2_code_free()</b> is called for the new copy of the compiled code. If
<b>pcre2_code_copy_withy_tables()</b> is called with a NULL argument, it returns <b>pcre2_code_copy_with_tables()</b> is called with a NULL argument, it returns
NULL. NULL.
</P> </P>
<P> <P>
@ -1315,7 +1315,7 @@ PCRE2_COPY_MATCHED_SUBJECT option, which is described in the section entitled
</P> </P>
<P> <P>
The <i>options</i> argument for <b>pcre2_compile()</b> contains various bit The <i>options</i> argument for <b>pcre2_compile()</b> contains various bit
settings that affect the compilation. It should be zero if no options are settings that affect the compilation. It should be zero if none of them are
required. The available options are described below. Some of them (in required. The available options are described below. Some of them (in
particular, those that are compatible with Perl, but some others as well) can particular, those that are compatible with Perl, but some others as well) can
also be set and unset from within the pattern (see the detailed description in also be set and unset from within the pattern (see the detailed description in
@ -1330,8 +1330,9 @@ compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and PCRE2_NO_UTF_CHECK
options can be set at the time of matching as well as at compile time. options can be set at the time of matching as well as at compile time.
</P> </P>
<P> <P>
Other, less frequently required compile-time parameters (for example, the Some additional options and less frequently required compile-time parameters
newline setting) can be provided in a compile context (as described (for example, the newline setting) can be provided in a compile context (as
described
<a href="#compilecontext">above).</a> <a href="#compilecontext">above).</a>
</P> </P>
<P> <P>
@ -1384,7 +1385,13 @@ This code fragment shows a typical straightforward call to
&errorcode, /* for error code */ &errorcode, /* for error code */
&erroffset, /* for error offset */ &erroffset, /* for error offset */
NULL); /* no compile context */ NULL); /* no compile context */
</pre>
</PRE>
</P>
<br><b>
Main compile options
</b><br>
<P>
The following names for option bits are defined in the <b>pcre2.h</b> header The following names for option bits are defined in the <b>pcre2.h</b> header
file: file:
<pre> <pre>
@ -1424,6 +1431,14 @@ hexadecimal digits, in which case the hexadecimal number defines the code point
to match. By default, as in Perl, a hexadecimal number is always expected after to match. By default, as in Perl, a hexadecimal number is always expected after
\x, but it may have zero, one, or two digits (so, for example, \xz matches a \x, but it may have zero, one, or two digits (so, for example, \xz matches a
binary zero character followed by z). binary zero character followed by z).
</P>
<P>
ECMAscript 6 added additional functionality to \u. This can be accessed using
the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile options"
<a href="#extracompileoptions">below).</a>
Note that this alternative escape handling applies only to patterns. Neither of
these options affects the processing of replacement strings passed to
<b>pcre2_substitute()</b>.
<pre> <pre>
PCRE2_ALT_CIRCUMFLEX PCRE2_ALT_CIRCUMFLEX
</pre> </pre>
@ -1830,9 +1845,8 @@ characters with code points greater than 127.
Extra compile options Extra compile options
</b><br> </b><br>
<P> <P>
Unlike the main compile-time options, the extra options are not saved with the The option bits that can be set in a compile context by calling the
compiled pattern. The option bits that can be set in a compile context by <b>pcre2_set_compile_extra_options()</b> function are as follows:
calling the <b>pcre2_set_compile_extra_options()</b> function are as follows:
<pre> <pre>
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
</pre> </pre>
@ -1857,6 +1871,14 @@ If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
point values in UTF-8 and UTF-32 patterns no longer provoke errors and are point values in UTF-8 and UTF-32 patterns no longer provoke errors and are
incorporated in the compiled pattern. However, they can only match subject incorporated in the compiled pattern. However, they can only match subject
characters if the matching function is called with PCRE2_NO_UTF_CHECK set. characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
<pre>
PCRE2_EXTRA_ALT_BSUX
</pre>
The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and \x in
the way that ECMAscript (aka JavaScript) does. Additional functionality was
defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has the effect of
PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} as a hexadecimal
character code, where hhh.. is any number of hexadecimal digits.
<pre> <pre>
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
</pre> </pre>
@ -3382,7 +3404,8 @@ capture groups and letters within \Q...\E quoted sequences.
<P> <P>
Note that case forcing sequences such as \U...\E do not nest. For example, Note that case forcing sequences such as \U...\E do not nest. For example,
the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no
effect. effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EXTRA_ALT_BSUX options do
not apply to not apply to replacement strings.
</P> </P>
<P> <P>
The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
@ -3784,7 +3807,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 04 February 2019 Last updated: 12 February 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -47,8 +47,9 @@ non-newline character, and \N{U+dd..}, matching a Unicode code point, are
supported. The escapes that modify the case of following letters are supported. The escapes that modify the case of following letters are
implemented by Perl's general string-handling and are not part of its pattern implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE2, an error is matching engine. If any of these are encountered by PCRE2, an error is
generated by default. However, if the PCRE2_ALT_BSUX option is set, \U and \u generated by default. However, if either of the PCRE2_ALT_BSUX or
are interpreted as ECMAScript interprets them. PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are interpreted as ECMAScript
interprets them.
</P> </P>
<P> <P>
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
@ -233,7 +234,7 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 03 February 2019 Last updated: 12 February 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -399,12 +399,33 @@ environment, these escapes are as follows:
\xhh character with hex code hh \xhh character with hex code hh
\x{hhh..} character with hex code hhh.. \x{hhh..} character with hex code hhh..
\N{U+hhh..} character with Unicode hex code point hhh.. \N{U+hhh..} character with Unicode hex code point hhh..
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
</pre> </pre>
There are some legacy applications where the escape sequence \r is expected to By default, after \x that is not followed by {, from zero to two hexadecimal
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \r in a digits are read (letters can be in upper or lower case). Any number of
pattern is converted to \n so that it matches a LF (linefeed) instead of a CR hexadecimal digits may appear between \x{ and }. If a character other than a
(carriage return) character. hexadecimal digit appears between \x{ and }, or if there is no terminating },
an error occurs.
</P>
<P>
Characters whose code points are less than 256 can be defined by either of the
two syntaxes for \x or by an octal sequence. There is no difference in the way
they are handled. For example, \xdc is exactly the same as \x{dc} or \334.
However, using the braced versions does make such sequences easier to read.
</P>
<P>
Support is available for some ECMAScript (aka JavaScript) escape sequences via
two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \x followed
by { is not recognized. Only if \x is followed by two hexadecimal digits is it
recognized as a character escape. Otherwise it is interpreted as a literal "x"
character. In this mode, support for code points greater than 256 is provided
by \u, which must be followed by four hexadecimal digits; otherwise it is
interpreted as a literal "u" character.
</P>
<P>
PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
\u{hhh..} is recognized as the character specified by hexadecimal code point.
There may be any number of hexadecimal digits. This syntax is from ECMAScript
6.
</P> </P>
<P> <P>
The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
@ -414,6 +435,12 @@ Note that when \N is not followed by an opening brace (curly bracket) it has
an entirely different meaning, matching any character that is not a newline. an entirely different meaning, matching any character that is not a newline.
</P> </P>
<P> <P>
There are some legacy applications where the escape sequence \r is expected to
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \r in a
pattern is converted to \n so that it matches a LF (linefeed) instead of a CR
(carriage return) character.
</P>
<P>
The precise effect of \cx on ASCII characters is as follows: if x is a lower The precise effect of \cx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A), 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
@ -500,28 +527,6 @@ Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal must not be introduced by a leading zero, because no more than three octal
digits are ever read. digits are ever read.
</P> </P>
<P>
By default, after \x that is not followed by {, from zero to two hexadecimal
digits are read (letters can be in upper or lower case). Any number of
hexadecimal digits may appear between \x{ and }. If a character other than
a hexadecimal digit appears between \x{ and }, or if there is no terminating
}, an error occurs.
</P>
<P>
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
described only when it is followed by two hexadecimal digits. Otherwise, it
matches a literal "x" character. In this mode, support for code points greater
than 256 is provided by \u, which must be followed by four hexadecimal digits;
otherwise it matches a literal "u" character. This syntax makes PCRE2 behave
like ECMAscript (aka JavaScript). Code points greater than 0xFFFF are not
supported.
</P>
<P>
Characters whose value is less than 256 can be defined by either of the two
syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no difference in
the way they are handled. For example, \xdc is exactly the same as \x{dc} (or
\u00dc in PCRE2_ALT_BSUX mode).
</P>
<br><b> <br><b>
Constraints on character values Constraints on character values
</b><br> </b><br>
@ -560,9 +565,10 @@ Unsupported escape sequences
<P> <P>
In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its string In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2 handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option does not support these escape sequences in patterns. However, if either of the
is set, \U matches a "U" character, and \u can be used to define a character PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U matches a "U"
by code point, as described above. character, and \u can be used to define a character by code point, as
described above.
</P> </P>
<br><b> <br><b>
Absolute and relative backreferences Absolute and relative backreferences
@ -3721,7 +3727,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC31" href="#TOC1">REVISION</a><br> <br><a name="SEC31" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 04 February 2019 Last updated: 12 February 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -58,7 +58,8 @@ documentation. This document contains a quick-reference summary of the syntax.
</P> </P>
<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br> <br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
<P> <P>
This table applies to ASCII and Unicode environments. This table applies to ASCII and Unicode environments. An unrecognized escape
sequence causes an error.
<pre> <pre>
\a alarm, that is, the BEL character (hex 07) \a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII printing character \cx "control-x", where x is any ASCII printing character
@ -70,12 +71,25 @@ This table applies to ASCII and Unicode environments.
\0dd character with octal code 0dd \0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference \ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd.. \o{ddd..} character with octal code ddd..
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\N{U+hh..} character with Unicode code point hh.. (Unicode mode only) \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\xhh character with hex code hh \xhh character with hex code hh
\x{hh..} character with hex code hh.. \x{hh..} character with hex code hh..
</pre> </pre>
If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
following are also recognized:
<pre>
\U the character "U"
\uhhhh character with hex code hhhh
\u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
</pre>
When \x is not followed by {, from zero to two hexadecimal digits are read,
but in ALT_BSUX mode \x must be followed by two hexadecimal digits to be
recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits
or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it
matches a literal "u".
</P>
<P>
Note that \0dd is always an octal code. The treatment of backslash followed by Note that \0dd is always an octal code. The treatment of backslash followed by
a non-zero digit is complicated; for details see the section a non-zero digit is complicated; for details see the section
<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a> <a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
@ -86,13 +100,6 @@ also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not
supported in EBCDIC environments. Note that \N not followed by an opening supported in EBCDIC environments. Note that \N not followed by an opening
curly bracket has a different meaning (see below). curly bracket has a different meaning (see below).
</P> </P>
<P>
When \x is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
it matches a literal "u".
</P>
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br> <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
<P> <P>
<pre> <pre>
@ -660,7 +667,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br> <br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 03 February 2019 Last updated: 11 February 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -609,6 +609,7 @@ for a description of the effects of these options.
escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF
/x extended set PCRE2_EXTENDED /x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE /xx extended_more set PCRE2_EXTENDED_MORE
extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE match_line set PCRE2_EXTRA_MATCH_LINE
@ -2075,7 +2076,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 03 February 2019 Last updated: 11 February 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -1296,7 +1296,7 @@ COMPILING A PATTERN
Copies of both the code and the tables are made, with the new code Copies of both the code and the tables are made, with the new code
pointing to the new tables. The memory for the new tables is automati- pointing to the new tables. The memory for the new tables is automati-
cally freed when pcre2_code_free() is called for the new copy of the cally freed when pcre2_code_free() is called for the new copy of the
compiled code. If pcre2_code_copy_withy_tables() is called with a NULL compiled code. If pcre2_code_copy_with_tables() is called with a NULL
argument, it returns NULL. argument, it returns NULL.
NOTE: When one of the matching functions is called, pointers to the NOTE: When one of the matching functions is called, pointers to the
@ -1310,7 +1310,7 @@ COMPILING A PATTERN
below. below.
The options argument for pcre2_compile() contains various bit settings The options argument for pcre2_compile() contains various bit settings
that affect the compilation. It should be zero if no options are that affect the compilation. It should be zero if none of them are
required. The available options are described below. Some of them (in required. The available options are described below. Some of them (in
particular, those that are compatible with Perl, but some others as particular, those that are compatible with Perl, but some others as
well) can also be set and unset from within the pattern (see the well) can also be set and unset from within the pattern (see the
@ -1322,9 +1322,9 @@ COMPILING A PATTERN
PCRE2_NO_UTF_CHECK options can be set at the time of matching as well PCRE2_NO_UTF_CHECK options can be set at the time of matching as well
as at compile time. as at compile time.
Other, less frequently required compile-time parameters (for example, Some additional options and less frequently required compile-time
the newline setting) can be provided in a compile context (as described parameters (for example, the newline setting) can be provided in a com-
above). pile context (as described above).
If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme- If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
diately. Otherwise, the variables to which these point are set to an diately. Otherwise, the variables to which these point are set to an
@ -1371,6 +1371,9 @@ COMPILING A PATTERN
&erroffset, /* for error offset */ &erroffset, /* for error offset */
NULL); /* no compile context */ NULL); /* no compile context */
Main compile options
The following names for option bits are defined in the pcre2.h header The following names for option bits are defined in the pcre2.h header
file: file:
@ -1409,6 +1412,12 @@ COMPILING A PATTERN
always expected after \x, but it may have zero, one, or two digits (so, always expected after \x, but it may have zero, one, or two digits (so,
for example, \xz matches a binary zero character followed by z). for example, \xz matches a binary zero character followed by z).
ECMAscript 6 added additional functionality to \u. This can be accessed
using the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile
options" below). Note that this alternative escape handling applies
only to patterns. Neither of these options affects the processing of
replacement strings passed to pcre2_substitute().
PCRE2_ALT_CIRCUMFLEX PCRE2_ALT_CIRCUMFLEX
In multiline mode (when PCRE2_MULTILINE is set), the circumflex In multiline mode (when PCRE2_MULTILINE is set), the circumflex
@ -1804,10 +1813,8 @@ COMPILING A PATTERN
Extra compile options Extra compile options
Unlike the main compile-time options, the extra options are not saved The option bits that can be set in a compile context by calling the
with the compiled pattern. The option bits that can be set in a compile pcre2_set_compile_extra_options() function are as follows:
context by calling the pcre2_set_compile_extra_options() function are
as follows:
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
@ -1834,6 +1841,15 @@ COMPILING A PATTERN
only match subject characters if the matching function is called with only match subject characters if the matching function is called with
PCRE2_NO_UTF_CHECK set. PCRE2_NO_UTF_CHECK set.
PCRE2_EXTRA_ALT_BSUX
The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and
\x in the way that ECMAscript (aka JavaScript) does. Additional func-
tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has
the effect of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..}
as a hexadecimal character code, where hhh.. is any number of hexadeci-
mal digits.
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
This is a dangerous option. Use with care. By default, an unrecognized This is a dangerous option. Use with care. By default, an unrecognized
@ -3288,7 +3304,9 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
Note that case forcing sequences such as \U...\E do not nest. For exam- Note that case forcing sequences such as \U...\E do not nest. For exam-
ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
\E has no effect. \E has no effect. Note also that the PCRE2_ALT_BSUX and
PCRE2_EXTRA_ALT_BSUX options do not apply to not apply to replacement
strings.
The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
flexibility to capture group substitution. The syntax is similar to flexibility to capture group substitution. The syntax is similar to
@ -3659,7 +3677,7 @@ AUTHOR
REVISION REVISION
Last updated: 04 February 2019 Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -4701,8 +4719,9 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
point, are supported. The escapes that modify the case of following point, are supported. The escapes that modify the case of following
letters are implemented by Perl's general string-handling and are not letters are implemented by Perl's general string-handling and are not
part of its pattern matching engine. If any of these are encountered by part of its pattern matching engine. If any of these are encountered by
PCRE2, an error is generated by default. However, if the PCRE2_ALT_BSUX PCRE2, an error is generated by default. However, if either of the
option is set, \U and \u are interpreted as ECMAScript interprets them. PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are
interpreted as ECMAScript interprets them.
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
is built with Unicode support (the default). The properties that can be is built with Unicode support (the default). The properties that can be
@ -4864,7 +4883,7 @@ AUTHOR
REVISION REVISION
Last updated: 03 February 2019 Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -6333,12 +6352,32 @@ BACKSLASH
\xhh character with hex code hh \xhh character with hex code hh
\x{hhh..} character with hex code hhh.. \x{hhh..} character with hex code hhh..
\N{U+hhh..} character with Unicode hex code point hhh.. \N{U+hhh..} character with Unicode hex code point hhh..
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
There are some legacy applications where the escape sequence \r is By default, after \x that is not followed by {, from zero to two hexa-
expected to match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option decimal digits are read (letters can be in upper or lower case). Any
is set, \r in a pattern is converted to \n so that it matches a LF number of hexadecimal digits may appear between \x{ and }. If a charac-
(linefeed) instead of a CR (carriage return) character. ter other than a hexadecimal digit appears between \x{ and }, or if
there is no terminating }, an error occurs.
Characters whose code points are less than 256 can be defined by either
of the two syntaxes for \x or by an octal sequence. There is no differ-
ence in the way they are handled. For example, \xdc is exactly the same
as \x{dc} or \334. However, using the braced versions does make such
sequences easier to read.
Support is available for some ECMAScript (aka JavaScript) escape
sequences via two compile-time options. If PCRE2_ALT_BSUX is set, the
sequence \x followed by { is not recognized. Only if \x is followed by
two hexadecimal digits is it recognized as a character escape. Other-
wise it is interpreted as a literal "x" character. In this mode, sup-
port for code points greater than 256 is provided by \u, which must be
followed by four hexadecimal digits; otherwise it is interpreted as a
literal "u" character.
PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in
addition, \u{hhh..} is recognized as the character specified by hexa-
decimal code point. There may be any number of hexadecimal digits.
This syntax is from ECMAScript 6.
The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF
option is set, that is, when PCRE2 is operating in a Unicode mode. Perl option is set, that is, when PCRE2 is operating in a Unicode mode. Perl
@ -6347,6 +6386,11 @@ BACKSLASH
brace (curly bracket) it has an entirely different meaning, matching brace (curly bracket) it has an entirely different meaning, matching
any character that is not a newline. any character that is not a newline.
There are some legacy applications where the escape sequence \r is
expected to match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option
is set, \r in a pattern is converted to \n so that it matches a LF
(linefeed) instead of a CR (carriage return) character.
The precise effect of \cx on ASCII characters is as follows: if x is a The precise effect of \cx on ASCII characters is as follows: if x is a
lower case letter, it is converted to upper case. Then bit 6 of the lower case letter, it is converted to upper case. Then bit 6 of the
character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
@ -6429,25 +6473,6 @@ BACKSLASH
syntax must not be introduced by a leading zero, because no more than syntax must not be introduced by a leading zero, because no more than
three octal digits are ever read. three octal digits are ever read.
By default, after \x that is not followed by {, from zero to two hexa-
decimal digits are read (letters can be in upper or lower case). Any
number of hexadecimal digits may appear between \x{ and }. If a charac-
ter other than a hexadecimal digit appears between \x{ and }, or if
there is no terminating }, an error occurs.
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
just described only when it is followed by two hexadecimal digits. Oth-
erwise, it matches a literal "x" character. In this mode, support for
code points greater than 256 is provided by \u, which must be followed
by four hexadecimal digits; otherwise it matches a literal "u" charac-
ter. This syntax makes PCRE2 behave like ECMAscript (aka JavaScript).
Code points greater than 0xFFFF are not supported.
Characters whose value is less than 256 can be defined by either of the
two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
ference in the way they are handled. For example, \xdc is exactly the
same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode).
Constraints on character values Constraints on character values
Characters that are specified using octal or hexadecimal numbers are Characters that are specified using octal or hexadecimal numbers are
@ -6480,9 +6505,10 @@ BACKSLASH
In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
string handler and used to modify the case of following characters. By string handler and used to modify the case of following characters. By
default, PCRE2 does not support these escape sequences. However, if the default, PCRE2 does not support these escape sequences in patterns.
PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX
used to define a character by code point, as described above. options is set, \U matches a "U" character, and \u can be used to
define a character by code point, as described above.
Absolute and relative backreferences Absolute and relative backreferences
@ -9332,7 +9358,7 @@ AUTHOR
REVISION REVISION
Last updated: 04 February 2019 Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -10203,7 +10229,8 @@ QUOTING
ESCAPED CHARACTERS ESCAPED CHARACTERS
This table applies to ASCII and Unicode environments. This table applies to ASCII and Unicode environments. An unrecognized
escape sequence causes an error.
\a alarm, that is, the BEL character (hex 07) \a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII printing character \cx "control-x", where x is any ASCII printing character
@ -10215,12 +10242,24 @@ ESCAPED CHARACTERS
\0dd character with octal code 0dd \0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference \ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd.. \o{ddd..} character with octal code ddd..
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\N{U+hh..} character with Unicode code point hh.. (Unicode mode only) \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\xhh character with hex code hh \xhh character with hex code hh
\x{hh..} character with hex code hh.. \x{hh..} character with hex code hh..
If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
following are also recognized:
\U the character "U"
\uhhhh character with hex code hhhh
\u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
When \x is not followed by {, from zero to two hexadecimal digits are
read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig-
its to be recognized as a hexadecimal escape; otherwise it matches a
literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by
four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
digits in curly brackets, it matches a literal "u".
Note that \0dd is always an octal code. The treatment of backslash fol- Note that \0dd is always an octal code. The treatment of backslash fol-
lowed by a non-zero digit is complicated; for details see the section lowed by a non-zero digit is complicated; for details see the section
"Non-printing characters" in the pcre2pattern documentation, where "Non-printing characters" in the pcre2pattern documentation, where
@ -10229,12 +10268,6 @@ ESCAPED CHARACTERS
EBCDIC environments. Note that \N not followed by an opening curly EBCDIC environments. Note that \N not followed by an opening curly
bracket has a different meaning (see below). bracket has a different meaning (see below).
When \x is not followed by {, from zero to two hexadecimal digits are
read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
imal digits to be recognized as a hexadecimal escape; otherwise it
matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol-
lowed by four hexadecimal digits, it matches a literal "u".
CHARACTER TYPES CHARACTER TYPES
@ -10670,7 +10703,7 @@ AUTHOR
REVISION REVISION
Last updated: 03 February 2019 Last updated: 11 February 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_COMPILE 3 "16 June 2017" "PCRE2 10.30" .TH PCRE2_COMPILE 3 "11 February 2019" "PCRE2 10.33"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -73,7 +73,13 @@ The option bits are:
PCRE2 must be built with Unicode support (the default) in order to use PCRE2 must be built with Unicode support (the default) in order to use
PCRE2_UTF, PCRE2_UCP and related options. PCRE2_UTF, PCRE2_UCP and related options.
.P .P
The yield of the function is a pointer to a private data structure that Additional options may be set in the compile context via the
.\" HREF
\fBpcre2_set_compile_extra_options\fP
.\"
function.
.P
The yield of this function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected. contains the compiled pattern, or NULL if an error was detected.
.P .P
There is a complete description of the PCRE2 native API, with more detail on There is a complete description of the PCRE2 native API, with more detail on

View File

@ -1,4 +1,4 @@
.TH PCRE2_SET_COMPILE_EXTRA_OPTIONS 3 "21 September 2018" "PCRE2 10.33" .TH PCRE2_SET_COMPILE_EXTRA_OPTIONS 3 "11 February 2019" "PCRE2 10.33"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -8,7 +8,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.PP .PP
.nf .nf
.B int pcre2_set_compile_extra_options(pcre2_compile_context *\fIccontext\fP, .B int pcre2_set_compile_extra_options(pcre2_compile_context *\fIccontext\fP,
.B " PCRE2_SIZE \fIextra_options\fP);" .B " uint32_t \fIextra_options\fP);"
.fi .fi
. .
.SH DESCRIPTION .SH DESCRIPTION
@ -21,6 +21,9 @@ options are:
.\" JOIN .\" JOIN
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \ex{df800} to \ex{dfff} PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \ex{df800} to \ex{dfff}
in UTF-8 and UTF-32 modes in UTF-8 and UTF-32 modes
.\" JOIN
PCRE2_EXTRA_ALT_BSUX Extended alternate \eu, \eU, and \ex
handling
.\" JOIN .\" JOIN
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
a literal following character a literal following character

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "04 February 2019" "PCRE2 10.33" .TH PCRE2API 3 "12 February 2019" "PCRE2 10.33"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -1231,7 +1231,7 @@ are needed. The \fBpcre2_code_copy_with_tables()\fP provides this facility.
Copies of both the code and the tables are made, with the new code pointing to Copies of both the code and the tables are made, with the new code pointing to
the new tables. The memory for the new tables is automatically freed when the new tables. The memory for the new tables is automatically freed when
\fBpcre2_code_free()\fP is called for the new copy of the compiled code. If \fBpcre2_code_free()\fP is called for the new copy of the compiled code. If
\fBpcre2_code_copy_withy_tables()\fP is called with a NULL argument, it returns \fBpcre2_code_copy_with_tables()\fP is called with a NULL argument, it returns
NULL. NULL.
.P .P
NOTE: When one of the matching functions is called, pointers to the compiled NOTE: When one of the matching functions is called, pointers to the compiled
@ -1252,7 +1252,7 @@ below.
.\" .\"
.P .P
The \fIoptions\fP argument for \fBpcre2_compile()\fP contains various bit The \fIoptions\fP argument for \fBpcre2_compile()\fP contains various bit
settings that affect the compilation. It should be zero if no options are settings that affect the compilation. It should be zero if none of them are
required. The available options are described below. Some of them (in required. The available options are described below. Some of them (in
particular, those that are compatible with Perl, but some others as well) can particular, those that are compatible with Perl, but some others as well) can
also be set and unset from within the pattern (see the detailed description in also be set and unset from within the pattern (see the detailed description in
@ -1267,8 +1267,9 @@ contents of the \fIoptions\fP argument specifies their settings at the start of
compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and PCRE2_NO_UTF_CHECK compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and PCRE2_NO_UTF_CHECK
options can be set at the time of matching as well as at compile time. options can be set at the time of matching as well as at compile time.
.P .P
Other, less frequently required compile-time parameters (for example, the Some additional options and less frequently required compile-time parameters
newline setting) can be provided in a compile context (as described (for example, the newline setting) can be provided in a compile context (as
described
.\" HTML <a href="#compilecontext"> .\" HTML <a href="#compilecontext">
.\" </a> .\" </a>
above). above).
@ -1325,6 +1326,11 @@ This code fragment shows a typical straightforward call to
&erroffset, /* for error offset */ &erroffset, /* for error offset */
NULL); /* no compile context */ NULL); /* no compile context */
.sp .sp
.
.
.SS "Main compile options"
.rs
.sp
The following names for option bits are defined in the \fBpcre2.h\fP header The following names for option bits are defined in the \fBpcre2.h\fP header
file: file:
.sp .sp
@ -1361,6 +1367,16 @@ hexadecimal digits, in which case the hexadecimal number defines the code point
to match. By default, as in Perl, a hexadecimal number is always expected after to match. By default, as in Perl, a hexadecimal number is always expected after
\ex, but it may have zero, one, or two digits (so, for example, \exz matches a \ex, but it may have zero, one, or two digits (so, for example, \exz matches a
binary zero character followed by z). binary zero character followed by z).
.P
ECMAscript 6 added additional functionality to \eu. This can be accessed using
the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile options"
.\" HTML <a href="#extracompileoptions">
.\" </a>
below).
.\"
Note that this alternative escape handling applies only to patterns. Neither of
these options affects the processing of replacement strings passed to
\fBpcre2_substitute()\fP.
.sp .sp
PCRE2_ALT_CIRCUMFLEX PCRE2_ALT_CIRCUMFLEX
.sp .sp
@ -1788,9 +1804,8 @@ characters with code points greater than 127.
.SS "Extra compile options" .SS "Extra compile options"
.rs .rs
.sp .sp
Unlike the main compile-time options, the extra options are not saved with the The option bits that can be set in a compile context by calling the
compiled pattern. The option bits that can be set in a compile context by \fBpcre2_set_compile_extra_options()\fP function are as follows:
calling the \fBpcre2_set_compile_extra_options()\fP function are as follows:
.sp .sp
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
.sp .sp
@ -1813,6 +1828,14 @@ If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
point values in UTF-8 and UTF-32 patterns no longer provoke errors and are point values in UTF-8 and UTF-32 patterns no longer provoke errors and are
incorporated in the compiled pattern. However, they can only match subject incorporated in the compiled pattern. However, they can only match subject
characters if the matching function is called with PCRE2_NO_UTF_CHECK set. characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
.sp
PCRE2_EXTRA_ALT_BSUX
.sp
The original option PCRE2_ALT_BSUX causes PCRE2 to process \eU, \eu, and \ex in
the way that ECMAscript (aka JavaScript) does. Additional functionality was
defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has the effect of
PCRE2_ALT_BSUX, but in addition it recognizes \eu{hhh..} as a hexadecimal
character code, where hhh.. is any number of hexadecimal digits.
.sp .sp
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
.sp .sp
@ -3383,7 +3406,8 @@ capture groups and letters within \eQ...\eE quoted sequences.
.P .P
Note that case forcing sequences such as \eU...\eE do not nest. For example, Note that case forcing sequences such as \eU...\eE do not nest. For example,
the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no
effect. effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EXTRA_ALT_BSUX options do
not apply to not apply to replacement strings.
.P .P
The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
flexibility to capture group substitution. The syntax is similar to that used flexibility to capture group substitution. The syntax is similar to that used
@ -3792,6 +3816,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 04 February 2019 Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2COMPAT 3 "03 February 2019" "PCRE2 10.33" .TH PCRE2COMPAT 3 "12 February 2019" "PCRE2 10.33"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL" .SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@ -33,8 +33,9 @@ non-newline character, and \eN{U+dd..}, matching a Unicode code point, are
supported. The escapes that modify the case of following letters are supported. The escapes that modify the case of following letters are
implemented by Perl's general string-handling and are not part of its pattern implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE2, an error is matching engine. If any of these are encountered by PCRE2, an error is
generated by default. However, if the PCRE2_ALT_BSUX option is set, \eU and \eu generated by default. However, if either of the PCRE2_ALT_BSUX or
are interpreted as ECMAScript interprets them. PCRE2_EXTRA_ALT_BSUX options is set, \eU and \eu are interpreted as ECMAScript
interprets them.
.P .P
5. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is 5. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is
built with Unicode support (the default). The properties that can be tested built with Unicode support (the default). The properties that can be tested
@ -198,6 +199,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 03 February 2019 Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "04 February 2019" "PCRE2 10.33" .TH PCRE2PATTERN 3 "12 February 2019" "PCRE2 10.33"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -373,12 +373,30 @@ environment, these escapes are as follows:
\exhh character with hex code hh \exhh character with hex code hh
\ex{hhh..} character with hex code hhh.. \ex{hhh..} character with hex code hhh..
\eN{U+hhh..} character with Unicode hex code point hhh.. \eN{U+hhh..} character with Unicode hex code point hhh..
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
.sp .sp
There are some legacy applications where the escape sequence \er is expected to By default, after \ex that is not followed by {, from zero to two hexadecimal
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a digits are read (letters can be in upper or lower case). Any number of
pattern is converted to \en so that it matches a LF (linefeed) instead of a CR hexadecimal digits may appear between \ex{ and }. If a character other than a
(carriage return) character. hexadecimal digit appears between \ex{ and }, or if there is no terminating },
an error occurs.
.P
Characters whose code points are less than 256 can be defined by either of the
two syntaxes for \ex or by an octal sequence. There is no difference in the way
they are handled. For example, \exdc is exactly the same as \ex{dc} or \e334.
However, using the braced versions does make such sequences easier to read.
.P
Support is available for some ECMAScript (aka JavaScript) escape sequences via
two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \ex followed
by { is not recognized. Only if \ex is followed by two hexadecimal digits is it
recognized as a character escape. Otherwise it is interpreted as a literal "x"
character. In this mode, support for code points greater than 256 is provided
by \eu, which must be followed by four hexadecimal digits; otherwise it is
interpreted as a literal "u" character.
.P
PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
\eu{hhh..} is recognized as the character specified by hexadecimal code point.
There may be any number of hexadecimal digits. This syntax is from ECMAScript
6.
.P .P
The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
@ -386,6 +404,11 @@ is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
Note that when \eN is not followed by an opening brace (curly bracket) it has Note that when \eN is not followed by an opening brace (curly bracket) it has
an entirely different meaning, matching any character that is not a newline. an entirely different meaning, matching any character that is not a newline.
.P .P
There are some legacy applications where the escape sequence \er is expected to
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a
pattern is converted to \en so that it matches a LF (linefeed) instead of a CR
(carriage return) character.
.P
The precise effect of \ecx on ASCII characters is as follows: if x is a lower The precise effect of \ecx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A), 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
@ -477,25 +500,6 @@ for themselves. For example, outside a character class:
Note that octal values of 100 or greater that are specified using this syntax Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal must not be introduced by a leading zero, because no more than three octal
digits are ever read. digits are ever read.
.P
By default, after \ex that is not followed by {, from zero to two hexadecimal
digits are read (letters can be in upper or lower case). Any number of
hexadecimal digits may appear between \ex{ and }. If a character other than
a hexadecimal digit appears between \ex{ and }, or if there is no terminating
}, an error occurs.
.P
If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just
described only when it is followed by two hexadecimal digits. Otherwise, it
matches a literal "x" character. In this mode, support for code points greater
than 256 is provided by \eu, which must be followed by four hexadecimal digits;
otherwise it matches a literal "u" character. This syntax makes PCRE2 behave
like ECMAscript (aka JavaScript). Code points greater than 0xFFFF are not
supported.
.P
Characters whose value is less than 256 can be defined by either of the two
syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in
the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
\eu00dc in PCRE2_ALT_BSUX mode).
. .
. .
.SS "Constraints on character values" .SS "Constraints on character values"
@ -534,9 +538,10 @@ character class, these sequences have different meanings.
.sp .sp
In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2 handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option does not support these escape sequences in patterns. However, if either of the
is set, \eU matches a "U" character, and \eu can be used to define a character PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \eU matches a "U"
by code point, as described above. character, and \eu can be used to define a character by code point, as
described above.
. .
. .
.SS "Absolute and relative backreferences" .SS "Absolute and relative backreferences"
@ -3758,6 +3763,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 04 February 2019 Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "03 February 2019" "PCRE2 10.33" .TH PCRE2SYNTAX 3 "11 February 2019" "PCRE2 10.33"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -22,7 +22,8 @@ documentation. This document contains a quick-reference summary of the syntax.
.SH "ESCAPED CHARACTERS" .SH "ESCAPED CHARACTERS"
.rs .rs
.sp .sp
This table applies to ASCII and Unicode environments. This table applies to ASCII and Unicode environments. An unrecognized escape
sequence causes an error.
.sp .sp
\ea alarm, that is, the BEL character (hex 07) \ea alarm, that is, the BEL character (hex 07)
\ecx "control-x", where x is any ASCII printing character \ecx "control-x", where x is any ASCII printing character
@ -34,12 +35,24 @@ This table applies to ASCII and Unicode environments.
\e0dd character with octal code 0dd \e0dd character with octal code 0dd
\eddd character with octal code ddd, or backreference \eddd character with octal code ddd, or backreference
\eo{ddd..} character with octal code ddd.. \eo{ddd..} character with octal code ddd..
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\eN{U+hh..} character with Unicode code point hh.. (Unicode mode only) \eN{U+hh..} character with Unicode code point hh.. (Unicode mode only)
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\exhh character with hex code hh \exhh character with hex code hh
\ex{hh..} character with hex code hh.. \ex{hh..} character with hex code hh..
.sp .sp
If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
following are also recognized:
.sp
\eU the character "U"
\euhhhh character with hex code hhhh
\eu{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
.sp
When \ex is not followed by {, from zero to two hexadecimal digits are read,
but in ALT_BSUX mode \ex must be followed by two hexadecimal digits to be
recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits
or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it
matches a literal "u".
.P
Note that \e0dd is always an octal code. The treatment of backslash followed by Note that \e0dd is always an octal code. The treatment of backslash followed by
a non-zero digit is complicated; for details see the section a non-zero digit is complicated; for details see the section
.\" HTML <a href="pcre2pattern.html#digitsafterbackslash"> .\" HTML <a href="pcre2pattern.html#digitsafterbackslash">
@ -54,12 +67,6 @@ documentation, where details of escape processing in EBCDIC environments are
also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not
supported in EBCDIC environments. Note that \eN not followed by an opening supported in EBCDIC environments. Note that \eN not followed by an opening
curly bracket has a different meaning (see below). curly bracket has a different meaning (see below).
.P
When \ex is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits,
it matches a literal "u".
. .
. .
.SH "CHARACTER TYPES" .SH "CHARACTER TYPES"
@ -647,6 +654,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 03 February 2019 Last updated: 11 February 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "03 February 2019" "PCRE 10.33" .TH PCRE2TEST 1 "11 February 2019" "PCRE 10.33"
.SH NAME .SH NAME
pcre2test - a program for testing Perl-compatible regular expressions. pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -568,6 +568,7 @@ for a description of the effects of these options.
escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF
/x extended set PCRE2_EXTENDED /x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE /xx extended_more set PCRE2_EXTENDED_MORE
extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE match_line set PCRE2_EXTRA_MATCH_LINE
@ -2056,6 +2057,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 03 February 2019 Last updated: 11 February 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -547,6 +547,7 @@ PATTERN MODIFIERS
escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF
/x extended set PCRE2_EXTENDED /x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE /xx extended_more set PCRE2_EXTENDED_MORE
extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE match_line set PCRE2_EXTRA_MATCH_LINE
@ -1887,5 +1888,5 @@ AUTHOR
REVISION REVISION
Last updated: 03 February 2019 Last updated: 11 February 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.

View File

@ -150,6 +150,7 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_EXTRA_MATCH_WORD 0x00000004u /* C */ #define PCRE2_EXTRA_MATCH_WORD 0x00000004u /* C */
#define PCRE2_EXTRA_MATCH_LINE 0x00000008u /* C */ #define PCRE2_EXTRA_MATCH_LINE 0x00000008u /* C */
#define PCRE2_EXTRA_ESCAPED_CR_IS_LF 0x00000010u /* C */ #define PCRE2_EXTRA_ESCAPED_CR_IS_LF 0x00000010u /* C */
#define PCRE2_EXTRA_ALT_BSUX 0x00000020u /* C */
/* These are for pcre2_jit_compile(). */ /* These are for pcre2_jit_compile(). */

View File

@ -764,7 +764,7 @@ are allowed. */
#define PUBLIC_COMPILE_EXTRA_OPTIONS \ #define PUBLIC_COMPILE_EXTRA_OPTIONS \
(PUBLIC_LITERAL_COMPILE_EXTRA_OPTIONS| \ (PUBLIC_LITERAL_COMPILE_EXTRA_OPTIONS| \
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES|PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL| \ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES|PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL| \
PCRE2_EXTRA_ESCAPED_CR_IS_LF) PCRE2_EXTRA_ESCAPED_CR_IS_LF|PCRE2_EXTRA_ALT_BSUX)
/* Compile time error code numbers. They are given names so that they can more /* Compile time error code numbers. They are given names so that they can more
easily be tracked. When a new number is added, the tables called eint1 and easily be tracked. When a new number is added, the tables called eint1 and
@ -1459,7 +1459,8 @@ Returns: zero => a data character
int int
PRIV(check_escape)(PCRE2_SPTR *ptrptr, PCRE2_SPTR ptrend, uint32_t *chptr, PRIV(check_escape)(PCRE2_SPTR *ptrptr, PCRE2_SPTR ptrend, uint32_t *chptr,
int *errorcodeptr, uint32_t options, BOOL isclass, compile_block *cb) int *errorcodeptr, uint32_t options, uint32_t extra_options, BOOL isclass,
compile_block *cb)
{ {
BOOL utf = (options & PCRE2_UTF) != 0; BOOL utf = (options & PCRE2_UTF) != 0;
PCRE2_SPTR ptr = *ptrptr; PCRE2_SPTR ptr = *ptrptr;
@ -1495,8 +1496,7 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
if (i > 0) if (i > 0)
{ {
c = (uint32_t)i; c = (uint32_t)i;
if (cb != NULL && c == CHAR_CR && if (c == CHAR_CR && (extra_options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)
(cb->cx->extra_options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)
c = CHAR_LF; c = CHAR_LF;
} }
else /* Negative table entry */ else /* Negative table entry */
@ -1551,22 +1551,28 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
/* Escapes that need further processing, including those that are unknown, have /* Escapes that need further processing, including those that are unknown, have
a zero entry in the lookup table. When called from pcre2_substitute(), only \c, a zero entry in the lookup table. When called from pcre2_substitute(), only \c,
\o, and \x are recognized (and \u when BSUX is set). */ \o, and \x are recognized (\u and \U can never appear as they are used for case
forcing). */
else else
{ {
int s;
PCRE2_SPTR oldptr; PCRE2_SPTR oldptr;
BOOL overflow; BOOL overflow;
int s; BOOL alt_bsux =
((options & PCRE2_ALT_BSUX) | (extra_options & PCRE2_EXTRA_ALT_BSUX)) != 0;
/* Filter calls from pcre2_substitute(). */ /* Filter calls from pcre2_substitute(). */
if (cb == NULL && c != CHAR_c && c != CHAR_o && c != CHAR_x && if (cb == NULL)
(c != CHAR_u || (options & PCRE2_ALT_BSUX) != 0)) {
if (c != CHAR_c && c != CHAR_o && c != CHAR_x)
{ {
*errorcodeptr = ERR3; *errorcodeptr = ERR3;
return 0; return 0;
} }
alt_bsux = FALSE; /* Do not modify \x handling */
}
switch (c) switch (c)
{ {
@ -1579,14 +1585,46 @@ else
*errorcodeptr = ERR37; *errorcodeptr = ERR37;
break; break;
/* \u is unrecognized when PCRE2_ALT_BSUX is not set. When it is treated /* \u is unrecognized when neither PCRE2_ALT_BSUX nor PCRE2_EXTRA_ALT_BSUX
specially, \u must be followed by four hex digits. Otherwise it is a is set. Otherwise, \u must be followed by exactly four hex digits or, if
lowercase u letter. */ PCRE2_EXTRA_ALT_BSUX is set, by any number of hex digits in braces.
Otherwise it is a lowercase u letter. This gives some compatibility with
ECMAScript (aka JavaScript). */
case CHAR_u: case CHAR_u:
if ((options & PCRE2_ALT_BSUX) == 0) *errorcodeptr = ERR37; else if (!alt_bsux) *errorcodeptr = ERR37; else
{ {
uint32_t xc; uint32_t xc;
if (*ptr == CHAR_LEFT_CURLY_BRACKET &&
(extra_options & PCRE2_EXTRA_ALT_BSUX) != 0)
{
PCRE2_SPTR hptr = ptr + 1;
cc = 0;
while (hptr < ptrend && (xc = XDIGIT(*hptr)) != 0xff)
{
if ((cc & 0xf0000000) != 0) /* Test for 32-bit overflow */
{
*errorcodeptr = ERR77;
ptr = hptr; /* Show where */
break; /* *hptr != } will cause another break below */
}
cc = (cc << 4) | xc;
hptr++;
}
if (hptr == ptr + 1 || /* No hex digits */
hptr >= ptrend || /* Hit end of input */
*hptr != CHAR_RIGHT_CURLY_BRACKET) /* No } terminator */
break; /* Hex escape not recognized */
c = cc; /* Accept the code point */
ptr = hptr + 1;
}
else /* Must be exactly 4 hex digits */
{
if (ptrend - ptr < 4) break; /* Less than 4 chars */ if (ptrend - ptr < 4) break; /* Less than 4 chars */
if ((cc = XDIGIT(ptr[0])) == 0xff) break; /* Not a hex digit */ if ((cc = XDIGIT(ptr[0])) == 0xff) break; /* Not a hex digit */
if ((xc = XDIGIT(ptr[1])) == 0xff) break; /* Not a hex digit */ if ((xc = XDIGIT(ptr[1])) == 0xff) break; /* Not a hex digit */
@ -1596,23 +1634,25 @@ else
if ((xc = XDIGIT(ptr[3])) == 0xff) break; /* Not a hex digit */ if ((xc = XDIGIT(ptr[3])) == 0xff) break; /* Not a hex digit */
c = (cc << 4) | xc; c = (cc << 4) | xc;
ptr += 4; ptr += 4;
}
if (utf) if (utf)
{ {
if (c > 0x10ffffU) *errorcodeptr = ERR77; if (c > 0x10ffffU) *errorcodeptr = ERR77;
else else
if (c >= 0xd800 && c <= 0xdfff && if (c >= 0xd800 && c <= 0xdfff &&
(cb->cx->extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0) (extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0)
*errorcodeptr = ERR73; *errorcodeptr = ERR73;
} }
else if (c > MAX_NON_UTF_CHAR) *errorcodeptr = ERR77; else if (c > MAX_NON_UTF_CHAR) *errorcodeptr = ERR77;
} }
break; break;
/* \U is unrecognized unless PCRE2_ALT_BSUX is set, in which case it is an /* \U is unrecognized unless PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set,
upper case letter. */ in which case it is an upper case letter. */
case CHAR_U: case CHAR_U:
if ((options & PCRE2_ALT_BSUX) == 0) *errorcodeptr = ERR37; if (!alt_bsux) *errorcodeptr = ERR37;
break; break;
/* In a character class, \g is just a literal "g". Outside a character /* In a character class, \g is just a literal "g". Outside a character
@ -1791,8 +1831,8 @@ else
} }
else if (ptr < ptrend && *ptr++ == CHAR_RIGHT_CURLY_BRACKET) else if (ptr < ptrend && *ptr++ == CHAR_RIGHT_CURLY_BRACKET)
{ {
if (utf && c >= 0xd800 && c <= 0xdfff && (cb == NULL || if (utf && c >= 0xd800 && c <= 0xdfff &&
(cb->cx->extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0)) (extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0)
{ {
ptr--; ptr--;
*errorcodeptr = ERR73; *errorcodeptr = ERR73;
@ -1806,11 +1846,11 @@ else
} }
break; break;
/* \x is complicated. When PCRE2_ALT_BSUX is set, \x must be followed by /* When PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set, \x must be followed
two hexadecimal digits. Otherwise it is a lowercase x letter. */ by two hexadecimal digits. Otherwise it is a lowercase x letter. */
case CHAR_x: case CHAR_x:
if ((options & PCRE2_ALT_BSUX) != 0) if (alt_bsux)
{ {
uint32_t xc; uint32_t xc;
if (ptrend - ptr < 2) break; /* Less than 2 characters */ if (ptrend - ptr < 2) break; /* Less than 2 characters */
@ -1818,9 +1858,9 @@ else
if ((xc = XDIGIT(ptr[1])) == 0xff) break; /* Not a hex digit */ if ((xc = XDIGIT(ptr[1])) == 0xff) break; /* Not a hex digit */
c = (cc << 4) | xc; c = (cc << 4) | xc;
ptr += 2; ptr += 2;
} /* End PCRE2_ALT_BSUX handling */ }
/* Handle \x in Perl's style. \x{ddd} is a character number which can be /* Handle \x in Perl's style. \x{ddd} is a character code which can be
greater than 0xff in UTF-8 or non-8bit mode, but only if the ddd are hex greater than 0xff in UTF-8 or non-8bit mode, but only if the ddd are hex
digits. If not, { used to be treated as a data character. However, Perl digits. If not, { used to be treated as a data character. However, Perl
seems to read hex digits up to the first non-such, and ignore the rest, so seems to read hex digits up to the first non-such, and ignore the rest, so
@ -1864,8 +1904,8 @@ else
} }
else if (ptr < ptrend && *ptr++ == CHAR_RIGHT_CURLY_BRACKET) else if (ptr < ptrend && *ptr++ == CHAR_RIGHT_CURLY_BRACKET)
{ {
if (utf && c >= 0xd800 && c <= 0xdfff && (cb == NULL || if (utf && c >= 0xd800 && c <= 0xdfff &&
(cb->cx->extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0)) (extra_options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) == 0)
{ {
ptr--; ptr--;
*errorcodeptr = ERR73; *errorcodeptr = ERR73;
@ -2438,6 +2478,7 @@ uint32_t *parsed_pattern = cb->parsed_pattern;
uint32_t *parsed_pattern_end = cb->parsed_pattern_end; uint32_t *parsed_pattern_end = cb->parsed_pattern_end;
uint32_t meta_quantifier = 0; uint32_t meta_quantifier = 0;
uint32_t add_after_mark = 0; uint32_t add_after_mark = 0;
uint32_t extra_options = cb->cx->extra_options;
uint16_t nest_depth = 0; uint16_t nest_depth = 0;
int after_manual_callout = 0; int after_manual_callout = 0;
int expect_cond_assert = 0; int expect_cond_assert = 0;
@ -2461,12 +2502,12 @@ nest_save *top_nest, *end_nests;
/* Insert leading items for word and line matching (features provided for the /* Insert leading items for word and line matching (features provided for the
benefit of pcre2grep). */ benefit of pcre2grep). */
if ((cb->cx->extra_options & PCRE2_EXTRA_MATCH_LINE) != 0) if ((extra_options & PCRE2_EXTRA_MATCH_LINE) != 0)
{ {
*parsed_pattern++ = META_CIRCUMFLEX; *parsed_pattern++ = META_CIRCUMFLEX;
*parsed_pattern++ = META_NOCAPTURE; *parsed_pattern++ = META_NOCAPTURE;
} }
else if ((cb->cx->extra_options & PCRE2_EXTRA_MATCH_WORD) != 0) else if ((extra_options & PCRE2_EXTRA_MATCH_WORD) != 0)
{ {
*parsed_pattern++ = META_ESCAPE + ESC_b; *parsed_pattern++ = META_ESCAPE + ESC_b;
*parsed_pattern++ = META_NOCAPTURE; *parsed_pattern++ = META_NOCAPTURE;
@ -2631,7 +2672,7 @@ while (ptr < ptrend)
if ((options & PCRE2_ALT_VERBNAMES) != 0) if ((options & PCRE2_ALT_VERBNAMES) != 0)
{ {
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options, escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options,
FALSE, cb); cb->cx->extra_options, FALSE, cb);
if (errorcode != 0) goto FAILED; if (errorcode != 0) goto FAILED;
} }
else escape = 0; /* Treat all as literal */ else escape = 0; /* Treat all as literal */
@ -2821,11 +2862,11 @@ while (ptr < ptrend)
case CHAR_BACKSLASH: case CHAR_BACKSLASH:
tempptr = ptr; tempptr = ptr;
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options, escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options,
FALSE, cb); cb->cx->extra_options, FALSE, cb);
if (errorcode != 0) if (errorcode != 0)
{ {
ESCAPE_FAILED: ESCAPE_FAILED:
if ((cb->cx->extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0) if ((extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0)
goto FAILED; goto FAILED;
ptr = tempptr; ptr = tempptr;
if (ptr >= ptrend) c = CHAR_BACKSLASH; else if (ptr >= ptrend) c = CHAR_BACKSLASH; else
@ -3382,12 +3423,12 @@ while (ptr < ptrend)
else else
{ {
tempptr = ptr; tempptr = ptr;
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options,
options, TRUE, cb); cb->cx->extra_options, TRUE, cb);
if (errorcode != 0) if (errorcode != 0)
{ {
if ((cb->cx->extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0) if ((extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0)
goto FAILED; goto FAILED;
ptr = tempptr; ptr = tempptr;
if (ptr >= ptrend) c = CHAR_BACKSLASH; else if (ptr >= ptrend) c = CHAR_BACKSLASH; else
@ -4545,12 +4586,12 @@ parsed_pattern = manage_callouts(ptr, &previous_callout, auto_callout,
/* Insert trailing items for word and line matching (features provided for the /* Insert trailing items for word and line matching (features provided for the
benefit of pcre2grep). */ benefit of pcre2grep). */
if ((cb->cx->extra_options & PCRE2_EXTRA_MATCH_LINE) != 0) if ((extra_options & PCRE2_EXTRA_MATCH_LINE) != 0)
{ {
*parsed_pattern++ = META_KET; *parsed_pattern++ = META_KET;
*parsed_pattern++ = META_DOLLAR; *parsed_pattern++ = META_DOLLAR;
} }
else if ((cb->cx->extra_options & PCRE2_EXTRA_MATCH_WORD) != 0) else if ((extra_options & PCRE2_EXTRA_MATCH_WORD) != 0)
{ {
*parsed_pattern++ = META_KET; *parsed_pattern++ = META_KET;
*parsed_pattern++ = META_ESCAPE + ESC_b; *parsed_pattern++ = META_ESCAPE + ESC_b;

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016-2018 University of Cambridge New API code Copyright (c) 2016-2019 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -1942,7 +1942,7 @@ is available. */
extern int _pcre2_auto_possessify(PCRE2_UCHAR *, BOOL, extern int _pcre2_auto_possessify(PCRE2_UCHAR *, BOOL,
const compile_block *); const compile_block *);
extern int _pcre2_check_escape(PCRE2_SPTR *, PCRE2_SPTR, uint32_t *, extern int _pcre2_check_escape(PCRE2_SPTR *, PCRE2_SPTR, uint32_t *,
int *, uint32_t, BOOL, compile_block *); int *, uint32_t, uint32_t, BOOL, compile_block *);
extern PCRE2_SPTR _pcre2_extuni(uint32_t, PCRE2_SPTR, PCRE2_SPTR, PCRE2_SPTR, extern PCRE2_SPTR _pcre2_extuni(uint32_t, PCRE2_SPTR, PCRE2_SPTR, PCRE2_SPTR,
BOOL, int *); BOOL, int *);
extern PCRE2_SPTR _pcre2_find_bracket(PCRE2_SPTR, BOOL, int); extern PCRE2_SPTR _pcre2_find_bracket(PCRE2_SPTR, BOOL, int);

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016-2018 University of Cambridge New API code Copyright (c) 2016-2019 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -129,7 +129,7 @@ for (; ptr < ptrend; ptr++)
ptr += 1; /* Must point after \ */ ptr += 1; /* Must point after \ */
erc = PRIV(check_escape)(&ptr, ptrend, &ch, &errorcode, erc = PRIV(check_escape)(&ptr, ptrend, &ch, &errorcode,
code->overall_options, FALSE, NULL); code->overall_options, code->extra_options, FALSE, NULL);
ptr -= 1; /* Back to last code unit of escape */ ptr -= 1; /* Back to last code unit of escape */
if (errorcode != 0) if (errorcode != 0)
{ {
@ -774,7 +774,7 @@ do
ptr++; /* Point after \ */ ptr++; /* Point after \ */
rc = PRIV(check_escape)(&ptr, repend, &ch, &errorcode, rc = PRIV(check_escape)(&ptr, repend, &ch, &errorcode,
code->overall_options, FALSE, NULL); code->overall_options, code->extra_options, FALSE, NULL);
if (errorcode != 0) goto BADESCAPE; if (errorcode != 0) goto BADESCAPE;
switch(rc) switch(rc)

View File

@ -646,6 +646,7 @@ static modstruct modlist[] = {
{ "expand", MOD_PAT, MOD_CTL, CTL_EXPAND, PO(control) }, { "expand", MOD_PAT, MOD_CTL, CTL_EXPAND, PO(control) },
{ "extended", MOD_PATP, MOD_OPT, PCRE2_EXTENDED, PO(options) }, { "extended", MOD_PATP, MOD_OPT, PCRE2_EXTENDED, PO(options) },
{ "extended_more", MOD_PATP, MOD_OPT, PCRE2_EXTENDED_MORE, PO(options) }, { "extended_more", MOD_PATP, MOD_OPT, PCRE2_EXTENDED_MORE, PO(options) },
{ "extra_alt_bsux", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ALT_BSUX, CO(extra_options) },
{ "find_limits", MOD_DAT, MOD_CTL, CTL_FINDLIMITS, DO(control) }, { "find_limits", MOD_DAT, MOD_CTL, CTL_FINDLIMITS, DO(control) },
{ "firstline", MOD_PAT, MOD_OPT, PCRE2_FIRSTLINE, PO(options) }, { "firstline", MOD_PAT, MOD_OPT, PCRE2_FIRSTLINE, PO(options) },
{ "framesize", MOD_PAT, MOD_CTL, CTL_FRAMESIZE, PO(control) }, { "framesize", MOD_PAT, MOD_CTL, CTL_FRAMESIZE, PO(control) },
@ -4189,10 +4190,11 @@ show_compile_extra_options(uint32_t options, const char *before,
const char *after) const char *after)
{ {
if (options == 0) fprintf(outfile, "%s <none>%s", before, after); if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
else fprintf(outfile, "%s%s%s%s%s%s%s", else fprintf(outfile, "%s%s%s%s%s%s%s%s",
before, before,
((options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) != 0)? " allow_surrogate_escapes" : "", ((options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) != 0)? " allow_surrogate_escapes" : "",
((options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) != 0)? " bad_escape_is_literal" : "", ((options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) != 0)? " bad_escape_is_literal" : "",
((options & PCRE2_EXTRA_ALT_BSUX) != 0)? " extra_alt_bsux" : "",
((options & PCRE2_EXTRA_MATCH_WORD) != 0)? " match_word" : "", ((options & PCRE2_EXTRA_MATCH_WORD) != 0)? " match_word" : "",
((options & PCRE2_EXTRA_MATCH_LINE) != 0)? " match_line" : "", ((options & PCRE2_EXTRA_MATCH_LINE) != 0)? " match_line" : "",
((options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)? " escaped_cr_is_lf" : "", ((options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)? " escaped_cr_is_lf" : "",

26
testdata/testinput2 vendored
View File

@ -2408,13 +2408,13 @@
\= Expect no match \= Expect no match
cat cat
/(\3)(\1)(a)/alt_bsux,allow_empty_class,match_unset_backref,dupnames /(\3)(\1)(a)/allow_empty_class,match_unset_backref,dupnames
cat cat
/TA]/ /TA]/
The ACTA] comes The ACTA] comes
/TA]/alt_bsux,allow_empty_class,match_unset_backref,dupnames /TA]/allow_empty_class,match_unset_backref,dupnames
The ACTA] comes The ACTA] comes
/(?2)[]a()b](abc)/ /(?2)[]a()b](abc)/
@ -2446,25 +2446,25 @@
/a[^]b/ /a[^]b/
/a[]b/alt_bsux,allow_empty_class,match_unset_backref,dupnames /a[]b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match \= Expect no match
ab ab
/a[]+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames /a[]+b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match \= Expect no match
ab ab
/a[]*+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames /a[]*+b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match \= Expect no match
ab ab
/a[^]b/alt_bsux,allow_empty_class,match_unset_backref,dupnames /a[^]b/allow_empty_class,match_unset_backref,dupnames
aXb aXb
a\nb a\nb
\= Expect no match \= Expect no match
ab ab
/a[^]+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames /a[^]+b/allow_empty_class,match_unset_backref,dupnames
aXb aXb
a\nX\nXb a\nX\nXb
\= Expect no match \= Expect no match
@ -2903,10 +2903,10 @@
xxxxabcde\=ps xxxxabcde\=ps
xxxxabcde\=ph xxxxabcde\=ph
/(\3)(\1)(a)/alt_bsux,allow_empty_class,match_unset_backref,dupnames /(\3)(\1)(a)/allow_empty_class,match_unset_backref,dupnames
cat cat
/(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames /(\3)(\1)(a)/I,allow_empty_class,match_unset_backref,dupnames
cat cat
/(\3)(\1)(a)/I /(\3)(\1)(a)/I
@ -3419,6 +3419,14 @@
\= Expect no match \= Expect no match
aAz aAz
/^\u{7a}/alt_bsux
u{7a}
\= Expect no match
zoo
/^\u{7a}/extra_alt_bsux
zoo
/(?(?=c)c|d)++Y/B /(?(?=c)c|d)++Y/B
/(?(?=c)c|d)*+Y/B /(?(?=c)c|d)*+Y/B

7
testdata/testinput5 vendored
View File

@ -333,13 +333,13 @@
/[[:a\x{100}b:]]/utf /[[:a\x{100}b:]]/utf
/a[^]b/utf,alt_bsux,allow_empty_class,match_unset_backref /a[^]b/utf,allow_empty_class,match_unset_backref
a\x{1234}b a\x{1234}b
a\nb a\nb
\= Expect no match \= Expect no match
ab ab
/a[^]+b/utf,alt_bsux,allow_empty_class,match_unset_backref /a[^]+b/utf,allow_empty_class,match_unset_backref
aXb aXb
a\nX\nX\x{1234}b a\nX\nX\x{1234}b
\= Expect no match \= Expect no match
@ -814,6 +814,9 @@
/\ud800/utf,alt_bsux,allow_empty_class,match_unset_backref /\ud800/utf,alt_bsux,allow_empty_class,match_unset_backref
/^\u{0000000000010ffff}/utf,extra_alt_bsux
\x{10ffff}
/^a+[a\x{200}]/B,utf /^a+[a\x{200}]/B,utf
aa aa

31
testdata/testoutput2 vendored
View File

@ -8774,7 +8774,7 @@ No match
cat cat
No match No match
/(\3)(\1)(a)/alt_bsux,allow_empty_class,match_unset_backref,dupnames /(\3)(\1)(a)/allow_empty_class,match_unset_backref,dupnames
cat cat
0: a 0: a
1: 1:
@ -8785,7 +8785,7 @@ No match
The ACTA] comes The ACTA] comes
0: TA] 0: TA]
/TA]/alt_bsux,allow_empty_class,match_unset_backref,dupnames /TA]/allow_empty_class,match_unset_backref,dupnames
The ACTA] comes The ACTA] comes
0: TA] 0: TA]
@ -8833,22 +8833,22 @@ Failed: error 106 at offset 4: missing terminating ] for character class
/a[^]b/ /a[^]b/
Failed: error 106 at offset 5: missing terminating ] for character class Failed: error 106 at offset 5: missing terminating ] for character class
/a[]b/alt_bsux,allow_empty_class,match_unset_backref,dupnames /a[]b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match \= Expect no match
ab ab
No match No match
/a[]+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames /a[]+b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match \= Expect no match
ab ab
No match No match
/a[]*+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames /a[]*+b/allow_empty_class,match_unset_backref,dupnames
\= Expect no match \= Expect no match
ab ab
No match No match
/a[^]b/alt_bsux,allow_empty_class,match_unset_backref,dupnames /a[^]b/allow_empty_class,match_unset_backref,dupnames
aXb aXb
0: aXb 0: aXb
a\nb a\nb
@ -8857,7 +8857,7 @@ No match
ab ab
No match No match
/a[^]+b/alt_bsux,allow_empty_class,match_unset_backref,dupnames /a[^]+b/allow_empty_class,match_unset_backref,dupnames
aXb aXb
0: aXb 0: aXb
a\nX\nXb a\nX\nXb
@ -9971,17 +9971,17 @@ Partial match: abca
xxxxabcde\=ph xxxxabcde\=ph
Partial match: abcde Partial match: abcde
/(\3)(\1)(a)/alt_bsux,allow_empty_class,match_unset_backref,dupnames /(\3)(\1)(a)/allow_empty_class,match_unset_backref,dupnames
cat cat
0: a 0: a
1: 1:
2: 2:
3: a 3: a
/(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames /(\3)(\1)(a)/I,allow_empty_class,match_unset_backref,dupnames
Capture group count = 3 Capture group count = 3
Max back reference = 3 Max back reference = 3
Options: alt_bsux allow_empty_class dupnames match_unset_backref Options: allow_empty_class dupnames match_unset_backref
Last code unit = 'a' Last code unit = 'a'
Subject length lower bound = 1 Subject length lower bound = 1
cat cat
@ -11365,6 +11365,17 @@ No match
aAz aAz
No match No match
/^\u{7a}/alt_bsux
u{7a}
0: u{7a}
\= Expect no match
zoo
No match
/^\u{7a}/extra_alt_bsux
zoo
0: z
/(?(?=c)c|d)++Y/B /(?(?=c)c|d)++Y/B
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra

View File

@ -798,7 +798,7 @@ No match
/[[:a\x{100}b:]]/utf /[[:a\x{100}b:]]/utf
Failed: error 130 at offset 3: unknown POSIX class name Failed: error 130 at offset 3: unknown POSIX class name
/a[^]b/utf,alt_bsux,allow_empty_class,match_unset_backref /a[^]b/utf,allow_empty_class,match_unset_backref
a\x{1234}b a\x{1234}b
0: a\x{1234}b 0: a\x{1234}b
a\nb a\nb
@ -807,7 +807,7 @@ Failed: error 130 at offset 3: unknown POSIX class name
ab ab
No match No match
/a[^]+b/utf,alt_bsux,allow_empty_class,match_unset_backref /a[^]+b/utf,allow_empty_class,match_unset_backref
aXb aXb
0: aXb 0: aXb
a\nX\nX\x{1234}b a\nX\nX\x{1234}b
@ -1734,6 +1734,10 @@ No match
/\ud800/utf,alt_bsux,allow_empty_class,match_unset_backref /\ud800/utf,alt_bsux,allow_empty_class,match_unset_backref
Failed: error 173 at offset 6: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) Failed: error 173 at offset 6: disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
/^\u{0000000000010ffff}/utf,extra_alt_bsux
\x{10ffff}
0: \x{10ffff}
/^a+[a\x{200}]/B,utf /^a+[a\x{200}]/B,utf
------------------------------------------------------------------ ------------------------------------------------------------------
Bra Bra