Implement support for invalid UTF in the pcre2_match() interpreter.

This commit is contained in:
Philip.Hazel 2019-05-24 17:15:48 +00:00
parent 2ad4329f83
commit 16c046ce50
48 changed files with 2780 additions and 1783 deletions

View File

@ -14,6 +14,10 @@ detects invalid characters in the 0xd800-0xdfff range.
3. Fix minor typo bug in JIT compile when \X is used in a non-UTF string.
4. Add support for matching in invalid UTF strings to the pcre2_match()
interpreter, and integrate with the existing JIT support via the new
PCRE2_MATCH_INVALID_UTF compile-time option.
Version 10.33 16-April-2019
---------------------------

View File

@ -65,6 +65,7 @@ The option bits are:
PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_INVALID_UTF Enable support for matching invalid UTF
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns

View File

@ -40,8 +40,12 @@ bits:
PCRE2_JIT_COMPLETE compile code for full matching
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
</pre>
There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been
superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF. The old
option is deprecated and may be removed in future.
</P>
<P>
The yield of the function is 0 for success, or a negative error code otherwise.
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
if an unknown bit is set in <i>options</i>.

View File

@ -1347,11 +1347,12 @@ and <b>pcre2_compile()</b> returns a non-NULL value.
<P>
There are nearly 100 positive error codes that <b>pcre2_compile()</b> may return
if it finds an error in the pattern. There are also some negative error codes
that are used for invalid UTF strings. These are the same as given by
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and are described in the
that are used for invalid UTF strings when validity checking is in force. These
are the same as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and
are described in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page. There is no separate documentation for the positive error codes, because
the textual error messages that are obtained by calling the
documentation. There is no separate documentation for the positive error codes,
because the textual error messages that are obtained by calling the
<b>pcre2_get_error_message()</b> function (see "Obtaining a textual error
message"
<a href="#geterrormessage">below)</a>
@ -1615,10 +1616,18 @@ expression engine is not the most efficient way of doing it. If you are doing a
lot of literal matching and are worried about efficiency, you should consider
using other approaches. The only other main options that are allowed with
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
error.
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
<pre>
PCRE2_MATCH_INVALID_UTF
</pre>
This option forces PCRE2_UTF (see below) and also enables support for matching
by <b>pcre2_match()</b> in subject strings that contain invalid UTF sequences.
This facility is not supported for DFA matching. For details, see the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
<pre>
PCRE2_MATCH_UNSET_BACKREF
</pre>
@ -2653,15 +2662,22 @@ of JIT; it forces matching to be done by the interpreter.
PCRE2_NO_UTF_CHECK
</pre>
When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
string is checked by default when <b>pcre2_match()</b> is subsequently called.
If a non-zero starting offset is given, the check is applied only to that part
of the subject that could be inspected during matching, and there is a check
that the starting offset points to the first code unit of a character or to the
end of the subject. If there are no lookbehind assertions in the pattern, the
check starts at the starting offset. Otherwise, it starts at the length of the
longest lookbehind before the starting offset, or at the start of the subject
if there are not that many characters before the starting offset. Note that the
sequences \b and \B are one-character lookbehinds.
string is checked unless PCRE2_NO_UTF_CHECK is passed to <b>pcre2_match()</b> or
PCRE2_MATCH_INVALID_UTF was passed to <b>pcre2_compile()</b>. The latter special
case is discussed in detail in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
</P>
<P>
In the default case, if a non-zero starting offset is given, the check is
applied only to that part of the subject that could be inspected during
matching, and there is a check that the starting offset points to the first
code unit of a character or to the end of the subject. If there are no
lookbehind assertions in the pattern, the check starts at the starting offset.
Otherwise, it starts at the length of the longest lookbehind before the
starting offset, or at the start of the subject if there are not that many
characters before the starting offset. Note that the sequences \b and \B are
one-character lookbehinds.
</P>
<P>
The check is carried out before any other processing takes place, and a
@ -2674,19 +2690,20 @@ and
<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page.
documentation.
</P>
<P>
If you know that your subject is valid, and you want to skip these checks for
If you know that your subject is valid, and you want to skip this check for
performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
<b>pcre2_match()</b>. You might want to do this for the second and subsequent
calls to <b>pcre2_match()</b> if you are making repeated calls to find other
calls to <b>pcre2_match()</b> if you are making repeated calls to find multiple
matches in the same subject string.
</P>
<P>
<b>Warning:</b> When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
<b>Warning:</b> Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
string as a subject, or an invalid value of <i>startoffset</i>, is undefined.
Your program may crash or loop indefinitely.
Your program may crash or loop indefinitely or give wrong results.
<pre>
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@ -3771,6 +3788,12 @@ a backreference.
This return is given if <b>pcre2_dfa_match()</b> encounters a condition item
that uses a backreference for the condition, or a test for recursion in a
specific capture group. These are not supported.
<pre>
PCRE2_ERROR_DFA_UINVALID_UTF
</pre>
This return is given if <b>pcre2_dfa_match()</b> is called for a pattern that
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
matching.
<pre>
PCRE2_ERROR_DFA_WSSIZE
</pre>
@ -3808,7 +3831,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 14 February 2019
Last updated: 23 May 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -147,25 +147,29 @@ pattern.
</P>
<br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
<P>
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
function expects its subject string to be a valid sequence of UTF code units.
If it is not, the result is undefined. This is also true by default of matching
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
<b>pcre2_jit_compile()</b>, code that can process a subject containing invalid
UTF is compiled.
When a pattern is compiled with the PCRE2_UTF option, subject strings are
normally expected to be a valid sequence of UTF code units. By default, this is
checked at the start of matching and an error is generated if invalid UTF is
detected. The PCRE2_NO_UTF_CHECK option can be passed to <b>pcre2_match()</b> to
skip the check (for improved performance) if you are sure that a subject string
is valid. If this option is used with an invalid string, the result is
undefined.
</P>
<P>
In this mode, an invalid code unit sequence never matches any pattern item. It
does not match dot, it does not match \p{Any}, it does not even match negative
items such as [^X]. A lookbehind assertion fails if it encounters an invalid
sequence while moving the current point backwards. In other words, an invalid
UTF code unit sequence acts as a barrier which no match can cross. Reaching an
invalid sequence causes an immediate backtrack.
However, a way of running matches on strings that may contain invalid UTF
sequences is available. Calling <b>pcre2_compile()</b> with the
PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
<b>pcre2_match()</b> to support invalid UTF, and, if <b>pcre2_jit_compile()</b>
is called, the compiled JIT code also supports invalid UTF. Details of how this
support works, in both the JIT and the interpretive cases, is given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
</P>
<P>
Using this option, an application can run matches in arbitrary data, knowing
that any matched strings that are returned will be valid UTF. This can be
useful when searching for text in executable or other binary files.
There is also an obsolete option for <b>pcre2_jit_compile()</b> called
PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
It is superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF
and should no longer be used. It may be removed in future.
</P>
<br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
<P>
@ -461,7 +465,7 @@ Cambridge, England.
</P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P>
Last updated: 06 March 2019
Last updated: 23 May 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -188,6 +188,10 @@ code unit) at a time, for all active paths through the tree.
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
</P>
<P>
10. The PCRE2_MATCH_INVALID_UTF option for <b>pcre2_compile()</b> is not
supported by <b>pcre2_dfa_match()</b>.
</P>
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
<P>
Using the alternative matching algorithm provides the following advantages:
@ -219,7 +223,8 @@ because it has to search for all possible matches, but is also because it is
less susceptible to optimization.
</P>
<P>
2. Capturing parentheses, backreferences, and script runs are not supported.
2. Capturing parentheses, backreferences, script runs, and matching within
invalid UTF string are not supported.
</P>
<P>
3. Although atomic groups are supported, their use does not provide the
@ -236,9 +241,9 @@ Cambridge, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
Last updated: 10 October 2018
Last updated: 23 May 2019
<br>
Copyright &copy; 1997-2018 University of Cambridge.
Copyright &copy; 1997-2019 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -91,10 +91,11 @@ single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
specified for the 32-bit library, in which case it constrains the character
values to valid Unicode code points. To process UTF strings, PCRE2 must be
built to include Unicode support (which is the default). When using UTF strings
you must either call the compiling function with the PCRE2_UTF option, or the
pattern must start with the special sequence (*UTF), which is equivalent to
setting the relevant option. How setting a UTF mode affects pattern matching is
mentioned in several places below. There is also a summary of features in the
you must either call the compiling function with one or both of the PCRE2_UTF
or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
setting a UTF mode affects pattern matching is mentioned in several places
below. There is also a summary of features in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page.
</P>
@ -428,11 +429,11 @@ There may be any number of hexadecimal digits. This syntax is from ECMAScript
6.
</P>
<P>
The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
\N{name} to specify characters by Unicode name; PCRE2 does not support this.
Note that when \N is not followed by an opening brace (curly bracket) it has
an entirely different meaning, matching any character that is not a newline.
The \N{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
UTF mode. Perl also uses \N{name} to specify characters by Unicode name; PCRE2
does not support this. Note that when \N is not followed by an opening brace
(curly bracket) it has an entirely different meaning, matching any character
that is not a newline.
</P>
<P>
There are some legacy applications where the escape sequence \r is expected to
@ -1360,7 +1361,7 @@ with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
with a malformed UTF character. This has undefined results, because PCRE2
assumes that it is matching character by character in a valid UTF string (by
default it checks the subject string's validity at the start of processing
unless the PCRE2_NO_UTF_CHECK option is used).
unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
</P>
<P>
An application can lock out the use of \C by setting the
@ -3727,7 +3728,7 @@ Cambridge, England.
</P>
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
<P>
Last updated: 12 February 2019
Last updated: 23 May 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -613,6 +613,7 @@ for a description of the effects of these options.
firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE
match_invalid_utf set PCRE2_MATCH_INVALID_UTF
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE
@ -2078,7 +2079,7 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 11 March 2019
Last updated: 23 May 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -16,22 +16,33 @@ please consult the man page, in case the conversion went wrong.
UNICODE AND UTF SUPPORT
</b><br>
<P>
When PCRE2 is built with Unicode support (which is the default), it has
knowledge of Unicode character properties and can process text strings in
UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by
default, PCRE2 assumes that one code unit is one character. To process a
pattern as a UTF string, where a character may require more than one code unit,
you must call
<a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
with the PCRE2_UTF option flag, or the pattern must start with the sequence
(*UTF). When either of these is the case, both the pattern and any subject
strings that are matched against it are treated as UTF strings instead of
strings of individual one-code-unit characters. There are also some other
changes to the way characters are handled, as documented below.
PCRE2 is normally built with Unicode support, though if you do not need it, you
can build it without, in which case the library will be smaller. With Unicode
support, PCRE2 has knowledge of Unicode character properties and can process
text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
width), but this is not the default. Unless specifically requested, PCRE2
treats each code unit in a string as one character.
</P>
<P>
If you do not need Unicode support you can build PCRE2 without it, in which
case the library will be smaller.
There are two ways of telling PCRE2 to switch to UTF mode, where characters may
consist of more than one code unit and the range of values is constrained. The
program can call
<a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
That is, the programmer can prevent the supplier of the pattern from switching
to UTF mode.
</P>
<P>
Note that the PCRE2_MATCH_INVALID_UTF option (see
<a href="#matchinvalid">below)</a>
forces PCRE2_UTF to be set.
</P>
<P>
In UTF mode, both the pattern and any subject strings that are matched against
it are treated as UTF strings instead of strings of individual one-code-unit
characters. There are also some other changes to the way characters are
handled, as documented below.
</P>
<br><b>
UNICODE PROPERTY SUPPORT
@ -63,22 +74,22 @@ also recognized; larger ones can be coded using \o{...}.
<P>
The escape sequence \N{U+&#60;hex digits&#62;} is recognized as another way of
specifying a Unicode character by code point in a UTF mode. It is not allowed
in non-UTF modes.
in non-UTF mode.
</P>
<P>
In UTF modes, repeat quantifiers apply to complete UTF characters, not to
In UTF mode, repeat quantifiers apply to complete UTF characters, not to
individual code units.
</P>
<P>
In UTF modes, the dot metacharacter matches one UTF character instead of a
In UTF mode, the dot metacharacter matches one UTF character instead of a
single code unit.
</P>
<P>
In UTF modes, capture group names are not restricted to ASCII, and may contain
In UTF mode, capture group names are not restricted to ASCII, and may contain
any Unicode letters and decimal digits, as well as underscore.
</P>
<P>
The escape sequence \C can be used to match a single code unit in a UTF mode,
The escape sequence \C can be used to match a single code unit in UTF mode,
but its use can lead to some strange effects because it breaks up multi-unit
characters (see the description of \C in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
@ -93,7 +104,7 @@ may consist of more than one code unit. The use of \C in these modes provokes
a match-time error. Also, the JIT optimization does not support \C in these
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
contains \C, it will not succeed, and so when <b>pcre2_match()</b> is called,
the matching will be carried out by the normal interpretive function.
the matching will be carried out by the interpretive function.
</P>
<P>
The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
@ -123,14 +134,14 @@ However, the special horizontal and vertical white space matching escapes (\h,
not PCRE2_UCP is set.
</P>
<br><b>
CASE-EQUIVALENCE IN UTF MODES
CASE-EQUIVALENCE IN UTF MODE
</b><br>
<P>
Case-insensitive matching in a UTF mode makes use of Unicode properties except
Case-insensitive matching in UTF mode makes use of Unicode properties except
for characters whose code points are less than 128 and that have at most two
case-equivalent values. For these, a direct table lookup is used for speed. A
few Unicode characters such as Greek sigma have more than two code points that
are case-equivalent, and these are treated as such.
are case-equivalent, and these are treated specially.
<a name="scriptruns"></a></P>
<br><b>
SCRIPT RUNS
@ -248,7 +259,7 @@ VALIDITY OF UTF STRINGS
<P>
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
are (by default) checked for validity on entry to the relevant functions. If an
invalid UTF string is passed, an negative error code is returned. The code unit
invalid UTF string is passed, a negative error code is returned. The code unit
offset to the offending character can be extracted from the match data block by
calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
error.
@ -263,17 +274,16 @@ only valid UTF code unit sequences.
</P>
<P>
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is usually undefined and your program may crash or loop indefinitely. There is,
however, one mode of matching that can handle invalid UTF subject strings. This
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
when calling <b>pcre2_jit_compile()</b>. For details, see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation.
is undefined and your program may crash or loop indefinitely or give incorrect
results. There is, however, one mode of matching that can handle invalid UTF
subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
<b>pcre2_compile()</b> and is discussed below in the next section. The rest of
this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
</P>
<P>
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable
the check for a subject string you must pass this same option to
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the UTF check
for the pattern; it does not also apply to subject strings. If you want to
disable the check for a subject string you must pass this same option to
<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
</P>
<P>
@ -352,7 +362,7 @@ these code points are excluded by RFC 3629.
<pre>
PCRE2_ERROR_UTF8_ERR13
</pre>
A 4-byte character has a value greater than 0x10fff; these code points are
A 4-byte character has a value greater than 0x10ffff; these code points are
excluded by RFC 3629.
<pre>
PCRE2_ERROR_UTF8_ERR14
@ -405,7 +415,59 @@ The following negative error codes are given for invalid UTF-32 strings:
PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
</PRE>
<a name="matchinvalid"></a></PRE>
</P>
<br><b>
MATCHING IN INVALID UTF STRINGS
</b><br>
<P>
You can run pattern matches on subject strings that may contain invalid UTF
sequences if you call <b>pcre2_compile()</b> with the PCRE2_MATCH_INVALID_UTF
option. This is supported by <b>pcre2_match()</b>, including JIT matching, but
not by <b>pcre2_dfa_match()</b>. When PCRE2_MATCH_INVALID_UTF is set, it forces
PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
valid UTF string.
</P>
<P>
Setting PCRE2_MATCH_INVALID_UTF does not affect what <b>pcre2_compile()</b>
generates, but if <b>pcre2_jit_compile()</b> is subsequently called, it does
generate different code. If JIT is not used, the option affects the behaviour
of the interpretive code in <b>pcre2_match()</b>. When PCRE2_MATCH_INVALID_UTF
is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
</P>
<P>
In this mode, an invalid code unit sequence in the subject never matches any
pattern item. It does not match dot, it does not match \p{Any}, it does not
even match negative items such as [^X]. A lookbehind assertion fails if it
encounters an invalid sequence while moving the current point backwards. In
other words, an invalid UTF code unit sequence acts as a barrier which no match
can cross.
</P>
<P>
You can also think of this as the subject being split up into fragments of
valid UTF, delimited internally by invalid code unit sequences. The pattern is
matched fragment by fragment. The result of a successful match, however, is
given as code unit offsets in the entire subject string in the usual way. There
are a few points to consider:
</P>
<P>
The internal boundaries are not interpreted as the beginnings or ends of lines
and so do not match circumflex or dollar characters in the pattern.
</P>
<P>
If <b>pcre2_match()</b> is called with an offset that points to an invalid
UTF-sequence, that sequence is skipped, and the match starts at the next valid
UTF character, or the end of the subject.
</P>
<P>
At internal fragment boundaries, \b and \B behave in the same way as at the
beginning and end of the subject. For example, a sequence such as \bWORD\b
would match an instance of WORD that is surrounded by invalid UTF code units.
</P>
<P>
Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
data, knowing that any matched strings that are returned are valid UTF. This
can be useful when searching for UTF text in executable or other binary files.
</P>
<br><b>
AUTHOR
@ -422,7 +484,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 06 March 2019
Last updated: 24 May 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2_COMPILE 3 "11 February 2019" "PCRE2 10.33"
.TH PCRE2_COMPILE 3 "23 May 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -53,6 +53,7 @@ The option bits are:
PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_INVALID_UTF Enable support for matching invalid UTF
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns

View File

@ -1,4 +1,4 @@
.TH PCRE2_JIT_COMPILE 3 "06 March 2019" "PCRE2 10.33"
.TH PCRE2_JIT_COMPILE 3 "23 May 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -29,8 +29,11 @@ bits:
PCRE2_JIT_COMPLETE compile code for full matching
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
.sp
There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been
superseded by the \fBpcre2_compile()\fP option PCRE2_MATCH_INVALID_UTF. The old
option is deprecated and may be removed in future.
.P
The yield of the function is 0 for success, or a negative error code otherwise.
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
if an unknown bit is set in \fIoptions\fP.

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "14 February 2019" "PCRE2 10.33"
.TH PCRE2API 3 "23 May 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -1285,13 +1285,14 @@ and \fBpcre2_compile()\fP returns a non-NULL value.
.P
There are nearly 100 positive error codes that \fBpcre2_compile()\fP may return
if it finds an error in the pattern. There are also some negative error codes
that are used for invalid UTF strings. These are the same as given by
\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and are described in the
that are used for invalid UTF strings when validity checking is in force. These
are the same as given by \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and
are described in the
.\" HREF
\fBpcre2unicode\fP
.\"
page. There is no separate documentation for the positive error codes, because
the textual error messages that are obtained by calling the
documentation. There is no separate documentation for the positive error codes,
because the textual error messages that are obtained by calling the
\fBpcre2_get_error_message()\fP function (see "Obtaining a textual error
message"
.\" HTML <a href="#geterrormessage">
@ -1557,10 +1558,20 @@ expression engine is not the most efficient way of doing it. If you are doing a
lot of literal matching and are worried about efficiency, you should consider
using other approaches. The only other main options that are allowed with
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
error.
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
.sp
PCRE2_MATCH_INVALID_UTF
.sp
This option forces PCRE2_UTF (see below) and also enables support for matching
by \fBpcre2_match()\fP in subject strings that contain invalid UTF sequences.
This facility is not supported for DFA matching. For details, see the
.\" HREF
\fBpcre2unicode\fP
.\"
documentation.
.sp
PCRE2_MATCH_UNSET_BACKREF
.sp
@ -2635,15 +2646,23 @@ of JIT; it forces matching to be done by the interpreter.
PCRE2_NO_UTF_CHECK
.sp
When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
string is checked by default when \fBpcre2_match()\fP is subsequently called.
If a non-zero starting offset is given, the check is applied only to that part
of the subject that could be inspected during matching, and there is a check
that the starting offset points to the first code unit of a character or to the
end of the subject. If there are no lookbehind assertions in the pattern, the
check starts at the starting offset. Otherwise, it starts at the length of the
longest lookbehind before the starting offset, or at the start of the subject
if there are not that many characters before the starting offset. Note that the
sequences \eb and \eB are one-character lookbehinds.
string is checked unless PCRE2_NO_UTF_CHECK is passed to \fBpcre2_match()\fP or
PCRE2_MATCH_INVALID_UTF was passed to \fBpcre2_compile()\fP. The latter special
case is discussed in detail in the
.\" HREF
\fBpcre2unicode\fP
.\"
documentation.
.P
In the default case, if a non-zero starting offset is given, the check is
applied only to that part of the subject that could be inspected during
matching, and there is a check that the starting offset points to the first
code unit of a character or to the end of the subject. If there are no
lookbehind assertions in the pattern, the check starts at the starting offset.
Otherwise, it starts at the length of the longest lookbehind before the
starting offset, or at the start of the subject if there are not that many
characters before the starting offset. Note that the sequences \eb and \eB are
one-character lookbehinds.
.P
The check is carried out before any other processing takes place, and a
negative error code is returned if the check fails. There are several UTF error
@ -2666,17 +2685,18 @@ in the
.\" HREF
\fBpcre2unicode\fP
.\"
page.
documentation.
.P
If you know that your subject is valid, and you want to skip these checks for
If you know that your subject is valid, and you want to skip this check for
performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
\fBpcre2_match()\fP. You might want to do this for the second and subsequent
calls to \fBpcre2_match()\fP if you are making repeated calls to find other
calls to \fBpcre2_match()\fP if you are making repeated calls to find multiple
matches in the same subject string.
.P
\fBWarning:\fP When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
\fBWarning:\fP Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
string as a subject, or an invalid value of \fIstartoffset\fP, is undefined.
Your program may crash or loop indefinitely.
Your program may crash or loop indefinitely or give wrong results.
.sp
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@ -3774,6 +3794,12 @@ a backreference.
This return is given if \fBpcre2_dfa_match()\fP encounters a condition item
that uses a backreference for the condition, or a test for recursion in a
specific capture group. These are not supported.
.sp
PCRE2_ERROR_DFA_UINVALID_UTF
.sp
This return is given if \fBpcre2_dfa_match()\fP is called for a pattern that
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
matching.
.sp
PCRE2_ERROR_DFA_WSSIZE
.sp
@ -3817,6 +3843,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 14 February 2019
Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2JIT 3 "06 March 2019" "PCRE2 10.33"
.TH PCRE2JIT 3 "23 May 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@ -123,23 +123,29 @@ pattern.
.SH "MATCHING SUBJECTS CONTAINING INVALID UTF"
.rs
.sp
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
function expects its subject string to be a valid sequence of UTF code units.
If it is not, the result is undefined. This is also true by default of matching
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
\fBpcre2_jit_compile()\fP, code that can process a subject containing invalid
UTF is compiled.
When a pattern is compiled with the PCRE2_UTF option, subject strings are
normally expected to be a valid sequence of UTF code units. By default, this is
checked at the start of matching and an error is generated if invalid UTF is
detected. The PCRE2_NO_UTF_CHECK option can be passed to \fBpcre2_match()\fP to
skip the check (for improved performance) if you are sure that a subject string
is valid. If this option is used with an invalid string, the result is
undefined.
.P
In this mode, an invalid code unit sequence never matches any pattern item. It
does not match dot, it does not match \ep{Any}, it does not even match negative
items such as [^X]. A lookbehind assertion fails if it encounters an invalid
sequence while moving the current point backwards. In other words, an invalid
UTF code unit sequence acts as a barrier which no match can cross. Reaching an
invalid sequence causes an immediate backtrack.
However, a way of running matches on strings that may contain invalid UTF
sequences is available. Calling \fBpcre2_compile()\fP with the
PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
\fBpcre2_match()\fP to support invalid UTF, and, if \fBpcre2_jit_compile()\fP
is called, the compiled JIT code also supports invalid UTF. Details of how this
support works, in both the JIT and the interpretive cases, is given in the
.\" HREF
\fBpcre2unicode\fP
.\"
documentation.
.P
Using this option, an application can run matches in arbitrary data, knowing
that any matched strings that are returned will be valid UTF. This can be
useful when searching for text in executable or other binary files.
There is also an obsolete option for \fBpcre2_jit_compile()\fP called
PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
It is superseded by the \fBpcre2_compile()\fP option PCRE2_MATCH_INVALID_UTF
and should no longer be used. It may be removed in future.
.
.
.SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
@ -438,6 +444,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 06 March 2019
Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2MATCHING 3 "10 October 2018" "PCRE2 10.33"
.TH PCRE2MATCHING 3 "23 May 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 MATCHING ALGORITHMS"
@ -157,6 +157,9 @@ code unit) at a time, for all active paths through the tree.
.P
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
.P
10. The PCRE2_MATCH_INVALID_UTF option for \fBpcre2_compile()\fP is not
supported by \fBpcre2_dfa_match()\fP.
.
.
.SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
@ -191,7 +194,8 @@ The alternative algorithm suffers from a number of disadvantages:
because it has to search for all possible matches, but is also because it is
less susceptible to optimization.
.P
2. Capturing parentheses, backreferences, and script runs are not supported.
2. Capturing parentheses, backreferences, script runs, and matching within
invalid UTF string are not supported.
.P
3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm.
@ -211,6 +215,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 10 October 2018
Copyright (c) 1997-2018 University of Cambridge.
Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "12 February 2019" "PCRE2 10.33"
.TH PCRE2PATTERN 3 "23 May 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -52,10 +52,11 @@ single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
specified for the 32-bit library, in which case it constrains the character
values to valid Unicode code points. To process UTF strings, PCRE2 must be
built to include Unicode support (which is the default). When using UTF strings
you must either call the compiling function with the PCRE2_UTF option, or the
pattern must start with the special sequence (*UTF), which is equivalent to
setting the relevant option. How setting a UTF mode affects pattern matching is
mentioned in several places below. There is also a summary of features in the
you must either call the compiling function with one or both of the PCRE2_UTF
or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
setting a UTF mode affects pattern matching is mentioned in several places
below. There is also a summary of features in the
.\" HREF
\fBpcre2unicode\fP
.\"
@ -398,11 +399,11 @@ PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
There may be any number of hexadecimal digits. This syntax is from ECMAScript
6.
.P
The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
\eN{name} to specify characters by Unicode name; PCRE2 does not support this.
Note that when \eN is not followed by an opening brace (curly bracket) it has
an entirely different meaning, matching any character that is not a newline.
The \eN{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
UTF mode. Perl also uses \eN{name} to specify characters by Unicode name; PCRE2
does not support this. Note that when \eN is not followed by an opening brace
(curly bracket) it has an entirely different meaning, matching any character
that is not a newline.
.P
There are some legacy applications where the escape sequence \er is expected to
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a
@ -1352,7 +1353,7 @@ with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
with a malformed UTF character. This has undefined results, because PCRE2
assumes that it is matching character by character in a valid UTF string (by
default it checks the subject string's validity at the start of processing
unless the PCRE2_NO_UTF_CHECK option is used).
unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
.P
An application can lock out the use of \eC by setting the
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
@ -3763,6 +3764,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 12 February 2019
Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "11 March 2019" "PCRE 10.33"
.TH PCRE2TEST 1 "23 May 2019" "PCRE 10.34"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@ -572,6 +572,7 @@ for a description of the effects of these options.
firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE
match_invalid_utf set PCRE2_MATCH_INVALID_UTF
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE
@ -2059,6 +2060,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 11 March 2019
Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -551,6 +551,7 @@ PATTERN MODIFIERS
firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE
match_invalid_utf set PCRE2_MATCH_INVALID_UTF
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE
@ -1890,5 +1891,5 @@ AUTHOR
REVISION
Last updated: 11 March 2019
Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge.

View File

@ -1,26 +1,38 @@
.TH PCRE2UNICODE 3 "11 May 2019" "PCRE2 10.33"
.TH PCRE2UNICODE 3 "24 May 2019" "PCRE2 10.34"
.SH NAME
PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT"
.rs
.sp
When PCRE2 is built with Unicode support (which is the default), it has
knowledge of Unicode character properties and can process text strings in
UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by
default, PCRE2 assumes that one code unit is one character. To process a
pattern as a UTF string, where a character may require more than one code unit,
you must call
PCRE2 is normally built with Unicode support, though if you do not need it, you
can build it without, in which case the library will be smaller. With Unicode
support, PCRE2 has knowledge of Unicode character properties and can process
text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
width), but this is not the default. Unless specifically requested, PCRE2
treats each code unit in a string as one character.
.P
There are two ways of telling PCRE2 to switch to UTF mode, where characters may
consist of more than one code unit and the range of values is constrained. The
program can call
.\" HREF
\fBpcre2_compile()\fP
.\"
with the PCRE2_UTF option flag, or the pattern must start with the sequence
(*UTF). When either of these is the case, both the pattern and any subject
strings that are matched against it are treated as UTF strings instead of
strings of individual one-code-unit characters. There are also some other
changes to the way characters are handled, as documented below.
with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
That is, the programmer can prevent the supplier of the pattern from switching
to UTF mode.
.P
If you do not need Unicode support you can build PCRE2 without it, in which
case the library will be smaller.
Note that the PCRE2_MATCH_INVALID_UTF option (see
.\" HTML <a href="#matchinvalid">
.\" </a>
below)
.\"
forces PCRE2_UTF to be set.
.P
In UTF mode, both the pattern and any subject strings that are matched against
it are treated as UTF strings instead of strings of individual one-code-unit
characters. There are also some other changes to the way characters are
handled, as documented below.
.
.
.SH "UNICODE PROPERTY SUPPORT"
@ -55,18 +67,18 @@ also recognized; larger ones can be coded using \eo{...}.
.P
The escape sequence \eN{U+<hex digits>} is recognized as another way of
specifying a Unicode character by code point in a UTF mode. It is not allowed
in non-UTF modes.
in non-UTF mode.
.P
In UTF modes, repeat quantifiers apply to complete UTF characters, not to
In UTF mode, repeat quantifiers apply to complete UTF characters, not to
individual code units.
.P
In UTF modes, the dot metacharacter matches one UTF character instead of a
In UTF mode, the dot metacharacter matches one UTF character instead of a
single code unit.
.P
In UTF modes, capture group names are not restricted to ASCII, and may contain
In UTF mode, capture group names are not restricted to ASCII, and may contain
any Unicode letters and decimal digits, as well as underscore.
.P
The escape sequence \eC can be used to match a single code unit in a UTF mode,
The escape sequence \eC can be used to match a single code unit in UTF mode,
but its use can lead to some strange effects because it breaks up multi-unit
characters (see the description of \eC in the
.\" HREF
@ -82,7 +94,7 @@ may consist of more than one code unit. The use of \eC in these modes provokes
a match-time error. Also, the JIT optimization does not support \eC in these
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
contains \eC, it will not succeed, and so when \fBpcre2_match()\fP is called,
the matching will be carried out by the normal interpretive function.
the matching will be carried out by the interpretive function.
.P
The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly test
characters of any code value, but, by default, the characters that PCRE2
@ -114,14 +126,14 @@ However, the special horizontal and vertical white space matching escapes (\eh,
not PCRE2_UCP is set.
.
.
.SH "CASE-EQUIVALENCE IN UTF MODES"
.SH "CASE-EQUIVALENCE IN UTF MODE"
.rs
.sp
Case-insensitive matching in a UTF mode makes use of Unicode properties except
Case-insensitive matching in UTF mode makes use of Unicode properties except
for characters whose code points are less than 128 and that have at most two
case-equivalent values. For these, a direct table lookup is used for speed. A
few Unicode characters such as Greek sigma have more than two code points that
are case-equivalent, and these are treated as such.
are case-equivalent, and these are treated specially.
.
.
.\" HTML <a name="scriptruns"></a>
@ -231,7 +243,7 @@ adjacent characters.
.sp
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
are (by default) checked for validity on entry to the relevant functions. If an
invalid UTF string is passed, an negative error code is returned. The code unit
invalid UTF string is passed, a negative error code is returned. The code unit
offset to the offending character can be extracted from the match data block by
calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF
error.
@ -244,18 +256,15 @@ PCRE2 assumes that the pattern or subject it is given (respectively) contains
only valid UTF code unit sequences.
.P
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is usually undefined and your program may crash or loop indefinitely. There is,
however, one mode of matching that can handle invalid UTF subject strings. This
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
when calling \fBpcre2_jit_compile()\fP. For details, see the
.\" HREF
\fBpcre2jit\fP
.\"
documentation.
is undefined and your program may crash or loop indefinitely or give incorrect
results. There is, however, one mode of matching that can handle invalid UTF
subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
\fBpcre2_compile()\fP and is discussed below in the next section. The rest of
this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
.P
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable
the check for a subject string you must pass this same option to
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the UTF check
for the pattern; it does not also apply to subject strings. If you want to
disable the check for a subject string you must pass this same option to
\fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP.
.P
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
@ -386,6 +395,52 @@ The following negative error codes are given for invalid UTF-32 strings:
.sp
.
.
.\" HTML <a name="matchinvalid"></a>
.SH "MATCHING IN INVALID UTF STRINGS"
.rs
.sp
You can run pattern matches on subject strings that may contain invalid UTF
sequences if you call \fBpcre2_compile()\fP with the PCRE2_MATCH_INVALID_UTF
option. This is supported by \fBpcre2_match()\fP, including JIT matching, but
not by \fBpcre2_dfa_match()\fP. When PCRE2_MATCH_INVALID_UTF is set, it forces
PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
valid UTF string.
.P
Setting PCRE2_MATCH_INVALID_UTF does not affect what \fBpcre2_compile()\fP
generates, but if \fBpcre2_jit_compile()\fP is subsequently called, it does
generate different code. If JIT is not used, the option affects the behaviour
of the interpretive code in \fBpcre2_match()\fP. When PCRE2_MATCH_INVALID_UTF
is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
.P
In this mode, an invalid code unit sequence in the subject never matches any
pattern item. It does not match dot, it does not match \ep{Any}, it does not
even match negative items such as [^X]. A lookbehind assertion fails if it
encounters an invalid sequence while moving the current point backwards. In
other words, an invalid UTF code unit sequence acts as a barrier which no match
can cross.
.P
You can also think of this as the subject being split up into fragments of
valid UTF, delimited internally by invalid code unit sequences. The pattern is
matched fragment by fragment. The result of a successful match, however, is
given as code unit offsets in the entire subject string in the usual way. There
are a few points to consider:
.P
The internal boundaries are not interpreted as the beginnings or ends of lines
and so do not match circumflex or dollar characters in the pattern.
.P
If \fBpcre2_match()\fP is called with an offset that points to an invalid
UTF-sequence, that sequence is skipped, and the match starts at the next valid
UTF character, or the end of the subject.
.P
At internal fragment boundaries, \eb and \eB behave in the same way as at the
beginning and end of the subject. For example, a sequence such as \ebWORD\eb
would match an instance of WORD that is surrounded by invalid UTF code units.
.P
Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
data, knowing that any matched strings that are returned are valid UTF. This
can be useful when searching for UTF text in executable or other binary files.
.
.
.SH AUTHOR
.rs
.sp
@ -400,6 +455,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 11 May 2019
Last updated: 24 May 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, second API, to be
#included by applications that call PCRE2 functions.
Copyright (c) 2016-2018 University of Cambridge
Copyright (c) 2016-2019 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
/* The current PCRE version information. */
#define PCRE2_MAJOR 10
#define PCRE2_MINOR 33
#define PCRE2_PRERELEASE
#define PCRE2_DATE 2019-04-16
#define PCRE2_MINOR 34
#define PCRE2_PRERELEASE -RC1
#define PCRE2_DATE 2019-04-22
/* When an application links to a PCRE DLL in Windows, the symbols that are
imported have to be identified as such. When building PCRE2, the appropriate
@ -142,6 +142,7 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_USE_OFFSET_LIMIT 0x00800000u /* J M D */
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */
#define PCRE2_LITERAL 0x02000000u /* C */
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
/* An additional compile options word is available in the compile context. */
@ -305,6 +306,7 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS 194
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
#define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE 196
#define PCRE2_ERROR_TOO_MANY_CAPTURES 197
/* "Expected" matching error codes: no match and partial match. */
@ -390,6 +392,7 @@ released, the numbers must not be changed. */
#define PCRE2_ERROR_HEAPLIMIT (-63)
#define PCRE2_ERROR_CONVERT_SYNTAX (-64)
#define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
#define PCRE2_ERROR_DFA_UINVALID_UTF (-66)
/* Request types for pcre2_pattern_info() */

View File

@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, second API, to be
#included by applications that call PCRE2 functions.
Copyright (c) 2016-2018 University of Cambridge
Copyright (c) 2016-2019 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -142,6 +142,7 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_USE_OFFSET_LIMIT 0x00800000u /* J M D */
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */
#define PCRE2_LITERAL 0x02000000u /* C */
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
/* An additional compile options word is available in the compile context. */
@ -391,6 +392,7 @@ released, the numbers must not be changed. */
#define PCRE2_ERROR_HEAPLIMIT (-63)
#define PCRE2_ERROR_CONVERT_SYNTAX (-64)
#define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
#define PCRE2_ERROR_DFA_UINVALID_UTF (-66)
/* Request types for pcre2_pattern_info() */

View File

@ -746,8 +746,8 @@ are allowed. */
#define PUBLIC_LITERAL_COMPILE_OPTIONS \
(PCRE2_ANCHORED|PCRE2_AUTO_CALLOUT|PCRE2_CASELESS|PCRE2_ENDANCHORED| \
PCRE2_FIRSTLINE|PCRE2_LITERAL|PCRE2_NO_START_OPTIMIZE| \
PCRE2_NO_UTF_CHECK|PCRE2_USE_OFFSET_LIMIT|PCRE2_UTF)
PCRE2_FIRSTLINE|PCRE2_LITERAL|PCRE2_MATCH_INVALID_UTF| \
PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK|PCRE2_USE_OFFSET_LIMIT|PCRE2_UTF)
#define PUBLIC_COMPILE_OPTIONS \
(PUBLIC_LITERAL_COMPILE_OPTIONS| \
@ -3615,7 +3615,7 @@ while (ptr < ptrend)
{
errorcode = ERR97;
goto FAILED;
}
}
cb->bracount++;
*parsed_pattern++ = META_CAPTURE | cb->bracount;
}
@ -4444,7 +4444,7 @@ while (ptr < ptrend)
{
errorcode = ERR97;
goto FAILED;
}
}
cb->bracount++;
*parsed_pattern++ = META_CAPTURE | cb->bracount;
nest_depth++;
@ -9503,6 +9503,10 @@ if (pattern == NULL)
if (ccontext == NULL)
ccontext = (pcre2_compile_context *)(&PRIV(default_compile_context));
/* PCRE2_MATCH_INVALID_UTF implies UTF */
if ((options & PCRE2_MATCH_INVALID_UTF) != 0) options |= PCRE2_UTF;
/* Check that all undefined public option bits are zero. */
@ -9682,7 +9686,7 @@ if ((options & PCRE2_LITERAL) == 0)
ptr += skipatstart;
/* Can't support UTF or UCP unless PCRE2 has been compiled with UTF support. */
/* Can't support UTF or UCP if PCRE2 was built without Unicode support. */
#ifndef SUPPORT_UNICODE
if ((cb.external_options & (PCRE2_UTF|PCRE2_UCP)) != 0)

View File

@ -3294,6 +3294,11 @@ time. */
if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0 &&
((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
return PCRE2_ERROR_BADOPTION;
/* Invalid UTF support is not available for DFA matching. */
if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0)
return PCRE2_ERROR_DFA_UINVALID_UTF;
/* Check that the first field in the block is the magic number. If it is not,
return with PCRE2_ERROR_BADMAGIC. */

View File

@ -269,6 +269,7 @@ static const unsigned char match_error_texts[] =
"invalid syntax\0"
/* 65 */
"internal error - duplicate substitution match\0"
"PCRE2_MATCH_INVALID_UTF is not supported for DFA matching\0"
;

View File

@ -866,6 +866,7 @@ typedef struct match_block {
PCRE2_SPTR name_table; /* Table of group names */
PCRE2_SPTR start_code; /* For use when recursing */
PCRE2_SPTR start_subject; /* Start of the subject string */
PCRE2_SPTR check_subject; /* Where UTF-checked from */
PCRE2_SPTR end_subject; /* End of the subject string */
PCRE2_SPTR end_match_ptr; /* Subject position at end match */
PCRE2_SPTR start_used_ptr; /* Earliest consulted character */

View File

@ -6,8 +6,9 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
This module by Zoltan Herczeg
Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016-2018 University of Cambridge
New API code Copyright (c) 2016-2019 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -7846,8 +7847,6 @@ if (needstype || needsscript)
if (needsscript)
{
// PH hacking
//fprintf(stderr, "~~B\n");
OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0);
@ -7901,7 +7900,6 @@ if (needstype || needsscript)
if (!needschar)
{
// PH hacking
//fprintf(stderr, "~~C\n");
OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0);
@ -7916,7 +7914,6 @@ if (needstype || needsscript)
else
{
// PH hacking
//fprintf(stderr, "~~D\n");
OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
@ -8594,8 +8591,8 @@ uint32_t c;
/* Patch by PH */
/* GETCHARINC(c, cc); */
c = *cc++;
#if PCRE2_CODE_UNIT_WIDTH == 32
if (c >= 0x110000)
return NULL;
@ -9257,8 +9254,6 @@ if (common->utf && *cc == OP_REFI)
CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop);
// PH hacking
//fprintf(stderr, "~~E\n");
OP1(SLJIT_MOV, TMP3, 0, TMP1, 0);
add_jump(compiler, &common->getucd, JUMP(SLJIT_FAST_CALL));
@ -14156,49 +14151,87 @@ Returns: 0: success or (*NOJIT) was used
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
pcre2_jit_compile(pcre2_code *code, uint32_t options)
{
#ifndef SUPPORT_JIT
(void)code;
(void)options;
return PCRE2_ERROR_JIT_BADOPTION;
#else /* SUPPORT_JIT */
pcre2_real_code *re = (pcre2_real_code *)code;
executable_functions *functions;
uint32_t excluded_options;
int result;
if (code == NULL)
return PCRE2_ERROR_NULL;
if ((options & ~PUBLIC_JIT_COMPILE_OPTIONS) != 0)
return PCRE2_ERROR_JIT_BADOPTION;
if ((re->flags & PCRE2_NOJIT) != 0) return 0;
functions = (executable_functions *)re->executable_jit;
/* Support for invalid UTF was first introduced in JIT, with the option
PCRE2_JIT_INVALID_UTF. Later, support was added to the interpreter, and the
compile-time option PCRE2_MATCH_INVALID_UTF was created. This is now the
preferred feature, with the earlier option deprecated. However, for backward
compatibility, if the earlier option is set, it forces the new option so that
if JIT matching falls back to the interpreter, there is still support for
invalid UTF. However, if this function has already been successfully called
without PCRE2_JIT_INVALID_UTF and without PCRE2_MATCH_INVALID_UTF (meaning that
non-invalid-supporting JIT code was compiled), give an error.
If in the future support for PCRE2_JIT_INVALID_UTF is withdrawn, the following
actions are needed:
1. Remove the definition from pcre2.h.in and from the list in
PUBLIC_JIT_COMPILE_OPTIONS above.
2. Replace PCRE2_JIT_INVALID_UTF with a local flag in this module.
3. Replace PCRE2_JIT_INVALID_UTF in pcre2_jit_test.c.
4. Delete the following short block of code. The setting of "re" and
"functions" can be moved into the JIT-only block below, but if that is
done, (void)re and (void)functions will be needed in the non-JIT case, to
avoid compiler warnings.
*/
if ((options & PCRE2_JIT_INVALID_UTF) != 0)
{
if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) == 0)
{
if (functions != NULL) return PCRE2_ERROR_JIT_BADOPTION;
re->overall_options |= PCRE2_MATCH_INVALID_UTF;
}
}
/* The above tests are run with and without JIT support. This means that
PCRE2_JIT_INVALID_UTF propagates back into the regex options (ensuring
interpreter support) even in the absence of JIT. But now, if there is no JIT
support, give an error return. */
#ifndef SUPPORT_JIT
return PCRE2_ERROR_JIT_BADOPTION;
#else /* SUPPORT_JIT */
/* There is JIT support. Do the necessary. */
if ((re->flags & PCRE2_NOJIT) != 0) return 0;
if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0)
options |= PCRE2_JIT_INVALID_UTF;
if ((options & PCRE2_JIT_COMPLETE) != 0 && (functions == NULL
|| functions->executable_funcs[0] == NULL)) {
excluded_options = (PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD);
result = jit_compile(code, options & ~excluded_options);
uint32_t excluded_options = (PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD);
int result = jit_compile(code, options & ~excluded_options);
if (result != 0)
return result;
}
if ((options & PCRE2_JIT_PARTIAL_SOFT) != 0 && (functions == NULL
|| functions->executable_funcs[1] == NULL)) {
excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD);
result = jit_compile(code, options & ~excluded_options);
uint32_t excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD);
int result = jit_compile(code, options & ~excluded_options);
if (result != 0)
return result;
}
if ((options & PCRE2_JIT_PARTIAL_HARD) != 0 && (functions == NULL
|| functions->executable_funcs[2] == NULL)) {
excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT);
result = jit_compile(code, options & ~excluded_options);
uint32_t excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT);
int result = jit_compile(code, options & ~excluded_options);
if (result != 0)
return result;
}

View File

@ -5412,7 +5412,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
{
while (number-- > 0)
{
if (Feptr <= mb->start_subject) RRETURN(MATCH_NOMATCH);
if (Feptr <= mb->check_subject) RRETURN(MATCH_NOMATCH);
Feptr--;
BACKCHAR(Feptr);
}
@ -5420,7 +5420,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
else
#endif
/* No UTF-8 support, or not in UTF-8 mode: count is byte count */
/* No UTF-8 support, or not in UTF-8 mode: count is code unit count */
{
if ((ptrdiff_t)number > Feptr - mb->start_subject) RRETURN(MATCH_NOMATCH);
@ -5743,7 +5743,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
case OP_NOT_WORD_BOUNDARY:
case OP_WORD_BOUNDARY:
if (Feptr == mb->start_subject) prev_is_word = FALSE; else
if (Feptr == mb->check_subject) prev_is_word = FALSE; else
{
PCRE2_SPTR lastptr = Feptr - 1;
#ifdef SUPPORT_UNICODE
@ -6014,7 +6014,6 @@ int was_zero_terminated = 0;
const uint8_t *start_bits = NULL;
const pcre2_real_code *re = (const pcre2_real_code *)code;
BOOL anchored;
BOOL firstline;
BOOL has_first_cu = FALSE;
@ -6029,10 +6028,23 @@ PCRE2_UCHAR req_cu2 = 0;
PCRE2_SPTR bumpalong_limit;
PCRE2_SPTR end_subject;
PCRE2_SPTR true_end_subject;
PCRE2_SPTR start_match = subject + start_offset;
PCRE2_SPTR req_cu_ptr = start_match - 1;
PCRE2_SPTR start_partial = NULL;
PCRE2_SPTR match_partial = NULL;
PCRE2_SPTR start_partial;
PCRE2_SPTR match_partial;
#ifdef SUPPORT_JIT
BOOL use_jit;
#endif
#ifdef SUPPORT_UNICODE
BOOL allow_invalid;
uint32_t fragment_options = 0;
#ifdef SUPPORT_JIT
BOOL jit_checked_utf = FALSE;
#endif
#endif
PCRE2_SIZE frame_size;
@ -6059,7 +6071,7 @@ if (length == PCRE2_ZERO_TERMINATED)
length = PRIV(strlen)(subject);
was_zero_terminated = 1;
}
end_subject = subject + length;
true_end_subject = end_subject = subject + length;
/* Plausibility checks */
@ -6095,12 +6107,24 @@ options |= (re->flags & FF) / ((FF & (~FF+1)) / (OO & (~OO+1)));
#undef FF
#undef OO
/* These two settings are used in the code for checking a UTF string that
follows immediately afterwards. Other values in the mb block are used only
during interpretive processing, not when the JIT support is in use, so they are
set up later. */
/* If the pattern was successfully studied with JIT support, we will run the
JIT executable instead of the rest of this function. Most options must be set
at compile time for the JIT code to be usable. */
#ifdef SUPPORT_JIT
use_jit = (re->executable_jit != NULL &&
(options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0);
#endif
/* Initialize UTF parameters. */
utf = (re->overall_options & PCRE2_UTF) != 0;
#ifdef SUPPORT_UNICODE
allow_invalid = (re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0;
#endif
/* Convert the partial matching flags into an integer. */
mb->partial = ((options & PCRE2_PARTIAL_HARD) != 0)? 2 :
((options & PCRE2_PARTIAL_SOFT) != 0)? 1 : 0;
@ -6111,61 +6135,6 @@ if (mb->partial != 0 &&
((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
return PCRE2_ERROR_BADOPTION;
/* Check a UTF string for validity if required. For 8-bit and 16-bit strings,
we must also check that a starting offset does not point into the middle of a
multiunit character. We check only the portion of the subject that is going to
be inspected during matching - from the offset minus the maximum back reference
to the given length. This saves time when a small part of a large subject is
being matched by the use of a starting offset. Note that the maximum lookbehind
is a number of characters, not code units. */
#ifdef SUPPORT_UNICODE
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
{
PCRE2_SPTR check_subject = start_match; /* start_match includes offset */
if (start_offset > 0)
{
#if PCRE2_CODE_UNIT_WIDTH != 32
unsigned int i;
if (start_match < end_subject && NOT_FIRSTCU(*start_match))
return PCRE2_ERROR_BADUTFOFFSET;
for (i = re->max_lookbehind; i > 0 && check_subject > subject; i--)
{
check_subject--;
while (check_subject > subject &&
#if PCRE2_CODE_UNIT_WIDTH == 8
(*check_subject & 0xc0) == 0x80)
#else /* 16-bit */
(*check_subject & 0xfc00) == 0xdc00)
#endif /* PCRE2_CODE_UNIT_WIDTH == 8 */
check_subject--;
}
#else
/* In the 32-bit library, one code unit equals one character. However,
we cannot just subtract the lookbehind and then compare pointers, because
a very large lookbehind could create an invalid pointer. */
if (start_offset >= re->max_lookbehind)
check_subject -= re->max_lookbehind;
else
check_subject = subject;
#endif /* PCRE2_CODE_UNIT_WIDTH != 32 */
}
/* Validate the relevant portion of the subject. After an error, adjust the
offset to be an absolute offset in the whole string. */
match_data->rc = PRIV(valid_utf)(check_subject,
length - (check_subject - subject), &(match_data->startchar));
if (match_data->rc != 0)
{
match_data->startchar += check_subject - subject;
return match_data->rc;
}
}
#endif /* SUPPORT_UNICODE */
/* It is an error to set an offset limit without setting the flag at compile
time. */
@ -6184,15 +6153,85 @@ if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0)
}
match_data->subject = NULL;
/* If the pattern was successfully studied with JIT support, run the JIT
executable instead of the rest of this function. Most options must be set at
compile time for the JIT code to be usable. Fallback to the normal code path if
an unsupported option is set or if JIT returns BADOPTION (which means that the
selected normal or partial matching mode was not compiled). */
/* ============================= JIT matching ============================== */
/* Prepare for JIT matching. Check a UTF string for validity unless no check is
requested or invalid UTF can be handled. We check only the portion of the
subject that might be be inspected during matching - from the offset minus the
maximum lookbehind to the given length. This saves time when a small part of a
large subject is being matched by the use of a starting offset. Note that the
maximum lookbehind is a number of characters, not code units. */
#ifdef SUPPORT_JIT
if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0)
if (use_jit)
{
#ifdef SUPPORT_UNICODE
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0 && !allow_invalid)
{
#if PCRE2_CODE_UNIT_WIDTH != 32
unsigned int i;
#endif
/* For 8-bit and 16-bit UTF, check that the first code unit is a valid
character start. */
#if PCRE2_CODE_UNIT_WIDTH != 32
if (start_match < end_subject && NOT_FIRSTCU(*start_match))
{
if (start_offset > 0) return PCRE2_ERROR_BADUTFOFFSET;
#if PCRE2_CODE_UNIT_WIDTH == 8
return PCRE2_ERROR_UTF8_ERR20; /* Isolated 0x80 byte */
#else
return PCRE2_ERROR_UTF16_ERR3; /* Isolated low surrogate */
#endif
}
#endif /* WIDTH != 32 */
/* Move back by the maximum lookbehind, just in case it happens at the very
start of matching. */
#if PCRE2_CODE_UNIT_WIDTH != 32
for (i = re->max_lookbehind; i > 0 && start_match > subject; i--)
{
start_match--;
while (start_match > subject &&
#if PCRE2_CODE_UNIT_WIDTH == 8
(*start_match & 0xc0) == 0x80)
#else /* 16-bit */
(*start_match & 0xfc00) == 0xdc00)
#endif
start_match--;
}
#else /* PCRE2_CODE_UNIT_WIDTH != 32 */
/* In the 32-bit library, one code unit equals one character. However,
we cannot just subtract the lookbehind and then compare pointers, because
a very large lookbehind could create an invalid pointer. */
if (start_offset >= re->max_lookbehind)
start_match -= re->max_lookbehind;
else
start_match = subject;
#endif /* PCRE2_CODE_UNIT_WIDTH != 32 */
/* Validate the relevant portion of the subject. Adjust the offset of an
invalid code point to be an absolute offset in the whole string. */
match_data->rc = PRIV(valid_utf)(start_match,
length - (start_match - subject), &(match_data->startchar));
if (match_data->rc != 0)
{
match_data->startchar += start_match - subject;
return match_data->rc;
}
jit_checked_utf = TRUE;
}
#endif /* SUPPORT_UNICODE */
/* If JIT returns BADOPTION, which means that the selected complete or
partial matching mode was not compiled, fall through to the interpreter. */
rc = pcre2_jit_match(code, subject, length, start_offset, options,
match_data, mcontext);
if (rc != PCRE2_ERROR_JIT_BADOPTION)
@ -6209,10 +6248,152 @@ if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0)
return rc;
}
}
#endif /* SUPPORT_JIT */
/* ========================= End of JIT matching ========================== */
/* Proceed with non-JIT matching. The default is to allow lookbehinds to the
start of the subject. A UTF check when there is a non-zero offset may change
this. */
mb->check_subject = subject;
/* If a UTF subject string was not checked for validity in the JIT code above,
check it here, and handle support for invalid UTF strings. The check above
happens only when invalid UTF is not supported and PCRE2_NO_CHECK_UTF is unset.
If we get here in those circumstances, it means the subject string is valid,
but for some reason JIT matching was not successful. There is no need to check
the subject again.
We check only the portion of the subject that might be be inspected during
matching - from the offset minus the maximum lookbehind to the given length.
This saves time when a small part of a large subject is being matched by the
use of a starting offset. Note that the maximum lookbehind is a number of
characters, not code units.
Note also that support for invalid UTF forces a check, overriding the setting
of PCRE2_NO_CHECK_UTF. */
#ifdef SUPPORT_UNICODE
if (utf &&
#ifdef SUPPORT_JIT
!jit_checked_utf &&
#endif
((options & PCRE2_NO_UTF_CHECK) == 0 || allow_invalid))
{
#if PCRE2_CODE_UNIT_WIDTH != 32
BOOL skipped_bad_start = FALSE;
#endif
/* Carry on with non-JIT matching. A NULL match context means "use a default
context", but we take the memory control functions from the pattern. */
/* For 8-bit and 16-bit UTF, check that the first code unit is a valid
character start. If we are handling invalid UTF, just skip over such code
units. Otherwise, give an appropriate error. */
#if PCRE2_CODE_UNIT_WIDTH != 32
if (allow_invalid)
{
while (start_match < end_subject && NOT_FIRSTCU(*start_match))
{
start_match++;
skipped_bad_start = TRUE;
}
}
else if (start_match < end_subject && NOT_FIRSTCU(*start_match))
{
if (start_offset > 0) return PCRE2_ERROR_BADUTFOFFSET;
#if PCRE2_CODE_UNIT_WIDTH == 8
return PCRE2_ERROR_UTF8_ERR20; /* Isolated 0x80 byte */
#else
return PCRE2_ERROR_UTF16_ERR3; /* Isolated low surrogate */
#endif
}
#endif /* WIDTH != 32 */
/* The mb->check_subject field points to the start of UTF checking;
lookbehinds can go back no further than this. */
mb->check_subject = start_match;
/* Move back by the maximum lookbehind, just in case it happens at the very
start of matching, but don't do this if we skipped bad 8-bit or 16-bit code
units above. */
#if PCRE2_CODE_UNIT_WIDTH != 32
if (!skipped_bad_start)
{
unsigned int i;
for (i = re->max_lookbehind; i > 0 && mb->check_subject > subject; i--)
{
mb->check_subject--;
while (mb->check_subject > subject &&
#if PCRE2_CODE_UNIT_WIDTH == 8
(*mb->check_subject & 0xc0) == 0x80)
#else /* 16-bit */
(*mb->check_subject & 0xfc00) == 0xdc00)
#endif
mb->check_subject--;
}
}
#else /* PCRE2_CODE_UNIT_WIDTH != 32 */
/* In the 32-bit library, one code unit equals one character. However,
we cannot just subtract the lookbehind and then compare pointers, because
a very large lookbehind could create an invalid pointer. */
if (start_offset >= re->max_lookbehind)
mb->check_subject -= re->max_lookbehind;
else
mb->check_subject = subject;
#endif /* PCRE2_CODE_UNIT_WIDTH != 32 */
/* Validate the relevant portion of the subject. There's a loop in case we
encounter bad UTF in the characters preceding start_match which we are
scanning because of a lookbehind. */
for (;;)
{
match_data->rc = PRIV(valid_utf)(mb->check_subject,
length - (mb->check_subject - subject), &(match_data->startchar));
if (match_data->rc == 0) break; /* Valid UTF string */
/* Invalid UTF string. Adjust the offset to be an absolute offset in the
whole string. If we are handling invalid UTF strings, set end_subject to
stop before the bad code unit, and set the options to "not end of line".
Otherwise return the error. */
match_data->startchar += mb->check_subject - subject;
if (!allow_invalid || match_data->rc > 0) return match_data->rc;
end_subject = subject + match_data->startchar;
/* If the end precedes start_match, it means there is invalid UTF in the
extra code units we reversed over because of a lookbehind. Advance past the
first bad code unit, and then skip invalid character starting code units in
8-bit and 16-bit modes, and try again. */
if (end_subject < start_match)
{
mb->check_subject = end_subject + 1;
#if PCRE2_CODE_UNIT_WIDTH != 32
while (mb->check_subject < start_match && NOT_FIRSTCU(*mb->check_subject))
mb->check_subject++;
#endif
}
/* Otherwise, set the not end of line option, and do the match. */
else
{
fragment_options = PCRE2_NOTEOL;
break;
}
}
}
#endif /* SUPPORT_UNICODE */
/* A NULL match context means "use a default context", but we take the memory
control functions from the pattern. */
if (mcontext == NULL)
{
@ -6224,8 +6405,8 @@ else mb->memctl = mcontext->memctl;
anchored = ((re->overall_options | options) & PCRE2_ANCHORED) != 0;
firstline = (re->overall_options & PCRE2_FIRSTLINE) != 0;
startline = (re->flags & PCRE2_STARTLINE) != 0;
bumpalong_limit = (mcontext->offset_limit == PCRE2_UNSET)?
end_subject : subject + mcontext->offset_limit;
bumpalong_limit = (mcontext->offset_limit == PCRE2_UNSET)?
true_end_subject : subject + mcontext->offset_limit;
/* Initialize and set up the fixed fields in the callout block, with a pointer
in the match block. */
@ -6236,7 +6417,8 @@ cb.subject = subject;
cb.subject_length = (PCRE2_SIZE)(end_subject - subject);
cb.callout_flags = 0;
/* Fill in the remaining fields in the match block. */
/* Fill in the remaining fields in the match block, except for moptions, which
gets set later. */
mb->callout = mcontext->callout;
mb->callout_data = mcontext->callout_data;
@ -6245,13 +6427,9 @@ mb->start_subject = subject;
mb->start_offset = start_offset;
mb->end_subject = end_subject;
mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
mb->moptions = options; /* Match options */
mb->poptions = re->overall_options; /* Pattern options */
mb->ignore_skip_arg = 0;
mb->mark = mb->nomatch_mark = NULL; /* In case never set */
mb->hitend = FALSE;
/* The name table is needed for finding all the numbers associated with a
given name, for condition testing. The code follows the name table. */
@ -6404,6 +6582,13 @@ if ((re->flags & PCRE2_LASTSET) != 0)
/* Loop for handling unanchored repeated matching attempts; for anchored regexs
the loop runs just once. */
#ifdef SUPPORT_UNICODE
FRAGMENT_RESTART:
#endif
start_partial = match_partial = NULL;
mb->hitend = FALSE;
for(;;)
{
PCRE2_SPTR new_start_match;
@ -6714,6 +6899,11 @@ for(;;)
mb->start_used_ptr = start_match;
mb->last_used_ptr = start_match;
#ifdef SUPPORT_UNICODE
mb->moptions = options | fragment_options;
#else
mb->moptions = options;
#endif
mb->match_call_count = 0;
mb->end_offset_top = 0;
mb->skip_arg_count = 0;
@ -6839,6 +7029,68 @@ for(;;)
ENDLOOP:
/* If end_subject != true_end_subject, it means we are handling invalid UTF,
and have just processed a non-terminal fragment. If this resulted in no match
or a partial match we must carry on to the next fragment (a partial match is
returned to the caller only at the very end of the subject). A loop is used to
avoid trying to match against empty fragments; if the pattern can match an
empty string it would have done so already. */
#ifdef SUPPORT_UNICODE
if (utf && end_subject != true_end_subject &&
(rc == MATCH_NOMATCH || rc == PCRE2_ERROR_PARTIAL))
{
for (;;)
{
/* Advance past the first bad code unit, and then skip invalid character
starting code units in 8-bit and 16-bit modes. */
start_match = end_subject + 1;
#if PCRE2_CODE_UNIT_WIDTH != 32
while (start_match < true_end_subject && NOT_FIRSTCU(*start_match))
start_match++;
#endif
/* If we have hit the end of the subject, there isn't another non-empty
fragment, so give up. */
if (start_match >= true_end_subject)
{
rc = MATCH_NOMATCH; /* In case it was partial */
break;
}
/* Check the rest of the subject */
mb->check_subject = start_match;
rc = PRIV(valid_utf)(start_match, length - (start_match - subject),
&(match_data->startchar));
/* The rest of the subject is valid UTF. */
if (rc == 0)
{
mb->end_subject = end_subject = true_end_subject;
fragment_options = PCRE2_NOTBOL;
goto FRAGMENT_RESTART;
}
/* A subsequent UTF error has been found; if the next fragment is
non-empty, set up to process it. Otherwise, let the loop advance. */
else if (rc < 0)
{
mb->end_subject = end_subject = start_match + match_data->startchar;
if (end_subject > start_match)
{
fragment_options = PCRE2_NOTBOL|PCRE2_NOTEOL;
goto FRAGMENT_RESTART;
}
}
}
}
#endif /* SUPPORT_UNICODE */
/* Release an enlarged frame vector that is on the heap. */
if (mb->match_frames != mb->stack_frames)

View File

@ -212,6 +212,12 @@ be C99 don't support it (hence DISABLE_PERCENT_ZT). */
#define REPLACE_MODSIZE 100 /* Field for reading 8-bit replacement */
#define VERSION_SIZE 64 /* Size of buffer for the version strings */
/* Default JIT compile options */
#define JIT_DEFAULT (PCRE2_JIT_COMPLETE|\
PCRE2_JIT_PARTIAL_SOFT|\
PCRE2_JIT_PARTIAL_HARD)
/* Make sure the buffer into which replacement strings are copied is big enough
to hold them as 32-bit code units. */
@ -664,6 +670,7 @@ static modstruct modlist[] = {
{ "literal", MOD_PAT, MOD_OPT, PCRE2_LITERAL, PO(options) },
{ "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) },
{ "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) },
{ "match_invalid_utf", MOD_PAT, MOD_OPT, PCRE2_MATCH_INVALID_UTF, PO(options) },
{ "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) },
{ "match_line", MOD_CTC, MOD_OPT, PCRE2_EXTRA_MATCH_LINE, CO(extra_options) },
{ "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) },
@ -4136,7 +4143,7 @@ static void
show_compile_options(uint32_t options, const char *before, const char *after)
{
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
before,
((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
((options & PCRE2_ALT_CIRCUMFLEX) != 0)? " alt_circumflex" : "",
@ -4153,6 +4160,7 @@ else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%
((options & PCRE2_EXTENDED_MORE) != 0)? " extended_more" : "",
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
((options & PCRE2_LITERAL) != 0)? " literal" : "",
((options & PCRE2_MATCH_INVALID_UTF) != 0)? " match_invalid_utf" : "",
((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
((options & PCRE2_NEVER_BACKSLASH_C) != 0)? " never_backslash_c" : "",
@ -4867,7 +4875,7 @@ switch(cmd)
case CMD_PATTERN:
(void)decode_modifiers(argptr, CTX_DEFPAT, &def_patctl, NULL);
if (def_patctl.jit == 0 && (def_patctl.control & CTL_JITVERIFY) != 0)
def_patctl.jit = 7;
def_patctl.jit = JIT_DEFAULT;
break;
/* Set default subject modifiers */
@ -5114,7 +5122,11 @@ patlen = p - buffer - 2;
/* Look for modifiers and options after the final delimiter. */
if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
utf = (pat_patctl.options & PCRE2_UTF) != 0;
/* Note that the match_invalid_utf option also sets utf when passed to
pcre2_compile(). */
utf = (pat_patctl.options & (PCRE2_UTF|PCRE2_MATCH_INVALID_UTF)) != 0;
/* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
exclusive with the utf modifier. */
@ -5161,7 +5173,7 @@ specified. */
if (pat_patctl.jit == 0 &&
(pat_patctl.control & (CTL_JITVERIFY|CTL_JITFAST)) != 0)
pat_patctl.jit = 7;
pat_patctl.jit = JIT_DEFAULT;
/* Now copy the pattern to pbuffer8 for use in 8-bit testing and for reflecting
in callouts. Convert from hex if requested (literal strings in quotes may be
@ -5744,6 +5756,7 @@ if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
{
int i;
clock_t time_taken = 0;
for (i = 0; i < timeit; i++)
{
clock_t start_time;
@ -5752,7 +5765,7 @@ if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
pat_patctl.options|use_forbid_utf, &errorcode, &erroroffset,
use_pat_context);
start_time = clock();
PCRE2_JIT_COMPILE(jitrc,compiled_code, pat_patctl.jit);
PCRE2_JIT_COMPILE(jitrc, compiled_code, pat_patctl.jit);
time_taken += clock() - start_time;
}
total_jit_compile_time += time_taken;
@ -8615,7 +8628,7 @@ while (argc > 1 && argv[op][0] == '-' && argv[op][1] != 0)
else if (strcmp(arg, "-jit") == 0 || strcmp(arg, "-jitverify") == 0)
{
if (arg[4] != 0) def_patctl.control |= CTL_JITVERIFY;
def_patctl.jit = 7; /* full & partial */
def_patctl.jit = JIT_DEFAULT; /* full & partial */
#ifndef SUPPORT_JIT
fprintf(stderr, "** Warning: JIT support is not available: "
"-jit[verify] calls functions that do nothing.\n");

66
testdata/testinput10 vendored
View File

@ -1,7 +1,7 @@
# This set of tests is for UTF-8 support and Unicode property support, with
# relevance only for the 8-bit library.
# The next 4 patterns have UTF-8 errors
# The next 5 patterns have UTF-8 errors
/[Ã]/utf
@ -11,6 +11,8 @@
/‚‚‚‚‚‚‚Ã/utf
/‚‚‚‚‚‚‚Ã/match_invalid_utf
# Now test subjects
/badutf/utf
@ -493,4 +495,66 @@
/(?(á/utf
# Invalid UTF-8 tests
/.../g,match_invalid_utf
abcd\x80wxzy\x80pqrs
abcd\x{80}wxzy\x80pqrs
/abc/match_invalid_utf
ab\x80ab\=ph
\= Expect no match
ab\x80cdef\=ph
/ab$/match_invalid_utf
ab\x80cdeab
\= Expect no match
ab\x80cde
/.../g,match_invalid_utf
abcd\x{80}wxzy\x80pqrs
/(?<=x)../g,match_invalid_utf
abcd\x{80}wxzy\x80pqrs
abcd\x{80}wxzy\x80xpqrs
/X$/match_invalid_utf
\= Expect no match
X\xc4
/(?<=..)X/match_invalid_utf,aftertext
AB\x80AQXYZ
AB\x80AQXYZ\=offset=5
AB\x80\x80AXYZXC\=offset=5
\= Expect no match
AB\x80XYZ
AB\x80XYZ\=offset=3
AB\xfeXYZ
AB\xffXYZ\=offset=3
AB\x80AXYZ
AB\x80AXYZ\=offset=4
AB\x80\x80AXYZ\=offset=5
/.../match_invalid_utf
AB\xc4CCC
\= Expect no match
A\x{d800}B
A\x{110000}B
A\xc4B
/\bX/match_invalid_utf
A\x80X
/\BX/match_invalid_utf
\= Expect no match
A\x80X
/(?<=...)X/match_invalid_utf
AAA\x80BBBXYZ
\= Expect no match
AAA\x80BXYZ
AAA\x80BBXYZ
# -------------------------------------
# End of testinput10

View File

@ -368,6 +368,4 @@
ab˙Az
ab\x{80000041}z
/\[()]{65535}/expand
# End of testinput11

45
testdata/testinput12 vendored
View File

@ -402,4 +402,49 @@
/(?(á/utf
# Invalid UTF-16/32 tests.
/.../g,match_invalid_utf
abcd\x{df00}wxzy\x{df00}pqrs
abcd\x{80}wxzy\x{df00}pqrs
/abc/match_invalid_utf
ab\x{df00}ab\=ph
\= Expect no match
ab\x{df00}cdef\=ph
/ab$/match_invalid_utf
ab\x{df00}cdeab
\= Expect no match
ab\x{df00}cde
/.../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
/(?<=x)../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
abcd\x{80}wxzy\x{df00}xpqrs
/X$/match_invalid_utf
\= Expect no match
X\x{df00}
/(?<=..)X/match_invalid_utf,aftertext
AB\x{df00}AQXYZ
AB\x{df00}AQXYZ\=offset=5
AB\x{df00}\x{df00}AXYZXC\=offset=5
\= Expect no match
AB\x{df00}XYZ
AB\x{df00}XYZ\=offset=3
AB\x{df00}AXYZ
AB\x{df00}AXYZ\=offset=4
AB\x{df00}\x{df00}AXYZ\=offset=5
/.../match_invalid_utf
\= Expect no match
A\x{d800}B
A\x{110000}B
# ----------------------------------------------------
# End of testinput12

4
testdata/testinput8 vendored
View File

@ -182,4 +182,8 @@
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8

2
testdata/testinput9 vendored
View File

@ -260,6 +260,4 @@
/(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/
/\[()]{65535}/expand
# End of testinput9

108
testdata/testoutput10 vendored
View File

@ -1,7 +1,7 @@
# This set of tests is for UTF-8 support and Unicode property support, with
# relevance only for the 8-bit library.
# The next 4 patterns have UTF-8 errors
# The next 5 patterns have UTF-8 errors
/[Ã]/utf
Failed: error -8 at offset 1: UTF-8 error: byte 2 top bits not 0x80
@ -15,6 +15,9 @@ Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
/‚‚‚‚‚‚‚Ã/utf
Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set
/‚‚‚‚‚‚‚Ã/match_invalid_utf
Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set
# Now test subjects
/badutf/utf
@ -1651,4 +1654,107 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
/(?(á/utf
Failed: error 142 at offset 5: syntax error in subpattern name (missing terminator?)
# Invalid UTF-8 tests
/.../g,match_invalid_utf
abcd\x80wxzy\x80pqrs
0: abc
0: wxz
0: pqr
abcd\x{80}wxzy\x80pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/abc/match_invalid_utf
ab\x80ab\=ph
Partial match: ab
\= Expect no match
ab\x80cdef\=ph
No match
/ab$/match_invalid_utf
ab\x80cdeab
0: ab
\= Expect no match
ab\x80cde
No match
/.../g,match_invalid_utf
abcd\x{80}wxzy\x80pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/(?<=x)../g,match_invalid_utf
abcd\x{80}wxzy\x80pqrs
0: zy
abcd\x{80}wxzy\x80xpqrs
0: zy
0: pq
/X$/match_invalid_utf
\= Expect no match
X\xc4
No match
/(?<=..)X/match_invalid_utf,aftertext
AB\x80AQXYZ
0: X
0+ YZ
AB\x80AQXYZ\=offset=5
0: X
0+ YZ
AB\x80\x80AXYZXC\=offset=5
0: X
0+ C
\= Expect no match
AB\x80XYZ
No match
AB\x80XYZ\=offset=3
No match
AB\xfeXYZ
No match
AB\xffXYZ\=offset=3
No match
AB\x80AXYZ
No match
AB\x80AXYZ\=offset=4
No match
AB\x80\x80AXYZ\=offset=5
No match
/.../match_invalid_utf
AB\xc4CCC
0: CCC
\= Expect no match
A\x{d800}B
No match
A\x{110000}B
No match
A\xc4B
No match
/\bX/match_invalid_utf
A\x80X
0: X
/\BX/match_invalid_utf
\= Expect no match
A\x80X
No match
/(?<=...)X/match_invalid_utf
AAA\x80BBBXYZ
0: X
\= Expect no match
AAA\x80BXYZ
No match
AAA\x80BBXYZ
No match
# -------------------------------------
# End of testinput10

View File

@ -661,7 +661,4 @@ Subject length lower bound = 1
ab˙Az
ab\x{80000041}z
/\[()]{65535}/expand
Failed: error 120 at offset 131070: regular expression is too large
# End of testinput11

View File

@ -667,6 +667,4 @@ Subject length lower bound = 1
ab\x{80000041}z
0: ab\x{80000041}z
/\[()]{65535}/expand
# End of testinput11

View File

@ -1502,4 +1502,81 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
/(?(á/utf
Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?)
# Invalid UTF-16/32 tests.
/.../g,match_invalid_utf
abcd\x{df00}wxzy\x{df00}pqrs
0: abc
0: wxz
0: pqr
abcd\x{80}wxzy\x{df00}pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/abc/match_invalid_utf
ab\x{df00}ab\=ph
Partial match: ab
\= Expect no match
ab\x{df00}cdef\=ph
No match
/ab$/match_invalid_utf
ab\x{df00}cdeab
0: ab
\= Expect no match
ab\x{df00}cde
No match
/.../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/(?<=x)../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
0: zy
abcd\x{80}wxzy\x{df00}xpqrs
0: zy
0: pq
/X$/match_invalid_utf
\= Expect no match
X\x{df00}
No match
/(?<=..)X/match_invalid_utf,aftertext
AB\x{df00}AQXYZ
0: X
0+ YZ
AB\x{df00}AQXYZ\=offset=5
0: X
0+ YZ
AB\x{df00}\x{df00}AXYZXC\=offset=5
0: X
0+ C
\= Expect no match
AB\x{df00}XYZ
No match
AB\x{df00}XYZ\=offset=3
No match
AB\x{df00}AXYZ
No match
AB\x{df00}AXYZ\=offset=4
No match
AB\x{df00}\x{df00}AXYZ\=offset=5
No match
/.../match_invalid_utf
\= Expect no match
A\x{d800}B
No match
A\x{110000}B
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
# ----------------------------------------------------
# End of testinput12

View File

@ -1500,4 +1500,81 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
/(?(á/utf
Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?)
# Invalid UTF-16/32 tests.
/.../g,match_invalid_utf
abcd\x{df00}wxzy\x{df00}pqrs
0: abc
0: wxz
0: pqr
abcd\x{80}wxzy\x{df00}pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/abc/match_invalid_utf
ab\x{df00}ab\=ph
Partial match: ab
\= Expect no match
ab\x{df00}cdef\=ph
No match
/ab$/match_invalid_utf
ab\x{df00}cdeab
0: ab
\= Expect no match
ab\x{df00}cde
No match
/.../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/(?<=x)../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
0: zy
abcd\x{80}wxzy\x{df00}xpqrs
0: zy
0: pq
/X$/match_invalid_utf
\= Expect no match
X\x{df00}
No match
/(?<=..)X/match_invalid_utf,aftertext
AB\x{df00}AQXYZ
0: X
0+ YZ
AB\x{df00}AQXYZ\=offset=5
0: X
0+ YZ
AB\x{df00}\x{df00}AXYZXC\=offset=5
0: X
0+ C
\= Expect no match
AB\x{df00}XYZ
No match
AB\x{df00}XYZ\=offset=3
No match
AB\x{df00}AXYZ
No match
AB\x{df00}AXYZ\=offset=4
No match
AB\x{df00}\x{df00}AXYZ\=offset=5
No match
/.../match_invalid_utf
\= Expect no match
A\x{d800}B
No match
A\x{110000}B
No match
# ----------------------------------------------------
# End of testinput12

View File

@ -1020,4 +1020,9 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
Failed: error 120 at offset 131070: regular expression is too large
# End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8

View File

@ -1020,4 +1020,9 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
Failed: error 120 at offset 131070: regular expression is too large
# End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8

View File

@ -367,7 +367,4 @@ Failed: error 134 at offset 14: character code point value in \x{} or \o{} is to
/(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/
Failed: error 176 at offset 259: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
/\[()]{65535}/expand
Failed: error 120 at offset 131070: regular expression is too large
# End of testinput9