Implement support for invalid UTF in the pcre2_match() interpreter.

This commit is contained in:
Philip.Hazel 2019-05-24 17:15:48 +00:00
parent 2ad4329f83
commit 16c046ce50
48 changed files with 2780 additions and 1783 deletions

View File

@ -14,6 +14,10 @@ detects invalid characters in the 0xd800-0xdfff range.
3. Fix minor typo bug in JIT compile when \X is used in a non-UTF string. 3. Fix minor typo bug in JIT compile when \X is used in a non-UTF string.
4. Add support for matching in invalid UTF strings to the pcre2_match()
interpreter, and integrate with the existing JIT support via the new
PCRE2_MATCH_INVALID_UTF compile-time option.
Version 10.33 16-April-2019 Version 10.33 16-April-2019
--------------------------- ---------------------------

View File

@ -65,6 +65,7 @@ The option bits are:
PCRE2_EXTENDED Ignore white space and # comments PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_INVALID_UTF Enable support for matching invalid UTF
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
PCRE2_MULTILINE ^ and $ match newlines within data PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns

View File

@ -40,8 +40,12 @@ bits:
PCRE2_JIT_COMPLETE compile code for full matching PCRE2_JIT_COMPLETE compile code for full matching
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
</pre> </pre>
There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been
superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF. The old
option is deprecated and may be removed in future.
</P>
<P>
The yield of the function is 0 for success, or a negative error code otherwise. The yield of the function is 0 for success, or a negative error code otherwise.
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
if an unknown bit is set in <i>options</i>. if an unknown bit is set in <i>options</i>.

View File

@ -1347,11 +1347,12 @@ and <b>pcre2_compile()</b> returns a non-NULL value.
<P> <P>
There are nearly 100 positive error codes that <b>pcre2_compile()</b> may return There are nearly 100 positive error codes that <b>pcre2_compile()</b> may return
if it finds an error in the pattern. There are also some negative error codes if it finds an error in the pattern. There are also some negative error codes
that are used for invalid UTF strings. These are the same as given by that are used for invalid UTF strings when validity checking is in force. These
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and are described in the are the same as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and
are described in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a> <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page. There is no separate documentation for the positive error codes, because documentation. There is no separate documentation for the positive error codes,
the textual error messages that are obtained by calling the because the textual error messages that are obtained by calling the
<b>pcre2_get_error_message()</b> function (see "Obtaining a textual error <b>pcre2_get_error_message()</b> function (see "Obtaining a textual error
message" message"
<a href="#geterrormessage">below)</a> <a href="#geterrormessage">below)</a>
@ -1615,10 +1616,18 @@ expression engine is not the most efficient way of doing it. If you are doing a
lot of literal matching and are worried about efficiency, you should consider lot of literal matching and are worried about efficiency, you should consider
using other approaches. The only other main options that are allowed with using other approaches. The only other main options that are allowed with
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
error. PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
<pre>
PCRE2_MATCH_INVALID_UTF
</pre>
This option forces PCRE2_UTF (see below) and also enables support for matching
by <b>pcre2_match()</b> in subject strings that contain invalid UTF sequences.
This facility is not supported for DFA matching. For details, see the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
<pre> <pre>
PCRE2_MATCH_UNSET_BACKREF PCRE2_MATCH_UNSET_BACKREF
</pre> </pre>
@ -2653,15 +2662,22 @@ of JIT; it forces matching to be done by the interpreter.
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
</pre> </pre>
When PCRE2_UTF is set at compile time, the validity of the subject as a UTF When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
string is checked by default when <b>pcre2_match()</b> is subsequently called. string is checked unless PCRE2_NO_UTF_CHECK is passed to <b>pcre2_match()</b> or
If a non-zero starting offset is given, the check is applied only to that part PCRE2_MATCH_INVALID_UTF was passed to <b>pcre2_compile()</b>. The latter special
of the subject that could be inspected during matching, and there is a check case is discussed in detail in the
that the starting offset points to the first code unit of a character or to the <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
end of the subject. If there are no lookbehind assertions in the pattern, the documentation.
check starts at the starting offset. Otherwise, it starts at the length of the </P>
longest lookbehind before the starting offset, or at the start of the subject <P>
if there are not that many characters before the starting offset. Note that the In the default case, if a non-zero starting offset is given, the check is
sequences \b and \B are one-character lookbehinds. applied only to that part of the subject that could be inspected during
matching, and there is a check that the starting offset points to the first
code unit of a character or to the end of the subject. If there are no
lookbehind assertions in the pattern, the check starts at the starting offset.
Otherwise, it starts at the length of the longest lookbehind before the
starting offset, or at the start of the subject if there are not that many
characters before the starting offset. Note that the sequences \b and \B are
one-character lookbehinds.
</P> </P>
<P> <P>
The check is carried out before any other processing takes place, and a The check is carried out before any other processing takes place, and a
@ -2674,19 +2690,20 @@ and
<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a> <a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
in the in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a> <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page. documentation.
</P> </P>
<P> <P>
If you know that your subject is valid, and you want to skip these checks for If you know that your subject is valid, and you want to skip this check for
performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
<b>pcre2_match()</b>. You might want to do this for the second and subsequent <b>pcre2_match()</b>. You might want to do this for the second and subsequent
calls to <b>pcre2_match()</b> if you are making repeated calls to find other calls to <b>pcre2_match()</b> if you are making repeated calls to find multiple
matches in the same subject string. matches in the same subject string.
</P> </P>
<P> <P>
<b>Warning:</b> When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid <b>Warning:</b> Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
string as a subject, or an invalid value of <i>startoffset</i>, is undefined. string as a subject, or an invalid value of <i>startoffset</i>, is undefined.
Your program may crash or loop indefinitely. Your program may crash or loop indefinitely or give wrong results.
<pre> <pre>
PCRE2_PARTIAL_HARD PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT PCRE2_PARTIAL_SOFT
@ -3771,6 +3788,12 @@ a backreference.
This return is given if <b>pcre2_dfa_match()</b> encounters a condition item This return is given if <b>pcre2_dfa_match()</b> encounters a condition item
that uses a backreference for the condition, or a test for recursion in a that uses a backreference for the condition, or a test for recursion in a
specific capture group. These are not supported. specific capture group. These are not supported.
<pre>
PCRE2_ERROR_DFA_UINVALID_UTF
</pre>
This return is given if <b>pcre2_dfa_match()</b> is called for a pattern that
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
matching.
<pre> <pre>
PCRE2_ERROR_DFA_WSSIZE PCRE2_ERROR_DFA_WSSIZE
</pre> </pre>
@ -3808,7 +3831,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 14 February 2019 Last updated: 23 May 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -147,25 +147,29 @@ pattern.
</P> </P>
<br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br> <br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
<P> <P>
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching When a pattern is compiled with the PCRE2_UTF option, subject strings are
function expects its subject string to be a valid sequence of UTF code units. normally expected to be a valid sequence of UTF code units. By default, this is
If it is not, the result is undefined. This is also true by default of matching checked at the start of matching and an error is generated if invalid UTF is
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to detected. The PCRE2_NO_UTF_CHECK option can be passed to <b>pcre2_match()</b> to
<b>pcre2_jit_compile()</b>, code that can process a subject containing invalid skip the check (for improved performance) if you are sure that a subject string
UTF is compiled. is valid. If this option is used with an invalid string, the result is
undefined.
</P> </P>
<P> <P>
In this mode, an invalid code unit sequence never matches any pattern item. It However, a way of running matches on strings that may contain invalid UTF
does not match dot, it does not match \p{Any}, it does not even match negative sequences is available. Calling <b>pcre2_compile()</b> with the
items such as [^X]. A lookbehind assertion fails if it encounters an invalid PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
sequence while moving the current point backwards. In other words, an invalid <b>pcre2_match()</b> to support invalid UTF, and, if <b>pcre2_jit_compile()</b>
UTF code unit sequence acts as a barrier which no match can cross. Reaching an is called, the compiled JIT code also supports invalid UTF. Details of how this
invalid sequence causes an immediate backtrack. support works, in both the JIT and the interpretive cases, is given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
</P> </P>
<P> <P>
Using this option, an application can run matches in arbitrary data, knowing There is also an obsolete option for <b>pcre2_jit_compile()</b> called
that any matched strings that are returned will be valid UTF. This can be PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
useful when searching for text in executable or other binary files. It is superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF
and should no longer be used. It may be removed in future.
</P> </P>
<br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br> <br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
<P> <P>
@ -461,7 +465,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br> <br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 06 March 2019 Last updated: 23 May 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -188,6 +188,10 @@ code unit) at a time, for all active paths through the tree.
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not 9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion. supported. (*FAIL) is supported, and behaves like a failing negative assertion.
</P> </P>
<P>
10. The PCRE2_MATCH_INVALID_UTF option for <b>pcre2_compile()</b> is not
supported by <b>pcre2_dfa_match()</b>.
</P>
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> <br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
<P> <P>
Using the alternative matching algorithm provides the following advantages: Using the alternative matching algorithm provides the following advantages:
@ -219,7 +223,8 @@ because it has to search for all possible matches, but is also because it is
less susceptible to optimization. less susceptible to optimization.
</P> </P>
<P> <P>
2. Capturing parentheses, backreferences, and script runs are not supported. 2. Capturing parentheses, backreferences, script runs, and matching within
invalid UTF string are not supported.
</P> </P>
<P> <P>
3. Although atomic groups are supported, their use does not provide the 3. Although atomic groups are supported, their use does not provide the
@ -236,9 +241,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br> <br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 10 October 2018 Last updated: 23 May 2019
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -91,10 +91,11 @@ single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
specified for the 32-bit library, in which case it constrains the character specified for the 32-bit library, in which case it constrains the character
values to valid Unicode code points. To process UTF strings, PCRE2 must be values to valid Unicode code points. To process UTF strings, PCRE2 must be
built to include Unicode support (which is the default). When using UTF strings built to include Unicode support (which is the default). When using UTF strings
you must either call the compiling function with the PCRE2_UTF option, or the you must either call the compiling function with one or both of the PCRE2_UTF
pattern must start with the special sequence (*UTF), which is equivalent to or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
setting the relevant option. How setting a UTF mode affects pattern matching is sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
mentioned in several places below. There is also a summary of features in the setting a UTF mode affects pattern matching is mentioned in several places
below. There is also a summary of features in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a> <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page. page.
</P> </P>
@ -428,11 +429,11 @@ There may be any number of hexadecimal digits. This syntax is from ECMAScript
6. 6.
</P> </P>
<P> <P>
The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option The \N{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses UTF mode. Perl also uses \N{name} to specify characters by Unicode name; PCRE2
\N{name} to specify characters by Unicode name; PCRE2 does not support this. does not support this. Note that when \N is not followed by an opening brace
Note that when \N is not followed by an opening brace (curly bracket) it has (curly bracket) it has an entirely different meaning, matching any character
an entirely different meaning, matching any character that is not a newline. that is not a newline.
</P> </P>
<P> <P>
There are some legacy applications where the escape sequence \r is expected to There are some legacy applications where the escape sequence \r is expected to
@ -1360,7 +1361,7 @@ with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
with a malformed UTF character. This has undefined results, because PCRE2 with a malformed UTF character. This has undefined results, because PCRE2
assumes that it is matching character by character in a valid UTF string (by assumes that it is matching character by character in a valid UTF string (by
default it checks the subject string's validity at the start of processing default it checks the subject string's validity at the start of processing
unless the PCRE2_NO_UTF_CHECK option is used). unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
</P> </P>
<P> <P>
An application can lock out the use of \C by setting the An application can lock out the use of \C by setting the
@ -3727,7 +3728,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC31" href="#TOC1">REVISION</a><br> <br><a name="SEC31" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 12 February 2019 Last updated: 23 May 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -613,6 +613,7 @@ for a description of the effects of these options.
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE match_line set PCRE2_EXTRA_MATCH_LINE
match_invalid_utf set PCRE2_MATCH_INVALID_UTF
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
match_word set PCRE2_EXTRA_MATCH_WORD match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE /m multiline set PCRE2_MULTILINE
@ -2078,7 +2079,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 11 March 2019 Last updated: 23 May 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -16,22 +16,33 @@ please consult the man page, in case the conversion went wrong.
UNICODE AND UTF SUPPORT UNICODE AND UTF SUPPORT
</b><br> </b><br>
<P> <P>
When PCRE2 is built with Unicode support (which is the default), it has PCRE2 is normally built with Unicode support, though if you do not need it, you
knowledge of Unicode character properties and can process text strings in can build it without, in which case the library will be smaller. With Unicode
UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by support, PCRE2 has knowledge of Unicode character properties and can process
default, PCRE2 assumes that one code unit is one character. To process a text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
pattern as a UTF string, where a character may require more than one code unit, width), but this is not the default. Unless specifically requested, PCRE2
you must call treats each code unit in a string as one character.
<a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
with the PCRE2_UTF option flag, or the pattern must start with the sequence
(*UTF). When either of these is the case, both the pattern and any subject
strings that are matched against it are treated as UTF strings instead of
strings of individual one-code-unit characters. There are also some other
changes to the way characters are handled, as documented below.
</P> </P>
<P> <P>
If you do not need Unicode support you can build PCRE2 without it, in which There are two ways of telling PCRE2 to switch to UTF mode, where characters may
case the library will be smaller. consist of more than one code unit and the range of values is constrained. The
program can call
<a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
That is, the programmer can prevent the supplier of the pattern from switching
to UTF mode.
</P>
<P>
Note that the PCRE2_MATCH_INVALID_UTF option (see
<a href="#matchinvalid">below)</a>
forces PCRE2_UTF to be set.
</P>
<P>
In UTF mode, both the pattern and any subject strings that are matched against
it are treated as UTF strings instead of strings of individual one-code-unit
characters. There are also some other changes to the way characters are
handled, as documented below.
</P> </P>
<br><b> <br><b>
UNICODE PROPERTY SUPPORT UNICODE PROPERTY SUPPORT
@ -63,22 +74,22 @@ also recognized; larger ones can be coded using \o{...}.
<P> <P>
The escape sequence \N{U+&#60;hex digits&#62;} is recognized as another way of The escape sequence \N{U+&#60;hex digits&#62;} is recognized as another way of
specifying a Unicode character by code point in a UTF mode. It is not allowed specifying a Unicode character by code point in a UTF mode. It is not allowed
in non-UTF modes. in non-UTF mode.
</P> </P>
<P> <P>
In UTF modes, repeat quantifiers apply to complete UTF characters, not to In UTF mode, repeat quantifiers apply to complete UTF characters, not to
individual code units. individual code units.
</P> </P>
<P> <P>
In UTF modes, the dot metacharacter matches one UTF character instead of a In UTF mode, the dot metacharacter matches one UTF character instead of a
single code unit. single code unit.
</P> </P>
<P> <P>
In UTF modes, capture group names are not restricted to ASCII, and may contain In UTF mode, capture group names are not restricted to ASCII, and may contain
any Unicode letters and decimal digits, as well as underscore. any Unicode letters and decimal digits, as well as underscore.
</P> </P>
<P> <P>
The escape sequence \C can be used to match a single code unit in a UTF mode, The escape sequence \C can be used to match a single code unit in UTF mode,
but its use can lead to some strange effects because it breaks up multi-unit but its use can lead to some strange effects because it breaks up multi-unit
characters (see the description of \C in the characters (see the description of \C in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
@ -93,7 +104,7 @@ may consist of more than one code unit. The use of \C in these modes provokes
a match-time error. Also, the JIT optimization does not support \C in these a match-time error. Also, the JIT optimization does not support \C in these
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
contains \C, it will not succeed, and so when <b>pcre2_match()</b> is called, contains \C, it will not succeed, and so when <b>pcre2_match()</b> is called,
the matching will be carried out by the normal interpretive function. the matching will be carried out by the interpretive function.
</P> </P>
<P> <P>
The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
@ -123,14 +134,14 @@ However, the special horizontal and vertical white space matching escapes (\h,
not PCRE2_UCP is set. not PCRE2_UCP is set.
</P> </P>
<br><b> <br><b>
CASE-EQUIVALENCE IN UTF MODES CASE-EQUIVALENCE IN UTF MODE
</b><br> </b><br>
<P> <P>
Case-insensitive matching in a UTF mode makes use of Unicode properties except Case-insensitive matching in UTF mode makes use of Unicode properties except
for characters whose code points are less than 128 and that have at most two for characters whose code points are less than 128 and that have at most two
case-equivalent values. For these, a direct table lookup is used for speed. A case-equivalent values. For these, a direct table lookup is used for speed. A
few Unicode characters such as Greek sigma have more than two code points that few Unicode characters such as Greek sigma have more than two code points that
are case-equivalent, and these are treated as such. are case-equivalent, and these are treated specially.
<a name="scriptruns"></a></P> <a name="scriptruns"></a></P>
<br><b> <br><b>
SCRIPT RUNS SCRIPT RUNS
@ -248,7 +259,7 @@ VALIDITY OF UTF STRINGS
<P> <P>
When the PCRE2_UTF option is set, the strings passed as patterns and subjects When the PCRE2_UTF option is set, the strings passed as patterns and subjects
are (by default) checked for validity on entry to the relevant functions. If an are (by default) checked for validity on entry to the relevant functions. If an
invalid UTF string is passed, an negative error code is returned. The code unit invalid UTF string is passed, a negative error code is returned. The code unit
offset to the offending character can be extracted from the match data block by offset to the offending character can be extracted from the match data block by
calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
error. error.
@ -263,17 +274,16 @@ only valid UTF code unit sequences.
</P> </P>
<P> <P>
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is usually undefined and your program may crash or loop indefinitely. There is, is undefined and your program may crash or loop indefinitely or give incorrect
however, one mode of matching that can handle invalid UTF subject strings. This results. There is, however, one mode of matching that can handle invalid UTF
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
when calling <b>pcre2_jit_compile()</b>. For details, see the <b>pcre2_compile()</b> and is discussed below in the next section. The rest of
<a href="pcre2jit.html"><b>pcre2jit</b></a> this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
documentation.
</P> </P>
<P> <P>
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the UTF check
the pattern; it does not also apply to subject strings. If you want to disable for the pattern; it does not also apply to subject strings. If you want to
the check for a subject string you must pass this same option to disable the check for a subject string you must pass this same option to
<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>. <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
</P> </P>
<P> <P>
@ -352,7 +362,7 @@ these code points are excluded by RFC 3629.
<pre> <pre>
PCRE2_ERROR_UTF8_ERR13 PCRE2_ERROR_UTF8_ERR13
</pre> </pre>
A 4-byte character has a value greater than 0x10fff; these code points are A 4-byte character has a value greater than 0x10ffff; these code points are
excluded by RFC 3629. excluded by RFC 3629.
<pre> <pre>
PCRE2_ERROR_UTF8_ERR14 PCRE2_ERROR_UTF8_ERR14
@ -405,7 +415,59 @@ The following negative error codes are given for invalid UTF-32 strings:
PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff) PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
</PRE> <a name="matchinvalid"></a></PRE>
</P>
<br><b>
MATCHING IN INVALID UTF STRINGS
</b><br>
<P>
You can run pattern matches on subject strings that may contain invalid UTF
sequences if you call <b>pcre2_compile()</b> with the PCRE2_MATCH_INVALID_UTF
option. This is supported by <b>pcre2_match()</b>, including JIT matching, but
not by <b>pcre2_dfa_match()</b>. When PCRE2_MATCH_INVALID_UTF is set, it forces
PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
valid UTF string.
</P>
<P>
Setting PCRE2_MATCH_INVALID_UTF does not affect what <b>pcre2_compile()</b>
generates, but if <b>pcre2_jit_compile()</b> is subsequently called, it does
generate different code. If JIT is not used, the option affects the behaviour
of the interpretive code in <b>pcre2_match()</b>. When PCRE2_MATCH_INVALID_UTF
is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
</P>
<P>
In this mode, an invalid code unit sequence in the subject never matches any
pattern item. It does not match dot, it does not match \p{Any}, it does not
even match negative items such as [^X]. A lookbehind assertion fails if it
encounters an invalid sequence while moving the current point backwards. In
other words, an invalid UTF code unit sequence acts as a barrier which no match
can cross.
</P>
<P>
You can also think of this as the subject being split up into fragments of
valid UTF, delimited internally by invalid code unit sequences. The pattern is
matched fragment by fragment. The result of a successful match, however, is
given as code unit offsets in the entire subject string in the usual way. There
are a few points to consider:
</P>
<P>
The internal boundaries are not interpreted as the beginnings or ends of lines
and so do not match circumflex or dollar characters in the pattern.
</P>
<P>
If <b>pcre2_match()</b> is called with an offset that points to an invalid
UTF-sequence, that sequence is skipped, and the match starts at the next valid
UTF character, or the end of the subject.
</P>
<P>
At internal fragment boundaries, \b and \B behave in the same way as at the
beginning and end of the subject. For example, a sequence such as \bWORD\b
would match an instance of WORD that is surrounded by invalid UTF code units.
</P>
<P>
Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
data, knowing that any matched strings that are returned are valid UTF. This
can be useful when searching for UTF text in executable or other binary files.
</P> </P>
<br><b> <br><b>
AUTHOR AUTHOR
@ -422,7 +484,7 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 06 March 2019 Last updated: 24 May 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2_COMPILE 3 "11 February 2019" "PCRE2 10.33" .TH PCRE2_COMPILE 3 "23 May 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -53,6 +53,7 @@ The option bits are:
PCRE2_EXTENDED Ignore white space and # comments PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_INVALID_UTF Enable support for matching invalid UTF
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
PCRE2_MULTILINE ^ and $ match newlines within data PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns

View File

@ -1,4 +1,4 @@
.TH PCRE2_JIT_COMPILE 3 "06 March 2019" "PCRE2 10.33" .TH PCRE2_JIT_COMPILE 3 "23 May 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -29,8 +29,11 @@ bits:
PCRE2_JIT_COMPLETE compile code for full matching PCRE2_JIT_COMPLETE compile code for full matching
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
.sp .sp
There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been
superseded by the \fBpcre2_compile()\fP option PCRE2_MATCH_INVALID_UTF. The old
option is deprecated and may be removed in future.
.P
The yield of the function is 0 for success, or a negative error code otherwise. The yield of the function is 0 for success, or a negative error code otherwise.
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
if an unknown bit is set in \fIoptions\fP. if an unknown bit is set in \fIoptions\fP.

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "14 February 2019" "PCRE2 10.33" .TH PCRE2API 3 "23 May 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -1285,13 +1285,14 @@ and \fBpcre2_compile()\fP returns a non-NULL value.
.P .P
There are nearly 100 positive error codes that \fBpcre2_compile()\fP may return There are nearly 100 positive error codes that \fBpcre2_compile()\fP may return
if it finds an error in the pattern. There are also some negative error codes if it finds an error in the pattern. There are also some negative error codes
that are used for invalid UTF strings. These are the same as given by that are used for invalid UTF strings when validity checking is in force. These
\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and are described in the are the same as given by \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and
are described in the
.\" HREF .\" HREF
\fBpcre2unicode\fP \fBpcre2unicode\fP
.\" .\"
page. There is no separate documentation for the positive error codes, because documentation. There is no separate documentation for the positive error codes,
the textual error messages that are obtained by calling the because the textual error messages that are obtained by calling the
\fBpcre2_get_error_message()\fP function (see "Obtaining a textual error \fBpcre2_get_error_message()\fP function (see "Obtaining a textual error
message" message"
.\" HTML <a href="#geterrormessage"> .\" HTML <a href="#geterrormessage">
@ -1557,10 +1558,20 @@ expression engine is not the most efficient way of doing it. If you are doing a
lot of literal matching and are worried about efficiency, you should consider lot of literal matching and are worried about efficiency, you should consider
using other approaches. The only other main options that are allowed with using other approaches. The only other main options that are allowed with
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
error. PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
.sp
PCRE2_MATCH_INVALID_UTF
.sp
This option forces PCRE2_UTF (see below) and also enables support for matching
by \fBpcre2_match()\fP in subject strings that contain invalid UTF sequences.
This facility is not supported for DFA matching. For details, see the
.\" HREF
\fBpcre2unicode\fP
.\"
documentation.
.sp .sp
PCRE2_MATCH_UNSET_BACKREF PCRE2_MATCH_UNSET_BACKREF
.sp .sp
@ -2635,15 +2646,23 @@ of JIT; it forces matching to be done by the interpreter.
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
.sp .sp
When PCRE2_UTF is set at compile time, the validity of the subject as a UTF When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
string is checked by default when \fBpcre2_match()\fP is subsequently called. string is checked unless PCRE2_NO_UTF_CHECK is passed to \fBpcre2_match()\fP or
If a non-zero starting offset is given, the check is applied only to that part PCRE2_MATCH_INVALID_UTF was passed to \fBpcre2_compile()\fP. The latter special
of the subject that could be inspected during matching, and there is a check case is discussed in detail in the
that the starting offset points to the first code unit of a character or to the .\" HREF
end of the subject. If there are no lookbehind assertions in the pattern, the \fBpcre2unicode\fP
check starts at the starting offset. Otherwise, it starts at the length of the .\"
longest lookbehind before the starting offset, or at the start of the subject documentation.
if there are not that many characters before the starting offset. Note that the .P
sequences \eb and \eB are one-character lookbehinds. In the default case, if a non-zero starting offset is given, the check is
applied only to that part of the subject that could be inspected during
matching, and there is a check that the starting offset points to the first
code unit of a character or to the end of the subject. If there are no
lookbehind assertions in the pattern, the check starts at the starting offset.
Otherwise, it starts at the length of the longest lookbehind before the
starting offset, or at the start of the subject if there are not that many
characters before the starting offset. Note that the sequences \eb and \eB are
one-character lookbehinds.
.P .P
The check is carried out before any other processing takes place, and a The check is carried out before any other processing takes place, and a
negative error code is returned if the check fails. There are several UTF error negative error code is returned if the check fails. There are several UTF error
@ -2666,17 +2685,18 @@ in the
.\" HREF .\" HREF
\fBpcre2unicode\fP \fBpcre2unicode\fP
.\" .\"
page. documentation.
.P .P
If you know that your subject is valid, and you want to skip these checks for If you know that your subject is valid, and you want to skip this check for
performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
\fBpcre2_match()\fP. You might want to do this for the second and subsequent \fBpcre2_match()\fP. You might want to do this for the second and subsequent
calls to \fBpcre2_match()\fP if you are making repeated calls to find other calls to \fBpcre2_match()\fP if you are making repeated calls to find multiple
matches in the same subject string. matches in the same subject string.
.P .P
\fBWarning:\fP When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid \fBWarning:\fP Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
string as a subject, or an invalid value of \fIstartoffset\fP, is undefined. string as a subject, or an invalid value of \fIstartoffset\fP, is undefined.
Your program may crash or loop indefinitely. Your program may crash or loop indefinitely or give wrong results.
.sp .sp
PCRE2_PARTIAL_HARD PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT PCRE2_PARTIAL_SOFT
@ -3774,6 +3794,12 @@ a backreference.
This return is given if \fBpcre2_dfa_match()\fP encounters a condition item This return is given if \fBpcre2_dfa_match()\fP encounters a condition item
that uses a backreference for the condition, or a test for recursion in a that uses a backreference for the condition, or a test for recursion in a
specific capture group. These are not supported. specific capture group. These are not supported.
.sp
PCRE2_ERROR_DFA_UINVALID_UTF
.sp
This return is given if \fBpcre2_dfa_match()\fP is called for a pattern that
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
matching.
.sp .sp
PCRE2_ERROR_DFA_WSSIZE PCRE2_ERROR_DFA_WSSIZE
.sp .sp
@ -3817,6 +3843,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 14 February 2019 Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2JIT 3 "06 March 2019" "PCRE2 10.33" .TH PCRE2JIT 3 "23 May 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT" .SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@ -123,23 +123,29 @@ pattern.
.SH "MATCHING SUBJECTS CONTAINING INVALID UTF" .SH "MATCHING SUBJECTS CONTAINING INVALID UTF"
.rs .rs
.sp .sp
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching When a pattern is compiled with the PCRE2_UTF option, subject strings are
function expects its subject string to be a valid sequence of UTF code units. normally expected to be a valid sequence of UTF code units. By default, this is
If it is not, the result is undefined. This is also true by default of matching checked at the start of matching and an error is generated if invalid UTF is
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to detected. The PCRE2_NO_UTF_CHECK option can be passed to \fBpcre2_match()\fP to
\fBpcre2_jit_compile()\fP, code that can process a subject containing invalid skip the check (for improved performance) if you are sure that a subject string
UTF is compiled. is valid. If this option is used with an invalid string, the result is
undefined.
.P .P
In this mode, an invalid code unit sequence never matches any pattern item. It However, a way of running matches on strings that may contain invalid UTF
does not match dot, it does not match \ep{Any}, it does not even match negative sequences is available. Calling \fBpcre2_compile()\fP with the
items such as [^X]. A lookbehind assertion fails if it encounters an invalid PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
sequence while moving the current point backwards. In other words, an invalid \fBpcre2_match()\fP to support invalid UTF, and, if \fBpcre2_jit_compile()\fP
UTF code unit sequence acts as a barrier which no match can cross. Reaching an is called, the compiled JIT code also supports invalid UTF. Details of how this
invalid sequence causes an immediate backtrack. support works, in both the JIT and the interpretive cases, is given in the
.\" HREF
\fBpcre2unicode\fP
.\"
documentation.
.P .P
Using this option, an application can run matches in arbitrary data, knowing There is also an obsolete option for \fBpcre2_jit_compile()\fP called
that any matched strings that are returned will be valid UTF. This can be PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
useful when searching for text in executable or other binary files. It is superseded by the \fBpcre2_compile()\fP option PCRE2_MATCH_INVALID_UTF
and should no longer be used. It may be removed in future.
. .
. .
.SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS" .SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
@ -438,6 +444,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 06 March 2019 Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2MATCHING 3 "10 October 2018" "PCRE2 10.33" .TH PCRE2MATCHING 3 "23 May 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 MATCHING ALGORITHMS" .SH "PCRE2 MATCHING ALGORITHMS"
@ -157,6 +157,9 @@ code unit) at a time, for all active paths through the tree.
.P .P
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not 9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion. supported. (*FAIL) is supported, and behaves like a failing negative assertion.
.P
10. The PCRE2_MATCH_INVALID_UTF option for \fBpcre2_compile()\fP is not
supported by \fBpcre2_dfa_match()\fP.
. .
. .
.SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM" .SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
@ -191,7 +194,8 @@ The alternative algorithm suffers from a number of disadvantages:
because it has to search for all possible matches, but is also because it is because it has to search for all possible matches, but is also because it is
less susceptible to optimization. less susceptible to optimization.
.P .P
2. Capturing parentheses, backreferences, and script runs are not supported. 2. Capturing parentheses, backreferences, script runs, and matching within
invalid UTF string are not supported.
.P .P
3. Although atomic groups are supported, their use does not provide the 3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm. performance advantage that it does for the standard algorithm.
@ -211,6 +215,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 10 October 2018 Last updated: 23 May 2019
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "12 February 2019" "PCRE2 10.33" .TH PCRE2PATTERN 3 "23 May 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -52,10 +52,11 @@ single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
specified for the 32-bit library, in which case it constrains the character specified for the 32-bit library, in which case it constrains the character
values to valid Unicode code points. To process UTF strings, PCRE2 must be values to valid Unicode code points. To process UTF strings, PCRE2 must be
built to include Unicode support (which is the default). When using UTF strings built to include Unicode support (which is the default). When using UTF strings
you must either call the compiling function with the PCRE2_UTF option, or the you must either call the compiling function with one or both of the PCRE2_UTF
pattern must start with the special sequence (*UTF), which is equivalent to or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
setting the relevant option. How setting a UTF mode affects pattern matching is sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
mentioned in several places below. There is also a summary of features in the setting a UTF mode affects pattern matching is mentioned in several places
below. There is also a summary of features in the
.\" HREF .\" HREF
\fBpcre2unicode\fP \fBpcre2unicode\fP
.\" .\"
@ -398,11 +399,11 @@ PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
There may be any number of hexadecimal digits. This syntax is from ECMAScript There may be any number of hexadecimal digits. This syntax is from ECMAScript
6. 6.
.P .P
The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option The \eN{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses UTF mode. Perl also uses \eN{name} to specify characters by Unicode name; PCRE2
\eN{name} to specify characters by Unicode name; PCRE2 does not support this. does not support this. Note that when \eN is not followed by an opening brace
Note that when \eN is not followed by an opening brace (curly bracket) it has (curly bracket) it has an entirely different meaning, matching any character
an entirely different meaning, matching any character that is not a newline. that is not a newline.
.P .P
There are some legacy applications where the escape sequence \er is expected to There are some legacy applications where the escape sequence \er is expected to
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a
@ -1352,7 +1353,7 @@ with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
with a malformed UTF character. This has undefined results, because PCRE2 with a malformed UTF character. This has undefined results, because PCRE2
assumes that it is matching character by character in a valid UTF string (by assumes that it is matching character by character in a valid UTF string (by
default it checks the subject string's validity at the start of processing default it checks the subject string's validity at the start of processing
unless the PCRE2_NO_UTF_CHECK option is used). unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
.P .P
An application can lock out the use of \eC by setting the An application can lock out the use of \eC by setting the
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
@ -3763,6 +3764,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 12 February 2019 Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "11 March 2019" "PCRE 10.33" .TH PCRE2TEST 1 "23 May 2019" "PCRE 10.34"
.SH NAME .SH NAME
pcre2test - a program for testing Perl-compatible regular expressions. pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -572,6 +572,7 @@ for a description of the effects of these options.
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE match_line set PCRE2_EXTRA_MATCH_LINE
match_invalid_utf set PCRE2_MATCH_INVALID_UTF
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
match_word set PCRE2_EXTRA_MATCH_WORD match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE /m multiline set PCRE2_MULTILINE
@ -2059,6 +2060,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 11 March 2019 Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -551,6 +551,7 @@ PATTERN MODIFIERS
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE match_line set PCRE2_EXTRA_MATCH_LINE
match_invalid_utf set PCRE2_MATCH_INVALID_UTF
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
match_word set PCRE2_EXTRA_MATCH_WORD match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE /m multiline set PCRE2_MULTILINE
@ -1890,5 +1891,5 @@ AUTHOR
REVISION REVISION
Last updated: 11 March 2019 Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.

View File

@ -1,26 +1,38 @@
.TH PCRE2UNICODE 3 "11 May 2019" "PCRE2 10.33" .TH PCRE2UNICODE 3 "24 May 2019" "PCRE2 10.34"
.SH NAME .SH NAME
PCRE - Perl-compatible regular expressions (revised API) PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT" .SH "UNICODE AND UTF SUPPORT"
.rs .rs
.sp .sp
When PCRE2 is built with Unicode support (which is the default), it has PCRE2 is normally built with Unicode support, though if you do not need it, you
knowledge of Unicode character properties and can process text strings in can build it without, in which case the library will be smaller. With Unicode
UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by support, PCRE2 has knowledge of Unicode character properties and can process
default, PCRE2 assumes that one code unit is one character. To process a text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
pattern as a UTF string, where a character may require more than one code unit, width), but this is not the default. Unless specifically requested, PCRE2
you must call treats each code unit in a string as one character.
.P
There are two ways of telling PCRE2 to switch to UTF mode, where characters may
consist of more than one code unit and the range of values is constrained. The
program can call
.\" HREF .\" HREF
\fBpcre2_compile()\fP \fBpcre2_compile()\fP
.\" .\"
with the PCRE2_UTF option flag, or the pattern must start with the sequence with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
(*UTF). When either of these is the case, both the pattern and any subject However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
strings that are matched against it are treated as UTF strings instead of That is, the programmer can prevent the supplier of the pattern from switching
strings of individual one-code-unit characters. There are also some other to UTF mode.
changes to the way characters are handled, as documented below.
.P .P
If you do not need Unicode support you can build PCRE2 without it, in which Note that the PCRE2_MATCH_INVALID_UTF option (see
case the library will be smaller. .\" HTML <a href="#matchinvalid">
.\" </a>
below)
.\"
forces PCRE2_UTF to be set.
.P
In UTF mode, both the pattern and any subject strings that are matched against
it are treated as UTF strings instead of strings of individual one-code-unit
characters. There are also some other changes to the way characters are
handled, as documented below.
. .
. .
.SH "UNICODE PROPERTY SUPPORT" .SH "UNICODE PROPERTY SUPPORT"
@ -55,18 +67,18 @@ also recognized; larger ones can be coded using \eo{...}.
.P .P
The escape sequence \eN{U+<hex digits>} is recognized as another way of The escape sequence \eN{U+<hex digits>} is recognized as another way of
specifying a Unicode character by code point in a UTF mode. It is not allowed specifying a Unicode character by code point in a UTF mode. It is not allowed
in non-UTF modes. in non-UTF mode.
.P .P
In UTF modes, repeat quantifiers apply to complete UTF characters, not to In UTF mode, repeat quantifiers apply to complete UTF characters, not to
individual code units. individual code units.
.P .P
In UTF modes, the dot metacharacter matches one UTF character instead of a In UTF mode, the dot metacharacter matches one UTF character instead of a
single code unit. single code unit.
.P .P
In UTF modes, capture group names are not restricted to ASCII, and may contain In UTF mode, capture group names are not restricted to ASCII, and may contain
any Unicode letters and decimal digits, as well as underscore. any Unicode letters and decimal digits, as well as underscore.
.P .P
The escape sequence \eC can be used to match a single code unit in a UTF mode, The escape sequence \eC can be used to match a single code unit in UTF mode,
but its use can lead to some strange effects because it breaks up multi-unit but its use can lead to some strange effects because it breaks up multi-unit
characters (see the description of \eC in the characters (see the description of \eC in the
.\" HREF .\" HREF
@ -82,7 +94,7 @@ may consist of more than one code unit. The use of \eC in these modes provokes
a match-time error. Also, the JIT optimization does not support \eC in these a match-time error. Also, the JIT optimization does not support \eC in these
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
contains \eC, it will not succeed, and so when \fBpcre2_match()\fP is called, contains \eC, it will not succeed, and so when \fBpcre2_match()\fP is called,
the matching will be carried out by the normal interpretive function. the matching will be carried out by the interpretive function.
.P .P
The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly test The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly test
characters of any code value, but, by default, the characters that PCRE2 characters of any code value, but, by default, the characters that PCRE2
@ -114,14 +126,14 @@ However, the special horizontal and vertical white space matching escapes (\eh,
not PCRE2_UCP is set. not PCRE2_UCP is set.
. .
. .
.SH "CASE-EQUIVALENCE IN UTF MODES" .SH "CASE-EQUIVALENCE IN UTF MODE"
.rs .rs
.sp .sp
Case-insensitive matching in a UTF mode makes use of Unicode properties except Case-insensitive matching in UTF mode makes use of Unicode properties except
for characters whose code points are less than 128 and that have at most two for characters whose code points are less than 128 and that have at most two
case-equivalent values. For these, a direct table lookup is used for speed. A case-equivalent values. For these, a direct table lookup is used for speed. A
few Unicode characters such as Greek sigma have more than two code points that few Unicode characters such as Greek sigma have more than two code points that
are case-equivalent, and these are treated as such. are case-equivalent, and these are treated specially.
. .
. .
.\" HTML <a name="scriptruns"></a> .\" HTML <a name="scriptruns"></a>
@ -231,7 +243,7 @@ adjacent characters.
.sp .sp
When the PCRE2_UTF option is set, the strings passed as patterns and subjects When the PCRE2_UTF option is set, the strings passed as patterns and subjects
are (by default) checked for validity on entry to the relevant functions. If an are (by default) checked for validity on entry to the relevant functions. If an
invalid UTF string is passed, an negative error code is returned. The code unit invalid UTF string is passed, a negative error code is returned. The code unit
offset to the offending character can be extracted from the match data block by offset to the offending character can be extracted from the match data block by
calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF
error. error.
@ -244,18 +256,15 @@ PCRE2 assumes that the pattern or subject it is given (respectively) contains
only valid UTF code unit sequences. only valid UTF code unit sequences.
.P .P
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is usually undefined and your program may crash or loop indefinitely. There is, is undefined and your program may crash or loop indefinitely or give incorrect
however, one mode of matching that can handle invalid UTF subject strings. This results. There is, however, one mode of matching that can handle invalid UTF
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
when calling \fBpcre2_jit_compile()\fP. For details, see the \fBpcre2_compile()\fP and is discussed below in the next section. The rest of
.\" HREF this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
\fBpcre2jit\fP
.\"
documentation.
.P .P
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the UTF check
the pattern; it does not also apply to subject strings. If you want to disable for the pattern; it does not also apply to subject strings. If you want to
the check for a subject string you must pass this same option to disable the check for a subject string you must pass this same option to
\fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP. \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP.
.P .P
UTF-16 and UTF-32 strings can indicate their endianness by special code knows UTF-16 and UTF-32 strings can indicate their endianness by special code knows
@ -386,6 +395,52 @@ The following negative error codes are given for invalid UTF-32 strings:
.sp .sp
. .
. .
.\" HTML <a name="matchinvalid"></a>
.SH "MATCHING IN INVALID UTF STRINGS"
.rs
.sp
You can run pattern matches on subject strings that may contain invalid UTF
sequences if you call \fBpcre2_compile()\fP with the PCRE2_MATCH_INVALID_UTF
option. This is supported by \fBpcre2_match()\fP, including JIT matching, but
not by \fBpcre2_dfa_match()\fP. When PCRE2_MATCH_INVALID_UTF is set, it forces
PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
valid UTF string.
.P
Setting PCRE2_MATCH_INVALID_UTF does not affect what \fBpcre2_compile()\fP
generates, but if \fBpcre2_jit_compile()\fP is subsequently called, it does
generate different code. If JIT is not used, the option affects the behaviour
of the interpretive code in \fBpcre2_match()\fP. When PCRE2_MATCH_INVALID_UTF
is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
.P
In this mode, an invalid code unit sequence in the subject never matches any
pattern item. It does not match dot, it does not match \ep{Any}, it does not
even match negative items such as [^X]. A lookbehind assertion fails if it
encounters an invalid sequence while moving the current point backwards. In
other words, an invalid UTF code unit sequence acts as a barrier which no match
can cross.
.P
You can also think of this as the subject being split up into fragments of
valid UTF, delimited internally by invalid code unit sequences. The pattern is
matched fragment by fragment. The result of a successful match, however, is
given as code unit offsets in the entire subject string in the usual way. There
are a few points to consider:
.P
The internal boundaries are not interpreted as the beginnings or ends of lines
and so do not match circumflex or dollar characters in the pattern.
.P
If \fBpcre2_match()\fP is called with an offset that points to an invalid
UTF-sequence, that sequence is skipped, and the match starts at the next valid
UTF character, or the end of the subject.
.P
At internal fragment boundaries, \eb and \eB behave in the same way as at the
beginning and end of the subject. For example, a sequence such as \ebWORD\eb
would match an instance of WORD that is surrounded by invalid UTF code units.
.P
Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
data, knowing that any matched strings that are returned are valid UTF. This
can be useful when searching for UTF text in executable or other binary files.
.
.
.SH AUTHOR .SH AUTHOR
.rs .rs
.sp .sp
@ -400,6 +455,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 11 May 2019 Last updated: 24 May 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, second API, to be /* This is the public header file for the PCRE library, second API, to be
#included by applications that call PCRE2 functions. #included by applications that call PCRE2 functions.
Copyright (c) 2016-2018 University of Cambridge Copyright (c) 2016-2019 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
/* The current PCRE version information. */ /* The current PCRE version information. */
#define PCRE2_MAJOR 10 #define PCRE2_MAJOR 10
#define PCRE2_MINOR 33 #define PCRE2_MINOR 34
#define PCRE2_PRERELEASE #define PCRE2_PRERELEASE -RC1
#define PCRE2_DATE 2019-04-16 #define PCRE2_DATE 2019-04-22
/* When an application links to a PCRE DLL in Windows, the symbols that are /* When an application links to a PCRE DLL in Windows, the symbols that are
imported have to be identified as such. When building PCRE2, the appropriate imported have to be identified as such. When building PCRE2, the appropriate
@ -142,6 +142,7 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_USE_OFFSET_LIMIT 0x00800000u /* J M D */ #define PCRE2_USE_OFFSET_LIMIT 0x00800000u /* J M D */
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */ #define PCRE2_EXTENDED_MORE 0x01000000u /* C */
#define PCRE2_LITERAL 0x02000000u /* C */ #define PCRE2_LITERAL 0x02000000u /* C */
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
/* An additional compile options word is available in the compile context. */ /* An additional compile options word is available in the compile context. */
@ -305,6 +306,7 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS 194 #define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS 194
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195 #define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
#define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE 196 #define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE 196
#define PCRE2_ERROR_TOO_MANY_CAPTURES 197
/* "Expected" matching error codes: no match and partial match. */ /* "Expected" matching error codes: no match and partial match. */
@ -390,6 +392,7 @@ released, the numbers must not be changed. */
#define PCRE2_ERROR_HEAPLIMIT (-63) #define PCRE2_ERROR_HEAPLIMIT (-63)
#define PCRE2_ERROR_CONVERT_SYNTAX (-64) #define PCRE2_ERROR_CONVERT_SYNTAX (-64)
#define PCRE2_ERROR_INTERNAL_DUPMATCH (-65) #define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
#define PCRE2_ERROR_DFA_UINVALID_UTF (-66)
/* Request types for pcre2_pattern_info() */ /* Request types for pcre2_pattern_info() */

View File

@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, second API, to be /* This is the public header file for the PCRE library, second API, to be
#included by applications that call PCRE2 functions. #included by applications that call PCRE2 functions.
Copyright (c) 2016-2018 University of Cambridge Copyright (c) 2016-2019 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -142,6 +142,7 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_USE_OFFSET_LIMIT 0x00800000u /* J M D */ #define PCRE2_USE_OFFSET_LIMIT 0x00800000u /* J M D */
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */ #define PCRE2_EXTENDED_MORE 0x01000000u /* C */
#define PCRE2_LITERAL 0x02000000u /* C */ #define PCRE2_LITERAL 0x02000000u /* C */
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
/* An additional compile options word is available in the compile context. */ /* An additional compile options word is available in the compile context. */
@ -391,6 +392,7 @@ released, the numbers must not be changed. */
#define PCRE2_ERROR_HEAPLIMIT (-63) #define PCRE2_ERROR_HEAPLIMIT (-63)
#define PCRE2_ERROR_CONVERT_SYNTAX (-64) #define PCRE2_ERROR_CONVERT_SYNTAX (-64)
#define PCRE2_ERROR_INTERNAL_DUPMATCH (-65) #define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
#define PCRE2_ERROR_DFA_UINVALID_UTF (-66)
/* Request types for pcre2_pattern_info() */ /* Request types for pcre2_pattern_info() */

View File

@ -746,8 +746,8 @@ are allowed. */
#define PUBLIC_LITERAL_COMPILE_OPTIONS \ #define PUBLIC_LITERAL_COMPILE_OPTIONS \
(PCRE2_ANCHORED|PCRE2_AUTO_CALLOUT|PCRE2_CASELESS|PCRE2_ENDANCHORED| \ (PCRE2_ANCHORED|PCRE2_AUTO_CALLOUT|PCRE2_CASELESS|PCRE2_ENDANCHORED| \
PCRE2_FIRSTLINE|PCRE2_LITERAL|PCRE2_NO_START_OPTIMIZE| \ PCRE2_FIRSTLINE|PCRE2_LITERAL|PCRE2_MATCH_INVALID_UTF| \
PCRE2_NO_UTF_CHECK|PCRE2_USE_OFFSET_LIMIT|PCRE2_UTF) PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK|PCRE2_USE_OFFSET_LIMIT|PCRE2_UTF)
#define PUBLIC_COMPILE_OPTIONS \ #define PUBLIC_COMPILE_OPTIONS \
(PUBLIC_LITERAL_COMPILE_OPTIONS| \ (PUBLIC_LITERAL_COMPILE_OPTIONS| \
@ -3615,7 +3615,7 @@ while (ptr < ptrend)
{ {
errorcode = ERR97; errorcode = ERR97;
goto FAILED; goto FAILED;
} }
cb->bracount++; cb->bracount++;
*parsed_pattern++ = META_CAPTURE | cb->bracount; *parsed_pattern++ = META_CAPTURE | cb->bracount;
} }
@ -4444,7 +4444,7 @@ while (ptr < ptrend)
{ {
errorcode = ERR97; errorcode = ERR97;
goto FAILED; goto FAILED;
} }
cb->bracount++; cb->bracount++;
*parsed_pattern++ = META_CAPTURE | cb->bracount; *parsed_pattern++ = META_CAPTURE | cb->bracount;
nest_depth++; nest_depth++;
@ -9503,6 +9503,10 @@ if (pattern == NULL)
if (ccontext == NULL) if (ccontext == NULL)
ccontext = (pcre2_compile_context *)(&PRIV(default_compile_context)); ccontext = (pcre2_compile_context *)(&PRIV(default_compile_context));
/* PCRE2_MATCH_INVALID_UTF implies UTF */
if ((options & PCRE2_MATCH_INVALID_UTF) != 0) options |= PCRE2_UTF;
/* Check that all undefined public option bits are zero. */ /* Check that all undefined public option bits are zero. */
@ -9682,7 +9686,7 @@ if ((options & PCRE2_LITERAL) == 0)
ptr += skipatstart; ptr += skipatstart;
/* Can't support UTF or UCP unless PCRE2 has been compiled with UTF support. */ /* Can't support UTF or UCP if PCRE2 was built without Unicode support. */
#ifndef SUPPORT_UNICODE #ifndef SUPPORT_UNICODE
if ((cb.external_options & (PCRE2_UTF|PCRE2_UCP)) != 0) if ((cb.external_options & (PCRE2_UTF|PCRE2_UCP)) != 0)

View File

@ -3294,6 +3294,11 @@ time. */
if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0 && if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0 &&
((re->overall_options | options) & PCRE2_ENDANCHORED) != 0) ((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
return PCRE2_ERROR_BADOPTION; return PCRE2_ERROR_BADOPTION;
/* Invalid UTF support is not available for DFA matching. */
if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0)
return PCRE2_ERROR_DFA_UINVALID_UTF;
/* Check that the first field in the block is the magic number. If it is not, /* Check that the first field in the block is the magic number. If it is not,
return with PCRE2_ERROR_BADMAGIC. */ return with PCRE2_ERROR_BADMAGIC. */

View File

@ -269,6 +269,7 @@ static const unsigned char match_error_texts[] =
"invalid syntax\0" "invalid syntax\0"
/* 65 */ /* 65 */
"internal error - duplicate substitution match\0" "internal error - duplicate substitution match\0"
"PCRE2_MATCH_INVALID_UTF is not supported for DFA matching\0"
; ;

View File

@ -866,6 +866,7 @@ typedef struct match_block {
PCRE2_SPTR name_table; /* Table of group names */ PCRE2_SPTR name_table; /* Table of group names */
PCRE2_SPTR start_code; /* For use when recursing */ PCRE2_SPTR start_code; /* For use when recursing */
PCRE2_SPTR start_subject; /* Start of the subject string */ PCRE2_SPTR start_subject; /* Start of the subject string */
PCRE2_SPTR check_subject; /* Where UTF-checked from */
PCRE2_SPTR end_subject; /* End of the subject string */ PCRE2_SPTR end_subject; /* End of the subject string */
PCRE2_SPTR end_match_ptr; /* Subject position at end match */ PCRE2_SPTR end_match_ptr; /* Subject position at end match */
PCRE2_SPTR start_used_ptr; /* Earliest consulted character */ PCRE2_SPTR start_used_ptr; /* Earliest consulted character */

View File

@ -6,8 +6,9 @@
and semantics are as close as possible to those of the Perl 5 language. and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel Written by Philip Hazel
This module by Zoltan Herczeg
Original API code Copyright (c) 1997-2012 University of Cambridge Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016-2018 University of Cambridge New API code Copyright (c) 2016-2019 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -7846,8 +7847,6 @@ if (needstype || needsscript)
if (needsscript) if (needsscript)
{ {
// PH hacking // PH hacking
//fprintf(stderr, "~~B\n");
OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2); OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3); OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0); OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0);
@ -7901,7 +7900,6 @@ if (needstype || needsscript)
if (!needschar) if (!needschar)
{ {
// PH hacking // PH hacking
//fprintf(stderr, "~~C\n");
OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2); OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3); OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0); OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0);
@ -7916,7 +7914,6 @@ if (needstype || needsscript)
else else
{ {
// PH hacking // PH hacking
//fprintf(stderr, "~~D\n");
OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2); OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3); OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
@ -8594,8 +8591,8 @@ uint32_t c;
/* Patch by PH */ /* Patch by PH */
/* GETCHARINC(c, cc); */ /* GETCHARINC(c, cc); */
c = *cc++; c = *cc++;
#if PCRE2_CODE_UNIT_WIDTH == 32 #if PCRE2_CODE_UNIT_WIDTH == 32
if (c >= 0x110000) if (c >= 0x110000)
return NULL; return NULL;
@ -9257,8 +9254,6 @@ if (common->utf && *cc == OP_REFI)
CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop); CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop);
// PH hacking // PH hacking
//fprintf(stderr, "~~E\n");
OP1(SLJIT_MOV, TMP3, 0, TMP1, 0); OP1(SLJIT_MOV, TMP3, 0, TMP1, 0);
add_jump(compiler, &common->getucd, JUMP(SLJIT_FAST_CALL)); add_jump(compiler, &common->getucd, JUMP(SLJIT_FAST_CALL));
@ -14156,49 +14151,87 @@ Returns: 0: success or (*NOJIT) was used
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
pcre2_jit_compile(pcre2_code *code, uint32_t options) pcre2_jit_compile(pcre2_code *code, uint32_t options)
{ {
#ifndef SUPPORT_JIT
(void)code;
(void)options;
return PCRE2_ERROR_JIT_BADOPTION;
#else /* SUPPORT_JIT */
pcre2_real_code *re = (pcre2_real_code *)code; pcre2_real_code *re = (pcre2_real_code *)code;
executable_functions *functions; executable_functions *functions;
uint32_t excluded_options;
int result;
if (code == NULL) if (code == NULL)
return PCRE2_ERROR_NULL; return PCRE2_ERROR_NULL;
if ((options & ~PUBLIC_JIT_COMPILE_OPTIONS) != 0) if ((options & ~PUBLIC_JIT_COMPILE_OPTIONS) != 0)
return PCRE2_ERROR_JIT_BADOPTION; return PCRE2_ERROR_JIT_BADOPTION;
if ((re->flags & PCRE2_NOJIT) != 0) return 0;
functions = (executable_functions *)re->executable_jit; functions = (executable_functions *)re->executable_jit;
/* Support for invalid UTF was first introduced in JIT, with the option
PCRE2_JIT_INVALID_UTF. Later, support was added to the interpreter, and the
compile-time option PCRE2_MATCH_INVALID_UTF was created. This is now the
preferred feature, with the earlier option deprecated. However, for backward
compatibility, if the earlier option is set, it forces the new option so that
if JIT matching falls back to the interpreter, there is still support for
invalid UTF. However, if this function has already been successfully called
without PCRE2_JIT_INVALID_UTF and without PCRE2_MATCH_INVALID_UTF (meaning that
non-invalid-supporting JIT code was compiled), give an error.
If in the future support for PCRE2_JIT_INVALID_UTF is withdrawn, the following
actions are needed:
1. Remove the definition from pcre2.h.in and from the list in
PUBLIC_JIT_COMPILE_OPTIONS above.
2. Replace PCRE2_JIT_INVALID_UTF with a local flag in this module.
3. Replace PCRE2_JIT_INVALID_UTF in pcre2_jit_test.c.
4. Delete the following short block of code. The setting of "re" and
"functions" can be moved into the JIT-only block below, but if that is
done, (void)re and (void)functions will be needed in the non-JIT case, to
avoid compiler warnings.
*/
if ((options & PCRE2_JIT_INVALID_UTF) != 0)
{
if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) == 0)
{
if (functions != NULL) return PCRE2_ERROR_JIT_BADOPTION;
re->overall_options |= PCRE2_MATCH_INVALID_UTF;
}
}
/* The above tests are run with and without JIT support. This means that
PCRE2_JIT_INVALID_UTF propagates back into the regex options (ensuring
interpreter support) even in the absence of JIT. But now, if there is no JIT
support, give an error return. */
#ifndef SUPPORT_JIT
return PCRE2_ERROR_JIT_BADOPTION;
#else /* SUPPORT_JIT */
/* There is JIT support. Do the necessary. */
if ((re->flags & PCRE2_NOJIT) != 0) return 0;
if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0)
options |= PCRE2_JIT_INVALID_UTF;
if ((options & PCRE2_JIT_COMPLETE) != 0 && (functions == NULL if ((options & PCRE2_JIT_COMPLETE) != 0 && (functions == NULL
|| functions->executable_funcs[0] == NULL)) { || functions->executable_funcs[0] == NULL)) {
excluded_options = (PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD); uint32_t excluded_options = (PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD);
result = jit_compile(code, options & ~excluded_options); int result = jit_compile(code, options & ~excluded_options);
if (result != 0) if (result != 0)
return result; return result;
} }
if ((options & PCRE2_JIT_PARTIAL_SOFT) != 0 && (functions == NULL if ((options & PCRE2_JIT_PARTIAL_SOFT) != 0 && (functions == NULL
|| functions->executable_funcs[1] == NULL)) { || functions->executable_funcs[1] == NULL)) {
excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD); uint32_t excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD);
result = jit_compile(code, options & ~excluded_options); int result = jit_compile(code, options & ~excluded_options);
if (result != 0) if (result != 0)
return result; return result;
} }
if ((options & PCRE2_JIT_PARTIAL_HARD) != 0 && (functions == NULL if ((options & PCRE2_JIT_PARTIAL_HARD) != 0 && (functions == NULL
|| functions->executable_funcs[2] == NULL)) { || functions->executable_funcs[2] == NULL)) {
excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT); uint32_t excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT);
result = jit_compile(code, options & ~excluded_options); int result = jit_compile(code, options & ~excluded_options);
if (result != 0) if (result != 0)
return result; return result;
} }

View File

@ -5412,7 +5412,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
{ {
while (number-- > 0) while (number-- > 0)
{ {
if (Feptr <= mb->start_subject) RRETURN(MATCH_NOMATCH); if (Feptr <= mb->check_subject) RRETURN(MATCH_NOMATCH);
Feptr--; Feptr--;
BACKCHAR(Feptr); BACKCHAR(Feptr);
} }
@ -5420,7 +5420,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
else else
#endif #endif
/* No UTF-8 support, or not in UTF-8 mode: count is byte count */ /* No UTF-8 support, or not in UTF-8 mode: count is code unit count */
{ {
if ((ptrdiff_t)number > Feptr - mb->start_subject) RRETURN(MATCH_NOMATCH); if ((ptrdiff_t)number > Feptr - mb->start_subject) RRETURN(MATCH_NOMATCH);
@ -5743,7 +5743,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
case OP_NOT_WORD_BOUNDARY: case OP_NOT_WORD_BOUNDARY:
case OP_WORD_BOUNDARY: case OP_WORD_BOUNDARY:
if (Feptr == mb->start_subject) prev_is_word = FALSE; else if (Feptr == mb->check_subject) prev_is_word = FALSE; else
{ {
PCRE2_SPTR lastptr = Feptr - 1; PCRE2_SPTR lastptr = Feptr - 1;
#ifdef SUPPORT_UNICODE #ifdef SUPPORT_UNICODE
@ -6014,7 +6014,6 @@ int was_zero_terminated = 0;
const uint8_t *start_bits = NULL; const uint8_t *start_bits = NULL;
const pcre2_real_code *re = (const pcre2_real_code *)code; const pcre2_real_code *re = (const pcre2_real_code *)code;
BOOL anchored; BOOL anchored;
BOOL firstline; BOOL firstline;
BOOL has_first_cu = FALSE; BOOL has_first_cu = FALSE;
@ -6029,10 +6028,23 @@ PCRE2_UCHAR req_cu2 = 0;
PCRE2_SPTR bumpalong_limit; PCRE2_SPTR bumpalong_limit;
PCRE2_SPTR end_subject; PCRE2_SPTR end_subject;
PCRE2_SPTR true_end_subject;
PCRE2_SPTR start_match = subject + start_offset; PCRE2_SPTR start_match = subject + start_offset;
PCRE2_SPTR req_cu_ptr = start_match - 1; PCRE2_SPTR req_cu_ptr = start_match - 1;
PCRE2_SPTR start_partial = NULL; PCRE2_SPTR start_partial;
PCRE2_SPTR match_partial = NULL; PCRE2_SPTR match_partial;
#ifdef SUPPORT_JIT
BOOL use_jit;
#endif
#ifdef SUPPORT_UNICODE
BOOL allow_invalid;
uint32_t fragment_options = 0;
#ifdef SUPPORT_JIT
BOOL jit_checked_utf = FALSE;
#endif
#endif
PCRE2_SIZE frame_size; PCRE2_SIZE frame_size;
@ -6059,7 +6071,7 @@ if (length == PCRE2_ZERO_TERMINATED)
length = PRIV(strlen)(subject); length = PRIV(strlen)(subject);
was_zero_terminated = 1; was_zero_terminated = 1;
} }
end_subject = subject + length; true_end_subject = end_subject = subject + length;
/* Plausibility checks */ /* Plausibility checks */
@ -6095,12 +6107,24 @@ options |= (re->flags & FF) / ((FF & (~FF+1)) / (OO & (~OO+1)));
#undef FF #undef FF
#undef OO #undef OO
/* These two settings are used in the code for checking a UTF string that /* If the pattern was successfully studied with JIT support, we will run the
follows immediately afterwards. Other values in the mb block are used only JIT executable instead of the rest of this function. Most options must be set
during interpretive processing, not when the JIT support is in use, so they are at compile time for the JIT code to be usable. */
set up later. */
#ifdef SUPPORT_JIT
use_jit = (re->executable_jit != NULL &&
(options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0);
#endif
/* Initialize UTF parameters. */
utf = (re->overall_options & PCRE2_UTF) != 0; utf = (re->overall_options & PCRE2_UTF) != 0;
#ifdef SUPPORT_UNICODE
allow_invalid = (re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0;
#endif
/* Convert the partial matching flags into an integer. */
mb->partial = ((options & PCRE2_PARTIAL_HARD) != 0)? 2 : mb->partial = ((options & PCRE2_PARTIAL_HARD) != 0)? 2 :
((options & PCRE2_PARTIAL_SOFT) != 0)? 1 : 0; ((options & PCRE2_PARTIAL_SOFT) != 0)? 1 : 0;
@ -6111,61 +6135,6 @@ if (mb->partial != 0 &&
((re->overall_options | options) & PCRE2_ENDANCHORED) != 0) ((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
return PCRE2_ERROR_BADOPTION; return PCRE2_ERROR_BADOPTION;
/* Check a UTF string for validity if required. For 8-bit and 16-bit strings,
we must also check that a starting offset does not point into the middle of a
multiunit character. We check only the portion of the subject that is going to
be inspected during matching - from the offset minus the maximum back reference
to the given length. This saves time when a small part of a large subject is
being matched by the use of a starting offset. Note that the maximum lookbehind
is a number of characters, not code units. */
#ifdef SUPPORT_UNICODE
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
{
PCRE2_SPTR check_subject = start_match; /* start_match includes offset */
if (start_offset > 0)
{
#if PCRE2_CODE_UNIT_WIDTH != 32
unsigned int i;
if (start_match < end_subject && NOT_FIRSTCU(*start_match))
return PCRE2_ERROR_BADUTFOFFSET;
for (i = re->max_lookbehind; i > 0 && check_subject > subject; i--)
{
check_subject--;
while (check_subject > subject &&
#if PCRE2_CODE_UNIT_WIDTH == 8
(*check_subject & 0xc0) == 0x80)
#else /* 16-bit */
(*check_subject & 0xfc00) == 0xdc00)
#endif /* PCRE2_CODE_UNIT_WIDTH == 8 */
check_subject--;
}
#else
/* In the 32-bit library, one code unit equals one character. However,
we cannot just subtract the lookbehind and then compare pointers, because
a very large lookbehind could create an invalid pointer. */
if (start_offset >= re->max_lookbehind)
check_subject -= re->max_lookbehind;
else
check_subject = subject;
#endif /* PCRE2_CODE_UNIT_WIDTH != 32 */
}
/* Validate the relevant portion of the subject. After an error, adjust the
offset to be an absolute offset in the whole string. */
match_data->rc = PRIV(valid_utf)(check_subject,
length - (check_subject - subject), &(match_data->startchar));
if (match_data->rc != 0)
{
match_data->startchar += check_subject - subject;
return match_data->rc;
}
}
#endif /* SUPPORT_UNICODE */
/* It is an error to set an offset limit without setting the flag at compile /* It is an error to set an offset limit without setting the flag at compile
time. */ time. */
@ -6184,15 +6153,85 @@ if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0)
} }
match_data->subject = NULL; match_data->subject = NULL;
/* If the pattern was successfully studied with JIT support, run the JIT
executable instead of the rest of this function. Most options must be set at /* ============================= JIT matching ============================== */
compile time for the JIT code to be usable. Fallback to the normal code path if
an unsupported option is set or if JIT returns BADOPTION (which means that the /* Prepare for JIT matching. Check a UTF string for validity unless no check is
selected normal or partial matching mode was not compiled). */ requested or invalid UTF can be handled. We check only the portion of the
subject that might be be inspected during matching - from the offset minus the
maximum lookbehind to the given length. This saves time when a small part of a
large subject is being matched by the use of a starting offset. Note that the
maximum lookbehind is a number of characters, not code units. */
#ifdef SUPPORT_JIT #ifdef SUPPORT_JIT
if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0) if (use_jit)
{ {
#ifdef SUPPORT_UNICODE
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0 && !allow_invalid)
{
#if PCRE2_CODE_UNIT_WIDTH != 32
unsigned int i;
#endif
/* For 8-bit and 16-bit UTF, check that the first code unit is a valid
character start. */
#if PCRE2_CODE_UNIT_WIDTH != 32
if (start_match < end_subject && NOT_FIRSTCU(*start_match))
{
if (start_offset > 0) return PCRE2_ERROR_BADUTFOFFSET;
#if PCRE2_CODE_UNIT_WIDTH == 8
return PCRE2_ERROR_UTF8_ERR20; /* Isolated 0x80 byte */
#else
return PCRE2_ERROR_UTF16_ERR3; /* Isolated low surrogate */
#endif
}
#endif /* WIDTH != 32 */
/* Move back by the maximum lookbehind, just in case it happens at the very
start of matching. */
#if PCRE2_CODE_UNIT_WIDTH != 32
for (i = re->max_lookbehind; i > 0 && start_match > subject; i--)
{
start_match--;
while (start_match > subject &&
#if PCRE2_CODE_UNIT_WIDTH == 8
(*start_match & 0xc0) == 0x80)
#else /* 16-bit */
(*start_match & 0xfc00) == 0xdc00)
#endif
start_match--;
}
#else /* PCRE2_CODE_UNIT_WIDTH != 32 */
/* In the 32-bit library, one code unit equals one character. However,
we cannot just subtract the lookbehind and then compare pointers, because
a very large lookbehind could create an invalid pointer. */
if (start_offset >= re->max_lookbehind)
start_match -= re->max_lookbehind;
else
start_match = subject;
#endif /* PCRE2_CODE_UNIT_WIDTH != 32 */
/* Validate the relevant portion of the subject. Adjust the offset of an
invalid code point to be an absolute offset in the whole string. */
match_data->rc = PRIV(valid_utf)(start_match,
length - (start_match - subject), &(match_data->startchar));
if (match_data->rc != 0)
{
match_data->startchar += start_match - subject;
return match_data->rc;
}
jit_checked_utf = TRUE;
}
#endif /* SUPPORT_UNICODE */
/* If JIT returns BADOPTION, which means that the selected complete or
partial matching mode was not compiled, fall through to the interpreter. */
rc = pcre2_jit_match(code, subject, length, start_offset, options, rc = pcre2_jit_match(code, subject, length, start_offset, options,
match_data, mcontext); match_data, mcontext);
if (rc != PCRE2_ERROR_JIT_BADOPTION) if (rc != PCRE2_ERROR_JIT_BADOPTION)
@ -6209,10 +6248,152 @@ if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0)
return rc; return rc;
} }
} }
#endif /* SUPPORT_JIT */
/* ========================= End of JIT matching ========================== */
/* Proceed with non-JIT matching. The default is to allow lookbehinds to the
start of the subject. A UTF check when there is a non-zero offset may change
this. */
mb->check_subject = subject;
/* If a UTF subject string was not checked for validity in the JIT code above,
check it here, and handle support for invalid UTF strings. The check above
happens only when invalid UTF is not supported and PCRE2_NO_CHECK_UTF is unset.
If we get here in those circumstances, it means the subject string is valid,
but for some reason JIT matching was not successful. There is no need to check
the subject again.
We check only the portion of the subject that might be be inspected during
matching - from the offset minus the maximum lookbehind to the given length.
This saves time when a small part of a large subject is being matched by the
use of a starting offset. Note that the maximum lookbehind is a number of
characters, not code units.
Note also that support for invalid UTF forces a check, overriding the setting
of PCRE2_NO_CHECK_UTF. */
#ifdef SUPPORT_UNICODE
if (utf &&
#ifdef SUPPORT_JIT
!jit_checked_utf &&
#endif
((options & PCRE2_NO_UTF_CHECK) == 0 || allow_invalid))
{
#if PCRE2_CODE_UNIT_WIDTH != 32
BOOL skipped_bad_start = FALSE;
#endif #endif
/* Carry on with non-JIT matching. A NULL match context means "use a default /* For 8-bit and 16-bit UTF, check that the first code unit is a valid
context", but we take the memory control functions from the pattern. */ character start. If we are handling invalid UTF, just skip over such code
units. Otherwise, give an appropriate error. */
#if PCRE2_CODE_UNIT_WIDTH != 32
if (allow_invalid)
{
while (start_match < end_subject && NOT_FIRSTCU(*start_match))
{
start_match++;
skipped_bad_start = TRUE;
}
}
else if (start_match < end_subject && NOT_FIRSTCU(*start_match))
{
if (start_offset > 0) return PCRE2_ERROR_BADUTFOFFSET;
#if PCRE2_CODE_UNIT_WIDTH == 8
return PCRE2_ERROR_UTF8_ERR20; /* Isolated 0x80 byte */
#else
return PCRE2_ERROR_UTF16_ERR3; /* Isolated low surrogate */
#endif
}
#endif /* WIDTH != 32 */
/* The mb->check_subject field points to the start of UTF checking;
lookbehinds can go back no further than this. */
mb->check_subject = start_match;
/* Move back by the maximum lookbehind, just in case it happens at the very
start of matching, but don't do this if we skipped bad 8-bit or 16-bit code
units above. */
#if PCRE2_CODE_UNIT_WIDTH != 32
if (!skipped_bad_start)
{
unsigned int i;
for (i = re->max_lookbehind; i > 0 && mb->check_subject > subject; i--)
{
mb->check_subject--;
while (mb->check_subject > subject &&
#if PCRE2_CODE_UNIT_WIDTH == 8
(*mb->check_subject & 0xc0) == 0x80)
#else /* 16-bit */
(*mb->check_subject & 0xfc00) == 0xdc00)
#endif
mb->check_subject--;
}
}
#else /* PCRE2_CODE_UNIT_WIDTH != 32 */
/* In the 32-bit library, one code unit equals one character. However,
we cannot just subtract the lookbehind and then compare pointers, because
a very large lookbehind could create an invalid pointer. */
if (start_offset >= re->max_lookbehind)
mb->check_subject -= re->max_lookbehind;
else
mb->check_subject = subject;
#endif /* PCRE2_CODE_UNIT_WIDTH != 32 */
/* Validate the relevant portion of the subject. There's a loop in case we
encounter bad UTF in the characters preceding start_match which we are
scanning because of a lookbehind. */
for (;;)
{
match_data->rc = PRIV(valid_utf)(mb->check_subject,
length - (mb->check_subject - subject), &(match_data->startchar));
if (match_data->rc == 0) break; /* Valid UTF string */
/* Invalid UTF string. Adjust the offset to be an absolute offset in the
whole string. If we are handling invalid UTF strings, set end_subject to
stop before the bad code unit, and set the options to "not end of line".
Otherwise return the error. */
match_data->startchar += mb->check_subject - subject;
if (!allow_invalid || match_data->rc > 0) return match_data->rc;
end_subject = subject + match_data->startchar;
/* If the end precedes start_match, it means there is invalid UTF in the
extra code units we reversed over because of a lookbehind. Advance past the
first bad code unit, and then skip invalid character starting code units in
8-bit and 16-bit modes, and try again. */
if (end_subject < start_match)
{
mb->check_subject = end_subject + 1;
#if PCRE2_CODE_UNIT_WIDTH != 32
while (mb->check_subject < start_match && NOT_FIRSTCU(*mb->check_subject))
mb->check_subject++;
#endif
}
/* Otherwise, set the not end of line option, and do the match. */
else
{
fragment_options = PCRE2_NOTEOL;
break;
}
}
}
#endif /* SUPPORT_UNICODE */
/* A NULL match context means "use a default context", but we take the memory
control functions from the pattern. */
if (mcontext == NULL) if (mcontext == NULL)
{ {
@ -6224,8 +6405,8 @@ else mb->memctl = mcontext->memctl;
anchored = ((re->overall_options | options) & PCRE2_ANCHORED) != 0; anchored = ((re->overall_options | options) & PCRE2_ANCHORED) != 0;
firstline = (re->overall_options & PCRE2_FIRSTLINE) != 0; firstline = (re->overall_options & PCRE2_FIRSTLINE) != 0;
startline = (re->flags & PCRE2_STARTLINE) != 0; startline = (re->flags & PCRE2_STARTLINE) != 0;
bumpalong_limit = (mcontext->offset_limit == PCRE2_UNSET)? bumpalong_limit = (mcontext->offset_limit == PCRE2_UNSET)?
end_subject : subject + mcontext->offset_limit; true_end_subject : subject + mcontext->offset_limit;
/* Initialize and set up the fixed fields in the callout block, with a pointer /* Initialize and set up the fixed fields in the callout block, with a pointer
in the match block. */ in the match block. */
@ -6236,7 +6417,8 @@ cb.subject = subject;
cb.subject_length = (PCRE2_SIZE)(end_subject - subject); cb.subject_length = (PCRE2_SIZE)(end_subject - subject);
cb.callout_flags = 0; cb.callout_flags = 0;
/* Fill in the remaining fields in the match block. */ /* Fill in the remaining fields in the match block, except for moptions, which
gets set later. */
mb->callout = mcontext->callout; mb->callout = mcontext->callout;
mb->callout_data = mcontext->callout_data; mb->callout_data = mcontext->callout_data;
@ -6245,13 +6427,9 @@ mb->start_subject = subject;
mb->start_offset = start_offset; mb->start_offset = start_offset;
mb->end_subject = end_subject; mb->end_subject = end_subject;
mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0; mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
mb->moptions = options; /* Match options */
mb->poptions = re->overall_options; /* Pattern options */ mb->poptions = re->overall_options; /* Pattern options */
mb->ignore_skip_arg = 0; mb->ignore_skip_arg = 0;
mb->mark = mb->nomatch_mark = NULL; /* In case never set */ mb->mark = mb->nomatch_mark = NULL; /* In case never set */
mb->hitend = FALSE;
/* The name table is needed for finding all the numbers associated with a /* The name table is needed for finding all the numbers associated with a
given name, for condition testing. The code follows the name table. */ given name, for condition testing. The code follows the name table. */
@ -6404,6 +6582,13 @@ if ((re->flags & PCRE2_LASTSET) != 0)
/* Loop for handling unanchored repeated matching attempts; for anchored regexs /* Loop for handling unanchored repeated matching attempts; for anchored regexs
the loop runs just once. */ the loop runs just once. */
#ifdef SUPPORT_UNICODE
FRAGMENT_RESTART:
#endif
start_partial = match_partial = NULL;
mb->hitend = FALSE;
for(;;) for(;;)
{ {
PCRE2_SPTR new_start_match; PCRE2_SPTR new_start_match;
@ -6714,6 +6899,11 @@ for(;;)
mb->start_used_ptr = start_match; mb->start_used_ptr = start_match;
mb->last_used_ptr = start_match; mb->last_used_ptr = start_match;
#ifdef SUPPORT_UNICODE
mb->moptions = options | fragment_options;
#else
mb->moptions = options;
#endif
mb->match_call_count = 0; mb->match_call_count = 0;
mb->end_offset_top = 0; mb->end_offset_top = 0;
mb->skip_arg_count = 0; mb->skip_arg_count = 0;
@ -6839,6 +7029,68 @@ for(;;)
ENDLOOP: ENDLOOP:
/* If end_subject != true_end_subject, it means we are handling invalid UTF,
and have just processed a non-terminal fragment. If this resulted in no match
or a partial match we must carry on to the next fragment (a partial match is
returned to the caller only at the very end of the subject). A loop is used to
avoid trying to match against empty fragments; if the pattern can match an
empty string it would have done so already. */
#ifdef SUPPORT_UNICODE
if (utf && end_subject != true_end_subject &&
(rc == MATCH_NOMATCH || rc == PCRE2_ERROR_PARTIAL))
{
for (;;)
{
/* Advance past the first bad code unit, and then skip invalid character
starting code units in 8-bit and 16-bit modes. */
start_match = end_subject + 1;
#if PCRE2_CODE_UNIT_WIDTH != 32
while (start_match < true_end_subject && NOT_FIRSTCU(*start_match))
start_match++;
#endif
/* If we have hit the end of the subject, there isn't another non-empty
fragment, so give up. */
if (start_match >= true_end_subject)
{
rc = MATCH_NOMATCH; /* In case it was partial */
break;
}
/* Check the rest of the subject */
mb->check_subject = start_match;
rc = PRIV(valid_utf)(start_match, length - (start_match - subject),
&(match_data->startchar));
/* The rest of the subject is valid UTF. */
if (rc == 0)
{
mb->end_subject = end_subject = true_end_subject;
fragment_options = PCRE2_NOTBOL;
goto FRAGMENT_RESTART;
}
/* A subsequent UTF error has been found; if the next fragment is
non-empty, set up to process it. Otherwise, let the loop advance. */
else if (rc < 0)
{
mb->end_subject = end_subject = start_match + match_data->startchar;
if (end_subject > start_match)
{
fragment_options = PCRE2_NOTBOL|PCRE2_NOTEOL;
goto FRAGMENT_RESTART;
}
}
}
}
#endif /* SUPPORT_UNICODE */
/* Release an enlarged frame vector that is on the heap. */ /* Release an enlarged frame vector that is on the heap. */
if (mb->match_frames != mb->stack_frames) if (mb->match_frames != mb->stack_frames)

View File

@ -212,6 +212,12 @@ be C99 don't support it (hence DISABLE_PERCENT_ZT). */
#define REPLACE_MODSIZE 100 /* Field for reading 8-bit replacement */ #define REPLACE_MODSIZE 100 /* Field for reading 8-bit replacement */
#define VERSION_SIZE 64 /* Size of buffer for the version strings */ #define VERSION_SIZE 64 /* Size of buffer for the version strings */
/* Default JIT compile options */
#define JIT_DEFAULT (PCRE2_JIT_COMPLETE|\
PCRE2_JIT_PARTIAL_SOFT|\
PCRE2_JIT_PARTIAL_HARD)
/* Make sure the buffer into which replacement strings are copied is big enough /* Make sure the buffer into which replacement strings are copied is big enough
to hold them as 32-bit code units. */ to hold them as 32-bit code units. */
@ -664,6 +670,7 @@ static modstruct modlist[] = {
{ "literal", MOD_PAT, MOD_OPT, PCRE2_LITERAL, PO(options) }, { "literal", MOD_PAT, MOD_OPT, PCRE2_LITERAL, PO(options) },
{ "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) }, { "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) },
{ "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) }, { "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) },
{ "match_invalid_utf", MOD_PAT, MOD_OPT, PCRE2_MATCH_INVALID_UTF, PO(options) },
{ "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) }, { "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) },
{ "match_line", MOD_CTC, MOD_OPT, PCRE2_EXTRA_MATCH_LINE, CO(extra_options) }, { "match_line", MOD_CTC, MOD_OPT, PCRE2_EXTRA_MATCH_LINE, CO(extra_options) },
{ "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) }, { "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) },
@ -4136,7 +4143,7 @@ static void
show_compile_options(uint32_t options, const char *before, const char *after) show_compile_options(uint32_t options, const char *before, const char *after)
{ {
if (options == 0) fprintf(outfile, "%s <none>%s", before, after); if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
before, before,
((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "", ((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
((options & PCRE2_ALT_CIRCUMFLEX) != 0)? " alt_circumflex" : "", ((options & PCRE2_ALT_CIRCUMFLEX) != 0)? " alt_circumflex" : "",
@ -4153,6 +4160,7 @@ else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%
((options & PCRE2_EXTENDED_MORE) != 0)? " extended_more" : "", ((options & PCRE2_EXTENDED_MORE) != 0)? " extended_more" : "",
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "", ((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
((options & PCRE2_LITERAL) != 0)? " literal" : "", ((options & PCRE2_LITERAL) != 0)? " literal" : "",
((options & PCRE2_MATCH_INVALID_UTF) != 0)? " match_invalid_utf" : "",
((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "", ((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
((options & PCRE2_MULTILINE) != 0)? " multiline" : "", ((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
((options & PCRE2_NEVER_BACKSLASH_C) != 0)? " never_backslash_c" : "", ((options & PCRE2_NEVER_BACKSLASH_C) != 0)? " never_backslash_c" : "",
@ -4867,7 +4875,7 @@ switch(cmd)
case CMD_PATTERN: case CMD_PATTERN:
(void)decode_modifiers(argptr, CTX_DEFPAT, &def_patctl, NULL); (void)decode_modifiers(argptr, CTX_DEFPAT, &def_patctl, NULL);
if (def_patctl.jit == 0 && (def_patctl.control & CTL_JITVERIFY) != 0) if (def_patctl.jit == 0 && (def_patctl.control & CTL_JITVERIFY) != 0)
def_patctl.jit = 7; def_patctl.jit = JIT_DEFAULT;
break; break;
/* Set default subject modifiers */ /* Set default subject modifiers */
@ -5114,7 +5122,11 @@ patlen = p - buffer - 2;
/* Look for modifiers and options after the final delimiter. */ /* Look for modifiers and options after the final delimiter. */
if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP; if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
utf = (pat_patctl.options & PCRE2_UTF) != 0;
/* Note that the match_invalid_utf option also sets utf when passed to
pcre2_compile(). */
utf = (pat_patctl.options & (PCRE2_UTF|PCRE2_MATCH_INVALID_UTF)) != 0;
/* The utf8_input modifier is not allowed in 8-bit mode, and is mutually /* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
exclusive with the utf modifier. */ exclusive with the utf modifier. */
@ -5161,7 +5173,7 @@ specified. */
if (pat_patctl.jit == 0 && if (pat_patctl.jit == 0 &&
(pat_patctl.control & (CTL_JITVERIFY|CTL_JITFAST)) != 0) (pat_patctl.control & (CTL_JITVERIFY|CTL_JITFAST)) != 0)
pat_patctl.jit = 7; pat_patctl.jit = JIT_DEFAULT;
/* Now copy the pattern to pbuffer8 for use in 8-bit testing and for reflecting /* Now copy the pattern to pbuffer8 for use in 8-bit testing and for reflecting
in callouts. Convert from hex if requested (literal strings in quotes may be in callouts. Convert from hex if requested (literal strings in quotes may be
@ -5744,6 +5756,7 @@ if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
{ {
int i; int i;
clock_t time_taken = 0; clock_t time_taken = 0;
for (i = 0; i < timeit; i++) for (i = 0; i < timeit; i++)
{ {
clock_t start_time; clock_t start_time;
@ -5752,7 +5765,7 @@ if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
pat_patctl.options|use_forbid_utf, &errorcode, &erroroffset, pat_patctl.options|use_forbid_utf, &errorcode, &erroroffset,
use_pat_context); use_pat_context);
start_time = clock(); start_time = clock();
PCRE2_JIT_COMPILE(jitrc,compiled_code, pat_patctl.jit); PCRE2_JIT_COMPILE(jitrc, compiled_code, pat_patctl.jit);
time_taken += clock() - start_time; time_taken += clock() - start_time;
} }
total_jit_compile_time += time_taken; total_jit_compile_time += time_taken;
@ -8615,7 +8628,7 @@ while (argc > 1 && argv[op][0] == '-' && argv[op][1] != 0)
else if (strcmp(arg, "-jit") == 0 || strcmp(arg, "-jitverify") == 0) else if (strcmp(arg, "-jit") == 0 || strcmp(arg, "-jitverify") == 0)
{ {
if (arg[4] != 0) def_patctl.control |= CTL_JITVERIFY; if (arg[4] != 0) def_patctl.control |= CTL_JITVERIFY;
def_patctl.jit = 7; /* full & partial */ def_patctl.jit = JIT_DEFAULT; /* full & partial */
#ifndef SUPPORT_JIT #ifndef SUPPORT_JIT
fprintf(stderr, "** Warning: JIT support is not available: " fprintf(stderr, "** Warning: JIT support is not available: "
"-jit[verify] calls functions that do nothing.\n"); "-jit[verify] calls functions that do nothing.\n");

66
testdata/testinput10 vendored
View File

@ -1,7 +1,7 @@
# This set of tests is for UTF-8 support and Unicode property support, with # This set of tests is for UTF-8 support and Unicode property support, with
# relevance only for the 8-bit library. # relevance only for the 8-bit library.
# The next 4 patterns have UTF-8 errors # The next 5 patterns have UTF-8 errors
/[Ã]/utf /[Ã]/utf
@ -11,6 +11,8 @@
/‚‚‚‚‚‚‚Ã/utf /‚‚‚‚‚‚‚Ã/utf
/‚‚‚‚‚‚‚Ã/match_invalid_utf
# Now test subjects # Now test subjects
/badutf/utf /badutf/utf
@ -493,4 +495,66 @@
/(?(á/utf /(?(á/utf
# Invalid UTF-8 tests
/.../g,match_invalid_utf
abcd\x80wxzy\x80pqrs
abcd\x{80}wxzy\x80pqrs
/abc/match_invalid_utf
ab\x80ab\=ph
\= Expect no match
ab\x80cdef\=ph
/ab$/match_invalid_utf
ab\x80cdeab
\= Expect no match
ab\x80cde
/.../g,match_invalid_utf
abcd\x{80}wxzy\x80pqrs
/(?<=x)../g,match_invalid_utf
abcd\x{80}wxzy\x80pqrs
abcd\x{80}wxzy\x80xpqrs
/X$/match_invalid_utf
\= Expect no match
X\xc4
/(?<=..)X/match_invalid_utf,aftertext
AB\x80AQXYZ
AB\x80AQXYZ\=offset=5
AB\x80\x80AXYZXC\=offset=5
\= Expect no match
AB\x80XYZ
AB\x80XYZ\=offset=3
AB\xfeXYZ
AB\xffXYZ\=offset=3
AB\x80AXYZ
AB\x80AXYZ\=offset=4
AB\x80\x80AXYZ\=offset=5
/.../match_invalid_utf
AB\xc4CCC
\= Expect no match
A\x{d800}B
A\x{110000}B
A\xc4B
/\bX/match_invalid_utf
A\x80X
/\BX/match_invalid_utf
\= Expect no match
A\x80X
/(?<=...)X/match_invalid_utf
AAA\x80BBBXYZ
\= Expect no match
AAA\x80BXYZ
AAA\x80BBXYZ
# -------------------------------------
# End of testinput10 # End of testinput10

View File

@ -368,6 +368,4 @@
ab˙Az ab˙Az
ab\x{80000041}z ab\x{80000041}z
/\[()]{65535}/expand
# End of testinput11 # End of testinput11

45
testdata/testinput12 vendored
View File

@ -402,4 +402,49 @@
/(?(á/utf /(?(á/utf
# Invalid UTF-16/32 tests.
/.../g,match_invalid_utf
abcd\x{df00}wxzy\x{df00}pqrs
abcd\x{80}wxzy\x{df00}pqrs
/abc/match_invalid_utf
ab\x{df00}ab\=ph
\= Expect no match
ab\x{df00}cdef\=ph
/ab$/match_invalid_utf
ab\x{df00}cdeab
\= Expect no match
ab\x{df00}cde
/.../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
/(?<=x)../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
abcd\x{80}wxzy\x{df00}xpqrs
/X$/match_invalid_utf
\= Expect no match
X\x{df00}
/(?<=..)X/match_invalid_utf,aftertext
AB\x{df00}AQXYZ
AB\x{df00}AQXYZ\=offset=5
AB\x{df00}\x{df00}AXYZXC\=offset=5
\= Expect no match
AB\x{df00}XYZ
AB\x{df00}XYZ\=offset=3
AB\x{df00}AXYZ
AB\x{df00}AXYZ\=offset=4
AB\x{df00}\x{df00}AXYZ\=offset=5
/.../match_invalid_utf
\= Expect no match
A\x{d800}B
A\x{110000}B
# ----------------------------------------------------
# End of testinput12 # End of testinput12

4
testdata/testinput8 vendored
View File

@ -182,4 +182,8 @@
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8 # End of testinput8

2
testdata/testinput9 vendored
View File

@ -260,6 +260,4 @@
/(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/ /(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/
/\[()]{65535}/expand
# End of testinput9 # End of testinput9

108
testdata/testoutput10 vendored
View File

@ -1,7 +1,7 @@
# This set of tests is for UTF-8 support and Unicode property support, with # This set of tests is for UTF-8 support and Unicode property support, with
# relevance only for the 8-bit library. # relevance only for the 8-bit library.
# The next 4 patterns have UTF-8 errors # The next 5 patterns have UTF-8 errors
/[Ã]/utf /[Ã]/utf
Failed: error -8 at offset 1: UTF-8 error: byte 2 top bits not 0x80 Failed: error -8 at offset 1: UTF-8 error: byte 2 top bits not 0x80
@ -15,6 +15,9 @@ Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
/‚‚‚‚‚‚‚Ã/utf /‚‚‚‚‚‚‚Ã/utf
Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set
/‚‚‚‚‚‚‚Ã/match_invalid_utf
Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set
# Now test subjects # Now test subjects
/badutf/utf /badutf/utf
@ -1651,4 +1654,107 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
/(?(á/utf /(?(á/utf
Failed: error 142 at offset 5: syntax error in subpattern name (missing terminator?) Failed: error 142 at offset 5: syntax error in subpattern name (missing terminator?)
# Invalid UTF-8 tests
/.../g,match_invalid_utf
abcd\x80wxzy\x80pqrs
0: abc
0: wxz
0: pqr
abcd\x{80}wxzy\x80pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/abc/match_invalid_utf
ab\x80ab\=ph
Partial match: ab
\= Expect no match
ab\x80cdef\=ph
No match
/ab$/match_invalid_utf
ab\x80cdeab
0: ab
\= Expect no match
ab\x80cde
No match
/.../g,match_invalid_utf
abcd\x{80}wxzy\x80pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/(?<=x)../g,match_invalid_utf
abcd\x{80}wxzy\x80pqrs
0: zy
abcd\x{80}wxzy\x80xpqrs
0: zy
0: pq
/X$/match_invalid_utf
\= Expect no match
X\xc4
No match
/(?<=..)X/match_invalid_utf,aftertext
AB\x80AQXYZ
0: X
0+ YZ
AB\x80AQXYZ\=offset=5
0: X
0+ YZ
AB\x80\x80AXYZXC\=offset=5
0: X
0+ C
\= Expect no match
AB\x80XYZ
No match
AB\x80XYZ\=offset=3
No match
AB\xfeXYZ
No match
AB\xffXYZ\=offset=3
No match
AB\x80AXYZ
No match
AB\x80AXYZ\=offset=4
No match
AB\x80\x80AXYZ\=offset=5
No match
/.../match_invalid_utf
AB\xc4CCC
0: CCC
\= Expect no match
A\x{d800}B
No match
A\x{110000}B
No match
A\xc4B
No match
/\bX/match_invalid_utf
A\x80X
0: X
/\BX/match_invalid_utf
\= Expect no match
A\x80X
No match
/(?<=...)X/match_invalid_utf
AAA\x80BBBXYZ
0: X
\= Expect no match
AAA\x80BXYZ
No match
AAA\x80BBXYZ
No match
# -------------------------------------
# End of testinput10 # End of testinput10

View File

@ -661,7 +661,4 @@ Subject length lower bound = 1
ab˙Az ab˙Az
ab\x{80000041}z ab\x{80000041}z
/\[()]{65535}/expand
Failed: error 120 at offset 131070: regular expression is too large
# End of testinput11 # End of testinput11

View File

@ -667,6 +667,4 @@ Subject length lower bound = 1
ab\x{80000041}z ab\x{80000041}z
0: ab\x{80000041}z 0: ab\x{80000041}z
/\[()]{65535}/expand
# End of testinput11 # End of testinput11

View File

@ -1502,4 +1502,81 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
/(?(á/utf /(?(á/utf
Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?) Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?)
# Invalid UTF-16/32 tests.
/.../g,match_invalid_utf
abcd\x{df00}wxzy\x{df00}pqrs
0: abc
0: wxz
0: pqr
abcd\x{80}wxzy\x{df00}pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/abc/match_invalid_utf
ab\x{df00}ab\=ph
Partial match: ab
\= Expect no match
ab\x{df00}cdef\=ph
No match
/ab$/match_invalid_utf
ab\x{df00}cdeab
0: ab
\= Expect no match
ab\x{df00}cde
No match
/.../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/(?<=x)../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
0: zy
abcd\x{80}wxzy\x{df00}xpqrs
0: zy
0: pq
/X$/match_invalid_utf
\= Expect no match
X\x{df00}
No match
/(?<=..)X/match_invalid_utf,aftertext
AB\x{df00}AQXYZ
0: X
0+ YZ
AB\x{df00}AQXYZ\=offset=5
0: X
0+ YZ
AB\x{df00}\x{df00}AXYZXC\=offset=5
0: X
0+ C
\= Expect no match
AB\x{df00}XYZ
No match
AB\x{df00}XYZ\=offset=3
No match
AB\x{df00}AXYZ
No match
AB\x{df00}AXYZ\=offset=4
No match
AB\x{df00}\x{df00}AXYZ\=offset=5
No match
/.../match_invalid_utf
\= Expect no match
A\x{d800}B
No match
A\x{110000}B
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
# ----------------------------------------------------
# End of testinput12 # End of testinput12

View File

@ -1500,4 +1500,81 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
/(?(á/utf /(?(á/utf
Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?) Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?)
# Invalid UTF-16/32 tests.
/.../g,match_invalid_utf
abcd\x{df00}wxzy\x{df00}pqrs
0: abc
0: wxz
0: pqr
abcd\x{80}wxzy\x{df00}pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/abc/match_invalid_utf
ab\x{df00}ab\=ph
Partial match: ab
\= Expect no match
ab\x{df00}cdef\=ph
No match
/ab$/match_invalid_utf
ab\x{df00}cdeab
0: ab
\= Expect no match
ab\x{df00}cde
No match
/.../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
0: abc
0: d\x{80}w
0: xzy
0: pqr
/(?<=x)../g,match_invalid_utf
abcd\x{80}wxzy\x{df00}pqrs
0: zy
abcd\x{80}wxzy\x{df00}xpqrs
0: zy
0: pq
/X$/match_invalid_utf
\= Expect no match
X\x{df00}
No match
/(?<=..)X/match_invalid_utf,aftertext
AB\x{df00}AQXYZ
0: X
0+ YZ
AB\x{df00}AQXYZ\=offset=5
0: X
0+ YZ
AB\x{df00}\x{df00}AXYZXC\=offset=5
0: X
0+ C
\= Expect no match
AB\x{df00}XYZ
No match
AB\x{df00}XYZ\=offset=3
No match
AB\x{df00}AXYZ
No match
AB\x{df00}AXYZ\=offset=4
No match
AB\x{df00}\x{df00}AXYZ\=offset=5
No match
/.../match_invalid_utf
\= Expect no match
A\x{d800}B
No match
A\x{110000}B
No match
# ----------------------------------------------------
# End of testinput12 # End of testinput12

View File

@ -1020,4 +1020,9 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
Failed: error 120 at offset 131070: regular expression is too large
# End of testinput8 # End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8 # End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8 # End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8 # End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8 # End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8 # End of testinput8

View File

@ -1020,4 +1020,9 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
Failed: error 120 at offset 131070: regular expression is too large
# End of testinput8 # End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8 # End of testinput8

View File

@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
#pattern -fullbincode
/\[()]{65535}/expand
# End of testinput8 # End of testinput8

View File

@ -367,7 +367,4 @@ Failed: error 134 at offset 14: character code point value in \x{} or \o{} is to
/(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/ /(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/
Failed: error 176 at offset 259: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN) Failed: error 176 at offset 259: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
/\[()]{65535}/expand
Failed: error 120 at offset 131070: regular expression is too large
# End of testinput9 # End of testinput9