Implement support for invalid UTF in the pcre2_match() interpreter.
This commit is contained in:
parent
2ad4329f83
commit
16c046ce50
|
@ -14,6 +14,10 @@ detects invalid characters in the 0xd800-0xdfff range.
|
|||
|
||||
3. Fix minor typo bug in JIT compile when \X is used in a non-UTF string.
|
||||
|
||||
4. Add support for matching in invalid UTF strings to the pcre2_match()
|
||||
interpreter, and integrate with the existing JIT support via the new
|
||||
PCRE2_MATCH_INVALID_UTF compile-time option.
|
||||
|
||||
|
||||
Version 10.33 16-April-2019
|
||||
---------------------------
|
||||
|
|
|
@ -65,6 +65,7 @@ The option bits are:
|
|||
PCRE2_EXTENDED Ignore white space and # comments
|
||||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_LITERAL Pattern characters are all literal
|
||||
PCRE2_MATCH_INVALID_UTF Enable support for matching invalid UTF
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
||||
|
|
|
@ -40,8 +40,12 @@ bits:
|
|||
PCRE2_JIT_COMPLETE compile code for full matching
|
||||
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
|
||||
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
|
||||
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
|
||||
</pre>
|
||||
There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been
|
||||
superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF. The old
|
||||
option is deprecated and may be removed in future.
|
||||
</P>
|
||||
<P>
|
||||
The yield of the function is 0 for success, or a negative error code otherwise.
|
||||
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
|
||||
if an unknown bit is set in <i>options</i>.
|
||||
|
|
|
@ -1347,11 +1347,12 @@ and <b>pcre2_compile()</b> returns a non-NULL value.
|
|||
<P>
|
||||
There are nearly 100 positive error codes that <b>pcre2_compile()</b> may return
|
||||
if it finds an error in the pattern. There are also some negative error codes
|
||||
that are used for invalid UTF strings. These are the same as given by
|
||||
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and are described in the
|
||||
that are used for invalid UTF strings when validity checking is in force. These
|
||||
are the same as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and
|
||||
are described in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page. There is no separate documentation for the positive error codes, because
|
||||
the textual error messages that are obtained by calling the
|
||||
documentation. There is no separate documentation for the positive error codes,
|
||||
because the textual error messages that are obtained by calling the
|
||||
<b>pcre2_get_error_message()</b> function (see "Obtaining a textual error
|
||||
message"
|
||||
<a href="#geterrormessage">below)</a>
|
||||
|
@ -1615,10 +1616,18 @@ expression engine is not the most efficient way of doing it. If you are doing a
|
|||
lot of literal matching and are worried about efficiency, you should consider
|
||||
using other approaches. The only other main options that are allowed with
|
||||
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
|
||||
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
|
||||
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
|
||||
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
|
||||
error.
|
||||
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
|
||||
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
|
||||
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
|
||||
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
|
||||
<pre>
|
||||
PCRE2_MATCH_INVALID_UTF
|
||||
</pre>
|
||||
This option forces PCRE2_UTF (see below) and also enables support for matching
|
||||
by <b>pcre2_match()</b> in subject strings that contain invalid UTF sequences.
|
||||
This facility is not supported for DFA matching. For details, see the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
documentation.
|
||||
<pre>
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
</pre>
|
||||
|
@ -2653,15 +2662,22 @@ of JIT; it forces matching to be done by the interpreter.
|
|||
PCRE2_NO_UTF_CHECK
|
||||
</pre>
|
||||
When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
|
||||
string is checked by default when <b>pcre2_match()</b> is subsequently called.
|
||||
If a non-zero starting offset is given, the check is applied only to that part
|
||||
of the subject that could be inspected during matching, and there is a check
|
||||
that the starting offset points to the first code unit of a character or to the
|
||||
end of the subject. If there are no lookbehind assertions in the pattern, the
|
||||
check starts at the starting offset. Otherwise, it starts at the length of the
|
||||
longest lookbehind before the starting offset, or at the start of the subject
|
||||
if there are not that many characters before the starting offset. Note that the
|
||||
sequences \b and \B are one-character lookbehinds.
|
||||
string is checked unless PCRE2_NO_UTF_CHECK is passed to <b>pcre2_match()</b> or
|
||||
PCRE2_MATCH_INVALID_UTF was passed to <b>pcre2_compile()</b>. The latter special
|
||||
case is discussed in detail in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
In the default case, if a non-zero starting offset is given, the check is
|
||||
applied only to that part of the subject that could be inspected during
|
||||
matching, and there is a check that the starting offset points to the first
|
||||
code unit of a character or to the end of the subject. If there are no
|
||||
lookbehind assertions in the pattern, the check starts at the starting offset.
|
||||
Otherwise, it starts at the length of the longest lookbehind before the
|
||||
starting offset, or at the start of the subject if there are not that many
|
||||
characters before the starting offset. Note that the sequences \b and \B are
|
||||
one-character lookbehinds.
|
||||
</P>
|
||||
<P>
|
||||
The check is carried out before any other processing takes place, and a
|
||||
|
@ -2674,19 +2690,20 @@ and
|
|||
<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
|
||||
in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page.
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
If you know that your subject is valid, and you want to skip these checks for
|
||||
If you know that your subject is valid, and you want to skip this check for
|
||||
performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
|
||||
<b>pcre2_match()</b>. You might want to do this for the second and subsequent
|
||||
calls to <b>pcre2_match()</b> if you are making repeated calls to find other
|
||||
calls to <b>pcre2_match()</b> if you are making repeated calls to find multiple
|
||||
matches in the same subject string.
|
||||
</P>
|
||||
<P>
|
||||
<b>Warning:</b> When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
|
||||
<b>Warning:</b> Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
|
||||
PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
|
||||
string as a subject, or an invalid value of <i>startoffset</i>, is undefined.
|
||||
Your program may crash or loop indefinitely.
|
||||
Your program may crash or loop indefinitely or give wrong results.
|
||||
<pre>
|
||||
PCRE2_PARTIAL_HARD
|
||||
PCRE2_PARTIAL_SOFT
|
||||
|
@ -3771,6 +3788,12 @@ a backreference.
|
|||
This return is given if <b>pcre2_dfa_match()</b> encounters a condition item
|
||||
that uses a backreference for the condition, or a test for recursion in a
|
||||
specific capture group. These are not supported.
|
||||
<pre>
|
||||
PCRE2_ERROR_DFA_UINVALID_UTF
|
||||
</pre>
|
||||
This return is given if <b>pcre2_dfa_match()</b> is called for a pattern that
|
||||
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
|
||||
matching.
|
||||
<pre>
|
||||
PCRE2_ERROR_DFA_WSSIZE
|
||||
</pre>
|
||||
|
@ -3808,7 +3831,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 14 February 2019
|
||||
Last updated: 23 May 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -147,25 +147,29 @@ pattern.
|
|||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
|
||||
<P>
|
||||
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
|
||||
function expects its subject string to be a valid sequence of UTF code units.
|
||||
If it is not, the result is undefined. This is also true by default of matching
|
||||
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
|
||||
<b>pcre2_jit_compile()</b>, code that can process a subject containing invalid
|
||||
UTF is compiled.
|
||||
When a pattern is compiled with the PCRE2_UTF option, subject strings are
|
||||
normally expected to be a valid sequence of UTF code units. By default, this is
|
||||
checked at the start of matching and an error is generated if invalid UTF is
|
||||
detected. The PCRE2_NO_UTF_CHECK option can be passed to <b>pcre2_match()</b> to
|
||||
skip the check (for improved performance) if you are sure that a subject string
|
||||
is valid. If this option is used with an invalid string, the result is
|
||||
undefined.
|
||||
</P>
|
||||
<P>
|
||||
In this mode, an invalid code unit sequence never matches any pattern item. It
|
||||
does not match dot, it does not match \p{Any}, it does not even match negative
|
||||
items such as [^X]. A lookbehind assertion fails if it encounters an invalid
|
||||
sequence while moving the current point backwards. In other words, an invalid
|
||||
UTF code unit sequence acts as a barrier which no match can cross. Reaching an
|
||||
invalid sequence causes an immediate backtrack.
|
||||
However, a way of running matches on strings that may contain invalid UTF
|
||||
sequences is available. Calling <b>pcre2_compile()</b> with the
|
||||
PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
|
||||
<b>pcre2_match()</b> to support invalid UTF, and, if <b>pcre2_jit_compile()</b>
|
||||
is called, the compiled JIT code also supports invalid UTF. Details of how this
|
||||
support works, in both the JIT and the interpretive cases, is given in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
Using this option, an application can run matches in arbitrary data, knowing
|
||||
that any matched strings that are returned will be valid UTF. This can be
|
||||
useful when searching for text in executable or other binary files.
|
||||
There is also an obsolete option for <b>pcre2_jit_compile()</b> called
|
||||
PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
|
||||
It is superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF
|
||||
and should no longer be used. It may be removed in future.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
|
||||
<P>
|
||||
|
@ -461,7 +465,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 06 March 2019
|
||||
Last updated: 23 May 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -188,6 +188,10 @@ code unit) at a time, for all active paths through the tree.
|
|||
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
||||
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
|
||||
</P>
|
||||
<P>
|
||||
10. The PCRE2_MATCH_INVALID_UTF option for <b>pcre2_compile()</b> is not
|
||||
supported by <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
|
||||
<P>
|
||||
Using the alternative matching algorithm provides the following advantages:
|
||||
|
@ -219,7 +223,8 @@ because it has to search for all possible matches, but is also because it is
|
|||
less susceptible to optimization.
|
||||
</P>
|
||||
<P>
|
||||
2. Capturing parentheses, backreferences, and script runs are not supported.
|
||||
2. Capturing parentheses, backreferences, script runs, and matching within
|
||||
invalid UTF string are not supported.
|
||||
</P>
|
||||
<P>
|
||||
3. Although atomic groups are supported, their use does not provide the
|
||||
|
@ -236,9 +241,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 10 October 2018
|
||||
Last updated: 23 May 2019
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -91,10 +91,11 @@ single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
|
|||
specified for the 32-bit library, in which case it constrains the character
|
||||
values to valid Unicode code points. To process UTF strings, PCRE2 must be
|
||||
built to include Unicode support (which is the default). When using UTF strings
|
||||
you must either call the compiling function with the PCRE2_UTF option, or the
|
||||
pattern must start with the special sequence (*UTF), which is equivalent to
|
||||
setting the relevant option. How setting a UTF mode affects pattern matching is
|
||||
mentioned in several places below. There is also a summary of features in the
|
||||
you must either call the compiling function with one or both of the PCRE2_UTF
|
||||
or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
|
||||
sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
|
||||
setting a UTF mode affects pattern matching is mentioned in several places
|
||||
below. There is also a summary of features in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page.
|
||||
</P>
|
||||
|
@ -428,11 +429,11 @@ There may be any number of hexadecimal digits. This syntax is from ECMAScript
|
|||
6.
|
||||
</P>
|
||||
<P>
|
||||
The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
|
||||
is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
|
||||
\N{name} to specify characters by Unicode name; PCRE2 does not support this.
|
||||
Note that when \N is not followed by an opening brace (curly bracket) it has
|
||||
an entirely different meaning, matching any character that is not a newline.
|
||||
The \N{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
|
||||
UTF mode. Perl also uses \N{name} to specify characters by Unicode name; PCRE2
|
||||
does not support this. Note that when \N is not followed by an opening brace
|
||||
(curly bracket) it has an entirely different meaning, matching any character
|
||||
that is not a newline.
|
||||
</P>
|
||||
<P>
|
||||
There are some legacy applications where the escape sequence \r is expected to
|
||||
|
@ -1360,7 +1361,7 @@ with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
|
|||
with a malformed UTF character. This has undefined results, because PCRE2
|
||||
assumes that it is matching character by character in a valid UTF string (by
|
||||
default it checks the subject string's validity at the start of processing
|
||||
unless the PCRE2_NO_UTF_CHECK option is used).
|
||||
unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
|
||||
</P>
|
||||
<P>
|
||||
An application can lock out the use of \C by setting the
|
||||
|
@ -3727,7 +3728,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 12 February 2019
|
||||
Last updated: 23 May 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -613,6 +613,7 @@ for a description of the effects of these options.
|
|||
firstline set PCRE2_FIRSTLINE
|
||||
literal set PCRE2_LITERAL
|
||||
match_line set PCRE2_EXTRA_MATCH_LINE
|
||||
match_invalid_utf set PCRE2_MATCH_INVALID_UTF
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
match_word set PCRE2_EXTRA_MATCH_WORD
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
|
@ -2078,7 +2079,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 11 March 2019
|
||||
Last updated: 23 May 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -16,22 +16,33 @@ please consult the man page, in case the conversion went wrong.
|
|||
UNICODE AND UTF SUPPORT
|
||||
</b><br>
|
||||
<P>
|
||||
When PCRE2 is built with Unicode support (which is the default), it has
|
||||
knowledge of Unicode character properties and can process text strings in
|
||||
UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by
|
||||
default, PCRE2 assumes that one code unit is one character. To process a
|
||||
pattern as a UTF string, where a character may require more than one code unit,
|
||||
you must call
|
||||
<a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
|
||||
with the PCRE2_UTF option flag, or the pattern must start with the sequence
|
||||
(*UTF). When either of these is the case, both the pattern and any subject
|
||||
strings that are matched against it are treated as UTF strings instead of
|
||||
strings of individual one-code-unit characters. There are also some other
|
||||
changes to the way characters are handled, as documented below.
|
||||
PCRE2 is normally built with Unicode support, though if you do not need it, you
|
||||
can build it without, in which case the library will be smaller. With Unicode
|
||||
support, PCRE2 has knowledge of Unicode character properties and can process
|
||||
text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
|
||||
width), but this is not the default. Unless specifically requested, PCRE2
|
||||
treats each code unit in a string as one character.
|
||||
</P>
|
||||
<P>
|
||||
If you do not need Unicode support you can build PCRE2 without it, in which
|
||||
case the library will be smaller.
|
||||
There are two ways of telling PCRE2 to switch to UTF mode, where characters may
|
||||
consist of more than one code unit and the range of values is constrained. The
|
||||
program can call
|
||||
<a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
|
||||
with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
|
||||
However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
|
||||
That is, the programmer can prevent the supplier of the pattern from switching
|
||||
to UTF mode.
|
||||
</P>
|
||||
<P>
|
||||
Note that the PCRE2_MATCH_INVALID_UTF option (see
|
||||
<a href="#matchinvalid">below)</a>
|
||||
forces PCRE2_UTF to be set.
|
||||
</P>
|
||||
<P>
|
||||
In UTF mode, both the pattern and any subject strings that are matched against
|
||||
it are treated as UTF strings instead of strings of individual one-code-unit
|
||||
characters. There are also some other changes to the way characters are
|
||||
handled, as documented below.
|
||||
</P>
|
||||
<br><b>
|
||||
UNICODE PROPERTY SUPPORT
|
||||
|
@ -63,22 +74,22 @@ also recognized; larger ones can be coded using \o{...}.
|
|||
<P>
|
||||
The escape sequence \N{U+<hex digits>} is recognized as another way of
|
||||
specifying a Unicode character by code point in a UTF mode. It is not allowed
|
||||
in non-UTF modes.
|
||||
in non-UTF mode.
|
||||
</P>
|
||||
<P>
|
||||
In UTF modes, repeat quantifiers apply to complete UTF characters, not to
|
||||
In UTF mode, repeat quantifiers apply to complete UTF characters, not to
|
||||
individual code units.
|
||||
</P>
|
||||
<P>
|
||||
In UTF modes, the dot metacharacter matches one UTF character instead of a
|
||||
In UTF mode, the dot metacharacter matches one UTF character instead of a
|
||||
single code unit.
|
||||
</P>
|
||||
<P>
|
||||
In UTF modes, capture group names are not restricted to ASCII, and may contain
|
||||
In UTF mode, capture group names are not restricted to ASCII, and may contain
|
||||
any Unicode letters and decimal digits, as well as underscore.
|
||||
</P>
|
||||
<P>
|
||||
The escape sequence \C can be used to match a single code unit in a UTF mode,
|
||||
The escape sequence \C can be used to match a single code unit in UTF mode,
|
||||
but its use can lead to some strange effects because it breaks up multi-unit
|
||||
characters (see the description of \C in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
|
@ -93,7 +104,7 @@ may consist of more than one code unit. The use of \C in these modes provokes
|
|||
a match-time error. Also, the JIT optimization does not support \C in these
|
||||
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
|
||||
contains \C, it will not succeed, and so when <b>pcre2_match()</b> is called,
|
||||
the matching will be carried out by the normal interpretive function.
|
||||
the matching will be carried out by the interpretive function.
|
||||
</P>
|
||||
<P>
|
||||
The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
|
||||
|
@ -123,14 +134,14 @@ However, the special horizontal and vertical white space matching escapes (\h,
|
|||
not PCRE2_UCP is set.
|
||||
</P>
|
||||
<br><b>
|
||||
CASE-EQUIVALENCE IN UTF MODES
|
||||
CASE-EQUIVALENCE IN UTF MODE
|
||||
</b><br>
|
||||
<P>
|
||||
Case-insensitive matching in a UTF mode makes use of Unicode properties except
|
||||
Case-insensitive matching in UTF mode makes use of Unicode properties except
|
||||
for characters whose code points are less than 128 and that have at most two
|
||||
case-equivalent values. For these, a direct table lookup is used for speed. A
|
||||
few Unicode characters such as Greek sigma have more than two code points that
|
||||
are case-equivalent, and these are treated as such.
|
||||
are case-equivalent, and these are treated specially.
|
||||
<a name="scriptruns"></a></P>
|
||||
<br><b>
|
||||
SCRIPT RUNS
|
||||
|
@ -248,7 +259,7 @@ VALIDITY OF UTF STRINGS
|
|||
<P>
|
||||
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
||||
are (by default) checked for validity on entry to the relevant functions. If an
|
||||
invalid UTF string is passed, an negative error code is returned. The code unit
|
||||
invalid UTF string is passed, a negative error code is returned. The code unit
|
||||
offset to the offending character can be extracted from the match data block by
|
||||
calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
|
||||
error.
|
||||
|
@ -263,17 +274,16 @@ only valid UTF code unit sequences.
|
|||
</P>
|
||||
<P>
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||
is usually undefined and your program may crash or loop indefinitely. There is,
|
||||
however, one mode of matching that can handle invalid UTF subject strings. This
|
||||
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
|
||||
when calling <b>pcre2_jit_compile()</b>. For details, see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation.
|
||||
is undefined and your program may crash or loop indefinitely or give incorrect
|
||||
results. There is, however, one mode of matching that can handle invalid UTF
|
||||
subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
|
||||
<b>pcre2_compile()</b> and is discussed below in the next section. The rest of
|
||||
this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
|
||||
</P>
|
||||
<P>
|
||||
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
|
||||
the pattern; it does not also apply to subject strings. If you want to disable
|
||||
the check for a subject string you must pass this same option to
|
||||
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the UTF check
|
||||
for the pattern; it does not also apply to subject strings. If you want to
|
||||
disable the check for a subject string you must pass this same option to
|
||||
<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -352,7 +362,7 @@ these code points are excluded by RFC 3629.
|
|||
<pre>
|
||||
PCRE2_ERROR_UTF8_ERR13
|
||||
</pre>
|
||||
A 4-byte character has a value greater than 0x10fff; these code points are
|
||||
A 4-byte character has a value greater than 0x10ffff; these code points are
|
||||
excluded by RFC 3629.
|
||||
<pre>
|
||||
PCRE2_ERROR_UTF8_ERR14
|
||||
|
@ -405,7 +415,59 @@ The following negative error codes are given for invalid UTF-32 strings:
|
|||
PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
|
||||
PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
|
||||
|
||||
</PRE>
|
||||
<a name="matchinvalid"></a></PRE>
|
||||
</P>
|
||||
<br><b>
|
||||
MATCHING IN INVALID UTF STRINGS
|
||||
</b><br>
|
||||
<P>
|
||||
You can run pattern matches on subject strings that may contain invalid UTF
|
||||
sequences if you call <b>pcre2_compile()</b> with the PCRE2_MATCH_INVALID_UTF
|
||||
option. This is supported by <b>pcre2_match()</b>, including JIT matching, but
|
||||
not by <b>pcre2_dfa_match()</b>. When PCRE2_MATCH_INVALID_UTF is set, it forces
|
||||
PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
|
||||
valid UTF string.
|
||||
</P>
|
||||
<P>
|
||||
Setting PCRE2_MATCH_INVALID_UTF does not affect what <b>pcre2_compile()</b>
|
||||
generates, but if <b>pcre2_jit_compile()</b> is subsequently called, it does
|
||||
generate different code. If JIT is not used, the option affects the behaviour
|
||||
of the interpretive code in <b>pcre2_match()</b>. When PCRE2_MATCH_INVALID_UTF
|
||||
is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
|
||||
</P>
|
||||
<P>
|
||||
In this mode, an invalid code unit sequence in the subject never matches any
|
||||
pattern item. It does not match dot, it does not match \p{Any}, it does not
|
||||
even match negative items such as [^X]. A lookbehind assertion fails if it
|
||||
encounters an invalid sequence while moving the current point backwards. In
|
||||
other words, an invalid UTF code unit sequence acts as a barrier which no match
|
||||
can cross.
|
||||
</P>
|
||||
<P>
|
||||
You can also think of this as the subject being split up into fragments of
|
||||
valid UTF, delimited internally by invalid code unit sequences. The pattern is
|
||||
matched fragment by fragment. The result of a successful match, however, is
|
||||
given as code unit offsets in the entire subject string in the usual way. There
|
||||
are a few points to consider:
|
||||
</P>
|
||||
<P>
|
||||
The internal boundaries are not interpreted as the beginnings or ends of lines
|
||||
and so do not match circumflex or dollar characters in the pattern.
|
||||
</P>
|
||||
<P>
|
||||
If <b>pcre2_match()</b> is called with an offset that points to an invalid
|
||||
UTF-sequence, that sequence is skipped, and the match starts at the next valid
|
||||
UTF character, or the end of the subject.
|
||||
</P>
|
||||
<P>
|
||||
At internal fragment boundaries, \b and \B behave in the same way as at the
|
||||
beginning and end of the subject. For example, a sequence such as \bWORD\b
|
||||
would match an instance of WORD that is surrounded by invalid UTF code units.
|
||||
</P>
|
||||
<P>
|
||||
Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
|
||||
data, knowing that any matched strings that are returned are valid UTF. This
|
||||
can be useful when searching for UTF text in executable or other binary files.
|
||||
</P>
|
||||
<br><b>
|
||||
AUTHOR
|
||||
|
@ -422,7 +484,7 @@ Cambridge, England.
|
|||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 06 March 2019
|
||||
Last updated: 24 May 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
2931
doc/pcre2.txt
2931
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_COMPILE 3 "11 February 2019" "PCRE2 10.33"
|
||||
.TH PCRE2_COMPILE 3 "23 May 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -53,6 +53,7 @@ The option bits are:
|
|||
PCRE2_EXTENDED Ignore white space and # comments
|
||||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_LITERAL Pattern characters are all literal
|
||||
PCRE2_MATCH_INVALID_UTF Enable support for matching invalid UTF
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_JIT_COMPILE 3 "06 March 2019" "PCRE2 10.33"
|
||||
.TH PCRE2_JIT_COMPILE 3 "23 May 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -29,8 +29,11 @@ bits:
|
|||
PCRE2_JIT_COMPLETE compile code for full matching
|
||||
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
|
||||
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
|
||||
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
|
||||
.sp
|
||||
There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been
|
||||
superseded by the \fBpcre2_compile()\fP option PCRE2_MATCH_INVALID_UTF. The old
|
||||
option is deprecated and may be removed in future.
|
||||
.P
|
||||
The yield of the function is 0 for success, or a negative error code otherwise.
|
||||
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
|
||||
if an unknown bit is set in \fIoptions\fP.
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "14 February 2019" "PCRE2 10.33"
|
||||
.TH PCRE2API 3 "23 May 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -1285,13 +1285,14 @@ and \fBpcre2_compile()\fP returns a non-NULL value.
|
|||
.P
|
||||
There are nearly 100 positive error codes that \fBpcre2_compile()\fP may return
|
||||
if it finds an error in the pattern. There are also some negative error codes
|
||||
that are used for invalid UTF strings. These are the same as given by
|
||||
\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and are described in the
|
||||
that are used for invalid UTF strings when validity checking is in force. These
|
||||
are the same as given by \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and
|
||||
are described in the
|
||||
.\" HREF
|
||||
\fBpcre2unicode\fP
|
||||
.\"
|
||||
page. There is no separate documentation for the positive error codes, because
|
||||
the textual error messages that are obtained by calling the
|
||||
documentation. There is no separate documentation for the positive error codes,
|
||||
because the textual error messages that are obtained by calling the
|
||||
\fBpcre2_get_error_message()\fP function (see "Obtaining a textual error
|
||||
message"
|
||||
.\" HTML <a href="#geterrormessage">
|
||||
|
@ -1557,10 +1558,20 @@ expression engine is not the most efficient way of doing it. If you are doing a
|
|||
lot of literal matching and are worried about efficiency, you should consider
|
||||
using other approaches. The only other main options that are allowed with
|
||||
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
|
||||
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
|
||||
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
|
||||
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
|
||||
error.
|
||||
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
|
||||
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
|
||||
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
|
||||
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
|
||||
.sp
|
||||
PCRE2_MATCH_INVALID_UTF
|
||||
.sp
|
||||
This option forces PCRE2_UTF (see below) and also enables support for matching
|
||||
by \fBpcre2_match()\fP in subject strings that contain invalid UTF sequences.
|
||||
This facility is not supported for DFA matching. For details, see the
|
||||
.\" HREF
|
||||
\fBpcre2unicode\fP
|
||||
.\"
|
||||
documentation.
|
||||
.sp
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
.sp
|
||||
|
@ -2635,15 +2646,23 @@ of JIT; it forces matching to be done by the interpreter.
|
|||
PCRE2_NO_UTF_CHECK
|
||||
.sp
|
||||
When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
|
||||
string is checked by default when \fBpcre2_match()\fP is subsequently called.
|
||||
If a non-zero starting offset is given, the check is applied only to that part
|
||||
of the subject that could be inspected during matching, and there is a check
|
||||
that the starting offset points to the first code unit of a character or to the
|
||||
end of the subject. If there are no lookbehind assertions in the pattern, the
|
||||
check starts at the starting offset. Otherwise, it starts at the length of the
|
||||
longest lookbehind before the starting offset, or at the start of the subject
|
||||
if there are not that many characters before the starting offset. Note that the
|
||||
sequences \eb and \eB are one-character lookbehinds.
|
||||
string is checked unless PCRE2_NO_UTF_CHECK is passed to \fBpcre2_match()\fP or
|
||||
PCRE2_MATCH_INVALID_UTF was passed to \fBpcre2_compile()\fP. The latter special
|
||||
case is discussed in detail in the
|
||||
.\" HREF
|
||||
\fBpcre2unicode\fP
|
||||
.\"
|
||||
documentation.
|
||||
.P
|
||||
In the default case, if a non-zero starting offset is given, the check is
|
||||
applied only to that part of the subject that could be inspected during
|
||||
matching, and there is a check that the starting offset points to the first
|
||||
code unit of a character or to the end of the subject. If there are no
|
||||
lookbehind assertions in the pattern, the check starts at the starting offset.
|
||||
Otherwise, it starts at the length of the longest lookbehind before the
|
||||
starting offset, or at the start of the subject if there are not that many
|
||||
characters before the starting offset. Note that the sequences \eb and \eB are
|
||||
one-character lookbehinds.
|
||||
.P
|
||||
The check is carried out before any other processing takes place, and a
|
||||
negative error code is returned if the check fails. There are several UTF error
|
||||
|
@ -2666,17 +2685,18 @@ in the
|
|||
.\" HREF
|
||||
\fBpcre2unicode\fP
|
||||
.\"
|
||||
page.
|
||||
documentation.
|
||||
.P
|
||||
If you know that your subject is valid, and you want to skip these checks for
|
||||
If you know that your subject is valid, and you want to skip this check for
|
||||
performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
|
||||
\fBpcre2_match()\fP. You might want to do this for the second and subsequent
|
||||
calls to \fBpcre2_match()\fP if you are making repeated calls to find other
|
||||
calls to \fBpcre2_match()\fP if you are making repeated calls to find multiple
|
||||
matches in the same subject string.
|
||||
.P
|
||||
\fBWarning:\fP When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
|
||||
\fBWarning:\fP Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
|
||||
PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
|
||||
string as a subject, or an invalid value of \fIstartoffset\fP, is undefined.
|
||||
Your program may crash or loop indefinitely.
|
||||
Your program may crash or loop indefinitely or give wrong results.
|
||||
.sp
|
||||
PCRE2_PARTIAL_HARD
|
||||
PCRE2_PARTIAL_SOFT
|
||||
|
@ -3774,6 +3794,12 @@ a backreference.
|
|||
This return is given if \fBpcre2_dfa_match()\fP encounters a condition item
|
||||
that uses a backreference for the condition, or a test for recursion in a
|
||||
specific capture group. These are not supported.
|
||||
.sp
|
||||
PCRE2_ERROR_DFA_UINVALID_UTF
|
||||
.sp
|
||||
This return is given if \fBpcre2_dfa_match()\fP is called for a pattern that
|
||||
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
|
||||
matching.
|
||||
.sp
|
||||
PCRE2_ERROR_DFA_WSSIZE
|
||||
.sp
|
||||
|
@ -3817,6 +3843,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 14 February 2019
|
||||
Last updated: 23 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2JIT 3 "06 March 2019" "PCRE2 10.33"
|
||||
.TH PCRE2JIT 3 "23 May 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
|
||||
|
@ -123,23 +123,29 @@ pattern.
|
|||
.SH "MATCHING SUBJECTS CONTAINING INVALID UTF"
|
||||
.rs
|
||||
.sp
|
||||
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
|
||||
function expects its subject string to be a valid sequence of UTF code units.
|
||||
If it is not, the result is undefined. This is also true by default of matching
|
||||
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
|
||||
\fBpcre2_jit_compile()\fP, code that can process a subject containing invalid
|
||||
UTF is compiled.
|
||||
When a pattern is compiled with the PCRE2_UTF option, subject strings are
|
||||
normally expected to be a valid sequence of UTF code units. By default, this is
|
||||
checked at the start of matching and an error is generated if invalid UTF is
|
||||
detected. The PCRE2_NO_UTF_CHECK option can be passed to \fBpcre2_match()\fP to
|
||||
skip the check (for improved performance) if you are sure that a subject string
|
||||
is valid. If this option is used with an invalid string, the result is
|
||||
undefined.
|
||||
.P
|
||||
In this mode, an invalid code unit sequence never matches any pattern item. It
|
||||
does not match dot, it does not match \ep{Any}, it does not even match negative
|
||||
items such as [^X]. A lookbehind assertion fails if it encounters an invalid
|
||||
sequence while moving the current point backwards. In other words, an invalid
|
||||
UTF code unit sequence acts as a barrier which no match can cross. Reaching an
|
||||
invalid sequence causes an immediate backtrack.
|
||||
However, a way of running matches on strings that may contain invalid UTF
|
||||
sequences is available. Calling \fBpcre2_compile()\fP with the
|
||||
PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
|
||||
\fBpcre2_match()\fP to support invalid UTF, and, if \fBpcre2_jit_compile()\fP
|
||||
is called, the compiled JIT code also supports invalid UTF. Details of how this
|
||||
support works, in both the JIT and the interpretive cases, is given in the
|
||||
.\" HREF
|
||||
\fBpcre2unicode\fP
|
||||
.\"
|
||||
documentation.
|
||||
.P
|
||||
Using this option, an application can run matches in arbitrary data, knowing
|
||||
that any matched strings that are returned will be valid UTF. This can be
|
||||
useful when searching for text in executable or other binary files.
|
||||
There is also an obsolete option for \fBpcre2_jit_compile()\fP called
|
||||
PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
|
||||
It is superseded by the \fBpcre2_compile()\fP option PCRE2_MATCH_INVALID_UTF
|
||||
and should no longer be used. It may be removed in future.
|
||||
.
|
||||
.
|
||||
.SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
|
||||
|
@ -438,6 +444,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 06 March 2019
|
||||
Last updated: 23 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2MATCHING 3 "10 October 2018" "PCRE2 10.33"
|
||||
.TH PCRE2MATCHING 3 "23 May 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 MATCHING ALGORITHMS"
|
||||
|
@ -157,6 +157,9 @@ code unit) at a time, for all active paths through the tree.
|
|||
.P
|
||||
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
||||
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
|
||||
.P
|
||||
10. The PCRE2_MATCH_INVALID_UTF option for \fBpcre2_compile()\fP is not
|
||||
supported by \fBpcre2_dfa_match()\fP.
|
||||
.
|
||||
.
|
||||
.SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
|
||||
|
@ -191,7 +194,8 @@ The alternative algorithm suffers from a number of disadvantages:
|
|||
because it has to search for all possible matches, but is also because it is
|
||||
less susceptible to optimization.
|
||||
.P
|
||||
2. Capturing parentheses, backreferences, and script runs are not supported.
|
||||
2. Capturing parentheses, backreferences, script runs, and matching within
|
||||
invalid UTF string are not supported.
|
||||
.P
|
||||
3. Although atomic groups are supported, their use does not provide the
|
||||
performance advantage that it does for the standard algorithm.
|
||||
|
@ -211,6 +215,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 10 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
Last updated: 23 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "12 February 2019" "PCRE2 10.33"
|
||||
.TH PCRE2PATTERN 3 "23 May 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -52,10 +52,11 @@ single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
|
|||
specified for the 32-bit library, in which case it constrains the character
|
||||
values to valid Unicode code points. To process UTF strings, PCRE2 must be
|
||||
built to include Unicode support (which is the default). When using UTF strings
|
||||
you must either call the compiling function with the PCRE2_UTF option, or the
|
||||
pattern must start with the special sequence (*UTF), which is equivalent to
|
||||
setting the relevant option. How setting a UTF mode affects pattern matching is
|
||||
mentioned in several places below. There is also a summary of features in the
|
||||
you must either call the compiling function with one or both of the PCRE2_UTF
|
||||
or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
|
||||
sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
|
||||
setting a UTF mode affects pattern matching is mentioned in several places
|
||||
below. There is also a summary of features in the
|
||||
.\" HREF
|
||||
\fBpcre2unicode\fP
|
||||
.\"
|
||||
|
@ -398,11 +399,11 @@ PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
|
|||
There may be any number of hexadecimal digits. This syntax is from ECMAScript
|
||||
6.
|
||||
.P
|
||||
The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
|
||||
is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
|
||||
\eN{name} to specify characters by Unicode name; PCRE2 does not support this.
|
||||
Note that when \eN is not followed by an opening brace (curly bracket) it has
|
||||
an entirely different meaning, matching any character that is not a newline.
|
||||
The \eN{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
|
||||
UTF mode. Perl also uses \eN{name} to specify characters by Unicode name; PCRE2
|
||||
does not support this. Note that when \eN is not followed by an opening brace
|
||||
(curly bracket) it has an entirely different meaning, matching any character
|
||||
that is not a newline.
|
||||
.P
|
||||
There are some legacy applications where the escape sequence \er is expected to
|
||||
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a
|
||||
|
@ -1352,7 +1353,7 @@ with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
|
|||
with a malformed UTF character. This has undefined results, because PCRE2
|
||||
assumes that it is matching character by character in a valid UTF string (by
|
||||
default it checks the subject string's validity at the start of processing
|
||||
unless the PCRE2_NO_UTF_CHECK option is used).
|
||||
unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
|
||||
.P
|
||||
An application can lock out the use of \eC by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
|
||||
|
@ -3763,6 +3764,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 12 February 2019
|
||||
Last updated: 23 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2TEST 1 "11 March 2019" "PCRE 10.33"
|
||||
.TH PCRE2TEST 1 "23 May 2019" "PCRE 10.34"
|
||||
.SH NAME
|
||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
|
@ -572,6 +572,7 @@ for a description of the effects of these options.
|
|||
firstline set PCRE2_FIRSTLINE
|
||||
literal set PCRE2_LITERAL
|
||||
match_line set PCRE2_EXTRA_MATCH_LINE
|
||||
match_invalid_utf set PCRE2_MATCH_INVALID_UTF
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
match_word set PCRE2_EXTRA_MATCH_WORD
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
|
@ -2059,6 +2060,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 11 March 2019
|
||||
Last updated: 23 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -551,6 +551,7 @@ PATTERN MODIFIERS
|
|||
firstline set PCRE2_FIRSTLINE
|
||||
literal set PCRE2_LITERAL
|
||||
match_line set PCRE2_EXTRA_MATCH_LINE
|
||||
match_invalid_utf set PCRE2_MATCH_INVALID_UTF
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
match_word set PCRE2_EXTRA_MATCH_WORD
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
|
@ -1890,5 +1891,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 11 March 2019
|
||||
Last updated: 23 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
|
|
|
@ -1,26 +1,38 @@
|
|||
.TH PCRE2UNICODE 3 "11 May 2019" "PCRE2 10.33"
|
||||
.TH PCRE2UNICODE 3 "24 May 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE - Perl-compatible regular expressions (revised API)
|
||||
.SH "UNICODE AND UTF SUPPORT"
|
||||
.rs
|
||||
.sp
|
||||
When PCRE2 is built with Unicode support (which is the default), it has
|
||||
knowledge of Unicode character properties and can process text strings in
|
||||
UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by
|
||||
default, PCRE2 assumes that one code unit is one character. To process a
|
||||
pattern as a UTF string, where a character may require more than one code unit,
|
||||
you must call
|
||||
PCRE2 is normally built with Unicode support, though if you do not need it, you
|
||||
can build it without, in which case the library will be smaller. With Unicode
|
||||
support, PCRE2 has knowledge of Unicode character properties and can process
|
||||
text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
|
||||
width), but this is not the default. Unless specifically requested, PCRE2
|
||||
treats each code unit in a string as one character.
|
||||
.P
|
||||
There are two ways of telling PCRE2 to switch to UTF mode, where characters may
|
||||
consist of more than one code unit and the range of values is constrained. The
|
||||
program can call
|
||||
.\" HREF
|
||||
\fBpcre2_compile()\fP
|
||||
.\"
|
||||
with the PCRE2_UTF option flag, or the pattern must start with the sequence
|
||||
(*UTF). When either of these is the case, both the pattern and any subject
|
||||
strings that are matched against it are treated as UTF strings instead of
|
||||
strings of individual one-code-unit characters. There are also some other
|
||||
changes to the way characters are handled, as documented below.
|
||||
with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
|
||||
However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
|
||||
That is, the programmer can prevent the supplier of the pattern from switching
|
||||
to UTF mode.
|
||||
.P
|
||||
If you do not need Unicode support you can build PCRE2 without it, in which
|
||||
case the library will be smaller.
|
||||
Note that the PCRE2_MATCH_INVALID_UTF option (see
|
||||
.\" HTML <a href="#matchinvalid">
|
||||
.\" </a>
|
||||
below)
|
||||
.\"
|
||||
forces PCRE2_UTF to be set.
|
||||
.P
|
||||
In UTF mode, both the pattern and any subject strings that are matched against
|
||||
it are treated as UTF strings instead of strings of individual one-code-unit
|
||||
characters. There are also some other changes to the way characters are
|
||||
handled, as documented below.
|
||||
.
|
||||
.
|
||||
.SH "UNICODE PROPERTY SUPPORT"
|
||||
|
@ -55,18 +67,18 @@ also recognized; larger ones can be coded using \eo{...}.
|
|||
.P
|
||||
The escape sequence \eN{U+<hex digits>} is recognized as another way of
|
||||
specifying a Unicode character by code point in a UTF mode. It is not allowed
|
||||
in non-UTF modes.
|
||||
in non-UTF mode.
|
||||
.P
|
||||
In UTF modes, repeat quantifiers apply to complete UTF characters, not to
|
||||
In UTF mode, repeat quantifiers apply to complete UTF characters, not to
|
||||
individual code units.
|
||||
.P
|
||||
In UTF modes, the dot metacharacter matches one UTF character instead of a
|
||||
In UTF mode, the dot metacharacter matches one UTF character instead of a
|
||||
single code unit.
|
||||
.P
|
||||
In UTF modes, capture group names are not restricted to ASCII, and may contain
|
||||
In UTF mode, capture group names are not restricted to ASCII, and may contain
|
||||
any Unicode letters and decimal digits, as well as underscore.
|
||||
.P
|
||||
The escape sequence \eC can be used to match a single code unit in a UTF mode,
|
||||
The escape sequence \eC can be used to match a single code unit in UTF mode,
|
||||
but its use can lead to some strange effects because it breaks up multi-unit
|
||||
characters (see the description of \eC in the
|
||||
.\" HREF
|
||||
|
@ -82,7 +94,7 @@ may consist of more than one code unit. The use of \eC in these modes provokes
|
|||
a match-time error. Also, the JIT optimization does not support \eC in these
|
||||
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
|
||||
contains \eC, it will not succeed, and so when \fBpcre2_match()\fP is called,
|
||||
the matching will be carried out by the normal interpretive function.
|
||||
the matching will be carried out by the interpretive function.
|
||||
.P
|
||||
The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly test
|
||||
characters of any code value, but, by default, the characters that PCRE2
|
||||
|
@ -114,14 +126,14 @@ However, the special horizontal and vertical white space matching escapes (\eh,
|
|||
not PCRE2_UCP is set.
|
||||
.
|
||||
.
|
||||
.SH "CASE-EQUIVALENCE IN UTF MODES"
|
||||
.SH "CASE-EQUIVALENCE IN UTF MODE"
|
||||
.rs
|
||||
.sp
|
||||
Case-insensitive matching in a UTF mode makes use of Unicode properties except
|
||||
Case-insensitive matching in UTF mode makes use of Unicode properties except
|
||||
for characters whose code points are less than 128 and that have at most two
|
||||
case-equivalent values. For these, a direct table lookup is used for speed. A
|
||||
few Unicode characters such as Greek sigma have more than two code points that
|
||||
are case-equivalent, and these are treated as such.
|
||||
are case-equivalent, and these are treated specially.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="scriptruns"></a>
|
||||
|
@ -231,7 +243,7 @@ adjacent characters.
|
|||
.sp
|
||||
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
||||
are (by default) checked for validity on entry to the relevant functions. If an
|
||||
invalid UTF string is passed, an negative error code is returned. The code unit
|
||||
invalid UTF string is passed, a negative error code is returned. The code unit
|
||||
offset to the offending character can be extracted from the match data block by
|
||||
calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF
|
||||
error.
|
||||
|
@ -244,18 +256,15 @@ PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
|||
only valid UTF code unit sequences.
|
||||
.P
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||
is usually undefined and your program may crash or loop indefinitely. There is,
|
||||
however, one mode of matching that can handle invalid UTF subject strings. This
|
||||
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
|
||||
when calling \fBpcre2_jit_compile()\fP. For details, see the
|
||||
.\" HREF
|
||||
\fBpcre2jit\fP
|
||||
.\"
|
||||
documentation.
|
||||
is undefined and your program may crash or loop indefinitely or give incorrect
|
||||
results. There is, however, one mode of matching that can handle invalid UTF
|
||||
subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
|
||||
\fBpcre2_compile()\fP and is discussed below in the next section. The rest of
|
||||
this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
|
||||
.P
|
||||
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
|
||||
the pattern; it does not also apply to subject strings. If you want to disable
|
||||
the check for a subject string you must pass this same option to
|
||||
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the UTF check
|
||||
for the pattern; it does not also apply to subject strings. If you want to
|
||||
disable the check for a subject string you must pass this same option to
|
||||
\fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP.
|
||||
.P
|
||||
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
||||
|
@ -386,6 +395,52 @@ The following negative error codes are given for invalid UTF-32 strings:
|
|||
.sp
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="matchinvalid"></a>
|
||||
.SH "MATCHING IN INVALID UTF STRINGS"
|
||||
.rs
|
||||
.sp
|
||||
You can run pattern matches on subject strings that may contain invalid UTF
|
||||
sequences if you call \fBpcre2_compile()\fP with the PCRE2_MATCH_INVALID_UTF
|
||||
option. This is supported by \fBpcre2_match()\fP, including JIT matching, but
|
||||
not by \fBpcre2_dfa_match()\fP. When PCRE2_MATCH_INVALID_UTF is set, it forces
|
||||
PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
|
||||
valid UTF string.
|
||||
.P
|
||||
Setting PCRE2_MATCH_INVALID_UTF does not affect what \fBpcre2_compile()\fP
|
||||
generates, but if \fBpcre2_jit_compile()\fP is subsequently called, it does
|
||||
generate different code. If JIT is not used, the option affects the behaviour
|
||||
of the interpretive code in \fBpcre2_match()\fP. When PCRE2_MATCH_INVALID_UTF
|
||||
is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
|
||||
.P
|
||||
In this mode, an invalid code unit sequence in the subject never matches any
|
||||
pattern item. It does not match dot, it does not match \ep{Any}, it does not
|
||||
even match negative items such as [^X]. A lookbehind assertion fails if it
|
||||
encounters an invalid sequence while moving the current point backwards. In
|
||||
other words, an invalid UTF code unit sequence acts as a barrier which no match
|
||||
can cross.
|
||||
.P
|
||||
You can also think of this as the subject being split up into fragments of
|
||||
valid UTF, delimited internally by invalid code unit sequences. The pattern is
|
||||
matched fragment by fragment. The result of a successful match, however, is
|
||||
given as code unit offsets in the entire subject string in the usual way. There
|
||||
are a few points to consider:
|
||||
.P
|
||||
The internal boundaries are not interpreted as the beginnings or ends of lines
|
||||
and so do not match circumflex or dollar characters in the pattern.
|
||||
.P
|
||||
If \fBpcre2_match()\fP is called with an offset that points to an invalid
|
||||
UTF-sequence, that sequence is skipped, and the match starts at the next valid
|
||||
UTF character, or the end of the subject.
|
||||
.P
|
||||
At internal fragment boundaries, \eb and \eB behave in the same way as at the
|
||||
beginning and end of the subject. For example, a sequence such as \ebWORD\eb
|
||||
would match an instance of WORD that is surrounded by invalid UTF code units.
|
||||
.P
|
||||
Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
|
||||
data, knowing that any matched strings that are returned are valid UTF. This
|
||||
can be useful when searching for UTF text in executable or other binary files.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
.rs
|
||||
.sp
|
||||
|
@ -400,6 +455,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 11 May 2019
|
||||
Last updated: 24 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -5,7 +5,7 @@
|
|||
/* This is the public header file for the PCRE library, second API, to be
|
||||
#included by applications that call PCRE2 functions.
|
||||
|
||||
Copyright (c) 2016-2018 University of Cambridge
|
||||
Copyright (c) 2016-2019 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
|
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
|
|||
/* The current PCRE version information. */
|
||||
|
||||
#define PCRE2_MAJOR 10
|
||||
#define PCRE2_MINOR 33
|
||||
#define PCRE2_PRERELEASE
|
||||
#define PCRE2_DATE 2019-04-16
|
||||
#define PCRE2_MINOR 34
|
||||
#define PCRE2_PRERELEASE -RC1
|
||||
#define PCRE2_DATE 2019-04-22
|
||||
|
||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||
imported have to be identified as such. When building PCRE2, the appropriate
|
||||
|
@ -142,6 +142,7 @@ D is inspected during pcre2_dfa_match() execution
|
|||
#define PCRE2_USE_OFFSET_LIMIT 0x00800000u /* J M D */
|
||||
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */
|
||||
#define PCRE2_LITERAL 0x02000000u /* C */
|
||||
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
|
||||
|
||||
/* An additional compile options word is available in the compile context. */
|
||||
|
||||
|
@ -305,6 +306,7 @@ pcre2_pattern_convert(). */
|
|||
#define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS 194
|
||||
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
|
||||
#define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE 196
|
||||
#define PCRE2_ERROR_TOO_MANY_CAPTURES 197
|
||||
|
||||
|
||||
/* "Expected" matching error codes: no match and partial match. */
|
||||
|
@ -390,6 +392,7 @@ released, the numbers must not be changed. */
|
|||
#define PCRE2_ERROR_HEAPLIMIT (-63)
|
||||
#define PCRE2_ERROR_CONVERT_SYNTAX (-64)
|
||||
#define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
|
||||
#define PCRE2_ERROR_DFA_UINVALID_UTF (-66)
|
||||
|
||||
|
||||
/* Request types for pcre2_pattern_info() */
|
||||
|
|
|
@ -5,7 +5,7 @@
|
|||
/* This is the public header file for the PCRE library, second API, to be
|
||||
#included by applications that call PCRE2 functions.
|
||||
|
||||
Copyright (c) 2016-2018 University of Cambridge
|
||||
Copyright (c) 2016-2019 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
|
@ -142,6 +142,7 @@ D is inspected during pcre2_dfa_match() execution
|
|||
#define PCRE2_USE_OFFSET_LIMIT 0x00800000u /* J M D */
|
||||
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */
|
||||
#define PCRE2_LITERAL 0x02000000u /* C */
|
||||
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
|
||||
|
||||
/* An additional compile options word is available in the compile context. */
|
||||
|
||||
|
@ -391,6 +392,7 @@ released, the numbers must not be changed. */
|
|||
#define PCRE2_ERROR_HEAPLIMIT (-63)
|
||||
#define PCRE2_ERROR_CONVERT_SYNTAX (-64)
|
||||
#define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
|
||||
#define PCRE2_ERROR_DFA_UINVALID_UTF (-66)
|
||||
|
||||
|
||||
/* Request types for pcre2_pattern_info() */
|
||||
|
|
|
@ -746,8 +746,8 @@ are allowed. */
|
|||
|
||||
#define PUBLIC_LITERAL_COMPILE_OPTIONS \
|
||||
(PCRE2_ANCHORED|PCRE2_AUTO_CALLOUT|PCRE2_CASELESS|PCRE2_ENDANCHORED| \
|
||||
PCRE2_FIRSTLINE|PCRE2_LITERAL|PCRE2_NO_START_OPTIMIZE| \
|
||||
PCRE2_NO_UTF_CHECK|PCRE2_USE_OFFSET_LIMIT|PCRE2_UTF)
|
||||
PCRE2_FIRSTLINE|PCRE2_LITERAL|PCRE2_MATCH_INVALID_UTF| \
|
||||
PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK|PCRE2_USE_OFFSET_LIMIT|PCRE2_UTF)
|
||||
|
||||
#define PUBLIC_COMPILE_OPTIONS \
|
||||
(PUBLIC_LITERAL_COMPILE_OPTIONS| \
|
||||
|
@ -9504,6 +9504,10 @@ if (pattern == NULL)
|
|||
if (ccontext == NULL)
|
||||
ccontext = (pcre2_compile_context *)(&PRIV(default_compile_context));
|
||||
|
||||
/* PCRE2_MATCH_INVALID_UTF implies UTF */
|
||||
|
||||
if ((options & PCRE2_MATCH_INVALID_UTF) != 0) options |= PCRE2_UTF;
|
||||
|
||||
/* Check that all undefined public option bits are zero. */
|
||||
|
||||
if ((options & ~PUBLIC_COMPILE_OPTIONS) != 0 ||
|
||||
|
@ -9682,7 +9686,7 @@ if ((options & PCRE2_LITERAL) == 0)
|
|||
|
||||
ptr += skipatstart;
|
||||
|
||||
/* Can't support UTF or UCP unless PCRE2 has been compiled with UTF support. */
|
||||
/* Can't support UTF or UCP if PCRE2 was built without Unicode support. */
|
||||
|
||||
#ifndef SUPPORT_UNICODE
|
||||
if ((cb.external_options & (PCRE2_UTF|PCRE2_UCP)) != 0)
|
||||
|
|
|
@ -3295,6 +3295,11 @@ if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0 &&
|
|||
((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
|
||||
return PCRE2_ERROR_BADOPTION;
|
||||
|
||||
/* Invalid UTF support is not available for DFA matching. */
|
||||
|
||||
if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0)
|
||||
return PCRE2_ERROR_DFA_UINVALID_UTF;
|
||||
|
||||
/* Check that the first field in the block is the magic number. If it is not,
|
||||
return with PCRE2_ERROR_BADMAGIC. */
|
||||
|
||||
|
|
|
@ -269,6 +269,7 @@ static const unsigned char match_error_texts[] =
|
|||
"invalid syntax\0"
|
||||
/* 65 */
|
||||
"internal error - duplicate substitution match\0"
|
||||
"PCRE2_MATCH_INVALID_UTF is not supported for DFA matching\0"
|
||||
;
|
||||
|
||||
|
||||
|
|
|
@ -866,6 +866,7 @@ typedef struct match_block {
|
|||
PCRE2_SPTR name_table; /* Table of group names */
|
||||
PCRE2_SPTR start_code; /* For use when recursing */
|
||||
PCRE2_SPTR start_subject; /* Start of the subject string */
|
||||
PCRE2_SPTR check_subject; /* Where UTF-checked from */
|
||||
PCRE2_SPTR end_subject; /* End of the subject string */
|
||||
PCRE2_SPTR end_match_ptr; /* Subject position at end match */
|
||||
PCRE2_SPTR start_used_ptr; /* Earliest consulted character */
|
||||
|
|
|
@ -6,8 +6,9 @@
|
|||
and semantics are as close as possible to those of the Perl 5 language.
|
||||
|
||||
Written by Philip Hazel
|
||||
This module by Zoltan Herczeg
|
||||
Original API code Copyright (c) 1997-2012 University of Cambridge
|
||||
New API code Copyright (c) 2016-2018 University of Cambridge
|
||||
New API code Copyright (c) 2016-2019 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
|
@ -7846,8 +7847,6 @@ if (needstype || needsscript)
|
|||
if (needsscript)
|
||||
{
|
||||
// PH hacking
|
||||
//fprintf(stderr, "~~B\n");
|
||||
|
||||
OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
|
||||
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
|
||||
OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0);
|
||||
|
@ -7901,7 +7900,6 @@ if (needstype || needsscript)
|
|||
if (!needschar)
|
||||
{
|
||||
// PH hacking
|
||||
//fprintf(stderr, "~~C\n");
|
||||
OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
|
||||
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
|
||||
OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0);
|
||||
|
@ -7916,7 +7914,6 @@ if (needstype || needsscript)
|
|||
else
|
||||
{
|
||||
// PH hacking
|
||||
//fprintf(stderr, "~~D\n");
|
||||
OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
|
||||
|
||||
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
|
||||
|
@ -8594,8 +8591,8 @@ uint32_t c;
|
|||
|
||||
/* Patch by PH */
|
||||
/* GETCHARINC(c, cc); */
|
||||
|
||||
c = *cc++;
|
||||
|
||||
#if PCRE2_CODE_UNIT_WIDTH == 32
|
||||
if (c >= 0x110000)
|
||||
return NULL;
|
||||
|
@ -9257,8 +9254,6 @@ if (common->utf && *cc == OP_REFI)
|
|||
CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop);
|
||||
|
||||
// PH hacking
|
||||
//fprintf(stderr, "~~E\n");
|
||||
|
||||
OP1(SLJIT_MOV, TMP3, 0, TMP1, 0);
|
||||
|
||||
add_jump(compiler, &common->getucd, JUMP(SLJIT_FAST_CALL));
|
||||
|
@ -14156,18 +14151,8 @@ Returns: 0: success or (*NOJIT) was used
|
|||
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
|
||||
pcre2_jit_compile(pcre2_code *code, uint32_t options)
|
||||
{
|
||||
#ifndef SUPPORT_JIT
|
||||
|
||||
(void)code;
|
||||
(void)options;
|
||||
return PCRE2_ERROR_JIT_BADOPTION;
|
||||
|
||||
#else /* SUPPORT_JIT */
|
||||
|
||||
pcre2_real_code *re = (pcre2_real_code *)code;
|
||||
executable_functions *functions;
|
||||
uint32_t excluded_options;
|
||||
int result;
|
||||
|
||||
if (code == NULL)
|
||||
return PCRE2_ERROR_NULL;
|
||||
|
@ -14175,30 +14160,78 @@ if (code == NULL)
|
|||
if ((options & ~PUBLIC_JIT_COMPILE_OPTIONS) != 0)
|
||||
return PCRE2_ERROR_JIT_BADOPTION;
|
||||
|
||||
if ((re->flags & PCRE2_NOJIT) != 0) return 0;
|
||||
|
||||
functions = (executable_functions *)re->executable_jit;
|
||||
|
||||
/* Support for invalid UTF was first introduced in JIT, with the option
|
||||
PCRE2_JIT_INVALID_UTF. Later, support was added to the interpreter, and the
|
||||
compile-time option PCRE2_MATCH_INVALID_UTF was created. This is now the
|
||||
preferred feature, with the earlier option deprecated. However, for backward
|
||||
compatibility, if the earlier option is set, it forces the new option so that
|
||||
if JIT matching falls back to the interpreter, there is still support for
|
||||
invalid UTF. However, if this function has already been successfully called
|
||||
without PCRE2_JIT_INVALID_UTF and without PCRE2_MATCH_INVALID_UTF (meaning that
|
||||
non-invalid-supporting JIT code was compiled), give an error.
|
||||
|
||||
If in the future support for PCRE2_JIT_INVALID_UTF is withdrawn, the following
|
||||
actions are needed:
|
||||
|
||||
1. Remove the definition from pcre2.h.in and from the list in
|
||||
PUBLIC_JIT_COMPILE_OPTIONS above.
|
||||
|
||||
2. Replace PCRE2_JIT_INVALID_UTF with a local flag in this module.
|
||||
|
||||
3. Replace PCRE2_JIT_INVALID_UTF in pcre2_jit_test.c.
|
||||
|
||||
4. Delete the following short block of code. The setting of "re" and
|
||||
"functions" can be moved into the JIT-only block below, but if that is
|
||||
done, (void)re and (void)functions will be needed in the non-JIT case, to
|
||||
avoid compiler warnings.
|
||||
*/
|
||||
|
||||
if ((options & PCRE2_JIT_INVALID_UTF) != 0)
|
||||
{
|
||||
if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) == 0)
|
||||
{
|
||||
if (functions != NULL) return PCRE2_ERROR_JIT_BADOPTION;
|
||||
re->overall_options |= PCRE2_MATCH_INVALID_UTF;
|
||||
}
|
||||
}
|
||||
|
||||
/* The above tests are run with and without JIT support. This means that
|
||||
PCRE2_JIT_INVALID_UTF propagates back into the regex options (ensuring
|
||||
interpreter support) even in the absence of JIT. But now, if there is no JIT
|
||||
support, give an error return. */
|
||||
|
||||
#ifndef SUPPORT_JIT
|
||||
return PCRE2_ERROR_JIT_BADOPTION;
|
||||
#else /* SUPPORT_JIT */
|
||||
|
||||
/* There is JIT support. Do the necessary. */
|
||||
|
||||
if ((re->flags & PCRE2_NOJIT) != 0) return 0;
|
||||
if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0)
|
||||
options |= PCRE2_JIT_INVALID_UTF;
|
||||
|
||||
if ((options & PCRE2_JIT_COMPLETE) != 0 && (functions == NULL
|
||||
|| functions->executable_funcs[0] == NULL)) {
|
||||
excluded_options = (PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD);
|
||||
result = jit_compile(code, options & ~excluded_options);
|
||||
uint32_t excluded_options = (PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD);
|
||||
int result = jit_compile(code, options & ~excluded_options);
|
||||
if (result != 0)
|
||||
return result;
|
||||
}
|
||||
|
||||
if ((options & PCRE2_JIT_PARTIAL_SOFT) != 0 && (functions == NULL
|
||||
|| functions->executable_funcs[1] == NULL)) {
|
||||
excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD);
|
||||
result = jit_compile(code, options & ~excluded_options);
|
||||
uint32_t excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD);
|
||||
int result = jit_compile(code, options & ~excluded_options);
|
||||
if (result != 0)
|
||||
return result;
|
||||
}
|
||||
|
||||
if ((options & PCRE2_JIT_PARTIAL_HARD) != 0 && (functions == NULL
|
||||
|| functions->executable_funcs[2] == NULL)) {
|
||||
excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT);
|
||||
result = jit_compile(code, options & ~excluded_options);
|
||||
uint32_t excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT);
|
||||
int result = jit_compile(code, options & ~excluded_options);
|
||||
if (result != 0)
|
||||
return result;
|
||||
}
|
||||
|
|
|
@ -5412,7 +5412,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
{
|
||||
while (number-- > 0)
|
||||
{
|
||||
if (Feptr <= mb->start_subject) RRETURN(MATCH_NOMATCH);
|
||||
if (Feptr <= mb->check_subject) RRETURN(MATCH_NOMATCH);
|
||||
Feptr--;
|
||||
BACKCHAR(Feptr);
|
||||
}
|
||||
|
@ -5420,7 +5420,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
else
|
||||
#endif
|
||||
|
||||
/* No UTF-8 support, or not in UTF-8 mode: count is byte count */
|
||||
/* No UTF-8 support, or not in UTF-8 mode: count is code unit count */
|
||||
|
||||
{
|
||||
if ((ptrdiff_t)number > Feptr - mb->start_subject) RRETURN(MATCH_NOMATCH);
|
||||
|
@ -5743,7 +5743,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
|
||||
case OP_NOT_WORD_BOUNDARY:
|
||||
case OP_WORD_BOUNDARY:
|
||||
if (Feptr == mb->start_subject) prev_is_word = FALSE; else
|
||||
if (Feptr == mb->check_subject) prev_is_word = FALSE; else
|
||||
{
|
||||
PCRE2_SPTR lastptr = Feptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
|
@ -6014,7 +6014,6 @@ int was_zero_terminated = 0;
|
|||
const uint8_t *start_bits = NULL;
|
||||
const pcre2_real_code *re = (const pcre2_real_code *)code;
|
||||
|
||||
|
||||
BOOL anchored;
|
||||
BOOL firstline;
|
||||
BOOL has_first_cu = FALSE;
|
||||
|
@ -6029,10 +6028,23 @@ PCRE2_UCHAR req_cu2 = 0;
|
|||
|
||||
PCRE2_SPTR bumpalong_limit;
|
||||
PCRE2_SPTR end_subject;
|
||||
PCRE2_SPTR true_end_subject;
|
||||
PCRE2_SPTR start_match = subject + start_offset;
|
||||
PCRE2_SPTR req_cu_ptr = start_match - 1;
|
||||
PCRE2_SPTR start_partial = NULL;
|
||||
PCRE2_SPTR match_partial = NULL;
|
||||
PCRE2_SPTR start_partial;
|
||||
PCRE2_SPTR match_partial;
|
||||
|
||||
#ifdef SUPPORT_JIT
|
||||
BOOL use_jit;
|
||||
#endif
|
||||
|
||||
#ifdef SUPPORT_UNICODE
|
||||
BOOL allow_invalid;
|
||||
uint32_t fragment_options = 0;
|
||||
#ifdef SUPPORT_JIT
|
||||
BOOL jit_checked_utf = FALSE;
|
||||
#endif
|
||||
#endif
|
||||
|
||||
PCRE2_SIZE frame_size;
|
||||
|
||||
|
@ -6059,7 +6071,7 @@ if (length == PCRE2_ZERO_TERMINATED)
|
|||
length = PRIV(strlen)(subject);
|
||||
was_zero_terminated = 1;
|
||||
}
|
||||
end_subject = subject + length;
|
||||
true_end_subject = end_subject = subject + length;
|
||||
|
||||
/* Plausibility checks */
|
||||
|
||||
|
@ -6095,12 +6107,24 @@ options |= (re->flags & FF) / ((FF & (~FF+1)) / (OO & (~OO+1)));
|
|||
#undef FF
|
||||
#undef OO
|
||||
|
||||
/* These two settings are used in the code for checking a UTF string that
|
||||
follows immediately afterwards. Other values in the mb block are used only
|
||||
during interpretive processing, not when the JIT support is in use, so they are
|
||||
set up later. */
|
||||
/* If the pattern was successfully studied with JIT support, we will run the
|
||||
JIT executable instead of the rest of this function. Most options must be set
|
||||
at compile time for the JIT code to be usable. */
|
||||
|
||||
#ifdef SUPPORT_JIT
|
||||
use_jit = (re->executable_jit != NULL &&
|
||||
(options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0);
|
||||
#endif
|
||||
|
||||
/* Initialize UTF parameters. */
|
||||
|
||||
utf = (re->overall_options & PCRE2_UTF) != 0;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
allow_invalid = (re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0;
|
||||
#endif
|
||||
|
||||
/* Convert the partial matching flags into an integer. */
|
||||
|
||||
mb->partial = ((options & PCRE2_PARTIAL_HARD) != 0)? 2 :
|
||||
((options & PCRE2_PARTIAL_SOFT) != 0)? 1 : 0;
|
||||
|
||||
|
@ -6111,61 +6135,6 @@ if (mb->partial != 0 &&
|
|||
((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
|
||||
return PCRE2_ERROR_BADOPTION;
|
||||
|
||||
/* Check a UTF string for validity if required. For 8-bit and 16-bit strings,
|
||||
we must also check that a starting offset does not point into the middle of a
|
||||
multiunit character. We check only the portion of the subject that is going to
|
||||
be inspected during matching - from the offset minus the maximum back reference
|
||||
to the given length. This saves time when a small part of a large subject is
|
||||
being matched by the use of a starting offset. Note that the maximum lookbehind
|
||||
is a number of characters, not code units. */
|
||||
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
|
||||
{
|
||||
PCRE2_SPTR check_subject = start_match; /* start_match includes offset */
|
||||
|
||||
if (start_offset > 0)
|
||||
{
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
unsigned int i;
|
||||
if (start_match < end_subject && NOT_FIRSTCU(*start_match))
|
||||
return PCRE2_ERROR_BADUTFOFFSET;
|
||||
for (i = re->max_lookbehind; i > 0 && check_subject > subject; i--)
|
||||
{
|
||||
check_subject--;
|
||||
while (check_subject > subject &&
|
||||
#if PCRE2_CODE_UNIT_WIDTH == 8
|
||||
(*check_subject & 0xc0) == 0x80)
|
||||
#else /* 16-bit */
|
||||
(*check_subject & 0xfc00) == 0xdc00)
|
||||
#endif /* PCRE2_CODE_UNIT_WIDTH == 8 */
|
||||
check_subject--;
|
||||
}
|
||||
#else
|
||||
/* In the 32-bit library, one code unit equals one character. However,
|
||||
we cannot just subtract the lookbehind and then compare pointers, because
|
||||
a very large lookbehind could create an invalid pointer. */
|
||||
|
||||
if (start_offset >= re->max_lookbehind)
|
||||
check_subject -= re->max_lookbehind;
|
||||
else
|
||||
check_subject = subject;
|
||||
#endif /* PCRE2_CODE_UNIT_WIDTH != 32 */
|
||||
}
|
||||
|
||||
/* Validate the relevant portion of the subject. After an error, adjust the
|
||||
offset to be an absolute offset in the whole string. */
|
||||
|
||||
match_data->rc = PRIV(valid_utf)(check_subject,
|
||||
length - (check_subject - subject), &(match_data->startchar));
|
||||
if (match_data->rc != 0)
|
||||
{
|
||||
match_data->startchar += check_subject - subject;
|
||||
return match_data->rc;
|
||||
}
|
||||
}
|
||||
#endif /* SUPPORT_UNICODE */
|
||||
|
||||
/* It is an error to set an offset limit without setting the flag at compile
|
||||
time. */
|
||||
|
||||
|
@ -6184,15 +6153,85 @@ if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0)
|
|||
}
|
||||
match_data->subject = NULL;
|
||||
|
||||
/* If the pattern was successfully studied with JIT support, run the JIT
|
||||
executable instead of the rest of this function. Most options must be set at
|
||||
compile time for the JIT code to be usable. Fallback to the normal code path if
|
||||
an unsupported option is set or if JIT returns BADOPTION (which means that the
|
||||
selected normal or partial matching mode was not compiled). */
|
||||
|
||||
/* ============================= JIT matching ============================== */
|
||||
|
||||
/* Prepare for JIT matching. Check a UTF string for validity unless no check is
|
||||
requested or invalid UTF can be handled. We check only the portion of the
|
||||
subject that might be be inspected during matching - from the offset minus the
|
||||
maximum lookbehind to the given length. This saves time when a small part of a
|
||||
large subject is being matched by the use of a starting offset. Note that the
|
||||
maximum lookbehind is a number of characters, not code units. */
|
||||
|
||||
#ifdef SUPPORT_JIT
|
||||
if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0)
|
||||
if (use_jit)
|
||||
{
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0 && !allow_invalid)
|
||||
{
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
unsigned int i;
|
||||
#endif
|
||||
|
||||
/* For 8-bit and 16-bit UTF, check that the first code unit is a valid
|
||||
character start. */
|
||||
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
if (start_match < end_subject && NOT_FIRSTCU(*start_match))
|
||||
{
|
||||
if (start_offset > 0) return PCRE2_ERROR_BADUTFOFFSET;
|
||||
#if PCRE2_CODE_UNIT_WIDTH == 8
|
||||
return PCRE2_ERROR_UTF8_ERR20; /* Isolated 0x80 byte */
|
||||
#else
|
||||
return PCRE2_ERROR_UTF16_ERR3; /* Isolated low surrogate */
|
||||
#endif
|
||||
}
|
||||
#endif /* WIDTH != 32 */
|
||||
|
||||
/* Move back by the maximum lookbehind, just in case it happens at the very
|
||||
start of matching. */
|
||||
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
for (i = re->max_lookbehind; i > 0 && start_match > subject; i--)
|
||||
{
|
||||
start_match--;
|
||||
while (start_match > subject &&
|
||||
#if PCRE2_CODE_UNIT_WIDTH == 8
|
||||
(*start_match & 0xc0) == 0x80)
|
||||
#else /* 16-bit */
|
||||
(*start_match & 0xfc00) == 0xdc00)
|
||||
#endif
|
||||
start_match--;
|
||||
}
|
||||
#else /* PCRE2_CODE_UNIT_WIDTH != 32 */
|
||||
|
||||
/* In the 32-bit library, one code unit equals one character. However,
|
||||
we cannot just subtract the lookbehind and then compare pointers, because
|
||||
a very large lookbehind could create an invalid pointer. */
|
||||
|
||||
if (start_offset >= re->max_lookbehind)
|
||||
start_match -= re->max_lookbehind;
|
||||
else
|
||||
start_match = subject;
|
||||
#endif /* PCRE2_CODE_UNIT_WIDTH != 32 */
|
||||
|
||||
/* Validate the relevant portion of the subject. Adjust the offset of an
|
||||
invalid code point to be an absolute offset in the whole string. */
|
||||
|
||||
match_data->rc = PRIV(valid_utf)(start_match,
|
||||
length - (start_match - subject), &(match_data->startchar));
|
||||
if (match_data->rc != 0)
|
||||
{
|
||||
match_data->startchar += start_match - subject;
|
||||
return match_data->rc;
|
||||
}
|
||||
jit_checked_utf = TRUE;
|
||||
}
|
||||
#endif /* SUPPORT_UNICODE */
|
||||
|
||||
/* If JIT returns BADOPTION, which means that the selected complete or
|
||||
partial matching mode was not compiled, fall through to the interpreter. */
|
||||
|
||||
rc = pcre2_jit_match(code, subject, length, start_offset, options,
|
||||
match_data, mcontext);
|
||||
if (rc != PCRE2_ERROR_JIT_BADOPTION)
|
||||
|
@ -6209,10 +6248,152 @@ if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0)
|
|||
return rc;
|
||||
}
|
||||
}
|
||||
#endif /* SUPPORT_JIT */
|
||||
|
||||
/* ========================= End of JIT matching ========================== */
|
||||
|
||||
|
||||
/* Proceed with non-JIT matching. The default is to allow lookbehinds to the
|
||||
start of the subject. A UTF check when there is a non-zero offset may change
|
||||
this. */
|
||||
|
||||
mb->check_subject = subject;
|
||||
|
||||
/* If a UTF subject string was not checked for validity in the JIT code above,
|
||||
check it here, and handle support for invalid UTF strings. The check above
|
||||
happens only when invalid UTF is not supported and PCRE2_NO_CHECK_UTF is unset.
|
||||
If we get here in those circumstances, it means the subject string is valid,
|
||||
but for some reason JIT matching was not successful. There is no need to check
|
||||
the subject again.
|
||||
|
||||
We check only the portion of the subject that might be be inspected during
|
||||
matching - from the offset minus the maximum lookbehind to the given length.
|
||||
This saves time when a small part of a large subject is being matched by the
|
||||
use of a starting offset. Note that the maximum lookbehind is a number of
|
||||
characters, not code units.
|
||||
|
||||
Note also that support for invalid UTF forces a check, overriding the setting
|
||||
of PCRE2_NO_CHECK_UTF. */
|
||||
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf &&
|
||||
#ifdef SUPPORT_JIT
|
||||
!jit_checked_utf &&
|
||||
#endif
|
||||
((options & PCRE2_NO_UTF_CHECK) == 0 || allow_invalid))
|
||||
{
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
BOOL skipped_bad_start = FALSE;
|
||||
#endif
|
||||
|
||||
/* Carry on with non-JIT matching. A NULL match context means "use a default
|
||||
context", but we take the memory control functions from the pattern. */
|
||||
/* For 8-bit and 16-bit UTF, check that the first code unit is a valid
|
||||
character start. If we are handling invalid UTF, just skip over such code
|
||||
units. Otherwise, give an appropriate error. */
|
||||
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
if (allow_invalid)
|
||||
{
|
||||
while (start_match < end_subject && NOT_FIRSTCU(*start_match))
|
||||
{
|
||||
start_match++;
|
||||
skipped_bad_start = TRUE;
|
||||
}
|
||||
}
|
||||
else if (start_match < end_subject && NOT_FIRSTCU(*start_match))
|
||||
{
|
||||
if (start_offset > 0) return PCRE2_ERROR_BADUTFOFFSET;
|
||||
#if PCRE2_CODE_UNIT_WIDTH == 8
|
||||
return PCRE2_ERROR_UTF8_ERR20; /* Isolated 0x80 byte */
|
||||
#else
|
||||
return PCRE2_ERROR_UTF16_ERR3; /* Isolated low surrogate */
|
||||
#endif
|
||||
}
|
||||
#endif /* WIDTH != 32 */
|
||||
|
||||
/* The mb->check_subject field points to the start of UTF checking;
|
||||
lookbehinds can go back no further than this. */
|
||||
|
||||
mb->check_subject = start_match;
|
||||
|
||||
/* Move back by the maximum lookbehind, just in case it happens at the very
|
||||
start of matching, but don't do this if we skipped bad 8-bit or 16-bit code
|
||||
units above. */
|
||||
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
if (!skipped_bad_start)
|
||||
{
|
||||
unsigned int i;
|
||||
for (i = re->max_lookbehind; i > 0 && mb->check_subject > subject; i--)
|
||||
{
|
||||
mb->check_subject--;
|
||||
while (mb->check_subject > subject &&
|
||||
#if PCRE2_CODE_UNIT_WIDTH == 8
|
||||
(*mb->check_subject & 0xc0) == 0x80)
|
||||
#else /* 16-bit */
|
||||
(*mb->check_subject & 0xfc00) == 0xdc00)
|
||||
#endif
|
||||
mb->check_subject--;
|
||||
}
|
||||
}
|
||||
#else /* PCRE2_CODE_UNIT_WIDTH != 32 */
|
||||
|
||||
/* In the 32-bit library, one code unit equals one character. However,
|
||||
we cannot just subtract the lookbehind and then compare pointers, because
|
||||
a very large lookbehind could create an invalid pointer. */
|
||||
|
||||
if (start_offset >= re->max_lookbehind)
|
||||
mb->check_subject -= re->max_lookbehind;
|
||||
else
|
||||
mb->check_subject = subject;
|
||||
#endif /* PCRE2_CODE_UNIT_WIDTH != 32 */
|
||||
|
||||
/* Validate the relevant portion of the subject. There's a loop in case we
|
||||
encounter bad UTF in the characters preceding start_match which we are
|
||||
scanning because of a lookbehind. */
|
||||
|
||||
for (;;)
|
||||
{
|
||||
match_data->rc = PRIV(valid_utf)(mb->check_subject,
|
||||
length - (mb->check_subject - subject), &(match_data->startchar));
|
||||
|
||||
if (match_data->rc == 0) break; /* Valid UTF string */
|
||||
|
||||
/* Invalid UTF string. Adjust the offset to be an absolute offset in the
|
||||
whole string. If we are handling invalid UTF strings, set end_subject to
|
||||
stop before the bad code unit, and set the options to "not end of line".
|
||||
Otherwise return the error. */
|
||||
|
||||
match_data->startchar += mb->check_subject - subject;
|
||||
if (!allow_invalid || match_data->rc > 0) return match_data->rc;
|
||||
end_subject = subject + match_data->startchar;
|
||||
|
||||
/* If the end precedes start_match, it means there is invalid UTF in the
|
||||
extra code units we reversed over because of a lookbehind. Advance past the
|
||||
first bad code unit, and then skip invalid character starting code units in
|
||||
8-bit and 16-bit modes, and try again. */
|
||||
|
||||
if (end_subject < start_match)
|
||||
{
|
||||
mb->check_subject = end_subject + 1;
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
while (mb->check_subject < start_match && NOT_FIRSTCU(*mb->check_subject))
|
||||
mb->check_subject++;
|
||||
#endif
|
||||
}
|
||||
|
||||
/* Otherwise, set the not end of line option, and do the match. */
|
||||
|
||||
else
|
||||
{
|
||||
fragment_options = PCRE2_NOTEOL;
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
#endif /* SUPPORT_UNICODE */
|
||||
|
||||
/* A NULL match context means "use a default context", but we take the memory
|
||||
control functions from the pattern. */
|
||||
|
||||
if (mcontext == NULL)
|
||||
{
|
||||
|
@ -6224,8 +6405,8 @@ else mb->memctl = mcontext->memctl;
|
|||
anchored = ((re->overall_options | options) & PCRE2_ANCHORED) != 0;
|
||||
firstline = (re->overall_options & PCRE2_FIRSTLINE) != 0;
|
||||
startline = (re->flags & PCRE2_STARTLINE) != 0;
|
||||
bumpalong_limit = (mcontext->offset_limit == PCRE2_UNSET)?
|
||||
end_subject : subject + mcontext->offset_limit;
|
||||
bumpalong_limit = (mcontext->offset_limit == PCRE2_UNSET)?
|
||||
true_end_subject : subject + mcontext->offset_limit;
|
||||
|
||||
/* Initialize and set up the fixed fields in the callout block, with a pointer
|
||||
in the match block. */
|
||||
|
@ -6236,7 +6417,8 @@ cb.subject = subject;
|
|||
cb.subject_length = (PCRE2_SIZE)(end_subject - subject);
|
||||
cb.callout_flags = 0;
|
||||
|
||||
/* Fill in the remaining fields in the match block. */
|
||||
/* Fill in the remaining fields in the match block, except for moptions, which
|
||||
gets set later. */
|
||||
|
||||
mb->callout = mcontext->callout;
|
||||
mb->callout_data = mcontext->callout_data;
|
||||
|
@ -6245,13 +6427,9 @@ mb->start_subject = subject;
|
|||
mb->start_offset = start_offset;
|
||||
mb->end_subject = end_subject;
|
||||
mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
|
||||
|
||||
mb->moptions = options; /* Match options */
|
||||
mb->poptions = re->overall_options; /* Pattern options */
|
||||
|
||||
mb->ignore_skip_arg = 0;
|
||||
mb->mark = mb->nomatch_mark = NULL; /* In case never set */
|
||||
mb->hitend = FALSE;
|
||||
|
||||
/* The name table is needed for finding all the numbers associated with a
|
||||
given name, for condition testing. The code follows the name table. */
|
||||
|
@ -6404,6 +6582,13 @@ if ((re->flags & PCRE2_LASTSET) != 0)
|
|||
/* Loop for handling unanchored repeated matching attempts; for anchored regexs
|
||||
the loop runs just once. */
|
||||
|
||||
#ifdef SUPPORT_UNICODE
|
||||
FRAGMENT_RESTART:
|
||||
#endif
|
||||
|
||||
start_partial = match_partial = NULL;
|
||||
mb->hitend = FALSE;
|
||||
|
||||
for(;;)
|
||||
{
|
||||
PCRE2_SPTR new_start_match;
|
||||
|
@ -6714,6 +6899,11 @@ for(;;)
|
|||
|
||||
mb->start_used_ptr = start_match;
|
||||
mb->last_used_ptr = start_match;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
mb->moptions = options | fragment_options;
|
||||
#else
|
||||
mb->moptions = options;
|
||||
#endif
|
||||
mb->match_call_count = 0;
|
||||
mb->end_offset_top = 0;
|
||||
mb->skip_arg_count = 0;
|
||||
|
@ -6839,6 +7029,68 @@ for(;;)
|
|||
|
||||
ENDLOOP:
|
||||
|
||||
/* If end_subject != true_end_subject, it means we are handling invalid UTF,
|
||||
and have just processed a non-terminal fragment. If this resulted in no match
|
||||
or a partial match we must carry on to the next fragment (a partial match is
|
||||
returned to the caller only at the very end of the subject). A loop is used to
|
||||
avoid trying to match against empty fragments; if the pattern can match an
|
||||
empty string it would have done so already. */
|
||||
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf && end_subject != true_end_subject &&
|
||||
(rc == MATCH_NOMATCH || rc == PCRE2_ERROR_PARTIAL))
|
||||
{
|
||||
for (;;)
|
||||
{
|
||||
/* Advance past the first bad code unit, and then skip invalid character
|
||||
starting code units in 8-bit and 16-bit modes. */
|
||||
|
||||
start_match = end_subject + 1;
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
while (start_match < true_end_subject && NOT_FIRSTCU(*start_match))
|
||||
start_match++;
|
||||
#endif
|
||||
|
||||
/* If we have hit the end of the subject, there isn't another non-empty
|
||||
fragment, so give up. */
|
||||
|
||||
if (start_match >= true_end_subject)
|
||||
{
|
||||
rc = MATCH_NOMATCH; /* In case it was partial */
|
||||
break;
|
||||
}
|
||||
|
||||
/* Check the rest of the subject */
|
||||
|
||||
mb->check_subject = start_match;
|
||||
rc = PRIV(valid_utf)(start_match, length - (start_match - subject),
|
||||
&(match_data->startchar));
|
||||
|
||||
/* The rest of the subject is valid UTF. */
|
||||
|
||||
if (rc == 0)
|
||||
{
|
||||
mb->end_subject = end_subject = true_end_subject;
|
||||
fragment_options = PCRE2_NOTBOL;
|
||||
goto FRAGMENT_RESTART;
|
||||
}
|
||||
|
||||
/* A subsequent UTF error has been found; if the next fragment is
|
||||
non-empty, set up to process it. Otherwise, let the loop advance. */
|
||||
|
||||
else if (rc < 0)
|
||||
{
|
||||
mb->end_subject = end_subject = start_match + match_data->startchar;
|
||||
if (end_subject > start_match)
|
||||
{
|
||||
fragment_options = PCRE2_NOTBOL|PCRE2_NOTEOL;
|
||||
goto FRAGMENT_RESTART;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
#endif /* SUPPORT_UNICODE */
|
||||
|
||||
/* Release an enlarged frame vector that is on the heap. */
|
||||
|
||||
if (mb->match_frames != mb->stack_frames)
|
||||
|
|
|
@ -212,6 +212,12 @@ be C99 don't support it (hence DISABLE_PERCENT_ZT). */
|
|||
#define REPLACE_MODSIZE 100 /* Field for reading 8-bit replacement */
|
||||
#define VERSION_SIZE 64 /* Size of buffer for the version strings */
|
||||
|
||||
/* Default JIT compile options */
|
||||
|
||||
#define JIT_DEFAULT (PCRE2_JIT_COMPLETE|\
|
||||
PCRE2_JIT_PARTIAL_SOFT|\
|
||||
PCRE2_JIT_PARTIAL_HARD)
|
||||
|
||||
/* Make sure the buffer into which replacement strings are copied is big enough
|
||||
to hold them as 32-bit code units. */
|
||||
|
||||
|
@ -664,6 +670,7 @@ static modstruct modlist[] = {
|
|||
{ "literal", MOD_PAT, MOD_OPT, PCRE2_LITERAL, PO(options) },
|
||||
{ "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) },
|
||||
{ "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) },
|
||||
{ "match_invalid_utf", MOD_PAT, MOD_OPT, PCRE2_MATCH_INVALID_UTF, PO(options) },
|
||||
{ "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) },
|
||||
{ "match_line", MOD_CTC, MOD_OPT, PCRE2_EXTRA_MATCH_LINE, CO(extra_options) },
|
||||
{ "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) },
|
||||
|
@ -4136,7 +4143,7 @@ static void
|
|||
show_compile_options(uint32_t options, const char *before, const char *after)
|
||||
{
|
||||
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
|
||||
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||
before,
|
||||
((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
|
||||
((options & PCRE2_ALT_CIRCUMFLEX) != 0)? " alt_circumflex" : "",
|
||||
|
@ -4153,6 +4160,7 @@ else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%
|
|||
((options & PCRE2_EXTENDED_MORE) != 0)? " extended_more" : "",
|
||||
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
|
||||
((options & PCRE2_LITERAL) != 0)? " literal" : "",
|
||||
((options & PCRE2_MATCH_INVALID_UTF) != 0)? " match_invalid_utf" : "",
|
||||
((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
|
||||
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
|
||||
((options & PCRE2_NEVER_BACKSLASH_C) != 0)? " never_backslash_c" : "",
|
||||
|
@ -4867,7 +4875,7 @@ switch(cmd)
|
|||
case CMD_PATTERN:
|
||||
(void)decode_modifiers(argptr, CTX_DEFPAT, &def_patctl, NULL);
|
||||
if (def_patctl.jit == 0 && (def_patctl.control & CTL_JITVERIFY) != 0)
|
||||
def_patctl.jit = 7;
|
||||
def_patctl.jit = JIT_DEFAULT;
|
||||
break;
|
||||
|
||||
/* Set default subject modifiers */
|
||||
|
@ -5114,7 +5122,11 @@ patlen = p - buffer - 2;
|
|||
/* Look for modifiers and options after the final delimiter. */
|
||||
|
||||
if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
|
||||
utf = (pat_patctl.options & PCRE2_UTF) != 0;
|
||||
|
||||
/* Note that the match_invalid_utf option also sets utf when passed to
|
||||
pcre2_compile(). */
|
||||
|
||||
utf = (pat_patctl.options & (PCRE2_UTF|PCRE2_MATCH_INVALID_UTF)) != 0;
|
||||
|
||||
/* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
|
||||
exclusive with the utf modifier. */
|
||||
|
@ -5161,7 +5173,7 @@ specified. */
|
|||
|
||||
if (pat_patctl.jit == 0 &&
|
||||
(pat_patctl.control & (CTL_JITVERIFY|CTL_JITFAST)) != 0)
|
||||
pat_patctl.jit = 7;
|
||||
pat_patctl.jit = JIT_DEFAULT;
|
||||
|
||||
/* Now copy the pattern to pbuffer8 for use in 8-bit testing and for reflecting
|
||||
in callouts. Convert from hex if requested (literal strings in quotes may be
|
||||
|
@ -5744,6 +5756,7 @@ if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
|
|||
{
|
||||
int i;
|
||||
clock_t time_taken = 0;
|
||||
|
||||
for (i = 0; i < timeit; i++)
|
||||
{
|
||||
clock_t start_time;
|
||||
|
@ -5752,7 +5765,7 @@ if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
|
|||
pat_patctl.options|use_forbid_utf, &errorcode, &erroroffset,
|
||||
use_pat_context);
|
||||
start_time = clock();
|
||||
PCRE2_JIT_COMPILE(jitrc,compiled_code, pat_patctl.jit);
|
||||
PCRE2_JIT_COMPILE(jitrc, compiled_code, pat_patctl.jit);
|
||||
time_taken += clock() - start_time;
|
||||
}
|
||||
total_jit_compile_time += time_taken;
|
||||
|
@ -8615,7 +8628,7 @@ while (argc > 1 && argv[op][0] == '-' && argv[op][1] != 0)
|
|||
else if (strcmp(arg, "-jit") == 0 || strcmp(arg, "-jitverify") == 0)
|
||||
{
|
||||
if (arg[4] != 0) def_patctl.control |= CTL_JITVERIFY;
|
||||
def_patctl.jit = 7; /* full & partial */
|
||||
def_patctl.jit = JIT_DEFAULT; /* full & partial */
|
||||
#ifndef SUPPORT_JIT
|
||||
fprintf(stderr, "** Warning: JIT support is not available: "
|
||||
"-jit[verify] calls functions that do nothing.\n");
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
# This set of tests is for UTF-8 support and Unicode property support, with
|
||||
# relevance only for the 8-bit library.
|
||||
|
||||
# The next 4 patterns have UTF-8 errors
|
||||
# The next 5 patterns have UTF-8 errors
|
||||
|
||||
/[Ã]/utf
|
||||
|
||||
|
@ -11,6 +11,8 @@
|
|||
|
||||
/‚‚‚‚‚‚‚Ã/utf
|
||||
|
||||
/‚‚‚‚‚‚‚Ã/match_invalid_utf
|
||||
|
||||
# Now test subjects
|
||||
|
||||
/badutf/utf
|
||||
|
@ -493,4 +495,66 @@
|
|||
|
||||
/(?(á/utf
|
||||
|
||||
# Invalid UTF-8 tests
|
||||
|
||||
/.../g,match_invalid_utf
|
||||
abcd\x80wxzy\x80pqrs
|
||||
abcd\x{80}wxzy\x80pqrs
|
||||
|
||||
/abc/match_invalid_utf
|
||||
ab\x80ab\=ph
|
||||
\= Expect no match
|
||||
ab\x80cdef\=ph
|
||||
|
||||
/ab$/match_invalid_utf
|
||||
ab\x80cdeab
|
||||
\= Expect no match
|
||||
ab\x80cde
|
||||
|
||||
/.../g,match_invalid_utf
|
||||
abcd\x{80}wxzy\x80pqrs
|
||||
|
||||
/(?<=x)../g,match_invalid_utf
|
||||
abcd\x{80}wxzy\x80pqrs
|
||||
abcd\x{80}wxzy\x80xpqrs
|
||||
|
||||
/X$/match_invalid_utf
|
||||
\= Expect no match
|
||||
X\xc4
|
||||
|
||||
/(?<=..)X/match_invalid_utf,aftertext
|
||||
AB\x80AQXYZ
|
||||
AB\x80AQXYZ\=offset=5
|
||||
AB\x80\x80AXYZXC\=offset=5
|
||||
\= Expect no match
|
||||
AB\x80XYZ
|
||||
AB\x80XYZ\=offset=3
|
||||
AB\xfeXYZ
|
||||
AB\xffXYZ\=offset=3
|
||||
AB\x80AXYZ
|
||||
AB\x80AXYZ\=offset=4
|
||||
AB\x80\x80AXYZ\=offset=5
|
||||
|
||||
/.../match_invalid_utf
|
||||
AB\xc4CCC
|
||||
\= Expect no match
|
||||
A\x{d800}B
|
||||
A\x{110000}B
|
||||
A\xc4B
|
||||
|
||||
/\bX/match_invalid_utf
|
||||
A\x80X
|
||||
|
||||
/\BX/match_invalid_utf
|
||||
\= Expect no match
|
||||
A\x80X
|
||||
|
||||
/(?<=...)X/match_invalid_utf
|
||||
AAA\x80BBBXYZ
|
||||
\= Expect no match
|
||||
AAA\x80BXYZ
|
||||
AAA\x80BBXYZ
|
||||
|
||||
# -------------------------------------
|
||||
|
||||
# End of testinput10
|
||||
|
|
|
@ -368,6 +368,4 @@
|
|||
ab˙Az
|
||||
ab\x{80000041}z
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput11
|
||||
|
|
|
@ -402,4 +402,49 @@
|
|||
|
||||
/(?(á/utf
|
||||
|
||||
# Invalid UTF-16/32 tests.
|
||||
|
||||
/.../g,match_invalid_utf
|
||||
abcd\x{df00}wxzy\x{df00}pqrs
|
||||
abcd\x{80}wxzy\x{df00}pqrs
|
||||
|
||||
/abc/match_invalid_utf
|
||||
ab\x{df00}ab\=ph
|
||||
\= Expect no match
|
||||
ab\x{df00}cdef\=ph
|
||||
|
||||
/ab$/match_invalid_utf
|
||||
ab\x{df00}cdeab
|
||||
\= Expect no match
|
||||
ab\x{df00}cde
|
||||
|
||||
/.../g,match_invalid_utf
|
||||
abcd\x{80}wxzy\x{df00}pqrs
|
||||
|
||||
/(?<=x)../g,match_invalid_utf
|
||||
abcd\x{80}wxzy\x{df00}pqrs
|
||||
abcd\x{80}wxzy\x{df00}xpqrs
|
||||
|
||||
/X$/match_invalid_utf
|
||||
\= Expect no match
|
||||
X\x{df00}
|
||||
|
||||
/(?<=..)X/match_invalid_utf,aftertext
|
||||
AB\x{df00}AQXYZ
|
||||
AB\x{df00}AQXYZ\=offset=5
|
||||
AB\x{df00}\x{df00}AXYZXC\=offset=5
|
||||
\= Expect no match
|
||||
AB\x{df00}XYZ
|
||||
AB\x{df00}XYZ\=offset=3
|
||||
AB\x{df00}AXYZ
|
||||
AB\x{df00}AXYZ\=offset=4
|
||||
AB\x{df00}\x{df00}AXYZ\=offset=5
|
||||
|
||||
/.../match_invalid_utf
|
||||
\= Expect no match
|
||||
A\x{d800}B
|
||||
A\x{110000}B
|
||||
|
||||
# ----------------------------------------------------
|
||||
|
||||
# End of testinput12
|
||||
|
|
|
@ -182,4 +182,8 @@
|
|||
|
||||
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
|
||||
|
||||
#pattern -fullbincode
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -260,6 +260,4 @@
|
|||
|
||||
/(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput9
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
# This set of tests is for UTF-8 support and Unicode property support, with
|
||||
# relevance only for the 8-bit library.
|
||||
|
||||
# The next 4 patterns have UTF-8 errors
|
||||
# The next 5 patterns have UTF-8 errors
|
||||
|
||||
/[Ã]/utf
|
||||
Failed: error -8 at offset 1: UTF-8 error: byte 2 top bits not 0x80
|
||||
|
@ -15,6 +15,9 @@ Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
|
|||
/‚‚‚‚‚‚‚Ã/utf
|
||||
Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set
|
||||
|
||||
/‚‚‚‚‚‚‚Ã/match_invalid_utf
|
||||
Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set
|
||||
|
||||
# Now test subjects
|
||||
|
||||
/badutf/utf
|
||||
|
@ -1651,4 +1654,107 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
|
|||
/(?(á/utf
|
||||
Failed: error 142 at offset 5: syntax error in subpattern name (missing terminator?)
|
||||
|
||||
# Invalid UTF-8 tests
|
||||
|
||||
/.../g,match_invalid_utf
|
||||
abcd\x80wxzy\x80pqrs
|
||||
0: abc
|
||||
0: wxz
|
||||
0: pqr
|
||||
abcd\x{80}wxzy\x80pqrs
|
||||
0: abc
|
||||
0: d\x{80}w
|
||||
0: xzy
|
||||
0: pqr
|
||||
|
||||
/abc/match_invalid_utf
|
||||
ab\x80ab\=ph
|
||||
Partial match: ab
|
||||
\= Expect no match
|
||||
ab\x80cdef\=ph
|
||||
No match
|
||||
|
||||
/ab$/match_invalid_utf
|
||||
ab\x80cdeab
|
||||
0: ab
|
||||
\= Expect no match
|
||||
ab\x80cde
|
||||
No match
|
||||
|
||||
/.../g,match_invalid_utf
|
||||
abcd\x{80}wxzy\x80pqrs
|
||||
0: abc
|
||||
0: d\x{80}w
|
||||
0: xzy
|
||||
0: pqr
|
||||
|
||||
/(?<=x)../g,match_invalid_utf
|
||||
abcd\x{80}wxzy\x80pqrs
|
||||
0: zy
|
||||
abcd\x{80}wxzy\x80xpqrs
|
||||
0: zy
|
||||
0: pq
|
||||
|
||||
/X$/match_invalid_utf
|
||||
\= Expect no match
|
||||
X\xc4
|
||||
No match
|
||||
|
||||
/(?<=..)X/match_invalid_utf,aftertext
|
||||
AB\x80AQXYZ
|
||||
0: X
|
||||
0+ YZ
|
||||
AB\x80AQXYZ\=offset=5
|
||||
0: X
|
||||
0+ YZ
|
||||
AB\x80\x80AXYZXC\=offset=5
|
||||
0: X
|
||||
0+ C
|
||||
\= Expect no match
|
||||
AB\x80XYZ
|
||||
No match
|
||||
AB\x80XYZ\=offset=3
|
||||
No match
|
||||
AB\xfeXYZ
|
||||
No match
|
||||
AB\xffXYZ\=offset=3
|
||||
No match
|
||||
AB\x80AXYZ
|
||||
No match
|
||||
AB\x80AXYZ\=offset=4
|
||||
No match
|
||||
AB\x80\x80AXYZ\=offset=5
|
||||
No match
|
||||
|
||||
/.../match_invalid_utf
|
||||
AB\xc4CCC
|
||||
0: CCC
|
||||
\= Expect no match
|
||||
A\x{d800}B
|
||||
No match
|
||||
A\x{110000}B
|
||||
No match
|
||||
A\xc4B
|
||||
No match
|
||||
|
||||
/\bX/match_invalid_utf
|
||||
A\x80X
|
||||
0: X
|
||||
|
||||
/\BX/match_invalid_utf
|
||||
\= Expect no match
|
||||
A\x80X
|
||||
No match
|
||||
|
||||
/(?<=...)X/match_invalid_utf
|
||||
AAA\x80BBBXYZ
|
||||
0: X
|
||||
\= Expect no match
|
||||
AAA\x80BXYZ
|
||||
No match
|
||||
AAA\x80BBXYZ
|
||||
No match
|
||||
|
||||
# -------------------------------------
|
||||
|
||||
# End of testinput10
|
||||
|
|
|
@ -661,7 +661,4 @@ Subject length lower bound = 1
|
|||
ab˙Az
|
||||
ab\x{80000041}z
|
||||
|
||||
/\[()]{65535}/expand
|
||||
Failed: error 120 at offset 131070: regular expression is too large
|
||||
|
||||
# End of testinput11
|
||||
|
|
|
@ -667,6 +667,4 @@ Subject length lower bound = 1
|
|||
ab\x{80000041}z
|
||||
0: ab\x{80000041}z
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput11
|
||||
|
|
|
@ -1502,4 +1502,81 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
|
|||
/(?(á/utf
|
||||
Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?)
|
||||
|
||||
# Invalid UTF-16/32 tests.
|
||||
|
||||
/.../g,match_invalid_utf
|
||||
abcd\x{df00}wxzy\x{df00}pqrs
|
||||
0: abc
|
||||
0: wxz
|
||||
0: pqr
|
||||
abcd\x{80}wxzy\x{df00}pqrs
|
||||
0: abc
|
||||
0: d\x{80}w
|
||||
0: xzy
|
||||
0: pqr
|
||||
|
||||
/abc/match_invalid_utf
|
||||
ab\x{df00}ab\=ph
|
||||
Partial match: ab
|
||||
\= Expect no match
|
||||
ab\x{df00}cdef\=ph
|
||||
No match
|
||||
|
||||
/ab$/match_invalid_utf
|
||||
ab\x{df00}cdeab
|
||||
0: ab
|
||||
\= Expect no match
|
||||
ab\x{df00}cde
|
||||
No match
|
||||
|
||||
/.../g,match_invalid_utf
|
||||
abcd\x{80}wxzy\x{df00}pqrs
|
||||
0: abc
|
||||
0: d\x{80}w
|
||||
0: xzy
|
||||
0: pqr
|
||||
|
||||
/(?<=x)../g,match_invalid_utf
|
||||
abcd\x{80}wxzy\x{df00}pqrs
|
||||
0: zy
|
||||
abcd\x{80}wxzy\x{df00}xpqrs
|
||||
0: zy
|
||||
0: pq
|
||||
|
||||
/X$/match_invalid_utf
|
||||
\= Expect no match
|
||||
X\x{df00}
|
||||
No match
|
||||
|
||||
/(?<=..)X/match_invalid_utf,aftertext
|
||||
AB\x{df00}AQXYZ
|
||||
0: X
|
||||
0+ YZ
|
||||
AB\x{df00}AQXYZ\=offset=5
|
||||
0: X
|
||||
0+ YZ
|
||||
AB\x{df00}\x{df00}AXYZXC\=offset=5
|
||||
0: X
|
||||
0+ C
|
||||
\= Expect no match
|
||||
AB\x{df00}XYZ
|
||||
No match
|
||||
AB\x{df00}XYZ\=offset=3
|
||||
No match
|
||||
AB\x{df00}AXYZ
|
||||
No match
|
||||
AB\x{df00}AXYZ\=offset=4
|
||||
No match
|
||||
AB\x{df00}\x{df00}AXYZ\=offset=5
|
||||
No match
|
||||
|
||||
/.../match_invalid_utf
|
||||
\= Expect no match
|
||||
A\x{d800}B
|
||||
No match
|
||||
A\x{110000}B
|
||||
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
|
||||
|
||||
# ----------------------------------------------------
|
||||
|
||||
# End of testinput12
|
||||
|
|
|
@ -1500,4 +1500,81 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
|
|||
/(?(á/utf
|
||||
Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?)
|
||||
|
||||
# Invalid UTF-16/32 tests.
|
||||
|
||||
/.../g,match_invalid_utf
|
||||
abcd\x{df00}wxzy\x{df00}pqrs
|
||||
0: abc
|
||||
0: wxz
|
||||
0: pqr
|
||||
abcd\x{80}wxzy\x{df00}pqrs
|
||||
0: abc
|
||||
0: d\x{80}w
|
||||
0: xzy
|
||||
0: pqr
|
||||
|
||||
/abc/match_invalid_utf
|
||||
ab\x{df00}ab\=ph
|
||||
Partial match: ab
|
||||
\= Expect no match
|
||||
ab\x{df00}cdef\=ph
|
||||
No match
|
||||
|
||||
/ab$/match_invalid_utf
|
||||
ab\x{df00}cdeab
|
||||
0: ab
|
||||
\= Expect no match
|
||||
ab\x{df00}cde
|
||||
No match
|
||||
|
||||
/.../g,match_invalid_utf
|
||||
abcd\x{80}wxzy\x{df00}pqrs
|
||||
0: abc
|
||||
0: d\x{80}w
|
||||
0: xzy
|
||||
0: pqr
|
||||
|
||||
/(?<=x)../g,match_invalid_utf
|
||||
abcd\x{80}wxzy\x{df00}pqrs
|
||||
0: zy
|
||||
abcd\x{80}wxzy\x{df00}xpqrs
|
||||
0: zy
|
||||
0: pq
|
||||
|
||||
/X$/match_invalid_utf
|
||||
\= Expect no match
|
||||
X\x{df00}
|
||||
No match
|
||||
|
||||
/(?<=..)X/match_invalid_utf,aftertext
|
||||
AB\x{df00}AQXYZ
|
||||
0: X
|
||||
0+ YZ
|
||||
AB\x{df00}AQXYZ\=offset=5
|
||||
0: X
|
||||
0+ YZ
|
||||
AB\x{df00}\x{df00}AXYZXC\=offset=5
|
||||
0: X
|
||||
0+ C
|
||||
\= Expect no match
|
||||
AB\x{df00}XYZ
|
||||
No match
|
||||
AB\x{df00}XYZ\=offset=3
|
||||
No match
|
||||
AB\x{df00}AXYZ
|
||||
No match
|
||||
AB\x{df00}AXYZ\=offset=4
|
||||
No match
|
||||
AB\x{df00}\x{df00}AXYZ\=offset=5
|
||||
No match
|
||||
|
||||
/.../match_invalid_utf
|
||||
\= Expect no match
|
||||
A\x{d800}B
|
||||
No match
|
||||
A\x{110000}B
|
||||
No match
|
||||
|
||||
# ----------------------------------------------------
|
||||
|
||||
# End of testinput12
|
||||
|
|
|
@ -1020,4 +1020,9 @@ Failed: error 114 at offset 509: missing closing parenthesis
|
|||
|
||||
fullbincode
|
||||
|
||||
#pattern -fullbincode
|
||||
|
||||
/\[()]{65535}/expand
|
||||
Failed: error 120 at offset 131070: regular expression is too large
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
|
|||
|
||||
fullbincode
|
||||
|
||||
#pattern -fullbincode
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
|
|||
|
||||
fullbincode
|
||||
|
||||
#pattern -fullbincode
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
|
|||
|
||||
fullbincode
|
||||
|
||||
#pattern -fullbincode
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
|
|||
|
||||
fullbincode
|
||||
|
||||
#pattern -fullbincode
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
|
|||
|
||||
fullbincode
|
||||
|
||||
#pattern -fullbincode
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -1020,4 +1020,9 @@ Failed: error 114 at offset 509: missing closing parenthesis
|
|||
|
||||
fullbincode
|
||||
|
||||
#pattern -fullbincode
|
||||
|
||||
/\[()]{65535}/expand
|
||||
Failed: error 120 at offset 131070: regular expression is too large
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
|
|||
|
||||
fullbincode
|
||||
|
||||
#pattern -fullbincode
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
|
|||
|
||||
fullbincode
|
||||
|
||||
#pattern -fullbincode
|
||||
|
||||
/\[()]{65535}/expand
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -367,7 +367,4 @@ Failed: error 134 at offset 14: character code point value in \x{} or \o{} is to
|
|||
/(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/
|
||||
Failed: error 176 at offset 259: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
|
||||
|
||||
/\[()]{65535}/expand
|
||||
Failed: error 120 at offset 131070: regular expression is too large
|
||||
|
||||
# End of testinput9
|
||||
|
|
Loading…
Reference in New Issue