Implement support for invalid UTF in the pcre2_match() interpreter.

2019-05-24 17:15:48 +00:00 · 2019-05-24 17:15:48 +00:00 · 16c046ce50
parent 2ad4329f83
commit 16c046ce50
48 changed files with 2780 additions and 1783 deletions
--- a/4
+++ b/4
@ -14,6 +14,10 @@ detects invalid characters in the 0xd800-0xdfff range.

 3. Fix minor typo bug in JIT compile when \X is used in a non-UTF string.

+4. Add support for matching in invalid UTF strings to the pcre2_match()
+interpreter, and integrate with the existing JIT support via the new
+PCRE2_MATCH_INVALID_UTF compile-time option.
+

 Version 10.33 16-April-2019
 ---------------------------
--- a/doc/html/pcre2_compile.html
+++ b/doc/html/pcre2_compile.html
@ -65,6 +65,7 @@ The option bits are:
  PCRE2_EXTENDED           Ignore white space and # comments
  PCRE2_FIRSTLINE          Force matching to be before newline
  PCRE2_LITERAL            Pattern characters are all literal
+  PCRE2_MATCH_INVALID_UTF  Enable support for matching invalid UTF 
  PCRE2_MATCH_UNSET_BACKREF  Match unset backreferences
  PCRE2_MULTILINE          ^ and $ match newlines within data
  PCRE2_NEVER_BACKSLASH_C  Lock out the use of \C in patterns
--- a/doc/html/pcre2_jit_compile.html
+++ b/doc/html/pcre2_jit_compile.html
@ -40,8 +40,12 @@ bits:
  PCRE2_JIT_COMPLETE      compile code for full matching
  PCRE2_JIT_PARTIAL_SOFT  compile code for soft partial matching
  PCRE2_JIT_PARTIAL_HARD  compile code for hard partial matching
-  PCRE2_JIT_INVALID_UTF   compile code to handle invalid UTF
 </pre>
+There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been 
+superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF. The old 
+option is deprecated and may be removed in future.
+</P>
+<P>
 The yield of the function is 0 for success, or a negative error code otherwise.
 In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
 if an unknown bit is set in <i>options</i>.
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@ -1347,11 +1347,12 @@ and <b>pcre2_compile()</b> returns a non-NULL value.
 <P>
 There are nearly 100 positive error codes that <b>pcre2_compile()</b> may return
 if it finds an error in the pattern. There are also some negative error codes
-that are used for invalid UTF strings. These are the same as given by
-<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and are described in the
+that are used for invalid UTF strings when validity checking is in force. These
+are the same as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and
+are described in the
 <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
-page. There is no separate documentation for the positive error codes, because
-the textual error messages that are obtained by calling the
+documentation. There is no separate documentation for the positive error codes,
+because the textual error messages that are obtained by calling the
 <b>pcre2_get_error_message()</b> function (see "Obtaining a textual error
 message"
 <a href="#geterrormessage">below)</a>
@ -1615,10 +1616,18 @@ expression engine is not the most efficient way of doing it. If you are doing a
 lot of literal matching and are worried about efficiency, you should consider
 using other approaches. The only other main options that are allowed with
 PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
-PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
-PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
-and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
-error.
+PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
+PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
+PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
+PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
+<pre>
+  PCRE2_MATCH_INVALID_UTF
+</pre>
+This option forces PCRE2_UTF (see below) and also enables support for matching
+by <b>pcre2_match()</b> in subject strings that contain invalid UTF sequences.
+This facility is not supported for DFA matching. For details, see the
+<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
+documentation.
 <pre>
  PCRE2_MATCH_UNSET_BACKREF
 </pre>
@ -2653,15 +2662,22 @@ of JIT; it forces matching to be done by the interpreter.
  PCRE2_NO_UTF_CHECK
 </pre>
 When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
-string is checked by default when <b>pcre2_match()</b> is subsequently called.
-If a non-zero starting offset is given, the check is applied only to that part
-of the subject that could be inspected during matching, and there is a check
-that the starting offset points to the first code unit of a character or to the
-end of the subject. If there are no lookbehind assertions in the pattern, the
-check starts at the starting offset. Otherwise, it starts at the length of the
-longest lookbehind before the starting offset, or at the start of the subject
-if there are not that many characters before the starting offset. Note that the
-sequences \b and \B are one-character lookbehinds.
+string is checked unless PCRE2_NO_UTF_CHECK is passed to <b>pcre2_match()</b> or
+PCRE2_MATCH_INVALID_UTF was passed to <b>pcre2_compile()</b>. The latter special
+case is discussed in detail in the
+<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
+documentation.
+</P>
+<P>
+In the default case, if a non-zero starting offset is given, the check is
+applied only to that part of the subject that could be inspected during
+matching, and there is a check that the starting offset points to the first
+code unit of a character or to the end of the subject. If there are no
+lookbehind assertions in the pattern, the check starts at the starting offset.
+Otherwise, it starts at the length of the longest lookbehind before the
+starting offset, or at the start of the subject if there are not that many
+characters before the starting offset. Note that the sequences \b and \B are
+one-character lookbehinds.
 </P>
 <P>
 The check is carried out before any other processing takes place, and a
@ -2674,19 +2690,20 @@ and
 <a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
 in the
 <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
-page.
+documentation.
 </P>
 <P>
-If you know that your subject is valid, and you want to skip these checks for
+If you know that your subject is valid, and you want to skip this check for
 performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
 <b>pcre2_match()</b>. You might want to do this for the second and subsequent
-calls to <b>pcre2_match()</b> if you are making repeated calls to find other
+calls to <b>pcre2_match()</b> if you are making repeated calls to find multiple
 matches in the same subject string.
 </P>
 <P>
-<b>Warning:</b> When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
+<b>Warning:</b> Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
+PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
 string as a subject, or an invalid value of <i>startoffset</i>, is undefined.
-Your program may crash or loop indefinitely.
+Your program may crash or loop indefinitely or give wrong results.
 <pre>
  PCRE2_PARTIAL_HARD
  PCRE2_PARTIAL_SOFT
@ -3771,6 +3788,12 @@ a backreference.
 This return is given if <b>pcre2_dfa_match()</b> encounters a condition item
 that uses a backreference for the condition, or a test for recursion in a
 specific capture group. These are not supported.
+<pre>
+  PCRE2_ERROR_DFA_UINVALID_UTF
+</pre>
+This return is given if <b>pcre2_dfa_match()</b> is called for a pattern that
+was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
+matching.
 <pre>
  PCRE2_ERROR_DFA_WSSIZE
 </pre>
@ -3808,7 +3831,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC42" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 14 February 2019
+Last updated: 23 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/html/pcre2jit.html
+++ b/doc/html/pcre2jit.html
@ -147,25 +147,29 @@ pattern.
 </P>
 <br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
 <P>
-When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
-function expects its subject string to be a valid sequence of UTF code units.
-If it is not, the result is undefined. This is also true by default of matching
-via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
-<b>pcre2_jit_compile()</b>, code that can process a subject containing invalid
-UTF is compiled.
+When a pattern is compiled with the PCRE2_UTF option, subject strings are
+normally expected to be a valid sequence of UTF code units. By default, this is
+checked at the start of matching and an error is generated if invalid UTF is
+detected. The PCRE2_NO_UTF_CHECK option can be passed to <b>pcre2_match()</b> to
+skip the check (for improved performance) if you are sure that a subject string
+is valid. If this option is used with an invalid string, the result is
+undefined.
 </P>
 <P>
-In this mode, an invalid code unit sequence never matches any pattern item. It
-does not match dot, it does not match \p{Any}, it does not even match negative
-items such as [^X]. A lookbehind assertion fails if it encounters an invalid
-sequence while moving the current point backwards. In other words, an invalid
-UTF code unit sequence acts as a barrier which no match can cross. Reaching an
-invalid sequence causes an immediate backtrack.
+However, a way of running matches on strings that may contain invalid UTF
+sequences is available. Calling <b>pcre2_compile()</b> with the
+PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
+<b>pcre2_match()</b> to support invalid UTF, and, if <b>pcre2_jit_compile()</b>
+is called, the compiled JIT code also supports invalid UTF. Details of how this
+support works, in both the JIT and the interpretive cases, is given in the
+<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
+documentation.
 </P>
 <P>
-Using this option, an application can run matches in arbitrary data, knowing
-that any matched strings that are returned will be valid UTF. This can be
-useful when searching for text in executable or other binary files.
+There is also an obsolete option for <b>pcre2_jit_compile()</b> called
+PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
+It is superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF
+and should no longer be used. It may be removed in future.
 </P>
 <br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
 <P>
@ -461,7 +465,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC14" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 06 March 2019
+Last updated: 23 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/html/pcre2matching.html
+++ b/doc/html/pcre2matching.html
@ -188,6 +188,10 @@ code unit) at a time, for all active paths through the tree.
 9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
 supported. (*FAIL) is supported, and behaves like a failing negative assertion.
 </P>
+<P>
+10. The PCRE2_MATCH_INVALID_UTF option for <b>pcre2_compile()</b> is not 
+supported by <b>pcre2_dfa_match()</b>.
+</P>
 <br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
 <P>
 Using the alternative matching algorithm provides the following advantages:
@ -219,7 +223,8 @@ because it has to search for all possible matches, but is also because it is
 less susceptible to optimization.
 </P>
 <P>
-2. Capturing parentheses, backreferences, and script runs are not supported.
+2. Capturing parentheses, backreferences, script runs, and matching within 
+invalid UTF string are not supported.
 </P>
 <P>
 3. Although atomic groups are supported, their use does not provide the
@ -236,9 +241,9 @@ Cambridge, England.
 </P>
 <br><a name="SEC8" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 10 October 2018
+Last updated: 23 May 2019
 <br>
-Copyright &copy; 1997-2018 University of Cambridge.
+Copyright &copy; 1997-2019 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -91,10 +91,11 @@ single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
 specified for the 32-bit library, in which case it constrains the character
 values to valid Unicode code points. To process UTF strings, PCRE2 must be
 built to include Unicode support (which is the default). When using UTF strings
-you must either call the compiling function with the PCRE2_UTF option, or the
-pattern must start with the special sequence (*UTF), which is equivalent to
-setting the relevant option. How setting a UTF mode affects pattern matching is
-mentioned in several places below. There is also a summary of features in the
+you must either call the compiling function with one or both of the PCRE2_UTF
+or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
+sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
+setting a UTF mode affects pattern matching is mentioned in several places
+below. There is also a summary of features in the
 <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
 page.
 </P>
@ -428,11 +429,11 @@ There may be any number of hexadecimal digits. This syntax is from ECMAScript
 6.
 </P>
 <P>
-The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
-is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
-\N{name} to specify characters by Unicode name; PCRE2 does not support this.
-Note that when \N is not followed by an opening brace (curly bracket) it has
-an entirely different meaning, matching any character that is not a newline.
+The \N{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
+UTF mode. Perl also uses \N{name} to specify characters by Unicode name; PCRE2
+does not support this. Note that when \N is not followed by an opening brace
+(curly bracket) it has an entirely different meaning, matching any character
+that is not a newline.
 </P>
 <P>
 There are some legacy applications where the escape sequence \r is expected to
@ -1360,7 +1361,7 @@ with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
 with a malformed UTF character. This has undefined results, because PCRE2
 assumes that it is matching character by character in a valid UTF string (by
 default it checks the subject string's validity at the start of processing
-unless the PCRE2_NO_UTF_CHECK option is used).
+unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
 </P>
 <P>
 An application can lock out the use of \C by setting the
@ -3727,7 +3728,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC31" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 12 February 2019
+Last updated: 23 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/html/pcre2test.html
+++ b/doc/html/pcre2test.html
@ -613,6 +613,7 @@ for a description of the effects of these options.
      firstline                 set PCRE2_FIRSTLINE
      literal                   set PCRE2_LITERAL
      match_line                set PCRE2_EXTRA_MATCH_LINE
+      match_invalid_utf         set PCRE2_MATCH_INVALID_UTF 
      match_unset_backref       set PCRE2_MATCH_UNSET_BACKREF
      match_word                set PCRE2_EXTRA_MATCH_WORD
  /m  multiline                 set PCRE2_MULTILINE
@ -2078,7 +2079,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC21" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 11 March 2019
+Last updated: 23 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/html/pcre2unicode.html
+++ b/doc/html/pcre2unicode.html
@ -16,22 +16,33 @@ please consult the man page, in case the conversion went wrong.
 UNICODE AND UTF SUPPORT
 </b><br>
 <P>
-When PCRE2 is built with Unicode support (which is the default), it has
-knowledge of Unicode character properties and can process text strings in
-UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by
-default, PCRE2 assumes that one code unit is one character. To process a
-pattern as a UTF string, where a character may require more than one code unit,
-you must call
-<a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
-with the PCRE2_UTF option flag, or the pattern must start with the sequence
-(*UTF). When either of these is the case, both the pattern and any subject
-strings that are matched against it are treated as UTF strings instead of
-strings of individual one-code-unit characters. There are also some other
-changes to the way characters are handled, as documented below.
+PCRE2 is normally built with Unicode support, though if you do not need it, you
+can build it without, in which case the library will be smaller. With Unicode
+support, PCRE2 has knowledge of Unicode character properties and can process
+text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
+width), but this is not the default. Unless specifically requested, PCRE2
+treats each code unit in a string as one character.
 </P>
 <P>
-If you do not need Unicode support you can build PCRE2 without it, in which
-case the library will be smaller.
+There are two ways of telling PCRE2 to switch to UTF mode, where characters may 
+consist of more than one code unit and the range of values is constrained. The 
+program can call
+<a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
+with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
+However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
+That is, the programmer can prevent the supplier of the pattern from switching 
+to UTF mode.
+</P>
+<P>
+Note that the PCRE2_MATCH_INVALID_UTF option (see
+<a href="#matchinvalid">below)</a>
+forces PCRE2_UTF to be set.
+</P>
+<P>
+In UTF mode, both the pattern and any subject strings that are matched against
+it are treated as UTF strings instead of strings of individual one-code-unit
+characters. There are also some other changes to the way characters are
+handled, as documented below.
 </P>
 <br><b>
 UNICODE PROPERTY SUPPORT
@ -63,22 +74,22 @@ also recognized; larger ones can be coded using \o{...}.
 <P>
 The escape sequence \N{U+&#60;hex digits&#62;} is recognized as another way of
 specifying a Unicode character by code point in a UTF mode. It is not allowed
-in non-UTF modes.
+in non-UTF mode.
 </P>
 <P>
-In UTF modes, repeat quantifiers apply to complete UTF characters, not to
+In UTF mode, repeat quantifiers apply to complete UTF characters, not to
 individual code units.
 </P>
 <P>
-In UTF modes, the dot metacharacter matches one UTF character instead of a
+In UTF mode, the dot metacharacter matches one UTF character instead of a
 single code unit.
 </P>
 <P>
-In UTF modes, capture group names are not restricted to ASCII, and may contain
+In UTF mode, capture group names are not restricted to ASCII, and may contain
 any Unicode letters and decimal digits, as well as underscore.
 </P>
 <P>
-The escape sequence \C can be used to match a single code unit in a UTF mode,
+The escape sequence \C can be used to match a single code unit in UTF mode,
 but its use can lead to some strange effects because it breaks up multi-unit
 characters (see the description of \C in the
 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
@ -93,7 +104,7 @@ may consist of more than one code unit. The use of \C in these modes provokes
 a match-time error. Also, the JIT optimization does not support \C in these
 modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
 contains \C, it will not succeed, and so when <b>pcre2_match()</b> is called,
-the matching will be carried out by the normal interpretive function.
+the matching will be carried out by the interpretive function.
 </P>
 <P>
 The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
@ -123,14 +134,14 @@ However, the special horizontal and vertical white space matching escapes (\h,
 not PCRE2_UCP is set.
 </P>
 <br><b>
-CASE-EQUIVALENCE IN UTF MODES
+CASE-EQUIVALENCE IN UTF MODE
 </b><br>
 <P>
-Case-insensitive matching in a UTF mode makes use of Unicode properties except
+Case-insensitive matching in UTF mode makes use of Unicode properties except
 for characters whose code points are less than 128 and that have at most two
 case-equivalent values. For these, a direct table lookup is used for speed. A
 few Unicode characters such as Greek sigma have more than two code points that
-are case-equivalent, and these are treated as such.
+are case-equivalent, and these are treated specially.
 <a name="scriptruns"></a></P>
 <br><b>
 SCRIPT RUNS
@ -248,7 +259,7 @@ VALIDITY OF UTF STRINGS
 <P>
 When the PCRE2_UTF option is set, the strings passed as patterns and subjects
 are (by default) checked for validity on entry to the relevant functions. If an
-invalid UTF string is passed, an negative error code is returned. The code unit
+invalid UTF string is passed, a negative error code is returned. The code unit
 offset to the offending character can be extracted from the match data block by
 calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
 error.
@ -263,17 +274,16 @@ only valid UTF code unit sequences.
 </P>
 <P>
 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
-is usually undefined and your program may crash or loop indefinitely. There is,
-however, one mode of matching that can handle invalid UTF subject strings. This
-is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
-when calling <b>pcre2_jit_compile()</b>. For details, see the
-<a href="pcre2jit.html"><b>pcre2jit</b></a>
-documentation.
+is undefined and your program may crash or loop indefinitely or give incorrect
+results. There is, however, one mode of matching that can handle invalid UTF
+subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
+<b>pcre2_compile()</b> and is discussed below in the next section. The rest of
+this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
 </P>
 <P>
-Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
-the pattern; it does not also apply to subject strings. If you want to disable
-the check for a subject string you must pass this same option to
+Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the UTF check
+for the pattern; it does not also apply to subject strings. If you want to
+disable the check for a subject string you must pass this same option to
 <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
 </P>
 <P>
@ -352,7 +362,7 @@ these code points are excluded by RFC 3629.
 <pre>
  PCRE2_ERROR_UTF8_ERR13
 </pre>
-A 4-byte character has a value greater than 0x10fff; these code points are
+A 4-byte character has a value greater than 0x10ffff; these code points are
 excluded by RFC 3629.
 <pre>
  PCRE2_ERROR_UTF8_ERR14
@ -405,7 +415,59 @@ The following negative error codes are given for invalid UTF-32 strings:
  PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
  PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff

-</PRE>
+<a name="matchinvalid"></a></PRE>
+</P>
+<br><b>
+MATCHING IN INVALID UTF STRINGS
+</b><br>
+<P>
+You can run pattern matches on subject strings that may contain invalid UTF
+sequences if you call <b>pcre2_compile()</b> with the PCRE2_MATCH_INVALID_UTF
+option. This is supported by <b>pcre2_match()</b>, including JIT matching, but
+not by <b>pcre2_dfa_match()</b>. When PCRE2_MATCH_INVALID_UTF is set, it forces
+PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a 
+valid UTF string.
+</P>
+<P>
+Setting PCRE2_MATCH_INVALID_UTF does not affect what <b>pcre2_compile()</b>
+generates, but if <b>pcre2_jit_compile()</b> is subsequently called, it does
+generate different code. If JIT is not used, the option affects the behaviour
+of the interpretive code in <b>pcre2_match()</b>. When PCRE2_MATCH_INVALID_UTF
+is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
+</P>
+<P>
+In this mode, an invalid code unit sequence in the subject never matches any
+pattern item. It does not match dot, it does not match \p{Any}, it does not
+even match negative items such as [^X]. A lookbehind assertion fails if it
+encounters an invalid sequence while moving the current point backwards. In
+other words, an invalid UTF code unit sequence acts as a barrier which no match
+can cross.
+</P>
+<P>
+You can also think of this as the subject being split up into fragments of
+valid UTF, delimited internally by invalid code unit sequences. The pattern is
+matched fragment by fragment. The result of a successful match, however, is
+given as code unit offsets in the entire subject string in the usual way. There
+are a few points to consider:
+</P>
+<P>
+The internal boundaries are not interpreted as the beginnings or ends of lines
+and so do not match circumflex or dollar characters in the pattern.
+</P>
+<P>
+If <b>pcre2_match()</b> is called with an offset that points to an invalid
+UTF-sequence, that sequence is skipped, and the match starts at the next valid
+UTF character, or the end of the subject.
+</P>
+<P>
+At internal fragment boundaries, \b and \B behave in the same way as at the
+beginning and end of the subject. For example, a sequence such as \bWORD\b 
+would match an instance of WORD that is surrounded by invalid UTF code units.
+</P>
+<P>
+Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
+data, knowing that any matched strings that are returned are valid UTF. This
+can be useful when searching for UTF text in executable or other binary files.
 </P>
 <br><b>
 AUTHOR
@ -422,7 +484,7 @@ Cambridge, England.
 REVISION
 </b><br>
 <P>
-Last updated: 06 March 2019
+Last updated: 24 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
--- a/doc/pcre2_compile.3
+++ b/doc/pcre2_compile.3
@ -1,4 +1,4 @@
-.TH PCRE2_COMPILE 3 "11 February 2019" "PCRE2 10.33"
+.TH PCRE2_COMPILE 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH SYNOPSIS
@ -53,6 +53,7 @@ The option bits are:
  PCRE2_EXTENDED           Ignore white space and # comments
  PCRE2_FIRSTLINE          Force matching to be before newline
  PCRE2_LITERAL            Pattern characters are all literal
+  PCRE2_MATCH_INVALID_UTF  Enable support for matching invalid UTF 
  PCRE2_MATCH_UNSET_BACKREF  Match unset backreferences
  PCRE2_MULTILINE          ^ and $ match newlines within data
  PCRE2_NEVER_BACKSLASH_C  Lock out the use of \eC in patterns
--- a/doc/pcre2_jit_compile.3
+++ b/doc/pcre2_jit_compile.3
@ -1,4 +1,4 @@
-.TH PCRE2_JIT_COMPILE 3 "06 March 2019" "PCRE2 10.33"
+.TH PCRE2_JIT_COMPILE 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH SYNOPSIS
@ -29,8 +29,11 @@ bits:
  PCRE2_JIT_COMPLETE      compile code for full matching
  PCRE2_JIT_PARTIAL_SOFT  compile code for soft partial matching
  PCRE2_JIT_PARTIAL_HARD  compile code for hard partial matching
-  PCRE2_JIT_INVALID_UTF   compile code to handle invalid UTF
 .sp
+There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been 
+superseded by the \fBpcre2_compile()\fP option PCRE2_MATCH_INVALID_UTF. The old 
+option is deprecated and may be removed in future.
+.P
 The yield of the function is 0 for success, or a negative error code otherwise.
 In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
 if an unknown bit is set in \fIoptions\fP.
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@ -1,4 +1,4 @@
-.TH PCRE2API 3 "14 February 2019" "PCRE2 10.33"
+.TH PCRE2API 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@ -1285,13 +1285,14 @@ and \fBpcre2_compile()\fP returns a non-NULL value.
 .P
 There are nearly 100 positive error codes that \fBpcre2_compile()\fP may return
 if it finds an error in the pattern. There are also some negative error codes
-that are used for invalid UTF strings. These are the same as given by
-\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and are described in the
+that are used for invalid UTF strings when validity checking is in force. These
+are the same as given by \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and
+are described in the
 .\" HREF
 \fBpcre2unicode\fP
 .\"
-page. There is no separate documentation for the positive error codes, because
-the textual error messages that are obtained by calling the
+documentation. There is no separate documentation for the positive error codes,
+because the textual error messages that are obtained by calling the
 \fBpcre2_get_error_message()\fP function (see "Obtaining a textual error
 message"
 .\" HTML <a href="#geterrormessage">
@ -1557,10 +1558,20 @@ expression engine is not the most efficient way of doing it. If you are doing a
 lot of literal matching and are worried about efficiency, you should consider
 using other approaches. The only other main options that are allowed with
 PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
-PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
-PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
-and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
-error.
+PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
+PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
+PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
+PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
+.sp
+  PCRE2_MATCH_INVALID_UTF
+.sp
+This option forces PCRE2_UTF (see below) and also enables support for matching
+by \fBpcre2_match()\fP in subject strings that contain invalid UTF sequences.
+This facility is not supported for DFA matching. For details, see the
+.\" HREF
+\fBpcre2unicode\fP
+.\"
+documentation.
 .sp
  PCRE2_MATCH_UNSET_BACKREF
 .sp
@ -2635,15 +2646,23 @@ of JIT; it forces matching to be done by the interpreter.
  PCRE2_NO_UTF_CHECK
 .sp
 When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
-string is checked by default when \fBpcre2_match()\fP is subsequently called.
-If a non-zero starting offset is given, the check is applied only to that part
-of the subject that could be inspected during matching, and there is a check
-that the starting offset points to the first code unit of a character or to the
-end of the subject. If there are no lookbehind assertions in the pattern, the
-check starts at the starting offset. Otherwise, it starts at the length of the
-longest lookbehind before the starting offset, or at the start of the subject
-if there are not that many characters before the starting offset. Note that the
-sequences \eb and \eB are one-character lookbehinds.
+string is checked unless PCRE2_NO_UTF_CHECK is passed to \fBpcre2_match()\fP or
+PCRE2_MATCH_INVALID_UTF was passed to \fBpcre2_compile()\fP. The latter special
+case is discussed in detail in the
+.\" HREF
+\fBpcre2unicode\fP
+.\"
+documentation.
+.P
+In the default case, if a non-zero starting offset is given, the check is
+applied only to that part of the subject that could be inspected during
+matching, and there is a check that the starting offset points to the first
+code unit of a character or to the end of the subject. If there are no
+lookbehind assertions in the pattern, the check starts at the starting offset.
+Otherwise, it starts at the length of the longest lookbehind before the
+starting offset, or at the start of the subject if there are not that many
+characters before the starting offset. Note that the sequences \eb and \eB are
+one-character lookbehinds.
 .P
 The check is carried out before any other processing takes place, and a
 negative error code is returned if the check fails. There are several UTF error
@ -2666,17 +2685,18 @@ in the
 .\" HREF
 \fBpcre2unicode\fP
 .\"
-page.
+documentation.
 .P
-If you know that your subject is valid, and you want to skip these checks for
+If you know that your subject is valid, and you want to skip this check for
 performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
 \fBpcre2_match()\fP. You might want to do this for the second and subsequent
-calls to \fBpcre2_match()\fP if you are making repeated calls to find other
+calls to \fBpcre2_match()\fP if you are making repeated calls to find multiple
 matches in the same subject string.
 .P
-\fBWarning:\fP When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
+\fBWarning:\fP Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
+PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
 string as a subject, or an invalid value of \fIstartoffset\fP, is undefined.
-Your program may crash or loop indefinitely.
+Your program may crash or loop indefinitely or give wrong results.
 .sp
  PCRE2_PARTIAL_HARD
  PCRE2_PARTIAL_SOFT
@ -3774,6 +3794,12 @@ a backreference.
 This return is given if \fBpcre2_dfa_match()\fP encounters a condition item
 that uses a backreference for the condition, or a test for recursion in a
 specific capture group. These are not supported.
+.sp
+  PCRE2_ERROR_DFA_UINVALID_UTF
+.sp
+This return is given if \fBpcre2_dfa_match()\fP is called for a pattern that
+was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
+matching.
 .sp
  PCRE2_ERROR_DFA_WSSIZE
 .sp
@ -3817,6 +3843,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 14 February 2019
+Last updated: 23 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2jit.3
+++ b/doc/pcre2jit.3
@ -1,4 +1,4 @@
-.TH PCRE2JIT 3 "06 March 2019" "PCRE2 10.33"
+.TH PCRE2JIT 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@ -123,23 +123,29 @@ pattern.
 .SH "MATCHING SUBJECTS CONTAINING INVALID UTF"
 .rs
 .sp
-When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
-function expects its subject string to be a valid sequence of UTF code units.
-If it is not, the result is undefined. This is also true by default of matching
-via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
-\fBpcre2_jit_compile()\fP, code that can process a subject containing invalid
-UTF is compiled.
+When a pattern is compiled with the PCRE2_UTF option, subject strings are
+normally expected to be a valid sequence of UTF code units. By default, this is
+checked at the start of matching and an error is generated if invalid UTF is
+detected. The PCRE2_NO_UTF_CHECK option can be passed to \fBpcre2_match()\fP to
+skip the check (for improved performance) if you are sure that a subject string
+is valid. If this option is used with an invalid string, the result is
+undefined.
 .P
-In this mode, an invalid code unit sequence never matches any pattern item. It
-does not match dot, it does not match \ep{Any}, it does not even match negative
-items such as [^X]. A lookbehind assertion fails if it encounters an invalid
-sequence while moving the current point backwards. In other words, an invalid
-UTF code unit sequence acts as a barrier which no match can cross. Reaching an
-invalid sequence causes an immediate backtrack.
+However, a way of running matches on strings that may contain invalid UTF
+sequences is available. Calling \fBpcre2_compile()\fP with the
+PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
+\fBpcre2_match()\fP to support invalid UTF, and, if \fBpcre2_jit_compile()\fP
+is called, the compiled JIT code also supports invalid UTF. Details of how this
+support works, in both the JIT and the interpretive cases, is given in the
+.\" HREF
+\fBpcre2unicode\fP
+.\"
+documentation.
 .P
-Using this option, an application can run matches in arbitrary data, knowing
-that any matched strings that are returned will be valid UTF. This can be
-useful when searching for text in executable or other binary files.
+There is also an obsolete option for \fBpcre2_jit_compile()\fP called
+PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
+It is superseded by the \fBpcre2_compile()\fP option PCRE2_MATCH_INVALID_UTF
+and should no longer be used. It may be removed in future.
 .
 .
 .SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
@ -438,6 +444,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 06 March 2019
+Last updated: 23 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2matching.3
+++ b/doc/pcre2matching.3
@ -1,4 +1,4 @@
-.TH PCRE2MATCHING 3 "10 October 2018" "PCRE2 10.33"
+.TH PCRE2MATCHING 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 MATCHING ALGORITHMS"
@ -157,6 +157,9 @@ code unit) at a time, for all active paths through the tree.
 .P
 9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
 supported. (*FAIL) is supported, and behaves like a failing negative assertion.
+.P
+10. The PCRE2_MATCH_INVALID_UTF option for \fBpcre2_compile()\fP is not 
+supported by \fBpcre2_dfa_match()\fP.
 .
 .
 .SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
@ -191,7 +194,8 @@ The alternative algorithm suffers from a number of disadvantages:
 because it has to search for all possible matches, but is also because it is
 less susceptible to optimization.
 .P
-2. Capturing parentheses, backreferences, and script runs are not supported.
+2. Capturing parentheses, backreferences, script runs, and matching within 
+invalid UTF string are not supported.
 .P
 3. Although atomic groups are supported, their use does not provide the
 performance advantage that it does for the standard algorithm.
@ -211,6 +215,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 10 October 2018
-Copyright (c) 1997-2018 University of Cambridge.
+Last updated: 23 May 2019
+Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "12 February 2019" "PCRE2 10.33"
+.TH PCRE2PATTERN 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -52,10 +52,11 @@ single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
 specified for the 32-bit library, in which case it constrains the character
 values to valid Unicode code points. To process UTF strings, PCRE2 must be
 built to include Unicode support (which is the default). When using UTF strings
-you must either call the compiling function with the PCRE2_UTF option, or the
-pattern must start with the special sequence (*UTF), which is equivalent to
-setting the relevant option. How setting a UTF mode affects pattern matching is
-mentioned in several places below. There is also a summary of features in the
+you must either call the compiling function with one or both of the PCRE2_UTF
+or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
+sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
+setting a UTF mode affects pattern matching is mentioned in several places
+below. There is also a summary of features in the
 .\" HREF
 \fBpcre2unicode\fP
 .\"
@ -398,11 +399,11 @@ PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
 There may be any number of hexadecimal digits. This syntax is from ECMAScript
 6.
 .P
-The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
-is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
-\eN{name} to specify characters by Unicode name; PCRE2 does not support this.
-Note that when \eN is not followed by an opening brace (curly bracket) it has
-an entirely different meaning, matching any character that is not a newline.
+The \eN{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
+UTF mode. Perl also uses \eN{name} to specify characters by Unicode name; PCRE2
+does not support this. Note that when \eN is not followed by an opening brace
+(curly bracket) it has an entirely different meaning, matching any character
+that is not a newline.
 .P
 There are some legacy applications where the escape sequence \er is expected to
 match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a
@ -1352,7 +1353,7 @@ with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
 with a malformed UTF character. This has undefined results, because PCRE2
 assumes that it is matching character by character in a valid UTF string (by
 default it checks the subject string's validity at the start of processing
-unless the PCRE2_NO_UTF_CHECK option is used).
+unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
 .P
 An application can lock out the use of \eC by setting the
 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
@ -3763,6 +3764,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 12 February 2019
+Last updated: 23 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2test.1
+++ b/doc/pcre2test.1
@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "11 March 2019" "PCRE 10.33"
+.TH PCRE2TEST 1 "23 May 2019" "PCRE 10.34"
 .SH NAME
 pcre2test - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@ -572,6 +572,7 @@ for a description of the effects of these options.
      firstline                 set PCRE2_FIRSTLINE
      literal                   set PCRE2_LITERAL
      match_line                set PCRE2_EXTRA_MATCH_LINE
+      match_invalid_utf         set PCRE2_MATCH_INVALID_UTF 
      match_unset_backref       set PCRE2_MATCH_UNSET_BACKREF
      match_word                set PCRE2_EXTRA_MATCH_WORD
  /m  multiline                 set PCRE2_MULTILINE
@ -2059,6 +2060,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 11 March 2019
+Last updated: 23 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2test.txt
+++ b/doc/pcre2test.txt
@ -551,6 +551,7 @@ PATTERN MODIFIERS
             firstline                 set PCRE2_FIRSTLINE
             literal                   set PCRE2_LITERAL
             match_line                set PCRE2_EXTRA_MATCH_LINE
+             match_invalid_utf         set PCRE2_MATCH_INVALID_UTF
             match_unset_backref       set PCRE2_MATCH_UNSET_BACKREF
             match_word                set PCRE2_EXTRA_MATCH_WORD
         /m  multiline                 set PCRE2_MULTILINE
@ -1890,5 +1891,5 @@ AUTHOR

 REVISION

-       Last updated: 11 March 2019
+       Last updated: 23 May 2019
       Copyright (c) 1997-2019 University of Cambridge.
--- a/doc/pcre2unicode.3
+++ b/doc/pcre2unicode.3
@ -1,26 +1,38 @@
-.TH PCRE2UNICODE 3 "11 May 2019" "PCRE2 10.33"
+.TH PCRE2UNICODE 3 "24 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE - Perl-compatible regular expressions (revised API)
 .SH "UNICODE AND UTF SUPPORT"
 .rs
 .sp
-When PCRE2 is built with Unicode support (which is the default), it has
-knowledge of Unicode character properties and can process text strings in
-UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by
-default, PCRE2 assumes that one code unit is one character. To process a
-pattern as a UTF string, where a character may require more than one code unit,
-you must call
+PCRE2 is normally built with Unicode support, though if you do not need it, you
+can build it without, in which case the library will be smaller. With Unicode
+support, PCRE2 has knowledge of Unicode character properties and can process
+text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
+width), but this is not the default. Unless specifically requested, PCRE2
+treats each code unit in a string as one character.
+.P
+There are two ways of telling PCRE2 to switch to UTF mode, where characters may 
+consist of more than one code unit and the range of values is constrained. The 
+program can call
 .\" HREF
 \fBpcre2_compile()\fP
 .\"
-with the PCRE2_UTF option flag, or the pattern must start with the sequence
-(*UTF). When either of these is the case, both the pattern and any subject
-strings that are matched against it are treated as UTF strings instead of
-strings of individual one-code-unit characters. There are also some other
-changes to the way characters are handled, as documented below.
+with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
+However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
+That is, the programmer can prevent the supplier of the pattern from switching 
+to UTF mode.
 .P
-If you do not need Unicode support you can build PCRE2 without it, in which
-case the library will be smaller.
+Note that the PCRE2_MATCH_INVALID_UTF option (see
+.\" HTML <a href="#matchinvalid">
+.\" </a>
+below)
+.\"
+forces PCRE2_UTF to be set.
+.P
+In UTF mode, both the pattern and any subject strings that are matched against
+it are treated as UTF strings instead of strings of individual one-code-unit
+characters. There are also some other changes to the way characters are
+handled, as documented below.
 .
 .
 .SH "UNICODE PROPERTY SUPPORT"
@ -55,18 +67,18 @@ also recognized; larger ones can be coded using \eo{...}.
 .P
 The escape sequence \eN{U+<hex digits>} is recognized as another way of
 specifying a Unicode character by code point in a UTF mode. It is not allowed
-in non-UTF modes.
+in non-UTF mode.
 .P
-In UTF modes, repeat quantifiers apply to complete UTF characters, not to
+In UTF mode, repeat quantifiers apply to complete UTF characters, not to
 individual code units.
 .P
-In UTF modes, the dot metacharacter matches one UTF character instead of a
+In UTF mode, the dot metacharacter matches one UTF character instead of a
 single code unit.
 .P
-In UTF modes, capture group names are not restricted to ASCII, and may contain
+In UTF mode, capture group names are not restricted to ASCII, and may contain
 any Unicode letters and decimal digits, as well as underscore.
 .P
-The escape sequence \eC can be used to match a single code unit in a UTF mode,
+The escape sequence \eC can be used to match a single code unit in UTF mode,
 but its use can lead to some strange effects because it breaks up multi-unit
 characters (see the description of \eC in the
 .\" HREF
@ -82,7 +94,7 @@ may consist of more than one code unit. The use of \eC in these modes provokes
 a match-time error. Also, the JIT optimization does not support \eC in these
 modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
 contains \eC, it will not succeed, and so when \fBpcre2_match()\fP is called,
-the matching will be carried out by the normal interpretive function.
+the matching will be carried out by the interpretive function.
 .P
 The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly test
 characters of any code value, but, by default, the characters that PCRE2
@ -114,14 +126,14 @@ However, the special horizontal and vertical white space matching escapes (\eh,
 not PCRE2_UCP is set.
 .
 .
-.SH "CASE-EQUIVALENCE IN UTF MODES"
+.SH "CASE-EQUIVALENCE IN UTF MODE"
 .rs
 .sp
-Case-insensitive matching in a UTF mode makes use of Unicode properties except
+Case-insensitive matching in UTF mode makes use of Unicode properties except
 for characters whose code points are less than 128 and that have at most two
 case-equivalent values. For these, a direct table lookup is used for speed. A
 few Unicode characters such as Greek sigma have more than two code points that
-are case-equivalent, and these are treated as such.
+are case-equivalent, and these are treated specially.
 .
 .
 .\" HTML <a name="scriptruns"></a>
@ -231,7 +243,7 @@ adjacent characters.
 .sp
 When the PCRE2_UTF option is set, the strings passed as patterns and subjects
 are (by default) checked for validity on entry to the relevant functions. If an
-invalid UTF string is passed, an negative error code is returned. The code unit
+invalid UTF string is passed, a negative error code is returned. The code unit
 offset to the offending character can be extracted from the match data block by
 calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF
 error.
@ -244,18 +256,15 @@ PCRE2 assumes that the pattern or subject it is given (respectively) contains
 only valid UTF code unit sequences.
 .P
 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
-is usually undefined and your program may crash or loop indefinitely. There is,
-however, one mode of matching that can handle invalid UTF subject strings. This
-is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
-when calling \fBpcre2_jit_compile()\fP. For details, see the
-.\" HREF
-\fBpcre2jit\fP
-.\"
-documentation.
+is undefined and your program may crash or loop indefinitely or give incorrect
+results. There is, however, one mode of matching that can handle invalid UTF
+subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
+\fBpcre2_compile()\fP and is discussed below in the next section. The rest of
+this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
 .P
-Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
-the pattern; it does not also apply to subject strings. If you want to disable
-the check for a subject string you must pass this same option to
+Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the UTF check
+for the pattern; it does not also apply to subject strings. If you want to
+disable the check for a subject string you must pass this same option to
 \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP.
 .P
 UTF-16 and UTF-32 strings can indicate their endianness by special code knows
@ -386,6 +395,52 @@ The following negative error codes are given for invalid UTF-32 strings:
 .sp
 .
 .
+.\" HTML <a name="matchinvalid"></a>
+.SH "MATCHING IN INVALID UTF STRINGS"
+.rs
+.sp
+You can run pattern matches on subject strings that may contain invalid UTF
+sequences if you call \fBpcre2_compile()\fP with the PCRE2_MATCH_INVALID_UTF
+option. This is supported by \fBpcre2_match()\fP, including JIT matching, but
+not by \fBpcre2_dfa_match()\fP. When PCRE2_MATCH_INVALID_UTF is set, it forces
+PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a 
+valid UTF string.
+.P
+Setting PCRE2_MATCH_INVALID_UTF does not affect what \fBpcre2_compile()\fP
+generates, but if \fBpcre2_jit_compile()\fP is subsequently called, it does
+generate different code. If JIT is not used, the option affects the behaviour
+of the interpretive code in \fBpcre2_match()\fP. When PCRE2_MATCH_INVALID_UTF
+is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
+.P
+In this mode, an invalid code unit sequence in the subject never matches any
+pattern item. It does not match dot, it does not match \ep{Any}, it does not
+even match negative items such as [^X]. A lookbehind assertion fails if it
+encounters an invalid sequence while moving the current point backwards. In
+other words, an invalid UTF code unit sequence acts as a barrier which no match
+can cross.
+.P
+You can also think of this as the subject being split up into fragments of
+valid UTF, delimited internally by invalid code unit sequences. The pattern is
+matched fragment by fragment. The result of a successful match, however, is
+given as code unit offsets in the entire subject string in the usual way. There
+are a few points to consider:
+.P
+The internal boundaries are not interpreted as the beginnings or ends of lines
+and so do not match circumflex or dollar characters in the pattern.
+.P
+If \fBpcre2_match()\fP is called with an offset that points to an invalid
+UTF-sequence, that sequence is skipped, and the match starts at the next valid
+UTF character, or the end of the subject.
+.P
+At internal fragment boundaries, \eb and \eB behave in the same way as at the
+beginning and end of the subject. For example, a sequence such as \ebWORD\eb 
+would match an instance of WORD that is surrounded by invalid UTF code units.
+.P
+Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
+data, knowing that any matched strings that are returned are valid UTF. This
+can be useful when searching for UTF text in executable or other binary files.
+.
+.
 .SH AUTHOR
 .rs
 .sp
@ -400,6 +455,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 11 May 2019
+Last updated: 24 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/src/pcre2.h.generic
+++ b/src/pcre2.h.generic
@ -5,7 +5,7 @@
 /* This is the public header file for the PCRE library, second API, to be
 #included by applications that call PCRE2 functions.

-           Copyright (c) 2016-2018 University of Cambridge
+           Copyright (c) 2016-2019 University of Cambridge

 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
 /* The current PCRE version information. */

 #define PCRE2_MAJOR           10
-#define PCRE2_MINOR           33
-#define PCRE2_PRERELEASE      
-#define PCRE2_DATE            2019-04-16
+#define PCRE2_MINOR           34
+#define PCRE2_PRERELEASE      -RC1
+#define PCRE2_DATE            2019-04-22

 /* When an application links to a PCRE DLL in Windows, the symbols that are
 imported have to be identified as such. When building PCRE2, the appropriate
@ -142,6 +142,7 @@ D   is inspected during pcre2_dfa_match() execution
 #define PCRE2_USE_OFFSET_LIMIT    0x00800000u  /*   J M D */
 #define PCRE2_EXTENDED_MORE       0x01000000u  /* C       */
 #define PCRE2_LITERAL             0x02000000u  /* C       */
+#define PCRE2_MATCH_INVALID_UTF   0x04000000u  /*   J M D */

 /* An additional compile options word is available in the compile context. */

@ -305,6 +306,7 @@ pcre2_pattern_convert(). */
 #define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS      194
 #define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN        195
 #define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE       196
+#define PCRE2_ERROR_TOO_MANY_CAPTURES              197


 /* "Expected" matching error codes: no match and partial match. */
@ -390,6 +392,7 @@ released, the numbers must not be changed. */
 #define PCRE2_ERROR_HEAPLIMIT         (-63)
 #define PCRE2_ERROR_CONVERT_SYNTAX    (-64)
 #define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
+#define PCRE2_ERROR_DFA_UINVALID_UTF  (-66)


 /* Request types for pcre2_pattern_info() */
--- a/src/pcre2.h.in
+++ b/src/pcre2.h.in
@ -5,7 +5,7 @@
 /* This is the public header file for the PCRE library, second API, to be
 #included by applications that call PCRE2 functions.

-           Copyright (c) 2016-2018 University of Cambridge
+           Copyright (c) 2016-2019 University of Cambridge

 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@ -142,6 +142,7 @@ D   is inspected during pcre2_dfa_match() execution
 #define PCRE2_USE_OFFSET_LIMIT    0x00800000u  /*   J M D */
 #define PCRE2_EXTENDED_MORE       0x01000000u  /* C       */
 #define PCRE2_LITERAL             0x02000000u  /* C       */
+#define PCRE2_MATCH_INVALID_UTF   0x04000000u  /*   J M D */

 /* An additional compile options word is available in the compile context. */

@ -391,6 +392,7 @@ released, the numbers must not be changed. */
 #define PCRE2_ERROR_HEAPLIMIT         (-63)
 #define PCRE2_ERROR_CONVERT_SYNTAX    (-64)
 #define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
+#define PCRE2_ERROR_DFA_UINVALID_UTF  (-66)


 /* Request types for pcre2_pattern_info() */
--- a/src/pcre2_compile.c
+++ b/src/pcre2_compile.c
@ -746,8 +746,8 @@ are allowed. */

 #define PUBLIC_LITERAL_COMPILE_OPTIONS \
  (PCRE2_ANCHORED|PCRE2_AUTO_CALLOUT|PCRE2_CASELESS|PCRE2_ENDANCHORED| \
-   PCRE2_FIRSTLINE|PCRE2_LITERAL|PCRE2_NO_START_OPTIMIZE| \
-   PCRE2_NO_UTF_CHECK|PCRE2_USE_OFFSET_LIMIT|PCRE2_UTF)
+   PCRE2_FIRSTLINE|PCRE2_LITERAL|PCRE2_MATCH_INVALID_UTF| \
+   PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK|PCRE2_USE_OFFSET_LIMIT|PCRE2_UTF)

 #define PUBLIC_COMPILE_OPTIONS \
  (PUBLIC_LITERAL_COMPILE_OPTIONS| \
@ -3615,7 +3615,7 @@ while (ptr < ptrend)
            {
            errorcode = ERR97;
            goto FAILED;
-            }    
+            }
          cb->bracount++;
          *parsed_pattern++ = META_CAPTURE | cb->bracount;
          }
@ -4444,7 +4444,7 @@ while (ptr < ptrend)
        {
        errorcode = ERR97;
        goto FAILED;
-        }    
+        }
      cb->bracount++;
      *parsed_pattern++ = META_CAPTURE | cb->bracount;
      nest_depth++;
@ -9503,6 +9503,10 @@ if (pattern == NULL)

 if (ccontext == NULL)
  ccontext = (pcre2_compile_context *)(&PRIV(default_compile_context));
+  
+/* PCRE2_MATCH_INVALID_UTF implies UTF */
+
+if ((options & PCRE2_MATCH_INVALID_UTF) != 0) options |= PCRE2_UTF; 

 /* Check that all undefined public option bits are zero. */

@ -9682,7 +9686,7 @@ if ((options & PCRE2_LITERAL) == 0)

 ptr += skipatstart;

-/* Can't support UTF or UCP unless PCRE2 has been compiled with UTF support. */
+/* Can't support UTF or UCP if PCRE2 was built without Unicode support. */

 #ifndef SUPPORT_UNICODE
 if ((cb.external_options & (PCRE2_UTF|PCRE2_UCP)) != 0)
--- a/src/pcre2_dfa_match.c
+++ b/src/pcre2_dfa_match.c
@ -3294,6 +3294,11 @@ time. */
 if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0 &&
   ((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
  return PCRE2_ERROR_BADOPTION;
+  
+/* Invalid UTF support is not available for DFA matching. */
+
+if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0) 
+  return PCRE2_ERROR_DFA_UINVALID_UTF;

 /* Check that the first field in the block is the magic number. If it is not,
 return with PCRE2_ERROR_BADMAGIC. */
--- a/src/pcre2_error.c
+++ b/src/pcre2_error.c
@ -269,6 +269,7 @@ static const unsigned char match_error_texts[] =
  "invalid syntax\0"
  /* 65 */
  "internal error - duplicate substitution match\0"
+  "PCRE2_MATCH_INVALID_UTF is not supported for DFA matching\0" 
  ;


--- a/src/pcre2_intmodedep.h
+++ b/src/pcre2_intmodedep.h
@ -866,6 +866,7 @@ typedef struct match_block {
  PCRE2_SPTR name_table;          /* Table of group names */
  PCRE2_SPTR start_code;          /* For use when recursing */
  PCRE2_SPTR start_subject;       /* Start of the subject string */
+  PCRE2_SPTR check_subject;       /* Where UTF-checked from */
  PCRE2_SPTR end_subject;         /* End of the subject string */
  PCRE2_SPTR end_match_ptr;       /* Subject position at end match */
  PCRE2_SPTR start_used_ptr;      /* Earliest consulted character */
--- a/src/pcre2_jit_compile.c
+++ b/src/pcre2_jit_compile.c
@ -6,8 +6,9 @@
 and semantics are as close as possible to those of the Perl 5 language.

                       Written by Philip Hazel
+                    This module by Zoltan Herczeg 
     Original API code Copyright (c) 1997-2012 University of Cambridge
-          New API code Copyright (c) 2016-2018 University of Cambridge
+          New API code Copyright (c) 2016-2019 University of Cambridge

 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@ -7846,8 +7847,6 @@ if (needstype || needsscript)
  if (needsscript)
    {
 // PH hacking
-//fprintf(stderr, "~~B\n");
-
      OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
      OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
      OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0);
@ -7901,7 +7900,6 @@ if (needstype || needsscript)
    if (!needschar)
      {
 // PH hacking
-//fprintf(stderr, "~~C\n");
  OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
  OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
  OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0);
@ -7916,7 +7914,6 @@ if (needstype || needsscript)
    else
      {
 // PH hacking
-//fprintf(stderr, "~~D\n");
  OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);

      OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
@ -8594,8 +8591,8 @@ uint32_t c;

 /* Patch by PH */
 /* GETCHARINC(c, cc); */
-
 c = *cc++;
+
 #if PCRE2_CODE_UNIT_WIDTH == 32
 if (c >= 0x110000)
  return NULL;
@ -9257,8 +9254,6 @@ if (common->utf && *cc == OP_REFI)
  CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop);

 // PH hacking
-//fprintf(stderr, "~~E\n");
-
  OP1(SLJIT_MOV, TMP3, 0, TMP1, 0);

  add_jump(compiler, &common->getucd, JUMP(SLJIT_FAST_CALL));
@ -14156,49 +14151,87 @@ Returns:        0: success or (*NOJIT) was used
 PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
 pcre2_jit_compile(pcre2_code *code, uint32_t options)
 {
-#ifndef SUPPORT_JIT
-
-(void)code;
-(void)options;
-return PCRE2_ERROR_JIT_BADOPTION;
-
-#else  /* SUPPORT_JIT */
-
 pcre2_real_code *re = (pcre2_real_code *)code;
 executable_functions *functions;
-uint32_t excluded_options;
-int result;

 if (code == NULL)
  return PCRE2_ERROR_NULL;

 if ((options & ~PUBLIC_JIT_COMPILE_OPTIONS) != 0)
  return PCRE2_ERROR_JIT_BADOPTION;
-
-if ((re->flags & PCRE2_NOJIT) != 0) return 0;
-
+  
 functions = (executable_functions *)re->executable_jit;

+/* Support for invalid UTF was first introduced in JIT, with the option 
+PCRE2_JIT_INVALID_UTF. Later, support was added to the interpreter, and the 
+compile-time option PCRE2_MATCH_INVALID_UTF was created. This is now the 
+preferred feature, with the earlier option deprecated. However, for backward 
+compatibility, if the earlier option is set, it forces the new option so that 
+if JIT matching falls back to the interpreter, there is still support for 
+invalid UTF. However, if this function has already been successfully called
+without PCRE2_JIT_INVALID_UTF and without PCRE2_MATCH_INVALID_UTF (meaning that 
+non-invalid-supporting JIT code was compiled), give an error. 
+
+If in the future support for PCRE2_JIT_INVALID_UTF is withdrawn, the following 
+actions are needed:
+
+  1. Remove the definition from pcre2.h.in and from the list in
+     PUBLIC_JIT_COMPILE_OPTIONS above.
+     
+  2. Replace PCRE2_JIT_INVALID_UTF with a local flag in this module.
+  
+  3. Replace PCRE2_JIT_INVALID_UTF in pcre2_jit_test.c.
+  
+  4. Delete the following short block of code. The setting of "re" and 
+     "functions" can be moved into the JIT-only block below, but if that is 
+     done, (void)re and (void)functions will be needed in the non-JIT case, to 
+     avoid compiler warnings.
+*/
+
+if ((options & PCRE2_JIT_INVALID_UTF) != 0)
+  {
+  if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) == 0)
+    {
+    if (functions != NULL) return PCRE2_ERROR_JIT_BADOPTION;
+    re->overall_options |= PCRE2_MATCH_INVALID_UTF; 
+    }  
+  }
+  
+/* The above tests are run with and without JIT support. This means that 
+PCRE2_JIT_INVALID_UTF propagates back into the regex options (ensuring 
+interpreter support) even in the absence of JIT. But now, if there is no JIT
+support, give an error return. */
+
+#ifndef SUPPORT_JIT
+return PCRE2_ERROR_JIT_BADOPTION;
+#else  /* SUPPORT_JIT */
+
+/* There is JIT support. Do the necessary. */
+
+if ((re->flags & PCRE2_NOJIT) != 0) return 0;
+if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0)
+  options |= PCRE2_JIT_INVALID_UTF;  
+  
 if ((options & PCRE2_JIT_COMPLETE) != 0 && (functions == NULL
    || functions->executable_funcs[0] == NULL)) {
-  excluded_options = (PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD);
-  result = jit_compile(code, options & ~excluded_options);
+  uint32_t excluded_options = (PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD);
+  int result = jit_compile(code, options & ~excluded_options);
  if (result != 0)
    return result;
  }

 if ((options & PCRE2_JIT_PARTIAL_SOFT) != 0 && (functions == NULL
    || functions->executable_funcs[1] == NULL)) {
-  excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD);
-  result = jit_compile(code, options & ~excluded_options);
+  uint32_t excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD);
+  int result = jit_compile(code, options & ~excluded_options);
  if (result != 0)
    return result;
  }

 if ((options & PCRE2_JIT_PARTIAL_HARD) != 0 && (functions == NULL
    || functions->executable_funcs[2] == NULL)) {
-  excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT);
-  result = jit_compile(code, options & ~excluded_options);
+  uint32_t excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT);
+  int result = jit_compile(code, options & ~excluded_options);
  if (result != 0)
    return result;
  }
--- a/src/pcre2_match.c
+++ b/src/pcre2_match.c
@ -5412,7 +5412,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
      {
      while (number-- > 0)
        {
-        if (Feptr <= mb->start_subject) RRETURN(MATCH_NOMATCH);
+        if (Feptr <= mb->check_subject) RRETURN(MATCH_NOMATCH);
        Feptr--;
        BACKCHAR(Feptr);
        }
@ -5420,7 +5420,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
    else
 #endif

-    /* No UTF-8 support, or not in UTF-8 mode: count is byte count */
+    /* No UTF-8 support, or not in UTF-8 mode: count is code unit count */

      {
      if ((ptrdiff_t)number > Feptr - mb->start_subject) RRETURN(MATCH_NOMATCH);
@ -5743,7 +5743,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);

    case OP_NOT_WORD_BOUNDARY:
    case OP_WORD_BOUNDARY:
-    if (Feptr == mb->start_subject) prev_is_word = FALSE; else
+    if (Feptr == mb->check_subject) prev_is_word = FALSE; else
      {
      PCRE2_SPTR lastptr = Feptr - 1;
 #ifdef SUPPORT_UNICODE
@ -6014,7 +6014,6 @@ int was_zero_terminated = 0;
 const uint8_t *start_bits = NULL;
 const pcre2_real_code *re = (const pcre2_real_code *)code;

-
 BOOL anchored;
 BOOL firstline;
 BOOL has_first_cu = FALSE;
@ -6029,10 +6028,23 @@ PCRE2_UCHAR req_cu2 = 0;

 PCRE2_SPTR bumpalong_limit;
 PCRE2_SPTR end_subject;
+PCRE2_SPTR true_end_subject;
 PCRE2_SPTR start_match = subject + start_offset;
 PCRE2_SPTR req_cu_ptr = start_match - 1;
-PCRE2_SPTR start_partial = NULL;
-PCRE2_SPTR match_partial = NULL;
+PCRE2_SPTR start_partial;
+PCRE2_SPTR match_partial;
+
+#ifdef SUPPORT_JIT
+BOOL use_jit;
+#endif
+
+#ifdef SUPPORT_UNICODE
+BOOL allow_invalid;
+uint32_t fragment_options = 0;
+#ifdef SUPPORT_JIT
+BOOL jit_checked_utf = FALSE;
+#endif
+#endif

 PCRE2_SIZE frame_size;

@ -6059,7 +6071,7 @@ if (length == PCRE2_ZERO_TERMINATED)
  length = PRIV(strlen)(subject);
  was_zero_terminated = 1;
  }
-end_subject = subject + length;
+true_end_subject = end_subject = subject + length;

 /* Plausibility checks */

@ -6095,12 +6107,24 @@ options |= (re->flags & FF) / ((FF & (~FF+1)) / (OO & (~OO+1)));
 #undef FF
 #undef OO

-/* These two settings are used in the code for checking a UTF string that
-follows immediately afterwards. Other values in the mb block are used only
-during interpretive processing, not when the JIT support is in use, so they are
-set up later. */
+/* If the pattern was successfully studied with JIT support, we will run the
+JIT executable instead of the rest of this function. Most options must be set
+at compile time for the JIT code to be usable. */
+
+#ifdef SUPPORT_JIT
+use_jit = (re->executable_jit != NULL &&
+          (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0);
+#endif
+
+/* Initialize UTF parameters. */

 utf = (re->overall_options & PCRE2_UTF) != 0;
+#ifdef SUPPORT_UNICODE
+allow_invalid = (re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0;
+#endif
+
+/* Convert the partial matching flags into an integer. */
+
 mb->partial = ((options & PCRE2_PARTIAL_HARD) != 0)? 2 :
              ((options & PCRE2_PARTIAL_SOFT) != 0)? 1 : 0;

@ -6111,61 +6135,6 @@ if (mb->partial != 0 &&
   ((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
  return PCRE2_ERROR_BADOPTION;

-/* Check a UTF string for validity if required. For 8-bit and 16-bit strings,
-we must also check that a starting offset does not point into the middle of a
-multiunit character. We check only the portion of the subject that is going to
-be inspected during matching - from the offset minus the maximum back reference
-to the given length. This saves time when a small part of a large subject is
-being matched by the use of a starting offset. Note that the maximum lookbehind
-is a number of characters, not code units. */
-
-#ifdef SUPPORT_UNICODE
-if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
-  {
-  PCRE2_SPTR check_subject = start_match;  /* start_match includes offset */
-
-  if (start_offset > 0)
-    {
-#if PCRE2_CODE_UNIT_WIDTH != 32
-    unsigned int i;
-    if (start_match < end_subject && NOT_FIRSTCU(*start_match))
-      return PCRE2_ERROR_BADUTFOFFSET;
-    for (i = re->max_lookbehind; i > 0 && check_subject > subject; i--)
-      {
-      check_subject--;
-      while (check_subject > subject &&
-#if PCRE2_CODE_UNIT_WIDTH == 8
-      (*check_subject & 0xc0) == 0x80)
-#else  /* 16-bit */
-      (*check_subject & 0xfc00) == 0xdc00)
-#endif /* PCRE2_CODE_UNIT_WIDTH == 8 */
-        check_subject--;
-      }
-#else
-    /* In the 32-bit library, one code unit equals one character. However,
-    we cannot just subtract the lookbehind and then compare pointers, because
-    a very large lookbehind could create an invalid pointer. */
-
-    if (start_offset >= re->max_lookbehind)
-      check_subject -= re->max_lookbehind;
-    else
-      check_subject = subject;
-#endif  /* PCRE2_CODE_UNIT_WIDTH != 32 */
-    }
-
-  /* Validate the relevant portion of the subject. After an error, adjust the
-  offset to be an absolute offset in the whole string. */
-
-  match_data->rc = PRIV(valid_utf)(check_subject,
-    length - (check_subject - subject), &(match_data->startchar));
-  if (match_data->rc != 0)
-    {
-    match_data->startchar += check_subject - subject;
-    return match_data->rc;
-    }
-  }
-#endif  /* SUPPORT_UNICODE */
-
 /* It is an error to set an offset limit without setting the flag at compile
 time. */

@ -6184,15 +6153,85 @@ if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0)
  }
 match_data->subject = NULL;

-/* If the pattern was successfully studied with JIT support, run the JIT
-executable instead of the rest of this function. Most options must be set at
-compile time for the JIT code to be usable. Fallback to the normal code path if
-an unsupported option is set or if JIT returns BADOPTION (which means that the
-selected normal or partial matching mode was not compiled). */
+
+/* ============================= JIT matching ============================== */
+
+/* Prepare for JIT matching. Check a UTF string for validity unless no check is
+requested or invalid UTF can be handled. We check only the portion of the
+subject that might be be inspected during matching - from the offset minus the
+maximum lookbehind to the given length. This saves time when a small part of a
+large subject is being matched by the use of a starting offset. Note that the
+maximum lookbehind is a number of characters, not code units. */

 #ifdef SUPPORT_JIT
-if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0)
+if (use_jit)
  {
+#ifdef SUPPORT_UNICODE
+  if (utf && (options & PCRE2_NO_UTF_CHECK) == 0 && !allow_invalid)
+    {
+#if PCRE2_CODE_UNIT_WIDTH != 32
+    unsigned int i;
+#endif
+
+    /* For 8-bit and 16-bit UTF, check that the first code unit is a valid
+    character start. */
+
+#if PCRE2_CODE_UNIT_WIDTH != 32
+    if (start_match < end_subject && NOT_FIRSTCU(*start_match))
+      {
+      if (start_offset > 0) return PCRE2_ERROR_BADUTFOFFSET;
+#if PCRE2_CODE_UNIT_WIDTH == 8
+      return PCRE2_ERROR_UTF8_ERR20;  /* Isolated 0x80 byte */
+#else
+      return PCRE2_ERROR_UTF16_ERR3;  /* Isolated low surrogate */
+#endif
+      }
+#endif  /* WIDTH != 32 */
+
+    /* Move back by the maximum lookbehind, just in case it happens at the very
+    start of matching. */
+
+#if PCRE2_CODE_UNIT_WIDTH != 32
+    for (i = re->max_lookbehind; i > 0 && start_match > subject; i--)
+      {
+      start_match--;
+      while (start_match > subject &&
+#if PCRE2_CODE_UNIT_WIDTH == 8
+      (*start_match & 0xc0) == 0x80)
+#else  /* 16-bit */
+      (*start_match & 0xfc00) == 0xdc00)
+#endif
+        start_match--;
+      }
+#else  /* PCRE2_CODE_UNIT_WIDTH != 32 */
+
+    /* In the 32-bit library, one code unit equals one character. However,
+    we cannot just subtract the lookbehind and then compare pointers, because
+    a very large lookbehind could create an invalid pointer. */
+
+    if (start_offset >= re->max_lookbehind)
+      start_match -= re->max_lookbehind;
+    else
+      start_match = subject;
+#endif  /* PCRE2_CODE_UNIT_WIDTH != 32 */
+
+    /* Validate the relevant portion of the subject. Adjust the offset of an
+    invalid code point to be an absolute offset in the whole string. */
+
+    match_data->rc = PRIV(valid_utf)(start_match,
+      length - (start_match - subject), &(match_data->startchar));
+    if (match_data->rc != 0)
+      {
+      match_data->startchar += start_match - subject;
+      return match_data->rc;
+      }
+    jit_checked_utf = TRUE;
+    }
+#endif  /* SUPPORT_UNICODE */
+
+  /* If JIT returns BADOPTION, which means that the selected complete or
+  partial matching mode was not compiled, fall through to the interpreter. */
+
  rc = pcre2_jit_match(code, subject, length, start_offset, options,
    match_data, mcontext);
  if (rc != PCRE2_ERROR_JIT_BADOPTION)
@ -6209,10 +6248,152 @@ if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0)
    return rc;
    }
  }
+#endif  /* SUPPORT_JIT */
+
+/* ========================= End of JIT matching ========================== */
+
+
+/* Proceed with non-JIT matching. The default is to allow lookbehinds to the
+start of the subject. A UTF check when there is a non-zero offset may change
+this. */
+
+mb->check_subject = subject;
+
+/* If a UTF subject string was not checked for validity in the JIT code above,
+check it here, and handle support for invalid UTF strings. The check above
+happens only when invalid UTF is not supported and PCRE2_NO_CHECK_UTF is unset.
+If we get here in those circumstances, it means the subject string is valid,
+but for some reason JIT matching was not successful. There is no need to check
+the subject again.
+
+We check only the portion of the subject that might be be inspected during
+matching - from the offset minus the maximum lookbehind to the given length.
+This saves time when a small part of a large subject is being matched by the
+use of a starting offset. Note that the maximum lookbehind is a number of
+characters, not code units.
+
+Note also that support for invalid UTF forces a check, overriding the setting
+of PCRE2_NO_CHECK_UTF. */
+
+#ifdef SUPPORT_UNICODE
+if (utf &&
+#ifdef SUPPORT_JIT
+    !jit_checked_utf &&
+#endif
+    ((options & PCRE2_NO_UTF_CHECK) == 0 || allow_invalid))
+  {
+#if PCRE2_CODE_UNIT_WIDTH != 32
+  BOOL skipped_bad_start = FALSE;
 #endif

-/* Carry on with non-JIT matching. A NULL match context means "use a default
-context", but we take the memory control functions from the pattern. */
+  /* For 8-bit and 16-bit UTF, check that the first code unit is a valid
+  character start. If we are handling invalid UTF, just skip over such code
+  units. Otherwise, give an appropriate error. */
+
+#if PCRE2_CODE_UNIT_WIDTH != 32
+  if (allow_invalid)
+    {
+    while (start_match < end_subject && NOT_FIRSTCU(*start_match))
+      {
+      start_match++;
+      skipped_bad_start = TRUE;
+      }
+    }
+  else if (start_match < end_subject && NOT_FIRSTCU(*start_match))
+    {
+    if (start_offset > 0) return PCRE2_ERROR_BADUTFOFFSET;
+#if PCRE2_CODE_UNIT_WIDTH == 8
+    return PCRE2_ERROR_UTF8_ERR20;  /* Isolated 0x80 byte */
+#else
+    return PCRE2_ERROR_UTF16_ERR3;  /* Isolated low surrogate */
+#endif
+    }
+#endif  /* WIDTH != 32 */
+
+  /* The mb->check_subject field points to the start of UTF checking;
+  lookbehinds can go back no further than this. */
+
+  mb->check_subject = start_match;
+
+  /* Move back by the maximum lookbehind, just in case it happens at the very
+  start of matching, but don't do this if we skipped bad 8-bit or 16-bit code
+  units above. */
+
+#if PCRE2_CODE_UNIT_WIDTH != 32
+  if (!skipped_bad_start)
+    {
+    unsigned int i;
+    for (i = re->max_lookbehind; i > 0 && mb->check_subject > subject; i--)
+      {
+      mb->check_subject--;
+      while (mb->check_subject > subject &&
+#if PCRE2_CODE_UNIT_WIDTH == 8
+      (*mb->check_subject & 0xc0) == 0x80)
+#else  /* 16-bit */
+      (*mb->check_subject & 0xfc00) == 0xdc00)
+#endif
+        mb->check_subject--;
+      }
+    }
+#else  /* PCRE2_CODE_UNIT_WIDTH != 32 */
+
+  /* In the 32-bit library, one code unit equals one character. However,
+  we cannot just subtract the lookbehind and then compare pointers, because
+  a very large lookbehind could create an invalid pointer. */
+
+  if (start_offset >= re->max_lookbehind)
+    mb->check_subject -= re->max_lookbehind;
+  else
+    mb->check_subject = subject;
+#endif  /* PCRE2_CODE_UNIT_WIDTH != 32 */
+
+  /* Validate the relevant portion of the subject. There's a loop in case we
+  encounter bad UTF in the characters preceding start_match which we are
+  scanning because of a lookbehind. */
+
+  for (;;)
+    {
+    match_data->rc = PRIV(valid_utf)(mb->check_subject,
+      length - (mb->check_subject - subject), &(match_data->startchar));
+
+    if (match_data->rc == 0) break;   /* Valid UTF string */
+
+    /* Invalid UTF string. Adjust the offset to be an absolute offset in the
+    whole string. If we are handling invalid UTF strings, set end_subject to
+    stop before the bad code unit, and set the options to "not end of line".
+    Otherwise return the error. */
+
+    match_data->startchar += mb->check_subject - subject;
+    if (!allow_invalid || match_data->rc > 0) return match_data->rc;
+    end_subject = subject + match_data->startchar;
+
+    /* If the end precedes start_match, it means there is invalid UTF in the
+    extra code units we reversed over because of a lookbehind. Advance past the
+    first bad code unit, and then skip invalid character starting code units in
+    8-bit and 16-bit modes, and try again. */
+
+    if (end_subject < start_match)
+      {
+      mb->check_subject = end_subject + 1;
+#if PCRE2_CODE_UNIT_WIDTH != 32
+      while (mb->check_subject < start_match && NOT_FIRSTCU(*mb->check_subject))
+        mb->check_subject++;
+#endif
+      }
+
+    /* Otherwise, set the not end of line option, and do the match. */
+
+    else
+      {
+      fragment_options = PCRE2_NOTEOL;
+      break;
+      }
+    }
+  }
+#endif  /* SUPPORT_UNICODE */
+
+/* A NULL match context means "use a default context", but we take the memory
+control functions from the pattern. */

 if (mcontext == NULL)
  {
@ -6224,8 +6405,8 @@ else mb->memctl = mcontext->memctl;
 anchored = ((re->overall_options | options) & PCRE2_ANCHORED) != 0;
 firstline = (re->overall_options & PCRE2_FIRSTLINE) != 0;
 startline = (re->flags & PCRE2_STARTLINE) != 0;
-bumpalong_limit =  (mcontext->offset_limit == PCRE2_UNSET)?
-  end_subject : subject + mcontext->offset_limit;
+bumpalong_limit = (mcontext->offset_limit == PCRE2_UNSET)?
+  true_end_subject : subject + mcontext->offset_limit;

 /* Initialize and set up the fixed fields in the callout block, with a pointer
 in the match block. */
@ -6236,7 +6417,8 @@ cb.subject = subject;
 cb.subject_length = (PCRE2_SIZE)(end_subject - subject);
 cb.callout_flags = 0;

-/* Fill in the remaining fields in the match block. */
+/* Fill in the remaining fields in the match block, except for moptions, which
+gets set later. */

 mb->callout = mcontext->callout;
 mb->callout_data = mcontext->callout_data;
@ -6245,13 +6427,9 @@ mb->start_subject = subject;
 mb->start_offset = start_offset;
 mb->end_subject = end_subject;
 mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
-
-mb->moptions = options;                 /* Match options */
 mb->poptions = re->overall_options;     /* Pattern options */
-
 mb->ignore_skip_arg = 0;
 mb->mark = mb->nomatch_mark = NULL;     /* In case never set */
-mb->hitend = FALSE;

 /* The name table is needed for finding all the numbers associated with a
 given name, for condition testing. The code follows the name table. */
@ -6404,6 +6582,13 @@ if ((re->flags & PCRE2_LASTSET) != 0)
 /* Loop for handling unanchored repeated matching attempts; for anchored regexs
 the loop runs just once. */

+#ifdef SUPPORT_UNICODE
+FRAGMENT_RESTART:
+#endif
+
+start_partial = match_partial = NULL;
+mb->hitend = FALSE;
+
 for(;;)
  {
  PCRE2_SPTR new_start_match;
@ -6714,6 +6899,11 @@ for(;;)

  mb->start_used_ptr = start_match;
  mb->last_used_ptr = start_match;
+#ifdef SUPPORT_UNICODE
+  mb->moptions = options | fragment_options;
+#else
+  mb->moptions = options;
+#endif
  mb->match_call_count = 0;
  mb->end_offset_top = 0;
  mb->skip_arg_count = 0;
@ -6839,6 +7029,68 @@ for(;;)

 ENDLOOP:

+/* If end_subject != true_end_subject, it means we are handling invalid UTF,
+and have just processed a non-terminal fragment. If this resulted in no match
+or a partial match we must carry on to the next fragment (a partial match is
+returned to the caller only at the very end of the subject). A loop is used to
+avoid trying to match against empty fragments; if the pattern can match an
+empty string it would have done so already. */
+
+#ifdef SUPPORT_UNICODE
+if (utf && end_subject != true_end_subject &&
+    (rc == MATCH_NOMATCH || rc == PCRE2_ERROR_PARTIAL))
+  {
+  for (;;)
+    {
+    /* Advance past the first bad code unit, and then skip invalid character
+    starting code units in 8-bit and 16-bit modes. */
+
+    start_match = end_subject + 1;
+#if PCRE2_CODE_UNIT_WIDTH != 32
+    while (start_match < true_end_subject && NOT_FIRSTCU(*start_match))
+      start_match++;
+#endif
+
+    /* If we have hit the end of the subject, there isn't another non-empty
+    fragment, so give up. */
+
+    if (start_match >= true_end_subject)
+      {
+      rc = MATCH_NOMATCH;  /* In case it was partial */
+      break;
+      }
+
+    /* Check the rest of the subject */
+
+    mb->check_subject = start_match;
+    rc = PRIV(valid_utf)(start_match, length - (start_match - subject),
+      &(match_data->startchar));
+
+    /* The rest of the subject is valid UTF. */
+
+    if (rc == 0)
+      {
+      mb->end_subject = end_subject = true_end_subject;
+      fragment_options = PCRE2_NOTBOL;
+      goto FRAGMENT_RESTART;
+      }
+
+    /* A subsequent UTF error has been found; if the next fragment is
+    non-empty, set up to process it. Otherwise, let the loop advance. */
+
+    else if (rc < 0)
+      {
+      mb->end_subject = end_subject = start_match + match_data->startchar;
+      if (end_subject > start_match)
+        {
+        fragment_options = PCRE2_NOTBOL|PCRE2_NOTEOL;
+        goto FRAGMENT_RESTART;
+        }
+      }
+    }
+  }
+#endif  /* SUPPORT_UNICODE */
+
 /* Release an enlarged frame vector that is on the heap. */

 if (mb->match_frames != mb->stack_frames)
--- a/src/pcre2test.c
+++ b/src/pcre2test.c
@ -212,6 +212,12 @@ be C99 don't support it (hence DISABLE_PERCENT_ZT). */
 #define REPLACE_MODSIZE 100       /* Field for reading 8-bit replacement */
 #define VERSION_SIZE 64           /* Size of buffer for the version strings */

+/* Default JIT compile options */
+
+#define JIT_DEFAULT (PCRE2_JIT_COMPLETE|\
+                     PCRE2_JIT_PARTIAL_SOFT|\
+                     PCRE2_JIT_PARTIAL_HARD)
+
 /* Make sure the buffer into which replacement strings are copied is big enough
 to hold them as 32-bit code units. */

@ -664,6 +670,7 @@ static modstruct modlist[] = {
  { "literal",                    MOD_PAT,  MOD_OPT, PCRE2_LITERAL,              PO(options) },
  { "locale",                     MOD_PAT,  MOD_STR, LOCALESIZE,                 PO(locale) },
  { "mark",                       MOD_PNDP, MOD_CTL, CTL_MARK,                   PO(control) },
+  { "match_invalid_utf",          MOD_PAT,  MOD_OPT, PCRE2_MATCH_INVALID_UTF,    PO(options) },
  { "match_limit",                MOD_CTM,  MOD_INT, 0,                          MO(match_limit) },
  { "match_line",                 MOD_CTC,  MOD_OPT, PCRE2_EXTRA_MATCH_LINE,     CO(extra_options) },
  { "match_unset_backref",        MOD_PAT,  MOD_OPT, PCRE2_MATCH_UNSET_BACKREF,  PO(options) },
@ -4136,7 +4143,7 @@ static void
 show_compile_options(uint32_t options, const char *before, const char *after)
 {
 if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
-else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
+else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
  before,
  ((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
  ((options & PCRE2_ALT_CIRCUMFLEX) != 0)? " alt_circumflex" : "",
@ -4153,6 +4160,7 @@ else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%
  ((options & PCRE2_EXTENDED_MORE) != 0)? " extended_more" : "",
  ((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
  ((options & PCRE2_LITERAL) != 0)? " literal" : "",
+  ((options & PCRE2_MATCH_INVALID_UTF) != 0)? " match_invalid_utf" : "",
  ((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
  ((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
  ((options & PCRE2_NEVER_BACKSLASH_C) != 0)? " never_backslash_c" : "",
@ -4867,7 +4875,7 @@ switch(cmd)
  case CMD_PATTERN:
  (void)decode_modifiers(argptr, CTX_DEFPAT, &def_patctl, NULL);
  if (def_patctl.jit == 0 && (def_patctl.control & CTL_JITVERIFY) != 0)
-    def_patctl.jit = 7;
+    def_patctl.jit = JIT_DEFAULT;
  break;

  /* Set default subject modifiers */
@ -5114,7 +5122,11 @@ patlen = p - buffer - 2;
 /* Look for modifiers and options after the final delimiter. */

 if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
-utf = (pat_patctl.options & PCRE2_UTF) != 0;
+
+/* Note that the match_invalid_utf option also sets utf when passed to 
+pcre2_compile(). */
+
+utf = (pat_patctl.options & (PCRE2_UTF|PCRE2_MATCH_INVALID_UTF)) != 0;

 /* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
 exclusive with the utf modifier. */
@ -5161,7 +5173,7 @@ specified. */

 if (pat_patctl.jit == 0 &&
    (pat_patctl.control & (CTL_JITVERIFY|CTL_JITFAST)) != 0)
-  pat_patctl.jit = 7;
+  pat_patctl.jit = JIT_DEFAULT;

 /* Now copy the pattern to pbuffer8 for use in 8-bit testing and for reflecting
 in callouts. Convert from hex if requested (literal strings in quotes may be
@ -5744,6 +5756,7 @@ if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
    {
    int i;
    clock_t time_taken = 0;
+
    for (i = 0; i < timeit; i++)
      {
      clock_t start_time;
@ -5752,7 +5765,7 @@ if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
        pat_patctl.options|use_forbid_utf, &errorcode, &erroroffset,
        use_pat_context);
      start_time = clock();
-      PCRE2_JIT_COMPILE(jitrc,compiled_code, pat_patctl.jit);
+      PCRE2_JIT_COMPILE(jitrc, compiled_code, pat_patctl.jit);
      time_taken += clock() - start_time;
      }
    total_jit_compile_time += time_taken;
@ -8615,7 +8628,7 @@ while (argc > 1 && argv[op][0] == '-' && argv[op][1] != 0)
  else if (strcmp(arg, "-jit") == 0 || strcmp(arg, "-jitverify") == 0)
    {
    if (arg[4] != 0) def_patctl.control |= CTL_JITVERIFY;
-    def_patctl.jit = 7;  /* full & partial */
+    def_patctl.jit = JIT_DEFAULT;  /* full & partial */
 #ifndef SUPPORT_JIT
    fprintf(stderr, "** Warning: JIT support is not available: "
                    "-jit[verify] calls functions that do nothing.\n");
--- a/testdata/testinput10
+++ b/testdata/testinput10
@ -1,7 +1,7 @@
 # This set of tests is for UTF-8 support and Unicode property support, with
 # relevance only for the 8-bit library.

-# The next 4 patterns have UTF-8 errors
+# The next 5 patterns have UTF-8 errors

 /[Ã]/utf

@ -11,6 +11,8 @@

 /Ã‚‚‚‚‚‚‚‚Ã/utf

+/Ã‚‚‚‚‚‚‚‚Ã/match_invalid_utf
+
 # Now test subjects

 /badutf/utf
@ -493,4 +495,66 @@

 /(?(Ã¡/utf

+# Invalid UTF-8 tests
+
+/.../g,match_invalid_utf
+    abcd\x80wxzy\x80pqrs
+    abcd\x{80}wxzy\x80pqrs
+
+/abc/match_invalid_utf
+    ab\x80ab\=ph
+\= Expect no match
+    ab\x80cdef\=ph
+
+/ab$/match_invalid_utf
+    ab\x80cdeab
+\= Expect no match
+    ab\x80cde
+
+/.../g,match_invalid_utf
+    abcd\x{80}wxzy\x80pqrs
+
+/(?<=x)../g,match_invalid_utf
+    abcd\x{80}wxzy\x80pqrs
+    abcd\x{80}wxzy\x80xpqrs
+    
+/X$/match_invalid_utf
+\= Expect no match
+    X\xc4
+    
+/(?<=..)X/match_invalid_utf,aftertext
+    AB\x80AQXYZ
+    AB\x80AQXYZ\=offset=5
+    AB\x80\x80AXYZXC\=offset=5
+\= Expect no match
+    AB\x80XYZ
+    AB\x80XYZ\=offset=3 
+    AB\xfeXYZ
+    AB\xffXYZ\=offset=3 
+    AB\x80AXYZ
+    AB\x80AXYZ\=offset=4
+    AB\x80\x80AXYZ\=offset=5
+
+/.../match_invalid_utf
+    AB\xc4CCC
+\= Expect no match
+    A\x{d800}B
+    A\x{110000}B
+    A\xc4B  
+
+/\bX/match_invalid_utf
+    A\x80X
+
+/\BX/match_invalid_utf
+\= Expect no match
+    A\x80X
+    
+/(?<=...)X/match_invalid_utf
+    AAA\x80BBBXYZ 
+\= Expect no match
+    AAA\x80BXYZ 
+    AAA\x80BBXYZ 
+
+# -------------------------------------
+
 # End of testinput10
--- a/testdata/testinput11
+++ b/testdata/testinput11
@ -368,6 +368,4 @@
    ab˙Az
    ab\x{80000041}z 

-/\[()]{65535}/expand
-
 # End of testinput11
--- a/testdata/testinput12
+++ b/testdata/testinput12
@ -402,4 +402,49 @@

 /(?(á/utf

+# Invalid UTF-16/32 tests.
+
+/.../g,match_invalid_utf
+    abcd\x{df00}wxzy\x{df00}pqrs
+    abcd\x{80}wxzy\x{df00}pqrs
+
+/abc/match_invalid_utf
+    ab\x{df00}ab\=ph
+\= Expect no match
+    ab\x{df00}cdef\=ph
+
+/ab$/match_invalid_utf
+    ab\x{df00}cdeab
+\= Expect no match
+    ab\x{df00}cde
+
+/.../g,match_invalid_utf
+    abcd\x{80}wxzy\x{df00}pqrs
+
+/(?<=x)../g,match_invalid_utf
+    abcd\x{80}wxzy\x{df00}pqrs
+    abcd\x{80}wxzy\x{df00}xpqrs
+
+/X$/match_invalid_utf
+\= Expect no match
+    X\x{df00}
+    
+/(?<=..)X/match_invalid_utf,aftertext
+    AB\x{df00}AQXYZ
+    AB\x{df00}AQXYZ\=offset=5
+    AB\x{df00}\x{df00}AXYZXC\=offset=5
+\= Expect no match
+    AB\x{df00}XYZ
+    AB\x{df00}XYZ\=offset=3 
+    AB\x{df00}AXYZ
+    AB\x{df00}AXYZ\=offset=4
+    AB\x{df00}\x{df00}AXYZ\=offset=5
+
+/.../match_invalid_utf
+\= Expect no match
+    A\x{d800}B
+    A\x{110000}B 
+
+# ---------------------------------------------------- 
+
 # End of testinput12
--- a/testdata/testinput8
+++ b/testdata/testinput8
@ -182,4 +182,8 @@

 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode

+#pattern -fullbincode
+
+/\[()]{65535}/expand
+
 # End of testinput8
--- a/testdata/testinput9
+++ b/testdata/testinput9
@ -260,6 +260,4 @@

 /(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/

-/\[()]{65535}/expand
-
 # End of testinput9
--- a/testdata/testoutput10
+++ b/testdata/testoutput10
@ -1,7 +1,7 @@
 # This set of tests is for UTF-8 support and Unicode property support, with
 # relevance only for the 8-bit library.

-# The next 4 patterns have UTF-8 errors
+# The next 5 patterns have UTF-8 errors

 /[Ã]/utf
 Failed: error -8 at offset 1: UTF-8 error: byte 2 top bits not 0x80
@ -15,6 +15,9 @@ Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
 /Ã‚‚‚‚‚‚‚‚Ã/utf
 Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set

+/Ã‚‚‚‚‚‚‚‚Ã/match_invalid_utf
+Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set
+
 # Now test subjects

 /badutf/utf
@ -1651,4 +1654,107 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
 /(?(Ã¡/utf
 Failed: error 142 at offset 5: syntax error in subpattern name (missing terminator?)

+# Invalid UTF-8 tests
+
+/.../g,match_invalid_utf
+    abcd\x80wxzy\x80pqrs
+ 0: abc
+ 0: wxz
+ 0: pqr
+    abcd\x{80}wxzy\x80pqrs
+ 0: abc
+ 0: d\x{80}w
+ 0: xzy
+ 0: pqr
+
+/abc/match_invalid_utf
+    ab\x80ab\=ph
+Partial match: ab
+\= Expect no match
+    ab\x80cdef\=ph
+No match
+
+/ab$/match_invalid_utf
+    ab\x80cdeab
+ 0: ab
+\= Expect no match
+    ab\x80cde
+No match
+
+/.../g,match_invalid_utf
+    abcd\x{80}wxzy\x80pqrs
+ 0: abc
+ 0: d\x{80}w
+ 0: xzy
+ 0: pqr
+
+/(?<=x)../g,match_invalid_utf
+    abcd\x{80}wxzy\x80pqrs
+ 0: zy
+    abcd\x{80}wxzy\x80xpqrs
+ 0: zy
+ 0: pq
+    
+/X$/match_invalid_utf
+\= Expect no match
+    X\xc4
+No match
+    
+/(?<=..)X/match_invalid_utf,aftertext
+    AB\x80AQXYZ
+ 0: X
+ 0+ YZ
+    AB\x80AQXYZ\=offset=5
+ 0: X
+ 0+ YZ
+    AB\x80\x80AXYZXC\=offset=5
+ 0: X
+ 0+ C
+\= Expect no match
+    AB\x80XYZ
+No match
+    AB\x80XYZ\=offset=3 
+No match
+    AB\xfeXYZ
+No match
+    AB\xffXYZ\=offset=3 
+No match
+    AB\x80AXYZ
+No match
+    AB\x80AXYZ\=offset=4
+No match
+    AB\x80\x80AXYZ\=offset=5
+No match
+
+/.../match_invalid_utf
+    AB\xc4CCC
+ 0: CCC
+\= Expect no match
+    A\x{d800}B
+No match
+    A\x{110000}B
+No match
+    A\xc4B  
+No match
+
+/\bX/match_invalid_utf
+    A\x80X
+ 0: X
+
+/\BX/match_invalid_utf
+\= Expect no match
+    A\x80X
+No match
+    
+/(?<=...)X/match_invalid_utf
+    AAA\x80BBBXYZ 
+ 0: X
+\= Expect no match
+    AAA\x80BXYZ 
+No match
+    AAA\x80BBXYZ 
+No match
+
+# -------------------------------------
+
 # End of testinput10
--- a/testdata/testoutput11-16
+++ b/testdata/testoutput11-16
@ -661,7 +661,4 @@ Subject length lower bound = 1
    ab˙Az
    ab\x{80000041}z 

-/\[()]{65535}/expand
-Failed: error 120 at offset 131070: regular expression is too large
-
 # End of testinput11
--- a/testdata/testoutput11-32
+++ b/testdata/testoutput11-32
@ -667,6 +667,4 @@ Subject length lower bound = 1
    ab\x{80000041}z 
 0: ab\x{80000041}z

-/\[()]{65535}/expand
-
 # End of testinput11
--- a/testdata/testoutput12-16
+++ b/testdata/testoutput12-16
@ -1502,4 +1502,81 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
 /(?(á/utf
 Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?)

+# Invalid UTF-16/32 tests.
+
+/.../g,match_invalid_utf
+    abcd\x{df00}wxzy\x{df00}pqrs
+ 0: abc
+ 0: wxz
+ 0: pqr
+    abcd\x{80}wxzy\x{df00}pqrs
+ 0: abc
+ 0: d\x{80}w
+ 0: xzy
+ 0: pqr
+
+/abc/match_invalid_utf
+    ab\x{df00}ab\=ph
+Partial match: ab
+\= Expect no match
+    ab\x{df00}cdef\=ph
+No match
+
+/ab$/match_invalid_utf
+    ab\x{df00}cdeab
+ 0: ab
+\= Expect no match
+    ab\x{df00}cde
+No match
+
+/.../g,match_invalid_utf
+    abcd\x{80}wxzy\x{df00}pqrs
+ 0: abc
+ 0: d\x{80}w
+ 0: xzy
+ 0: pqr
+
+/(?<=x)../g,match_invalid_utf
+    abcd\x{80}wxzy\x{df00}pqrs
+ 0: zy
+    abcd\x{80}wxzy\x{df00}xpqrs
+ 0: zy
+ 0: pq
+
+/X$/match_invalid_utf
+\= Expect no match
+    X\x{df00}
+No match
+    
+/(?<=..)X/match_invalid_utf,aftertext
+    AB\x{df00}AQXYZ
+ 0: X
+ 0+ YZ
+    AB\x{df00}AQXYZ\=offset=5
+ 0: X
+ 0+ YZ
+    AB\x{df00}\x{df00}AXYZXC\=offset=5
+ 0: X
+ 0+ C
+\= Expect no match
+    AB\x{df00}XYZ
+No match
+    AB\x{df00}XYZ\=offset=3 
+No match
+    AB\x{df00}AXYZ
+No match
+    AB\x{df00}AXYZ\=offset=4
+No match
+    AB\x{df00}\x{df00}AXYZ\=offset=5
+No match
+
+/.../match_invalid_utf
+\= Expect no match
+    A\x{d800}B
+No match
+    A\x{110000}B 
+** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
+
+# ---------------------------------------------------- 
+
 # End of testinput12
--- a/testdata/testoutput12-32
+++ b/testdata/testoutput12-32
@ -1500,4 +1500,81 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
 /(?(á/utf
 Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?)

+# Invalid UTF-16/32 tests.
+
+/.../g,match_invalid_utf
+    abcd\x{df00}wxzy\x{df00}pqrs
+ 0: abc
+ 0: wxz
+ 0: pqr
+    abcd\x{80}wxzy\x{df00}pqrs
+ 0: abc
+ 0: d\x{80}w
+ 0: xzy
+ 0: pqr
+
+/abc/match_invalid_utf
+    ab\x{df00}ab\=ph
+Partial match: ab
+\= Expect no match
+    ab\x{df00}cdef\=ph
+No match
+
+/ab$/match_invalid_utf
+    ab\x{df00}cdeab
+ 0: ab
+\= Expect no match
+    ab\x{df00}cde
+No match
+
+/.../g,match_invalid_utf
+    abcd\x{80}wxzy\x{df00}pqrs
+ 0: abc
+ 0: d\x{80}w
+ 0: xzy
+ 0: pqr
+
+/(?<=x)../g,match_invalid_utf
+    abcd\x{80}wxzy\x{df00}pqrs
+ 0: zy
+    abcd\x{80}wxzy\x{df00}xpqrs
+ 0: zy
+ 0: pq
+
+/X$/match_invalid_utf
+\= Expect no match
+    X\x{df00}
+No match
+    
+/(?<=..)X/match_invalid_utf,aftertext
+    AB\x{df00}AQXYZ
+ 0: X
+ 0+ YZ
+    AB\x{df00}AQXYZ\=offset=5
+ 0: X
+ 0+ YZ
+    AB\x{df00}\x{df00}AXYZXC\=offset=5
+ 0: X
+ 0+ C
+\= Expect no match
+    AB\x{df00}XYZ
+No match
+    AB\x{df00}XYZ\=offset=3 
+No match
+    AB\x{df00}AXYZ
+No match
+    AB\x{df00}AXYZ\=offset=4
+No match
+    AB\x{df00}\x{df00}AXYZ\=offset=5
+No match
+
+/.../match_invalid_utf
+\= Expect no match
+    A\x{d800}B
+No match
+    A\x{110000}B 
+No match
+
+# ---------------------------------------------------- 
+
 # End of testinput12
--- a/testdata/testoutput8-16-2
+++ b/testdata/testoutput8-16-2
@ -1020,4 +1020,9 @@ Failed: error 114 at offset 509: missing closing parenthesis

 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode

+#pattern -fullbincode
+
+/\[()]{65535}/expand
+Failed: error 120 at offset 131070: regular expression is too large
+
 # End of testinput8
--- a/testdata/testoutput8-16-3
+++ b/testdata/testoutput8-16-3
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis

 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode

+#pattern -fullbincode
+
+/\[()]{65535}/expand
+
 # End of testinput8
--- a/testdata/testoutput8-16-4
+++ b/testdata/testoutput8-16-4
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis

 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode

+#pattern -fullbincode
+
+/\[()]{65535}/expand
+
 # End of testinput8
--- a/testdata/testoutput8-32-2
+++ b/testdata/testoutput8-32-2
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis

 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode

+#pattern -fullbincode
+
+/\[()]{65535}/expand
+
 # End of testinput8
--- a/testdata/testoutput8-32-3
+++ b/testdata/testoutput8-32-3
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis

 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode

+#pattern -fullbincode
+
+/\[()]{65535}/expand
+
 # End of testinput8
--- a/testdata/testoutput8-32-4
+++ b/testdata/testoutput8-32-4
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis

 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode

+#pattern -fullbincode
+
+/\[()]{65535}/expand
+
 # End of testinput8
--- a/testdata/testoutput8-8-2
+++ b/testdata/testoutput8-8-2
@ -1020,4 +1020,9 @@ Failed: error 114 at offset 509: missing closing parenthesis

 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode

+#pattern -fullbincode
+
+/\[()]{65535}/expand
+Failed: error 120 at offset 131070: regular expression is too large
+
 # End of testinput8
--- a/testdata/testoutput8-8-3
+++ b/testdata/testoutput8-8-3
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis

 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode

+#pattern -fullbincode
+
+/\[()]{65535}/expand
+
 # End of testinput8
--- a/testdata/testoutput8-8-4
+++ b/testdata/testoutput8-8-4
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis

 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode

+#pattern -fullbincode
+
+/\[()]{65535}/expand
+
 # End of testinput8
--- a/testdata/testoutput9
+++ b/testdata/testoutput9
@ -367,7 +367,4 @@ Failed: error 134 at offset 14: character code point value in \x{} or \o{} is to
 /(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/
 Failed: error 176 at offset 259: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)

-/\[()]{65535}/expand
-Failed: error 120 at offset 131070: regular expression is too large
-
 # End of testinput9