Implement support for invalid UTF in the pcre2_match() interpreter.

2019-05-24 17:15:48 +00:00 · 2019-05-24 17:15:48 +00:00 · 16c046ce50
parent 2ad4329f83
commit 16c046ce50
48 changed files with 2780 additions and 1783 deletions
--- a/4
+++ b/4
@ -14,6 +14,10 @@ detects invalid characters in the 0xd800-0xdfff range.
 3. Fix minor typo bug in JIT compile when \X is used in a non-UTF string.
 4. Add support for matching in invalid UTF strings to the pcre2_match()
 interpreter, and integrate with the existing JIT support via the new
 PCRE2_MATCH_INVALID_UTF compile-time option.
 Version 10.33 16-April-2019
 ---------------------------
--- a/doc/html/pcre2_compile.html
+++ b/doc/html/pcre2_compile.html
@ -65,6 +65,7 @@ The option bits are:
  PCRE2_EXTENDED           Ignore white space and # comments
  PCRE2_FIRSTLINE          Force matching to be before newline
  PCRE2_LITERAL            Pattern characters are all literal
  PCRE2_MATCH_INVALID_UTF  Enable support for matching invalid UTF 
  PCRE2_MATCH_UNSET_BACKREF  Match unset backreferences
  PCRE2_MULTILINE          ^ and $ match newlines within data
  PCRE2_NEVER_BACKSLASH_C  Lock out the use of \C in patterns
--- a/doc/html/pcre2_jit_compile.html
+++ b/doc/html/pcre2_jit_compile.html
@ -40,8 +40,12 @@ bits:
  PCRE2_JIT_COMPLETE      compile code for full matching
  PCRE2_JIT_PARTIAL_SOFT  compile code for soft partial matching
  PCRE2_JIT_PARTIAL_HARD  compile code for hard partial matching
  PCRE2_JIT_INVALID_UTF   compile code to handle invalid UTF
 </pre>
 There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been 
 superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF. The old 
 option is deprecated and may be removed in future.
 </P>
 <P>
 The yield of the function is 0 for success, or a negative error code otherwise.
 In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
 if an unknown bit is set in <i>options</i>.
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@ -1347,11 +1347,12 @@ and <b>pcre2_compile()</b> returns a non-NULL value.
 <P>
 There are nearly 100 positive error codes that <b>pcre2_compile()</b> may return
 if it finds an error in the pattern. There are also some negative error codes
-that are used for invalid UTF strings. These are the same as given by
+that are used for invalid UTF strings when validity checking is in force. These
-<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and are described in the
+are the same as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and
 are described in the
 <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
-page. There is no separate documentation for the positive error codes, because
+documentation. There is no separate documentation for the positive error codes,
-the textual error messages that are obtained by calling the
+because the textual error messages that are obtained by calling the
 <b>pcre2_get_error_message()</b> function (see "Obtaining a textual error
 message"
 <a href="#geterrormessage">below)</a>
@ -1615,10 +1616,18 @@ expression engine is not the most efficient way of doing it. If you are doing a
 lot of literal matching and are worried about efficiency, you should consider
 using other approaches. The only other main options that are allowed with
 PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
-PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
+PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
-PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
+PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
-and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
+PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
-error.
+PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
 <pre>
  PCRE2_MATCH_INVALID_UTF
 </pre>
 This option forces PCRE2_UTF (see below) and also enables support for matching
 by <b>pcre2_match()</b> in subject strings that contain invalid UTF sequences.
 This facility is not supported for DFA matching. For details, see the
 <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
 documentation.
 <pre>
  PCRE2_MATCH_UNSET_BACKREF
 </pre>
@ -2653,15 +2662,22 @@ of JIT; it forces matching to be done by the interpreter.
  PCRE2_NO_UTF_CHECK
 </pre>
 When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
-string is checked by default when <b>pcre2_match()</b> is subsequently called.
+string is checked unless PCRE2_NO_UTF_CHECK is passed to <b>pcre2_match()</b> or
-If a non-zero starting offset is given, the check is applied only to that part
+PCRE2_MATCH_INVALID_UTF was passed to <b>pcre2_compile()</b>. The latter special
-of the subject that could be inspected during matching, and there is a check
+case is discussed in detail in the
-that the starting offset points to the first code unit of a character or to the
+<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
-end of the subject. If there are no lookbehind assertions in the pattern, the
+documentation.
-check starts at the starting offset. Otherwise, it starts at the length of the
+</P>
-longest lookbehind before the starting offset, or at the start of the subject
+<P>
-if there are not that many characters before the starting offset. Note that the
+In the default case, if a non-zero starting offset is given, the check is
-sequences \b and \B are one-character lookbehinds.
+applied only to that part of the subject that could be inspected during
 matching, and there is a check that the starting offset points to the first
 code unit of a character or to the end of the subject. If there are no
 lookbehind assertions in the pattern, the check starts at the starting offset.
 Otherwise, it starts at the length of the longest lookbehind before the
 starting offset, or at the start of the subject if there are not that many
 characters before the starting offset. Note that the sequences \b and \B are
 one-character lookbehinds.
 </P>
 <P>
 The check is carried out before any other processing takes place, and a
@ -2674,19 +2690,20 @@ and
 <a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
 in the
 <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
-page.
+documentation.
 </P>
 <P>
-If you know that your subject is valid, and you want to skip these checks for
+If you know that your subject is valid, and you want to skip this check for
 performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
 <b>pcre2_match()</b>. You might want to do this for the second and subsequent
-calls to <b>pcre2_match()</b> if you are making repeated calls to find other
+calls to <b>pcre2_match()</b> if you are making repeated calls to find multiple
 matches in the same subject string.
 </P>
 <P>
-<b>Warning:</b> When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
+<b>Warning:</b> Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
 PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
 string as a subject, or an invalid value of <i>startoffset</i>, is undefined.
-Your program may crash or loop indefinitely.
+Your program may crash or loop indefinitely or give wrong results.
 <pre>
  PCRE2_PARTIAL_HARD
  PCRE2_PARTIAL_SOFT
@ -3771,6 +3788,12 @@ a backreference.
 This return is given if <b>pcre2_dfa_match()</b> encounters a condition item
 that uses a backreference for the condition, or a test for recursion in a
 specific capture group. These are not supported.
 <pre>
  PCRE2_ERROR_DFA_UINVALID_UTF
 </pre>
 This return is given if <b>pcre2_dfa_match()</b> is called for a pattern that
 was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
 matching.
 <pre>
  PCRE2_ERROR_DFA_WSSIZE
 </pre>
@ -3808,7 +3831,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC42" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 14 February 2019
+Last updated: 23 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/html/pcre2jit.html
+++ b/doc/html/pcre2jit.html
@ -147,25 +147,29 @@ pattern.
 </P>
 <br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
 <P>
-When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
+When a pattern is compiled with the PCRE2_UTF option, subject strings are
-function expects its subject string to be a valid sequence of UTF code units.
+normally expected to be a valid sequence of UTF code units. By default, this is
-If it is not, the result is undefined. This is also true by default of matching
+checked at the start of matching and an error is generated if invalid UTF is
-via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
+detected. The PCRE2_NO_UTF_CHECK option can be passed to <b>pcre2_match()</b> to
-<b>pcre2_jit_compile()</b>, code that can process a subject containing invalid
+skip the check (for improved performance) if you are sure that a subject string
-UTF is compiled.
+is valid. If this option is used with an invalid string, the result is
 undefined.
 </P>
 <P>
-In this mode, an invalid code unit sequence never matches any pattern item. It
+However, a way of running matches on strings that may contain invalid UTF
-does not match dot, it does not match \p{Any}, it does not even match negative
+sequences is available. Calling <b>pcre2_compile()</b> with the
-items such as [^X]. A lookbehind assertion fails if it encounters an invalid
+PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
-sequence while moving the current point backwards. In other words, an invalid
+<b>pcre2_match()</b> to support invalid UTF, and, if <b>pcre2_jit_compile()</b>
-UTF code unit sequence acts as a barrier which no match can cross. Reaching an
+is called, the compiled JIT code also supports invalid UTF. Details of how this
-invalid sequence causes an immediate backtrack.
+support works, in both the JIT and the interpretive cases, is given in the
 <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
 documentation.
 </P>
 <P>
-Using this option, an application can run matches in arbitrary data, knowing
+There is also an obsolete option for <b>pcre2_jit_compile()</b> called
-that any matched strings that are returned will be valid UTF. This can be
+PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
-useful when searching for text in executable or other binary files.
+It is superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF
 and should no longer be used. It may be removed in future.
 </P>
 <br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
 <P>
@ -461,7 +465,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC14" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 06 March 2019
+Last updated: 23 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/html/pcre2matching.html
+++ b/doc/html/pcre2matching.html
@ -188,6 +188,10 @@ code unit) at a time, for all active paths through the tree.
 9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
 supported. (*FAIL) is supported, and behaves like a failing negative assertion.
 </P>
 <P>
 10. The PCRE2_MATCH_INVALID_UTF option for <b>pcre2_compile()</b> is not 
 supported by <b>pcre2_dfa_match()</b>.
 </P>
 <br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
 <P>
 Using the alternative matching algorithm provides the following advantages:
@ -219,7 +223,8 @@ because it has to search for all possible matches, but is also because it is
 less susceptible to optimization.
 </P>
 <P>
-2. Capturing parentheses, backreferences, and script runs are not supported.
+2. Capturing parentheses, backreferences, script runs, and matching within 
 invalid UTF string are not supported.
 </P>
 <P>
 3. Although atomic groups are supported, their use does not provide the
@ -236,9 +241,9 @@ Cambridge, England.
 </P>
 <br><a name="SEC8" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 10 October 2018
+Last updated: 23 May 2019
 <br>
-Copyright &copy; 1997-2018 University of Cambridge.
+Copyright &copy; 1997-2019 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -91,10 +91,11 @@ single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
 specified for the 32-bit library, in which case it constrains the character
 values to valid Unicode code points. To process UTF strings, PCRE2 must be
 built to include Unicode support (which is the default). When using UTF strings
-you must either call the compiling function with the PCRE2_UTF option, or the
+you must either call the compiling function with one or both of the PCRE2_UTF
-pattern must start with the special sequence (*UTF), which is equivalent to
+or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
-setting the relevant option. How setting a UTF mode affects pattern matching is
+sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
-mentioned in several places below. There is also a summary of features in the
+setting a UTF mode affects pattern matching is mentioned in several places
 below. There is also a summary of features in the
 <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
 page.
 </P>
@ -428,11 +429,11 @@ There may be any number of hexadecimal digits. This syntax is from ECMAScript
 6.
 </P>
 <P>
-The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
+The \N{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
-is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
+UTF mode. Perl also uses \N{name} to specify characters by Unicode name; PCRE2
-\N{name} to specify characters by Unicode name; PCRE2 does not support this.
+does not support this. Note that when \N is not followed by an opening brace
-Note that when \N is not followed by an opening brace (curly bracket) it has
+(curly bracket) it has an entirely different meaning, matching any character
-an entirely different meaning, matching any character that is not a newline.
+that is not a newline.
 </P>
 <P>
 There are some legacy applications where the escape sequence \r is expected to
@ -1360,7 +1361,7 @@ with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
 with a malformed UTF character. This has undefined results, because PCRE2
 assumes that it is matching character by character in a valid UTF string (by
 default it checks the subject string's validity at the start of processing
-unless the PCRE2_NO_UTF_CHECK option is used).
+unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
 </P>
 <P>
 An application can lock out the use of \C by setting the
@ -3727,7 +3728,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC31" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 12 February 2019
+Last updated: 23 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/html/pcre2test.html
+++ b/doc/html/pcre2test.html
@ -613,6 +613,7 @@ for a description of the effects of these options.
      firstline                 set PCRE2_FIRSTLINE
      literal                   set PCRE2_LITERAL
      match_line                set PCRE2_EXTRA_MATCH_LINE
      match_invalid_utf         set PCRE2_MATCH_INVALID_UTF 
      match_unset_backref       set PCRE2_MATCH_UNSET_BACKREF
      match_word                set PCRE2_EXTRA_MATCH_WORD
  /m  multiline                 set PCRE2_MULTILINE
@ -2078,7 +2079,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC21" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 11 March 2019
+Last updated: 23 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/html/pcre2unicode.html
+++ b/doc/html/pcre2unicode.html
@ -16,22 +16,33 @@ please consult the man page, in case the conversion went wrong.
 UNICODE AND UTF SUPPORT
 </b><br>
 <P>
-When PCRE2 is built with Unicode support (which is the default), it has
+PCRE2 is normally built with Unicode support, though if you do not need it, you
-knowledge of Unicode character properties and can process text strings in
+can build it without, in which case the library will be smaller. With Unicode
-UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by
+support, PCRE2 has knowledge of Unicode character properties and can process
-default, PCRE2 assumes that one code unit is one character. To process a
+text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
-pattern as a UTF string, where a character may require more than one code unit,
+width), but this is not the default. Unless specifically requested, PCRE2
-you must call
+treats each code unit in a string as one character.
 <a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
 with the PCRE2_UTF option flag, or the pattern must start with the sequence
 (*UTF). When either of these is the case, both the pattern and any subject
 strings that are matched against it are treated as UTF strings instead of
 strings of individual one-code-unit characters. There are also some other
 changes to the way characters are handled, as documented below.
 </P>
 <P>
-If you do not need Unicode support you can build PCRE2 without it, in which
+There are two ways of telling PCRE2 to switch to UTF mode, where characters may 
-case the library will be smaller.
+consist of more than one code unit and the range of values is constrained. The 
 program can call
 <a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
 with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
 However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
 That is, the programmer can prevent the supplier of the pattern from switching 
 to UTF mode.
 </P>
 <P>
 Note that the PCRE2_MATCH_INVALID_UTF option (see
 <a href="#matchinvalid">below)</a>
 forces PCRE2_UTF to be set.
 </P>
 <P>
 In UTF mode, both the pattern and any subject strings that are matched against
 it are treated as UTF strings instead of strings of individual one-code-unit
 characters. There are also some other changes to the way characters are
 handled, as documented below.
 </P>
 <br><b>
 UNICODE PROPERTY SUPPORT
@ -63,22 +74,22 @@ also recognized; larger ones can be coded using \o{...}.
 <P>
 The escape sequence \N{U+&#60;hex digits&#62;} is recognized as another way of
 specifying a Unicode character by code point in a UTF mode. It is not allowed
-in non-UTF modes.
+in non-UTF mode.
 </P>
 <P>
-In UTF modes, repeat quantifiers apply to complete UTF characters, not to
+In UTF mode, repeat quantifiers apply to complete UTF characters, not to
 individual code units.
 </P>
 <P>
-In UTF modes, the dot metacharacter matches one UTF character instead of a
+In UTF mode, the dot metacharacter matches one UTF character instead of a
 single code unit.
 </P>
 <P>
-In UTF modes, capture group names are not restricted to ASCII, and may contain
+In UTF mode, capture group names are not restricted to ASCII, and may contain
 any Unicode letters and decimal digits, as well as underscore.
 </P>
 <P>
-The escape sequence \C can be used to match a single code unit in a UTF mode,
+The escape sequence \C can be used to match a single code unit in UTF mode,
 but its use can lead to some strange effects because it breaks up multi-unit
 characters (see the description of \C in the
 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
@ -93,7 +104,7 @@ may consist of more than one code unit. The use of \C in these modes provokes
 a match-time error. Also, the JIT optimization does not support \C in these
 modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
 contains \C, it will not succeed, and so when <b>pcre2_match()</b> is called,
-the matching will be carried out by the normal interpretive function.
+the matching will be carried out by the interpretive function.
 </P>
 <P>
 The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
@ -123,14 +134,14 @@ However, the special horizontal and vertical white space matching escapes (\h,
 not PCRE2_UCP is set.
 </P>
 <br><b>
-CASE-EQUIVALENCE IN UTF MODES
+CASE-EQUIVALENCE IN UTF MODE
 </b><br>
 <P>
-Case-insensitive matching in a UTF mode makes use of Unicode properties except
+Case-insensitive matching in UTF mode makes use of Unicode properties except
 for characters whose code points are less than 128 and that have at most two
 case-equivalent values. For these, a direct table lookup is used for speed. A
 few Unicode characters such as Greek sigma have more than two code points that
-are case-equivalent, and these are treated as such.
+are case-equivalent, and these are treated specially.
 <a name="scriptruns"></a></P>
 <br><b>
 SCRIPT RUNS
@ -248,7 +259,7 @@ VALIDITY OF UTF STRINGS
 <P>
 When the PCRE2_UTF option is set, the strings passed as patterns and subjects
 are (by default) checked for validity on entry to the relevant functions. If an
-invalid UTF string is passed, an negative error code is returned. The code unit
+invalid UTF string is passed, a negative error code is returned. The code unit
 offset to the offending character can be extracted from the match data block by
 calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
 error.
@ -263,17 +274,16 @@ only valid UTF code unit sequences.
 </P>
 <P>
 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
-is usually undefined and your program may crash or loop indefinitely. There is,
+is undefined and your program may crash or loop indefinitely or give incorrect
-however, one mode of matching that can handle invalid UTF subject strings. This
+results. There is, however, one mode of matching that can handle invalid UTF
-is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
+subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
-when calling <b>pcre2_jit_compile()</b>. For details, see the
+<b>pcre2_compile()</b> and is discussed below in the next section. The rest of
-<a href="pcre2jit.html"><b>pcre2jit</b></a>
+this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
 documentation.
 </P>
 <P>
-Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
+Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the UTF check
-the pattern; it does not also apply to subject strings. If you want to disable
+for the pattern; it does not also apply to subject strings. If you want to
-the check for a subject string you must pass this same option to
+disable the check for a subject string you must pass this same option to
 <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
 </P>
 <P>
@ -352,7 +362,7 @@ these code points are excluded by RFC 3629.
 <pre>
  PCRE2_ERROR_UTF8_ERR13
 </pre>
-A 4-byte character has a value greater than 0x10fff; these code points are
+A 4-byte character has a value greater than 0x10ffff; these code points are
 excluded by RFC 3629.
 <pre>
  PCRE2_ERROR_UTF8_ERR14
@ -405,7 +415,59 @@ The following negative error codes are given for invalid UTF-32 strings:
  PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
  PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff
-</PRE>
+<a name="matchinvalid"></a></PRE>
 </P>
 <br><b>
 MATCHING IN INVALID UTF STRINGS
 </b><br>
 <P>
 You can run pattern matches on subject strings that may contain invalid UTF
 sequences if you call <b>pcre2_compile()</b> with the PCRE2_MATCH_INVALID_UTF
 option. This is supported by <b>pcre2_match()</b>, including JIT matching, but
 not by <b>pcre2_dfa_match()</b>. When PCRE2_MATCH_INVALID_UTF is set, it forces
 PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a 
 valid UTF string.
 </P>
 <P>
 Setting PCRE2_MATCH_INVALID_UTF does not affect what <b>pcre2_compile()</b>
 generates, but if <b>pcre2_jit_compile()</b> is subsequently called, it does
 generate different code. If JIT is not used, the option affects the behaviour
 of the interpretive code in <b>pcre2_match()</b>. When PCRE2_MATCH_INVALID_UTF
 is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
 </P>
 <P>
 In this mode, an invalid code unit sequence in the subject never matches any
 pattern item. It does not match dot, it does not match \p{Any}, it does not
 even match negative items such as [^X]. A lookbehind assertion fails if it
 encounters an invalid sequence while moving the current point backwards. In
 other words, an invalid UTF code unit sequence acts as a barrier which no match
 can cross.
 </P>
 <P>
 You can also think of this as the subject being split up into fragments of
 valid UTF, delimited internally by invalid code unit sequences. The pattern is
 matched fragment by fragment. The result of a successful match, however, is
 given as code unit offsets in the entire subject string in the usual way. There
 are a few points to consider:
 </P>
 <P>
 The internal boundaries are not interpreted as the beginnings or ends of lines
 and so do not match circumflex or dollar characters in the pattern.
 </P>
 <P>
 If <b>pcre2_match()</b> is called with an offset that points to an invalid
 UTF-sequence, that sequence is skipped, and the match starts at the next valid
 UTF character, or the end of the subject.
 </P>
 <P>
 At internal fragment boundaries, \b and \B behave in the same way as at the
 beginning and end of the subject. For example, a sequence such as \bWORD\b 
 would match an instance of WORD that is surrounded by invalid UTF code units.
 </P>
 <P>
 Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
 data, knowing that any matched strings that are returned are valid UTF. This
 can be useful when searching for UTF text in executable or other binary files.
 </P>
 <br><b>
 AUTHOR
@ -422,7 +484,7 @@ Cambridge, England.
 REVISION
 </b><br>
 <P>
-Last updated: 06 March 2019
+Last updated: 24 May 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
--- a/doc/pcre2_compile.3
+++ b/doc/pcre2_compile.3
@ -1,4 +1,4 @@
-.TH PCRE2_COMPILE 3 "11 February 2019" "PCRE2 10.33"
+.TH PCRE2_COMPILE 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH SYNOPSIS
@ -53,6 +53,7 @@ The option bits are:
  PCRE2_EXTENDED           Ignore white space and # comments
  PCRE2_FIRSTLINE          Force matching to be before newline
  PCRE2_LITERAL            Pattern characters are all literal
  PCRE2_MATCH_INVALID_UTF  Enable support for matching invalid UTF 
  PCRE2_MATCH_UNSET_BACKREF  Match unset backreferences
  PCRE2_MULTILINE          ^ and $ match newlines within data
  PCRE2_NEVER_BACKSLASH_C  Lock out the use of \eC in patterns
--- a/doc/pcre2_jit_compile.3
+++ b/doc/pcre2_jit_compile.3
@ -1,4 +1,4 @@
-.TH PCRE2_JIT_COMPILE 3 "06 March 2019" "PCRE2 10.33"
+.TH PCRE2_JIT_COMPILE 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH SYNOPSIS
@ -29,8 +29,11 @@ bits:
  PCRE2_JIT_COMPLETE      compile code for full matching
  PCRE2_JIT_PARTIAL_SOFT  compile code for soft partial matching
  PCRE2_JIT_PARTIAL_HARD  compile code for hard partial matching
  PCRE2_JIT_INVALID_UTF   compile code to handle invalid UTF
 .sp
 There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been 
 superseded by the \fBpcre2_compile()\fP option PCRE2_MATCH_INVALID_UTF. The old 
 option is deprecated and may be removed in future.
 .P
 The yield of the function is 0 for success, or a negative error code otherwise.
 In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
 if an unknown bit is set in \fIoptions\fP.
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@ -1,4 +1,4 @@
-.TH PCRE2API 3 "14 February 2019" "PCRE2 10.33"
+.TH PCRE2API 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@ -1285,13 +1285,14 @@ and \fBpcre2_compile()\fP returns a non-NULL value.
 .P
 There are nearly 100 positive error codes that \fBpcre2_compile()\fP may return
 if it finds an error in the pattern. There are also some negative error codes
-that are used for invalid UTF strings. These are the same as given by
+that are used for invalid UTF strings when validity checking is in force. These
-\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and are described in the
+are the same as given by \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and
 are described in the
 .\" HREF
 \fBpcre2unicode\fP
 .\"
-page. There is no separate documentation for the positive error codes, because
+documentation. There is no separate documentation for the positive error codes,
-the textual error messages that are obtained by calling the
+because the textual error messages that are obtained by calling the
 \fBpcre2_get_error_message()\fP function (see "Obtaining a textual error
 message"
 .\" HTML <a href="#geterrormessage">
@ -1557,10 +1558,20 @@ expression engine is not the most efficient way of doing it. If you are doing a
 lot of literal matching and are worried about efficiency, you should consider
 using other approaches. The only other main options that are allowed with
 PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
-PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
+PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
-PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
+PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
-and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
+PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
-error.
+PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
 .sp
  PCRE2_MATCH_INVALID_UTF
 .sp
 This option forces PCRE2_UTF (see below) and also enables support for matching
 by \fBpcre2_match()\fP in subject strings that contain invalid UTF sequences.
 This facility is not supported for DFA matching. For details, see the
 .\" HREF
 \fBpcre2unicode\fP
 .\"
 documentation.
 .sp
  PCRE2_MATCH_UNSET_BACKREF
 .sp
@ -2635,15 +2646,23 @@ of JIT; it forces matching to be done by the interpreter.
  PCRE2_NO_UTF_CHECK
 .sp
 When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
-string is checked by default when \fBpcre2_match()\fP is subsequently called.
+string is checked unless PCRE2_NO_UTF_CHECK is passed to \fBpcre2_match()\fP or
-If a non-zero starting offset is given, the check is applied only to that part
+PCRE2_MATCH_INVALID_UTF was passed to \fBpcre2_compile()\fP. The latter special
-of the subject that could be inspected during matching, and there is a check
+case is discussed in detail in the
-that the starting offset points to the first code unit of a character or to the
+.\" HREF
-end of the subject. If there are no lookbehind assertions in the pattern, the
+\fBpcre2unicode\fP
-check starts at the starting offset. Otherwise, it starts at the length of the
+.\"
-longest lookbehind before the starting offset, or at the start of the subject
+documentation.
-if there are not that many characters before the starting offset. Note that the
+.P
-sequences \eb and \eB are one-character lookbehinds.
+In the default case, if a non-zero starting offset is given, the check is
 applied only to that part of the subject that could be inspected during
 matching, and there is a check that the starting offset points to the first
 code unit of a character or to the end of the subject. If there are no
 lookbehind assertions in the pattern, the check starts at the starting offset.
 Otherwise, it starts at the length of the longest lookbehind before the
 starting offset, or at the start of the subject if there are not that many
 characters before the starting offset. Note that the sequences \eb and \eB are
 one-character lookbehinds.
 .P
 The check is carried out before any other processing takes place, and a
 negative error code is returned if the check fails. There are several UTF error
@ -2666,17 +2685,18 @@ in the
 .\" HREF
 \fBpcre2unicode\fP
 .\"
-page.
+documentation.
 .P
-If you know that your subject is valid, and you want to skip these checks for
+If you know that your subject is valid, and you want to skip this check for
 performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
 \fBpcre2_match()\fP. You might want to do this for the second and subsequent
-calls to \fBpcre2_match()\fP if you are making repeated calls to find other
+calls to \fBpcre2_match()\fP if you are making repeated calls to find multiple
 matches in the same subject string.
 .P
-\fBWarning:\fP When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
+\fBWarning:\fP Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
 PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
 string as a subject, or an invalid value of \fIstartoffset\fP, is undefined.
-Your program may crash or loop indefinitely.
+Your program may crash or loop indefinitely or give wrong results.
 .sp
  PCRE2_PARTIAL_HARD
  PCRE2_PARTIAL_SOFT
@ -3774,6 +3794,12 @@ a backreference.
 This return is given if \fBpcre2_dfa_match()\fP encounters a condition item
 that uses a backreference for the condition, or a test for recursion in a
 specific capture group. These are not supported.
 .sp
  PCRE2_ERROR_DFA_UINVALID_UTF
 .sp
 This return is given if \fBpcre2_dfa_match()\fP is called for a pattern that
 was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
 matching.
 .sp
  PCRE2_ERROR_DFA_WSSIZE
 .sp
@ -3817,6 +3843,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 14 February 2019
+Last updated: 23 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2jit.3
+++ b/doc/pcre2jit.3
@ -1,4 +1,4 @@
-.TH PCRE2JIT 3 "06 March 2019" "PCRE2 10.33"
+.TH PCRE2JIT 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@ -123,23 +123,29 @@ pattern.
 .SH "MATCHING SUBJECTS CONTAINING INVALID UTF"
 .rs
 .sp
-When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
+When a pattern is compiled with the PCRE2_UTF option, subject strings are
-function expects its subject string to be a valid sequence of UTF code units.
+normally expected to be a valid sequence of UTF code units. By default, this is
-If it is not, the result is undefined. This is also true by default of matching
+checked at the start of matching and an error is generated if invalid UTF is
-via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
+detected. The PCRE2_NO_UTF_CHECK option can be passed to \fBpcre2_match()\fP to
-\fBpcre2_jit_compile()\fP, code that can process a subject containing invalid
+skip the check (for improved performance) if you are sure that a subject string
-UTF is compiled.
+is valid. If this option is used with an invalid string, the result is
 undefined.
 .P
-In this mode, an invalid code unit sequence never matches any pattern item. It
+However, a way of running matches on strings that may contain invalid UTF
-does not match dot, it does not match \ep{Any}, it does not even match negative
+sequences is available. Calling \fBpcre2_compile()\fP with the
-items such as [^X]. A lookbehind assertion fails if it encounters an invalid
+PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
-sequence while moving the current point backwards. In other words, an invalid
+\fBpcre2_match()\fP to support invalid UTF, and, if \fBpcre2_jit_compile()\fP
-UTF code unit sequence acts as a barrier which no match can cross. Reaching an
+is called, the compiled JIT code also supports invalid UTF. Details of how this
-invalid sequence causes an immediate backtrack.
+support works, in both the JIT and the interpretive cases, is given in the
 .\" HREF
 \fBpcre2unicode\fP
 .\"
 documentation.
 .P
-Using this option, an application can run matches in arbitrary data, knowing
+There is also an obsolete option for \fBpcre2_jit_compile()\fP called
-that any matched strings that are returned will be valid UTF. This can be
+PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
-useful when searching for text in executable or other binary files.
+It is superseded by the \fBpcre2_compile()\fP option PCRE2_MATCH_INVALID_UTF
 and should no longer be used. It may be removed in future.
 .
 .
 .SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
@ -438,6 +444,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 06 March 2019
+Last updated: 23 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2matching.3
+++ b/doc/pcre2matching.3
@ -1,4 +1,4 @@
-.TH PCRE2MATCHING 3 "10 October 2018" "PCRE2 10.33"
+.TH PCRE2MATCHING 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 MATCHING ALGORITHMS"
@ -157,6 +157,9 @@ code unit) at a time, for all active paths through the tree.
 .P
 9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
 supported. (*FAIL) is supported, and behaves like a failing negative assertion.
 .P
 10. The PCRE2_MATCH_INVALID_UTF option for \fBpcre2_compile()\fP is not 
 supported by \fBpcre2_dfa_match()\fP.
 .
 .
 .SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
@ -191,7 +194,8 @@ The alternative algorithm suffers from a number of disadvantages:
 because it has to search for all possible matches, but is also because it is
 less susceptible to optimization.
 .P
-2. Capturing parentheses, backreferences, and script runs are not supported.
+2. Capturing parentheses, backreferences, script runs, and matching within 
 invalid UTF string are not supported.
 .P
 3. Although atomic groups are supported, their use does not provide the
 performance advantage that it does for the standard algorithm.
@ -211,6 +215,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 10 October 2018
+Last updated: 23 May 2019
-Copyright (c) 1997-2018 University of Cambridge.
+Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "12 February 2019" "PCRE2 10.33"
+.TH PCRE2PATTERN 3 "23 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -52,10 +52,11 @@ single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
 specified for the 32-bit library, in which case it constrains the character
 values to valid Unicode code points. To process UTF strings, PCRE2 must be
 built to include Unicode support (which is the default). When using UTF strings
-you must either call the compiling function with the PCRE2_UTF option, or the
+you must either call the compiling function with one or both of the PCRE2_UTF
-pattern must start with the special sequence (*UTF), which is equivalent to
+or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
-setting the relevant option. How setting a UTF mode affects pattern matching is
+sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
-mentioned in several places below. There is also a summary of features in the
+setting a UTF mode affects pattern matching is mentioned in several places
 below. There is also a summary of features in the
 .\" HREF
 \fBpcre2unicode\fP
 .\"
@ -398,11 +399,11 @@ PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
 There may be any number of hexadecimal digits. This syntax is from ECMAScript
 6.
 .P
-The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
+The \eN{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
-is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
+UTF mode. Perl also uses \eN{name} to specify characters by Unicode name; PCRE2
-\eN{name} to specify characters by Unicode name; PCRE2 does not support this.
+does not support this. Note that when \eN is not followed by an opening brace
-Note that when \eN is not followed by an opening brace (curly bracket) it has
+(curly bracket) it has an entirely different meaning, matching any character
-an entirely different meaning, matching any character that is not a newline.
+that is not a newline.
 .P
 There are some legacy applications where the escape sequence \er is expected to
 match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a
@ -1352,7 +1353,7 @@ with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
 with a malformed UTF character. This has undefined results, because PCRE2
 assumes that it is matching character by character in a valid UTF string (by
 default it checks the subject string's validity at the start of processing
-unless the PCRE2_NO_UTF_CHECK option is used).
+unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
 .P
 An application can lock out the use of \eC by setting the
 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
@ -3763,6 +3764,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 12 February 2019
+Last updated: 23 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2test.1
+++ b/doc/pcre2test.1
@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "11 March 2019" "PCRE 10.33"
+.TH PCRE2TEST 1 "23 May 2019" "PCRE 10.34"
 .SH NAME
 pcre2test - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@ -572,6 +572,7 @@ for a description of the effects of these options.
      firstline                 set PCRE2_FIRSTLINE
      literal                   set PCRE2_LITERAL
      match_line                set PCRE2_EXTRA_MATCH_LINE
      match_invalid_utf         set PCRE2_MATCH_INVALID_UTF 
      match_unset_backref       set PCRE2_MATCH_UNSET_BACKREF
      match_word                set PCRE2_EXTRA_MATCH_WORD
  /m  multiline                 set PCRE2_MULTILINE
@ -2059,6 +2060,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 11 March 2019
+Last updated: 23 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2test.txt
+++ b/doc/pcre2test.txt
@ -551,6 +551,7 @@ PATTERN MODIFIERS
             firstline                 set PCRE2_FIRSTLINE
             literal                   set PCRE2_LITERAL
             match_line                set PCRE2_EXTRA_MATCH_LINE
             match_invalid_utf         set PCRE2_MATCH_INVALID_UTF
             match_unset_backref       set PCRE2_MATCH_UNSET_BACKREF
             match_word                set PCRE2_EXTRA_MATCH_WORD
         /m  multiline                 set PCRE2_MULTILINE
@ -1890,5 +1891,5 @@ AUTHOR
 REVISION
-       Last updated: 11 March 2019
+       Last updated: 23 May 2019
       Copyright (c) 1997-2019 University of Cambridge.
--- a/doc/pcre2unicode.3
+++ b/doc/pcre2unicode.3
@ -1,26 +1,38 @@
-.TH PCRE2UNICODE 3 "11 May 2019" "PCRE2 10.33"
+.TH PCRE2UNICODE 3 "24 May 2019" "PCRE2 10.34"
 .SH NAME
 PCRE - Perl-compatible regular expressions (revised API)
 .SH "UNICODE AND UTF SUPPORT"
 .rs
 .sp
-When PCRE2 is built with Unicode support (which is the default), it has
+PCRE2 is normally built with Unicode support, though if you do not need it, you
-knowledge of Unicode character properties and can process text strings in
+can build it without, in which case the library will be smaller. With Unicode
-UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by
+support, PCRE2 has knowledge of Unicode character properties and can process
-default, PCRE2 assumes that one code unit is one character. To process a
+text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
-pattern as a UTF string, where a character may require more than one code unit,
+width), but this is not the default. Unless specifically requested, PCRE2
-you must call
+treats each code unit in a string as one character.
 .P
 There are two ways of telling PCRE2 to switch to UTF mode, where characters may 
 consist of more than one code unit and the range of values is constrained. The 
 program can call
 .\" HREF
 \fBpcre2_compile()\fP
 .\"
-with the PCRE2_UTF option flag, or the pattern must start with the sequence
+with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
-(*UTF). When either of these is the case, both the pattern and any subject
+However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
-strings that are matched against it are treated as UTF strings instead of
+That is, the programmer can prevent the supplier of the pattern from switching 
-strings of individual one-code-unit characters. There are also some other
+to UTF mode.
 changes to the way characters are handled, as documented below.
 .P
-If you do not need Unicode support you can build PCRE2 without it, in which
+Note that the PCRE2_MATCH_INVALID_UTF option (see
-case the library will be smaller.
+.\" HTML <a href="#matchinvalid">
 .\" </a>
 below)
 .\"
 forces PCRE2_UTF to be set.
 .P
 In UTF mode, both the pattern and any subject strings that are matched against
 it are treated as UTF strings instead of strings of individual one-code-unit
 characters. There are also some other changes to the way characters are
 handled, as documented below.
 .
 .
 .SH "UNICODE PROPERTY SUPPORT"
@ -55,18 +67,18 @@ also recognized; larger ones can be coded using \eo{...}.
 .P
 The escape sequence \eN{U+<hex digits>} is recognized as another way of
 specifying a Unicode character by code point in a UTF mode. It is not allowed
-in non-UTF modes.
+in non-UTF mode.
 .P
-In UTF modes, repeat quantifiers apply to complete UTF characters, not to
+In UTF mode, repeat quantifiers apply to complete UTF characters, not to
 individual code units.
 .P
-In UTF modes, the dot metacharacter matches one UTF character instead of a
+In UTF mode, the dot metacharacter matches one UTF character instead of a
 single code unit.
 .P
-In UTF modes, capture group names are not restricted to ASCII, and may contain
+In UTF mode, capture group names are not restricted to ASCII, and may contain
 any Unicode letters and decimal digits, as well as underscore.
 .P
-The escape sequence \eC can be used to match a single code unit in a UTF mode,
+The escape sequence \eC can be used to match a single code unit in UTF mode,
 but its use can lead to some strange effects because it breaks up multi-unit
 characters (see the description of \eC in the
 .\" HREF
@ -82,7 +94,7 @@ may consist of more than one code unit. The use of \eC in these modes provokes
 a match-time error. Also, the JIT optimization does not support \eC in these
 modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
 contains \eC, it will not succeed, and so when \fBpcre2_match()\fP is called,
-the matching will be carried out by the normal interpretive function.
+the matching will be carried out by the interpretive function.
 .P
 The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly test
 characters of any code value, but, by default, the characters that PCRE2
@ -114,14 +126,14 @@ However, the special horizontal and vertical white space matching escapes (\eh,
 not PCRE2_UCP is set.
 .
 .
-.SH "CASE-EQUIVALENCE IN UTF MODES"
+.SH "CASE-EQUIVALENCE IN UTF MODE"
 .rs
 .sp
-Case-insensitive matching in a UTF mode makes use of Unicode properties except
+Case-insensitive matching in UTF mode makes use of Unicode properties except
 for characters whose code points are less than 128 and that have at most two
 case-equivalent values. For these, a direct table lookup is used for speed. A
 few Unicode characters such as Greek sigma have more than two code points that
-are case-equivalent, and these are treated as such.
+are case-equivalent, and these are treated specially.
 .
 .
 .\" HTML <a name="scriptruns"></a>
@ -231,7 +243,7 @@ adjacent characters.
 .sp
 When the PCRE2_UTF option is set, the strings passed as patterns and subjects
 are (by default) checked for validity on entry to the relevant functions. If an
-invalid UTF string is passed, an negative error code is returned. The code unit
+invalid UTF string is passed, a negative error code is returned. The code unit
 offset to the offending character can be extracted from the match data block by
 calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF
 error.
@ -244,18 +256,15 @@ PCRE2 assumes that the pattern or subject it is given (respectively) contains
 only valid UTF code unit sequences.
 .P
 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
-is usually undefined and your program may crash or loop indefinitely. There is,
+is undefined and your program may crash or loop indefinitely or give incorrect
-however, one mode of matching that can handle invalid UTF subject strings. This
+results. There is, however, one mode of matching that can handle invalid UTF
-is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
+subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
-when calling \fBpcre2_jit_compile()\fP. For details, see the
+\fBpcre2_compile()\fP and is discussed below in the next section. The rest of
-.\" HREF
+this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
 \fBpcre2jit\fP
 .\"
 documentation.
 .P
-Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
+Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the UTF check
-the pattern; it does not also apply to subject strings. If you want to disable
+for the pattern; it does not also apply to subject strings. If you want to
-the check for a subject string you must pass this same option to
+disable the check for a subject string you must pass this same option to
 \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP.
 .P
 UTF-16 and UTF-32 strings can indicate their endianness by special code knows
@ -386,6 +395,52 @@ The following negative error codes are given for invalid UTF-32 strings:
 .sp
 .
 .
 .\" HTML <a name="matchinvalid"></a>
 .SH "MATCHING IN INVALID UTF STRINGS"
 .rs
 .sp
 You can run pattern matches on subject strings that may contain invalid UTF
 sequences if you call \fBpcre2_compile()\fP with the PCRE2_MATCH_INVALID_UTF
 option. This is supported by \fBpcre2_match()\fP, including JIT matching, but
 not by \fBpcre2_dfa_match()\fP. When PCRE2_MATCH_INVALID_UTF is set, it forces
 PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a 
 valid UTF string.
 .P
 Setting PCRE2_MATCH_INVALID_UTF does not affect what \fBpcre2_compile()\fP
 generates, but if \fBpcre2_jit_compile()\fP is subsequently called, it does
 generate different code. If JIT is not used, the option affects the behaviour
 of the interpretive code in \fBpcre2_match()\fP. When PCRE2_MATCH_INVALID_UTF
 is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
 .P
 In this mode, an invalid code unit sequence in the subject never matches any
 pattern item. It does not match dot, it does not match \ep{Any}, it does not
 even match negative items such as [^X]. A lookbehind assertion fails if it
 encounters an invalid sequence while moving the current point backwards. In
 other words, an invalid UTF code unit sequence acts as a barrier which no match
 can cross.
 .P
 You can also think of this as the subject being split up into fragments of
 valid UTF, delimited internally by invalid code unit sequences. The pattern is
 matched fragment by fragment. The result of a successful match, however, is
 given as code unit offsets in the entire subject string in the usual way. There
 are a few points to consider:
 .P
 The internal boundaries are not interpreted as the beginnings or ends of lines
 and so do not match circumflex or dollar characters in the pattern.
 .P
 If \fBpcre2_match()\fP is called with an offset that points to an invalid
 UTF-sequence, that sequence is skipped, and the match starts at the next valid
 UTF character, or the end of the subject.
 .P
 At internal fragment boundaries, \eb and \eB behave in the same way as at the
 beginning and end of the subject. For example, a sequence such as \ebWORD\eb 
 would match an instance of WORD that is surrounded by invalid UTF code units.
 .P
 Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
 data, knowing that any matched strings that are returned are valid UTF. This
 can be useful when searching for UTF text in executable or other binary files.
 .
 .
 .SH AUTHOR
 .rs
 .sp
@ -400,6 +455,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 11 May 2019
+Last updated: 24 May 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/src/pcre2.h.generic
+++ b/src/pcre2.h.generic
@ -5,7 +5,7 @@
 /* This is the public header file for the PCRE library, second API, to be
 #included by applications that call PCRE2 functions.
-           Copyright (c) 2016-2018 University of Cambridge
+           Copyright (c) 2016-2019 University of Cambridge
 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
 /* The current PCRE version information. */
 #define PCRE2_MAJOR           10
-#define PCRE2_MINOR           33
+#define PCRE2_MINOR           34
-#define PCRE2_PRERELEASE      
+#define PCRE2_PRERELEASE      -RC1
-#define PCRE2_DATE            2019-04-16
+#define PCRE2_DATE            2019-04-22
 /* When an application links to a PCRE DLL in Windows, the symbols that are
 imported have to be identified as such. When building PCRE2, the appropriate
@ -142,6 +142,7 @@ D   is inspected during pcre2_dfa_match() execution
 #define PCRE2_USE_OFFSET_LIMIT    0x00800000u  /*   J M D */
 #define PCRE2_EXTENDED_MORE       0x01000000u  /* C       */
 #define PCRE2_LITERAL             0x02000000u  /* C       */
 #define PCRE2_MATCH_INVALID_UTF   0x04000000u  /*   J M D */
 /* An additional compile options word is available in the compile context. */
@ -305,6 +306,7 @@ pcre2_pattern_convert(). */
 #define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS      194
 #define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN        195
 #define PCRE2_ERROR_SCRIPT_RUN_NOT_AVAILABLE       196
 #define PCRE2_ERROR_TOO_MANY_CAPTURES              197
 /* "Expected" matching error codes: no match and partial match. */
@ -390,6 +392,7 @@ released, the numbers must not be changed. */
 #define PCRE2_ERROR_HEAPLIMIT         (-63)
 #define PCRE2_ERROR_CONVERT_SYNTAX    (-64)
 #define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
 #define PCRE2_ERROR_DFA_UINVALID_UTF  (-66)
 /* Request types for pcre2_pattern_info() */
--- a/src/pcre2.h.in
+++ b/src/pcre2.h.in
@ -5,7 +5,7 @@
 /* This is the public header file for the PCRE library, second API, to be
 #included by applications that call PCRE2 functions.
-           Copyright (c) 2016-2018 University of Cambridge
+           Copyright (c) 2016-2019 University of Cambridge
 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@ -142,6 +142,7 @@ D   is inspected during pcre2_dfa_match() execution
 #define PCRE2_USE_OFFSET_LIMIT    0x00800000u  /*   J M D */
 #define PCRE2_EXTENDED_MORE       0x01000000u  /* C       */
 #define PCRE2_LITERAL             0x02000000u  /* C       */
 #define PCRE2_MATCH_INVALID_UTF   0x04000000u  /*   J M D */
 /* An additional compile options word is available in the compile context. */
@ -391,6 +392,7 @@ released, the numbers must not be changed. */
 #define PCRE2_ERROR_HEAPLIMIT         (-63)
 #define PCRE2_ERROR_CONVERT_SYNTAX    (-64)
 #define PCRE2_ERROR_INTERNAL_DUPMATCH (-65)
 #define PCRE2_ERROR_DFA_UINVALID_UTF  (-66)
 /* Request types for pcre2_pattern_info() */
--- a/src/pcre2_compile.c
+++ b/src/pcre2_compile.c
@ -746,8 +746,8 @@ are allowed. */
 #define PUBLIC_LITERAL_COMPILE_OPTIONS \
  (PCRE2_ANCHORED|PCRE2_AUTO_CALLOUT|PCRE2_CASELESS|PCRE2_ENDANCHORED| \
-   PCRE2_FIRSTLINE|PCRE2_LITERAL|PCRE2_NO_START_OPTIMIZE| \
+   PCRE2_FIRSTLINE|PCRE2_LITERAL|PCRE2_MATCH_INVALID_UTF| \
-   PCRE2_NO_UTF_CHECK|PCRE2_USE_OFFSET_LIMIT|PCRE2_UTF)
+   PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK|PCRE2_USE_OFFSET_LIMIT|PCRE2_UTF)
 #define PUBLIC_COMPILE_OPTIONS \
  (PUBLIC_LITERAL_COMPILE_OPTIONS| \
@ -3615,7 +3615,7 @@ while (ptr < ptrend)
            {
            errorcode = ERR97;
            goto FAILED;
-            }    
+            }
          cb->bracount++;
          *parsed_pattern++ = META_CAPTURE | cb->bracount;
          }
@ -4444,7 +4444,7 @@ while (ptr < ptrend)
        {
        errorcode = ERR97;
        goto FAILED;
-        }    
+        }
      cb->bracount++;
      *parsed_pattern++ = META_CAPTURE | cb->bracount;
      nest_depth++;
@ -9503,6 +9503,10 @@ if (pattern == NULL)
 if (ccontext == NULL)
  ccontext = (pcre2_compile_context *)(&PRIV(default_compile_context));
 /* PCRE2_MATCH_INVALID_UTF implies UTF */
 if ((options & PCRE2_MATCH_INVALID_UTF) != 0) options |= PCRE2_UTF; 
 /* Check that all undefined public option bits are zero. */
@ -9682,7 +9686,7 @@ if ((options & PCRE2_LITERAL) == 0)
 ptr += skipatstart;
-/* Can't support UTF or UCP unless PCRE2 has been compiled with UTF support. */
+/* Can't support UTF or UCP if PCRE2 was built without Unicode support. */
 #ifndef SUPPORT_UNICODE
 if ((cb.external_options & (PCRE2_UTF|PCRE2_UCP)) != 0)
--- a/src/pcre2_dfa_match.c
+++ b/src/pcre2_dfa_match.c
@ -3294,6 +3294,11 @@ time. */
 if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0 &&
   ((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
  return PCRE2_ERROR_BADOPTION;
 /* Invalid UTF support is not available for DFA matching. */
 if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0) 
  return PCRE2_ERROR_DFA_UINVALID_UTF;
 /* Check that the first field in the block is the magic number. If it is not,
 return with PCRE2_ERROR_BADMAGIC. */
--- a/src/pcre2_error.c
+++ b/src/pcre2_error.c
@ -269,6 +269,7 @@ static const unsigned char match_error_texts[] =
  "invalid syntax\0"
  /* 65 */
  "internal error - duplicate substitution match\0"
  "PCRE2_MATCH_INVALID_UTF is not supported for DFA matching\0" 
  ;
--- a/src/pcre2_intmodedep.h
+++ b/src/pcre2_intmodedep.h
@ -866,6 +866,7 @@ typedef struct match_block {
  PCRE2_SPTR name_table;          /* Table of group names */
  PCRE2_SPTR start_code;          /* For use when recursing */
  PCRE2_SPTR start_subject;       /* Start of the subject string */
  PCRE2_SPTR check_subject;       /* Where UTF-checked from */
  PCRE2_SPTR end_subject;         /* End of the subject string */
  PCRE2_SPTR end_match_ptr;       /* Subject position at end match */
  PCRE2_SPTR start_used_ptr;      /* Earliest consulted character */
--- a/src/pcre2_jit_compile.c
+++ b/src/pcre2_jit_compile.c
@ -6,8 +6,9 @@
 and semantics are as close as possible to those of the Perl 5 language.
                       Written by Philip Hazel
                    This module by Zoltan Herczeg 
     Original API code Copyright (c) 1997-2012 University of Cambridge
-          New API code Copyright (c) 2016-2018 University of Cambridge
+          New API code Copyright (c) 2016-2019 University of Cambridge
 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@ -7846,8 +7847,6 @@ if (needstype || needsscript)
  if (needsscript)
    {
 // PH hacking
 //fprintf(stderr, "~~B\n");
      OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
      OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
      OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0);
@ -7901,7 +7900,6 @@ if (needstype || needsscript)
    if (!needschar)
      {
 // PH hacking
 //fprintf(stderr, "~~C\n");
  OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
  OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
  OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, TMP1, 0);
@ -7916,7 +7914,6 @@ if (needstype || needsscript)
    else
      {
 // PH hacking
 //fprintf(stderr, "~~D\n");
  OP2(SLJIT_SHL, TMP1, 0, TMP2, 0, SLJIT_IMM, 2);
      OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 3);
@ -8594,8 +8591,8 @@ uint32_t c;
 /* Patch by PH */
 /* GETCHARINC(c, cc); */
 c = *cc++;
 #if PCRE2_CODE_UNIT_WIDTH == 32
 if (c >= 0x110000)
  return NULL;
@ -9257,8 +9254,6 @@ if (common->utf && *cc == OP_REFI)
  CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop);
 // PH hacking
 //fprintf(stderr, "~~E\n");
  OP1(SLJIT_MOV, TMP3, 0, TMP1, 0);
  add_jump(compiler, &common->getucd, JUMP(SLJIT_FAST_CALL));
@ -14156,49 +14151,87 @@ Returns:        0: success or (*NOJIT) was used
 PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
 pcre2_jit_compile(pcre2_code *code, uint32_t options)
 {
 #ifndef SUPPORT_JIT
 (void)code;
 (void)options;
 return PCRE2_ERROR_JIT_BADOPTION;
 #else  /* SUPPORT_JIT */
 pcre2_real_code *re = (pcre2_real_code *)code;
 executable_functions *functions;
 uint32_t excluded_options;
 int result;
 if (code == NULL)
  return PCRE2_ERROR_NULL;
 if ((options & ~PUBLIC_JIT_COMPILE_OPTIONS) != 0)
  return PCRE2_ERROR_JIT_BADOPTION;
-
+  
 if ((re->flags & PCRE2_NOJIT) != 0) return 0;
 functions = (executable_functions *)re->executable_jit;
 /* Support for invalid UTF was first introduced in JIT, with the option 
 PCRE2_JIT_INVALID_UTF. Later, support was added to the interpreter, and the 
 compile-time option PCRE2_MATCH_INVALID_UTF was created. This is now the 
 preferred feature, with the earlier option deprecated. However, for backward 
 compatibility, if the earlier option is set, it forces the new option so that 
 if JIT matching falls back to the interpreter, there is still support for 
 invalid UTF. However, if this function has already been successfully called
 without PCRE2_JIT_INVALID_UTF and without PCRE2_MATCH_INVALID_UTF (meaning that 
 non-invalid-supporting JIT code was compiled), give an error. 
 If in the future support for PCRE2_JIT_INVALID_UTF is withdrawn, the following 
 actions are needed:
  1. Remove the definition from pcre2.h.in and from the list in
     PUBLIC_JIT_COMPILE_OPTIONS above.
  2. Replace PCRE2_JIT_INVALID_UTF with a local flag in this module.
  3. Replace PCRE2_JIT_INVALID_UTF in pcre2_jit_test.c.
  4. Delete the following short block of code. The setting of "re" and 
     "functions" can be moved into the JIT-only block below, but if that is 
     done, (void)re and (void)functions will be needed in the non-JIT case, to 
     avoid compiler warnings.
 */
 if ((options & PCRE2_JIT_INVALID_UTF) != 0)
  {
  if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) == 0)
    {
    if (functions != NULL) return PCRE2_ERROR_JIT_BADOPTION;
    re->overall_options |= PCRE2_MATCH_INVALID_UTF; 
    }  
  }
 /* The above tests are run with and without JIT support. This means that 
 PCRE2_JIT_INVALID_UTF propagates back into the regex options (ensuring 
 interpreter support) even in the absence of JIT. But now, if there is no JIT
 support, give an error return. */
 #ifndef SUPPORT_JIT
 return PCRE2_ERROR_JIT_BADOPTION;
 #else  /* SUPPORT_JIT */
 /* There is JIT support. Do the necessary. */
 if ((re->flags & PCRE2_NOJIT) != 0) return 0;
 if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0)
  options |= PCRE2_JIT_INVALID_UTF;  
 if ((options & PCRE2_JIT_COMPLETE) != 0 && (functions == NULL
    || functions->executable_funcs[0] == NULL)) {
-  excluded_options = (PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD);
+  uint32_t excluded_options = (PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD);
-  result = jit_compile(code, options & ~excluded_options);
+  int result = jit_compile(code, options & ~excluded_options);
  if (result != 0)
    return result;
  }
 if ((options & PCRE2_JIT_PARTIAL_SOFT) != 0 && (functions == NULL
    || functions->executable_funcs[1] == NULL)) {
-  excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD);
+  uint32_t excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD);
-  result = jit_compile(code, options & ~excluded_options);
+  int result = jit_compile(code, options & ~excluded_options);
  if (result != 0)
    return result;
  }
 if ((options & PCRE2_JIT_PARTIAL_HARD) != 0 && (functions == NULL
    || functions->executable_funcs[2] == NULL)) {
-  excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT);
+  uint32_t excluded_options = (PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT);
-  result = jit_compile(code, options & ~excluded_options);
+  int result = jit_compile(code, options & ~excluded_options);
  if (result != 0)
    return result;
  }
--- a/src/pcre2_match.c
+++ b/src/pcre2_match.c
@ -5412,7 +5412,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
      {
      while (number-- > 0)
        {
-        if (Feptr <= mb->start_subject) RRETURN(MATCH_NOMATCH);
+        if (Feptr <= mb->check_subject) RRETURN(MATCH_NOMATCH);
        Feptr--;
        BACKCHAR(Feptr);
        }
@ -5420,7 +5420,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
    else
 #endif
-    /* No UTF-8 support, or not in UTF-8 mode: count is byte count */
+    /* No UTF-8 support, or not in UTF-8 mode: count is code unit count */
      {
      if ((ptrdiff_t)number > Feptr - mb->start_subject) RRETURN(MATCH_NOMATCH);
@ -5743,7 +5743,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
    case OP_NOT_WORD_BOUNDARY:
    case OP_WORD_BOUNDARY:
-    if (Feptr == mb->start_subject) prev_is_word = FALSE; else
+    if (Feptr == mb->check_subject) prev_is_word = FALSE; else
      {
      PCRE2_SPTR lastptr = Feptr - 1;
 #ifdef SUPPORT_UNICODE
@ -6014,7 +6014,6 @@ int was_zero_terminated = 0;
 const uint8_t *start_bits = NULL;
 const pcre2_real_code *re = (const pcre2_real_code *)code;
 BOOL anchored;
 BOOL firstline;
 BOOL has_first_cu = FALSE;
@ -6029,10 +6028,23 @@ PCRE2_UCHAR req_cu2 = 0;
 PCRE2_SPTR bumpalong_limit;
 PCRE2_SPTR end_subject;
 PCRE2_SPTR true_end_subject;
 PCRE2_SPTR start_match = subject + start_offset;
 PCRE2_SPTR req_cu_ptr = start_match - 1;
-PCRE2_SPTR start_partial = NULL;
+PCRE2_SPTR start_partial;
-PCRE2_SPTR match_partial = NULL;
+PCRE2_SPTR match_partial;
 #ifdef SUPPORT_JIT
 BOOL use_jit;
 #endif
 #ifdef SUPPORT_UNICODE
 BOOL allow_invalid;
 uint32_t fragment_options = 0;
 #ifdef SUPPORT_JIT
 BOOL jit_checked_utf = FALSE;
 #endif
 #endif
 PCRE2_SIZE frame_size;
@ -6059,7 +6071,7 @@ if (length == PCRE2_ZERO_TERMINATED)
  length = PRIV(strlen)(subject);
  was_zero_terminated = 1;
  }
-end_subject = subject + length;
+true_end_subject = end_subject = subject + length;
 /* Plausibility checks */
@ -6095,12 +6107,24 @@ options |= (re->flags & FF) / ((FF & (~FF+1)) / (OO & (~OO+1)));
 #undef FF
 #undef OO
-/* These two settings are used in the code for checking a UTF string that
+/* If the pattern was successfully studied with JIT support, we will run the
-follows immediately afterwards. Other values in the mb block are used only
+JIT executable instead of the rest of this function. Most options must be set
-during interpretive processing, not when the JIT support is in use, so they are
+at compile time for the JIT code to be usable. */
-set up later. */
+
 #ifdef SUPPORT_JIT
 use_jit = (re->executable_jit != NULL &&
          (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0);
 #endif
 /* Initialize UTF parameters. */
 utf = (re->overall_options & PCRE2_UTF) != 0;
 #ifdef SUPPORT_UNICODE
 allow_invalid = (re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0;
 #endif
 /* Convert the partial matching flags into an integer. */
 mb->partial = ((options & PCRE2_PARTIAL_HARD) != 0)? 2 :
              ((options & PCRE2_PARTIAL_SOFT) != 0)? 1 : 0;
@ -6111,61 +6135,6 @@ if (mb->partial != 0 &&
   ((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
  return PCRE2_ERROR_BADOPTION;
 /* Check a UTF string for validity if required. For 8-bit and 16-bit strings,
 we must also check that a starting offset does not point into the middle of a
 multiunit character. We check only the portion of the subject that is going to
 be inspected during matching - from the offset minus the maximum back reference
 to the given length. This saves time when a small part of a large subject is
 being matched by the use of a starting offset. Note that the maximum lookbehind
 is a number of characters, not code units. */
 #ifdef SUPPORT_UNICODE
 if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
  {
  PCRE2_SPTR check_subject = start_match;  /* start_match includes offset */
  if (start_offset > 0)
    {
 #if PCRE2_CODE_UNIT_WIDTH != 32
    unsigned int i;
    if (start_match < end_subject && NOT_FIRSTCU(*start_match))
      return PCRE2_ERROR_BADUTFOFFSET;
    for (i = re->max_lookbehind; i > 0 && check_subject > subject; i--)
      {
      check_subject--;
      while (check_subject > subject &&
 #if PCRE2_CODE_UNIT_WIDTH == 8
      (*check_subject & 0xc0) == 0x80)
 #else  /* 16-bit */
      (*check_subject & 0xfc00) == 0xdc00)
 #endif /* PCRE2_CODE_UNIT_WIDTH == 8 */
        check_subject--;
      }
 #else
    /* In the 32-bit library, one code unit equals one character. However,
    we cannot just subtract the lookbehind and then compare pointers, because
    a very large lookbehind could create an invalid pointer. */
    if (start_offset >= re->max_lookbehind)
      check_subject -= re->max_lookbehind;
    else
      check_subject = subject;
 #endif  /* PCRE2_CODE_UNIT_WIDTH != 32 */
    }
  /* Validate the relevant portion of the subject. After an error, adjust the
  offset to be an absolute offset in the whole string. */
  match_data->rc = PRIV(valid_utf)(check_subject,
    length - (check_subject - subject), &(match_data->startchar));
  if (match_data->rc != 0)
    {
    match_data->startchar += check_subject - subject;
    return match_data->rc;
    }
  }
 #endif  /* SUPPORT_UNICODE */
 /* It is an error to set an offset limit without setting the flag at compile
 time. */
@ -6184,15 +6153,85 @@ if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0)
  }
 match_data->subject = NULL;
-/* If the pattern was successfully studied with JIT support, run the JIT
+
-executable instead of the rest of this function. Most options must be set at
+/* ============================= JIT matching ============================== */
-compile time for the JIT code to be usable. Fallback to the normal code path if
+
-an unsupported option is set or if JIT returns BADOPTION (which means that the
+/* Prepare for JIT matching. Check a UTF string for validity unless no check is
-selected normal or partial matching mode was not compiled). */
+requested or invalid UTF can be handled. We check only the portion of the
 subject that might be be inspected during matching - from the offset minus the
 maximum lookbehind to the given length. This saves time when a small part of a
 large subject is being matched by the use of a starting offset. Note that the
 maximum lookbehind is a number of characters, not code units. */
 #ifdef SUPPORT_JIT
-if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0)
+if (use_jit)
  {
 #ifdef SUPPORT_UNICODE
  if (utf && (options & PCRE2_NO_UTF_CHECK) == 0 && !allow_invalid)
    {
 #if PCRE2_CODE_UNIT_WIDTH != 32
    unsigned int i;
 #endif
    /* For 8-bit and 16-bit UTF, check that the first code unit is a valid
    character start. */
 #if PCRE2_CODE_UNIT_WIDTH != 32
    if (start_match < end_subject && NOT_FIRSTCU(*start_match))
      {
      if (start_offset > 0) return PCRE2_ERROR_BADUTFOFFSET;
 #if PCRE2_CODE_UNIT_WIDTH == 8
      return PCRE2_ERROR_UTF8_ERR20;  /* Isolated 0x80 byte */
 #else
      return PCRE2_ERROR_UTF16_ERR3;  /* Isolated low surrogate */
 #endif
      }
 #endif  /* WIDTH != 32 */
    /* Move back by the maximum lookbehind, just in case it happens at the very
    start of matching. */
 #if PCRE2_CODE_UNIT_WIDTH != 32
    for (i = re->max_lookbehind; i > 0 && start_match > subject; i--)
      {
      start_match--;
      while (start_match > subject &&
 #if PCRE2_CODE_UNIT_WIDTH == 8
      (*start_match & 0xc0) == 0x80)
 #else  /* 16-bit */
      (*start_match & 0xfc00) == 0xdc00)
 #endif
        start_match--;
      }
 #else  /* PCRE2_CODE_UNIT_WIDTH != 32 */
    /* In the 32-bit library, one code unit equals one character. However,
    we cannot just subtract the lookbehind and then compare pointers, because
    a very large lookbehind could create an invalid pointer. */
    if (start_offset >= re->max_lookbehind)
      start_match -= re->max_lookbehind;
    else
      start_match = subject;
 #endif  /* PCRE2_CODE_UNIT_WIDTH != 32 */
    /* Validate the relevant portion of the subject. Adjust the offset of an
    invalid code point to be an absolute offset in the whole string. */
    match_data->rc = PRIV(valid_utf)(start_match,
      length - (start_match - subject), &(match_data->startchar));
    if (match_data->rc != 0)
      {
      match_data->startchar += start_match - subject;
      return match_data->rc;
      }
    jit_checked_utf = TRUE;
    }
 #endif  /* SUPPORT_UNICODE */
  /* If JIT returns BADOPTION, which means that the selected complete or
  partial matching mode was not compiled, fall through to the interpreter. */
  rc = pcre2_jit_match(code, subject, length, start_offset, options,
    match_data, mcontext);
  if (rc != PCRE2_ERROR_JIT_BADOPTION)
@ -6209,10 +6248,152 @@ if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0)
    return rc;
    }
  }
 #endif  /* SUPPORT_JIT */
 /* ========================= End of JIT matching ========================== */
 /* Proceed with non-JIT matching. The default is to allow lookbehinds to the
 start of the subject. A UTF check when there is a non-zero offset may change
 this. */
 mb->check_subject = subject;
 /* If a UTF subject string was not checked for validity in the JIT code above,
 check it here, and handle support for invalid UTF strings. The check above
 happens only when invalid UTF is not supported and PCRE2_NO_CHECK_UTF is unset.
 If we get here in those circumstances, it means the subject string is valid,
 but for some reason JIT matching was not successful. There is no need to check
 the subject again.
 We check only the portion of the subject that might be be inspected during
 matching - from the offset minus the maximum lookbehind to the given length.
 This saves time when a small part of a large subject is being matched by the
 use of a starting offset. Note that the maximum lookbehind is a number of
 characters, not code units.
 Note also that support for invalid UTF forces a check, overriding the setting
 of PCRE2_NO_CHECK_UTF. */
 #ifdef SUPPORT_UNICODE
 if (utf &&
 #ifdef SUPPORT_JIT
    !jit_checked_utf &&
 #endif
    ((options & PCRE2_NO_UTF_CHECK) == 0 || allow_invalid))
  {
 #if PCRE2_CODE_UNIT_WIDTH != 32
  BOOL skipped_bad_start = FALSE;
 #endif
-/* Carry on with non-JIT matching. A NULL match context means "use a default
+  /* For 8-bit and 16-bit UTF, check that the first code unit is a valid
-context", but we take the memory control functions from the pattern. */
+  character start. If we are handling invalid UTF, just skip over such code
  units. Otherwise, give an appropriate error. */
 #if PCRE2_CODE_UNIT_WIDTH != 32
  if (allow_invalid)
    {
    while (start_match < end_subject && NOT_FIRSTCU(*start_match))
      {
      start_match++;
      skipped_bad_start = TRUE;
      }
    }
  else if (start_match < end_subject && NOT_FIRSTCU(*start_match))
    {
    if (start_offset > 0) return PCRE2_ERROR_BADUTFOFFSET;
 #if PCRE2_CODE_UNIT_WIDTH == 8
    return PCRE2_ERROR_UTF8_ERR20;  /* Isolated 0x80 byte */
 #else
    return PCRE2_ERROR_UTF16_ERR3;  /* Isolated low surrogate */
 #endif
    }
 #endif  /* WIDTH != 32 */
  /* The mb->check_subject field points to the start of UTF checking;
  lookbehinds can go back no further than this. */
  mb->check_subject = start_match;
  /* Move back by the maximum lookbehind, just in case it happens at the very
  start of matching, but don't do this if we skipped bad 8-bit or 16-bit code
  units above. */
 #if PCRE2_CODE_UNIT_WIDTH != 32
  if (!skipped_bad_start)
    {
    unsigned int i;
    for (i = re->max_lookbehind; i > 0 && mb->check_subject > subject; i--)
      {
      mb->check_subject--;
      while (mb->check_subject > subject &&
 #if PCRE2_CODE_UNIT_WIDTH == 8
      (*mb->check_subject & 0xc0) == 0x80)
 #else  /* 16-bit */
      (*mb->check_subject & 0xfc00) == 0xdc00)
 #endif
        mb->check_subject--;
      }
    }
 #else  /* PCRE2_CODE_UNIT_WIDTH != 32 */
  /* In the 32-bit library, one code unit equals one character. However,
  we cannot just subtract the lookbehind and then compare pointers, because
  a very large lookbehind could create an invalid pointer. */
  if (start_offset >= re->max_lookbehind)
    mb->check_subject -= re->max_lookbehind;
  else
    mb->check_subject = subject;
 #endif  /* PCRE2_CODE_UNIT_WIDTH != 32 */
  /* Validate the relevant portion of the subject. There's a loop in case we
  encounter bad UTF in the characters preceding start_match which we are
  scanning because of a lookbehind. */
  for (;;)
    {
    match_data->rc = PRIV(valid_utf)(mb->check_subject,
      length - (mb->check_subject - subject), &(match_data->startchar));
    if (match_data->rc == 0) break;   /* Valid UTF string */
    /* Invalid UTF string. Adjust the offset to be an absolute offset in the
    whole string. If we are handling invalid UTF strings, set end_subject to
    stop before the bad code unit, and set the options to "not end of line".
    Otherwise return the error. */
    match_data->startchar += mb->check_subject - subject;
    if (!allow_invalid || match_data->rc > 0) return match_data->rc;
    end_subject = subject + match_data->startchar;
    /* If the end precedes start_match, it means there is invalid UTF in the
    extra code units we reversed over because of a lookbehind. Advance past the
    first bad code unit, and then skip invalid character starting code units in
    8-bit and 16-bit modes, and try again. */
    if (end_subject < start_match)
      {
      mb->check_subject = end_subject + 1;
 #if PCRE2_CODE_UNIT_WIDTH != 32
      while (mb->check_subject < start_match && NOT_FIRSTCU(*mb->check_subject))
        mb->check_subject++;
 #endif
      }
    /* Otherwise, set the not end of line option, and do the match. */
    else
      {
      fragment_options = PCRE2_NOTEOL;
      break;
      }
    }
  }
 #endif  /* SUPPORT_UNICODE */
 /* A NULL match context means "use a default context", but we take the memory
 control functions from the pattern. */
 if (mcontext == NULL)
  {
@ -6224,8 +6405,8 @@ else mb->memctl = mcontext->memctl;
 anchored = ((re->overall_options | options) & PCRE2_ANCHORED) != 0;
 firstline = (re->overall_options & PCRE2_FIRSTLINE) != 0;
 startline = (re->flags & PCRE2_STARTLINE) != 0;
-bumpalong_limit =  (mcontext->offset_limit == PCRE2_UNSET)?
+bumpalong_limit = (mcontext->offset_limit == PCRE2_UNSET)?
-  end_subject : subject + mcontext->offset_limit;
+  true_end_subject : subject + mcontext->offset_limit;
 /* Initialize and set up the fixed fields in the callout block, with a pointer
 in the match block. */
@ -6236,7 +6417,8 @@ cb.subject = subject;
 cb.subject_length = (PCRE2_SIZE)(end_subject - subject);
 cb.callout_flags = 0;
-/* Fill in the remaining fields in the match block. */
+/* Fill in the remaining fields in the match block, except for moptions, which
 gets set later. */
 mb->callout = mcontext->callout;
 mb->callout_data = mcontext->callout_data;
@ -6245,13 +6427,9 @@ mb->start_subject = subject;
 mb->start_offset = start_offset;
 mb->end_subject = end_subject;
 mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0;
 mb->moptions = options;                 /* Match options */
 mb->poptions = re->overall_options;     /* Pattern options */
 mb->ignore_skip_arg = 0;
 mb->mark = mb->nomatch_mark = NULL;     /* In case never set */
 mb->hitend = FALSE;
 /* The name table is needed for finding all the numbers associated with a
 given name, for condition testing. The code follows the name table. */
@ -6404,6 +6582,13 @@ if ((re->flags & PCRE2_LASTSET) != 0)
 /* Loop for handling unanchored repeated matching attempts; for anchored regexs
 the loop runs just once. */
 #ifdef SUPPORT_UNICODE
 FRAGMENT_RESTART:
 #endif
 start_partial = match_partial = NULL;
 mb->hitend = FALSE;
 for(;;)
  {
  PCRE2_SPTR new_start_match;
@ -6714,6 +6899,11 @@ for(;;)
  mb->start_used_ptr = start_match;
  mb->last_used_ptr = start_match;
 #ifdef SUPPORT_UNICODE
  mb->moptions = options | fragment_options;
 #else
  mb->moptions = options;
 #endif
  mb->match_call_count = 0;
  mb->end_offset_top = 0;
  mb->skip_arg_count = 0;
@ -6839,6 +7029,68 @@ for(;;)
 ENDLOOP:
 /* If end_subject != true_end_subject, it means we are handling invalid UTF,
 and have just processed a non-terminal fragment. If this resulted in no match
 or a partial match we must carry on to the next fragment (a partial match is
 returned to the caller only at the very end of the subject). A loop is used to
 avoid trying to match against empty fragments; if the pattern can match an
 empty string it would have done so already. */
 #ifdef SUPPORT_UNICODE
 if (utf && end_subject != true_end_subject &&
    (rc == MATCH_NOMATCH || rc == PCRE2_ERROR_PARTIAL))
  {
  for (;;)
    {
    /* Advance past the first bad code unit, and then skip invalid character
    starting code units in 8-bit and 16-bit modes. */
    start_match = end_subject + 1;
 #if PCRE2_CODE_UNIT_WIDTH != 32
    while (start_match < true_end_subject && NOT_FIRSTCU(*start_match))
      start_match++;
 #endif
    /* If we have hit the end of the subject, there isn't another non-empty
    fragment, so give up. */
    if (start_match >= true_end_subject)
      {
      rc = MATCH_NOMATCH;  /* In case it was partial */
      break;
      }
    /* Check the rest of the subject */
    mb->check_subject = start_match;
    rc = PRIV(valid_utf)(start_match, length - (start_match - subject),
      &(match_data->startchar));
    /* The rest of the subject is valid UTF. */
    if (rc == 0)
      {
      mb->end_subject = end_subject = true_end_subject;
      fragment_options = PCRE2_NOTBOL;
      goto FRAGMENT_RESTART;
      }
    /* A subsequent UTF error has been found; if the next fragment is
    non-empty, set up to process it. Otherwise, let the loop advance. */
    else if (rc < 0)
      {
      mb->end_subject = end_subject = start_match + match_data->startchar;
      if (end_subject > start_match)
        {
        fragment_options = PCRE2_NOTBOL|PCRE2_NOTEOL;
        goto FRAGMENT_RESTART;
        }
      }
    }
  }
 #endif  /* SUPPORT_UNICODE */
 /* Release an enlarged frame vector that is on the heap. */
 if (mb->match_frames != mb->stack_frames)
--- a/src/pcre2test.c
+++ b/src/pcre2test.c
@ -212,6 +212,12 @@ be C99 don't support it (hence DISABLE_PERCENT_ZT). */
 #define REPLACE_MODSIZE 100       /* Field for reading 8-bit replacement */
 #define VERSION_SIZE 64           /* Size of buffer for the version strings */
 /* Default JIT compile options */
 #define JIT_DEFAULT (PCRE2_JIT_COMPLETE|\
                     PCRE2_JIT_PARTIAL_SOFT|\
                     PCRE2_JIT_PARTIAL_HARD)
 /* Make sure the buffer into which replacement strings are copied is big enough
 to hold them as 32-bit code units. */
@ -664,6 +670,7 @@ static modstruct modlist[] = {
  { "literal",                    MOD_PAT,  MOD_OPT, PCRE2_LITERAL,              PO(options) },
  { "locale",                     MOD_PAT,  MOD_STR, LOCALESIZE,                 PO(locale) },
  { "mark",                       MOD_PNDP, MOD_CTL, CTL_MARK,                   PO(control) },
  { "match_invalid_utf",          MOD_PAT,  MOD_OPT, PCRE2_MATCH_INVALID_UTF,    PO(options) },
  { "match_limit",                MOD_CTM,  MOD_INT, 0,                          MO(match_limit) },
  { "match_line",                 MOD_CTC,  MOD_OPT, PCRE2_EXTRA_MATCH_LINE,     CO(extra_options) },
  { "match_unset_backref",        MOD_PAT,  MOD_OPT, PCRE2_MATCH_UNSET_BACKREF,  PO(options) },
@ -4136,7 +4143,7 @@ static void
 show_compile_options(uint32_t options, const char *before, const char *after)
 {
 if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
-else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
+else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
  before,
  ((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
  ((options & PCRE2_ALT_CIRCUMFLEX) != 0)? " alt_circumflex" : "",
@ -4153,6 +4160,7 @@ else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%
  ((options & PCRE2_EXTENDED_MORE) != 0)? " extended_more" : "",
  ((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
  ((options & PCRE2_LITERAL) != 0)? " literal" : "",
  ((options & PCRE2_MATCH_INVALID_UTF) != 0)? " match_invalid_utf" : "",
  ((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
  ((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
  ((options & PCRE2_NEVER_BACKSLASH_C) != 0)? " never_backslash_c" : "",
@ -4867,7 +4875,7 @@ switch(cmd)
  case CMD_PATTERN:
  (void)decode_modifiers(argptr, CTX_DEFPAT, &def_patctl, NULL);
  if (def_patctl.jit == 0 && (def_patctl.control & CTL_JITVERIFY) != 0)
-    def_patctl.jit = 7;
+    def_patctl.jit = JIT_DEFAULT;
  break;
  /* Set default subject modifiers */
@ -5114,7 +5122,11 @@ patlen = p - buffer - 2;
 /* Look for modifiers and options after the final delimiter. */
 if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
-utf = (pat_patctl.options & PCRE2_UTF) != 0;
+
 /* Note that the match_invalid_utf option also sets utf when passed to 
 pcre2_compile(). */
 utf = (pat_patctl.options & (PCRE2_UTF|PCRE2_MATCH_INVALID_UTF)) != 0;
 /* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
 exclusive with the utf modifier. */
@ -5161,7 +5173,7 @@ specified. */
 if (pat_patctl.jit == 0 &&
    (pat_patctl.control & (CTL_JITVERIFY|CTL_JITFAST)) != 0)
-  pat_patctl.jit = 7;
+  pat_patctl.jit = JIT_DEFAULT;
 /* Now copy the pattern to pbuffer8 for use in 8-bit testing and for reflecting
 in callouts. Convert from hex if requested (literal strings in quotes may be
@ -5744,6 +5756,7 @@ if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
    {
    int i;
    clock_t time_taken = 0;
    for (i = 0; i < timeit; i++)
      {
      clock_t start_time;
@ -5752,7 +5765,7 @@ if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
        pat_patctl.options|use_forbid_utf, &errorcode, &erroroffset,
        use_pat_context);
      start_time = clock();
-      PCRE2_JIT_COMPILE(jitrc,compiled_code, pat_patctl.jit);
+      PCRE2_JIT_COMPILE(jitrc, compiled_code, pat_patctl.jit);
      time_taken += clock() - start_time;
      }
    total_jit_compile_time += time_taken;
@ -8615,7 +8628,7 @@ while (argc > 1 && argv[op][0] == '-' && argv[op][1] != 0)
  else if (strcmp(arg, "-jit") == 0 || strcmp(arg, "-jitverify") == 0)
    {
    if (arg[4] != 0) def_patctl.control |= CTL_JITVERIFY;
-    def_patctl.jit = 7;  /* full & partial */
+    def_patctl.jit = JIT_DEFAULT;  /* full & partial */
 #ifndef SUPPORT_JIT
    fprintf(stderr, "** Warning: JIT support is not available: "
                    "-jit[verify] calls functions that do nothing.\n");
--- a/testdata/testinput10
+++ b/testdata/testinput10
@ -1,7 +1,7 @@
 # This set of tests is for UTF-8 support and Unicode property support, with
 # relevance only for the 8-bit library.
-# The next 4 patterns have UTF-8 errors
+# The next 5 patterns have UTF-8 errors
 /[Ã]/utf
@ -11,6 +11,8 @@
 /Ã‚‚‚‚‚‚‚‚Ã/utf
 /Ã‚‚‚‚‚‚‚‚Ã/match_invalid_utf
 # Now test subjects
 /badutf/utf
@ -493,4 +495,66 @@
 /(?(Ã¡/utf
 # Invalid UTF-8 tests
 /.../g,match_invalid_utf
    abcd\x80wxzy\x80pqrs
    abcd\x{80}wxzy\x80pqrs
 /abc/match_invalid_utf
    ab\x80ab\=ph
 \= Expect no match
    ab\x80cdef\=ph
 /ab$/match_invalid_utf
    ab\x80cdeab
 \= Expect no match
    ab\x80cde
 /.../g,match_invalid_utf
    abcd\x{80}wxzy\x80pqrs
 /(?<=x)../g,match_invalid_utf
    abcd\x{80}wxzy\x80pqrs
    abcd\x{80}wxzy\x80xpqrs
 /X$/match_invalid_utf
 \= Expect no match
    X\xc4
 /(?<=..)X/match_invalid_utf,aftertext
    AB\x80AQXYZ
    AB\x80AQXYZ\=offset=5
    AB\x80\x80AXYZXC\=offset=5
 \= Expect no match
    AB\x80XYZ
    AB\x80XYZ\=offset=3 
    AB\xfeXYZ
    AB\xffXYZ\=offset=3 
    AB\x80AXYZ
    AB\x80AXYZ\=offset=4
    AB\x80\x80AXYZ\=offset=5
 /.../match_invalid_utf
    AB\xc4CCC
 \= Expect no match
    A\x{d800}B
    A\x{110000}B
    A\xc4B  
 /\bX/match_invalid_utf
    A\x80X
 /\BX/match_invalid_utf
 \= Expect no match
    A\x80X
 /(?<=...)X/match_invalid_utf
    AAA\x80BBBXYZ 
 \= Expect no match
    AAA\x80BXYZ 
    AAA\x80BBXYZ 
 # -------------------------------------
 # End of testinput10
--- a/testdata/testinput11
+++ b/testdata/testinput11
@ -368,6 +368,4 @@
    ab˙Az
    ab\x{80000041}z 
 /\[()]{65535}/expand
 # End of testinput11
--- a/testdata/testinput12
+++ b/testdata/testinput12
@ -402,4 +402,49 @@
 /(?(á/utf
 # Invalid UTF-16/32 tests.
 /.../g,match_invalid_utf
    abcd\x{df00}wxzy\x{df00}pqrs
    abcd\x{80}wxzy\x{df00}pqrs
 /abc/match_invalid_utf
    ab\x{df00}ab\=ph
 \= Expect no match
    ab\x{df00}cdef\=ph
 /ab$/match_invalid_utf
    ab\x{df00}cdeab
 \= Expect no match
    ab\x{df00}cde
 /.../g,match_invalid_utf
    abcd\x{80}wxzy\x{df00}pqrs
 /(?<=x)../g,match_invalid_utf
    abcd\x{80}wxzy\x{df00}pqrs
    abcd\x{80}wxzy\x{df00}xpqrs
 /X$/match_invalid_utf
 \= Expect no match
    X\x{df00}
 /(?<=..)X/match_invalid_utf,aftertext
    AB\x{df00}AQXYZ
    AB\x{df00}AQXYZ\=offset=5
    AB\x{df00}\x{df00}AXYZXC\=offset=5
 \= Expect no match
    AB\x{df00}XYZ
    AB\x{df00}XYZ\=offset=3 
    AB\x{df00}AXYZ
    AB\x{df00}AXYZ\=offset=4
    AB\x{df00}\x{df00}AXYZ\=offset=5
 /.../match_invalid_utf
 \= Expect no match
    A\x{d800}B
    A\x{110000}B 
 # ---------------------------------------------------- 
 # End of testinput12
--- a/testdata/testinput8
+++ b/testdata/testinput8
@ -182,4 +182,8 @@
 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
 #pattern -fullbincode
 /\[()]{65535}/expand
 # End of testinput8
--- a/testdata/testinput9
+++ b/testdata/testinput9
@ -260,6 +260,4 @@
 /(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/
 /\[()]{65535}/expand
 # End of testinput9
--- a/testdata/testoutput10
+++ b/testdata/testoutput10
@ -1,7 +1,7 @@
 # This set of tests is for UTF-8 support and Unicode property support, with
 # relevance only for the 8-bit library.
-# The next 4 patterns have UTF-8 errors
+# The next 5 patterns have UTF-8 errors
 /[Ã]/utf
 Failed: error -8 at offset 1: UTF-8 error: byte 2 top bits not 0x80
@ -15,6 +15,9 @@ Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
 /Ã‚‚‚‚‚‚‚‚Ã/utf
 Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set
 /Ã‚‚‚‚‚‚‚‚Ã/match_invalid_utf
 Failed: error -22 at offset 2: UTF-8 error: isolated byte with 0x80 bit set
 # Now test subjects
 /badutf/utf
@ -1651,4 +1654,107 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
 /(?(Ã¡/utf
 Failed: error 142 at offset 5: syntax error in subpattern name (missing terminator?)
 # Invalid UTF-8 tests
 /.../g,match_invalid_utf
    abcd\x80wxzy\x80pqrs
 0: abc
 0: wxz
 0: pqr
    abcd\x{80}wxzy\x80pqrs
 0: abc
 0: d\x{80}w
 0: xzy
 0: pqr
 /abc/match_invalid_utf
    ab\x80ab\=ph
 Partial match: ab
 \= Expect no match
    ab\x80cdef\=ph
 No match
 /ab$/match_invalid_utf
    ab\x80cdeab
 0: ab
 \= Expect no match
    ab\x80cde
 No match
 /.../g,match_invalid_utf
    abcd\x{80}wxzy\x80pqrs
 0: abc
 0: d\x{80}w
 0: xzy
 0: pqr
 /(?<=x)../g,match_invalid_utf
    abcd\x{80}wxzy\x80pqrs
 0: zy
    abcd\x{80}wxzy\x80xpqrs
 0: zy
 0: pq
 /X$/match_invalid_utf
 \= Expect no match
    X\xc4
 No match
 /(?<=..)X/match_invalid_utf,aftertext
    AB\x80AQXYZ
 0: X
 0+ YZ
    AB\x80AQXYZ\=offset=5
 0: X
 0+ YZ
    AB\x80\x80AXYZXC\=offset=5
 0: X
 0+ C
 \= Expect no match
    AB\x80XYZ
 No match
    AB\x80XYZ\=offset=3 
 No match
    AB\xfeXYZ
 No match
    AB\xffXYZ\=offset=3 
 No match
    AB\x80AXYZ
 No match
    AB\x80AXYZ\=offset=4
 No match
    AB\x80\x80AXYZ\=offset=5
 No match
 /.../match_invalid_utf
    AB\xc4CCC
 0: CCC
 \= Expect no match
    A\x{d800}B
 No match
    A\x{110000}B
 No match
    A\xc4B  
 No match
 /\bX/match_invalid_utf
    A\x80X
 0: X
 /\BX/match_invalid_utf
 \= Expect no match
    A\x80X
 No match
 /(?<=...)X/match_invalid_utf
    AAA\x80BBBXYZ 
 0: X
 \= Expect no match
    AAA\x80BXYZ 
 No match
    AAA\x80BBXYZ 
 No match
 # -------------------------------------
 # End of testinput10
--- a/testdata/testoutput11-16
+++ b/testdata/testoutput11-16
@ -661,7 +661,4 @@ Subject length lower bound = 1
    ab˙Az
    ab\x{80000041}z 
 /\[()]{65535}/expand
 Failed: error 120 at offset 131070: regular expression is too large
 # End of testinput11
--- a/testdata/testoutput11-32
+++ b/testdata/testoutput11-32
@ -667,6 +667,4 @@ Subject length lower bound = 1
    ab\x{80000041}z 
 0: ab\x{80000041}z
 /\[()]{65535}/expand
 # End of testinput11
--- a/testdata/testoutput12-16
+++ b/testdata/testoutput12-16
@ -1502,4 +1502,81 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
 /(?(á/utf
 Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?)
 # Invalid UTF-16/32 tests.
 /.../g,match_invalid_utf
    abcd\x{df00}wxzy\x{df00}pqrs
 0: abc
 0: wxz
 0: pqr
    abcd\x{80}wxzy\x{df00}pqrs
 0: abc
 0: d\x{80}w
 0: xzy
 0: pqr
 /abc/match_invalid_utf
    ab\x{df00}ab\=ph
 Partial match: ab
 \= Expect no match
    ab\x{df00}cdef\=ph
 No match
 /ab$/match_invalid_utf
    ab\x{df00}cdeab
 0: ab
 \= Expect no match
    ab\x{df00}cde
 No match
 /.../g,match_invalid_utf
    abcd\x{80}wxzy\x{df00}pqrs
 0: abc
 0: d\x{80}w
 0: xzy
 0: pqr
 /(?<=x)../g,match_invalid_utf
    abcd\x{80}wxzy\x{df00}pqrs
 0: zy
    abcd\x{80}wxzy\x{df00}xpqrs
 0: zy
 0: pq
 /X$/match_invalid_utf
 \= Expect no match
    X\x{df00}
 No match
 /(?<=..)X/match_invalid_utf,aftertext
    AB\x{df00}AQXYZ
 0: X
 0+ YZ
    AB\x{df00}AQXYZ\=offset=5
 0: X
 0+ YZ
    AB\x{df00}\x{df00}AXYZXC\=offset=5
 0: X
 0+ C
 \= Expect no match
    AB\x{df00}XYZ
 No match
    AB\x{df00}XYZ\=offset=3 
 No match
    AB\x{df00}AXYZ
 No match
    AB\x{df00}AXYZ\=offset=4
 No match
    AB\x{df00}\x{df00}AXYZ\=offset=5
 No match
 /.../match_invalid_utf
 \= Expect no match
    A\x{d800}B
 No match
    A\x{110000}B 
 ** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
 # ---------------------------------------------------- 
 # End of testinput12
--- a/testdata/testoutput12-32
+++ b/testdata/testoutput12-32
@ -1500,4 +1500,81 @@ Failed: error 142 at offset 4: syntax error in subpattern name (missing terminat
 /(?(á/utf
 Failed: error 142 at offset 4: syntax error in subpattern name (missing terminator?)
 # Invalid UTF-16/32 tests.
 /.../g,match_invalid_utf
    abcd\x{df00}wxzy\x{df00}pqrs
 0: abc
 0: wxz
 0: pqr
    abcd\x{80}wxzy\x{df00}pqrs
 0: abc
 0: d\x{80}w
 0: xzy
 0: pqr
 /abc/match_invalid_utf
    ab\x{df00}ab\=ph
 Partial match: ab
 \= Expect no match
    ab\x{df00}cdef\=ph
 No match
 /ab$/match_invalid_utf
    ab\x{df00}cdeab
 0: ab
 \= Expect no match
    ab\x{df00}cde
 No match
 /.../g,match_invalid_utf
    abcd\x{80}wxzy\x{df00}pqrs
 0: abc
 0: d\x{80}w
 0: xzy
 0: pqr
 /(?<=x)../g,match_invalid_utf
    abcd\x{80}wxzy\x{df00}pqrs
 0: zy
    abcd\x{80}wxzy\x{df00}xpqrs
 0: zy
 0: pq
 /X$/match_invalid_utf
 \= Expect no match
    X\x{df00}
 No match
 /(?<=..)X/match_invalid_utf,aftertext
    AB\x{df00}AQXYZ
 0: X
 0+ YZ
    AB\x{df00}AQXYZ\=offset=5
 0: X
 0+ YZ
    AB\x{df00}\x{df00}AXYZXC\=offset=5
 0: X
 0+ C
 \= Expect no match
    AB\x{df00}XYZ
 No match
    AB\x{df00}XYZ\=offset=3 
 No match
    AB\x{df00}AXYZ
 No match
    AB\x{df00}AXYZ\=offset=4
 No match
    AB\x{df00}\x{df00}AXYZ\=offset=5
 No match
 /.../match_invalid_utf
 \= Expect no match
    A\x{d800}B
 No match
    A\x{110000}B 
 No match
 # ---------------------------------------------------- 
 # End of testinput12
--- a/testdata/testoutput8-16-2
+++ b/testdata/testoutput8-16-2
@ -1020,4 +1020,9 @@ Failed: error 114 at offset 509: missing closing parenthesis
 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
 #pattern -fullbincode
 /\[()]{65535}/expand
 Failed: error 120 at offset 131070: regular expression is too large
 # End of testinput8
--- a/testdata/testoutput8-16-3
+++ b/testdata/testoutput8-16-3
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
 #pattern -fullbincode
 /\[()]{65535}/expand
 # End of testinput8
--- a/testdata/testoutput8-16-4
+++ b/testdata/testoutput8-16-4
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
 #pattern -fullbincode
 /\[()]{65535}/expand
 # End of testinput8
--- a/testdata/testoutput8-32-2
+++ b/testdata/testoutput8-32-2
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
 #pattern -fullbincode
 /\[()]{65535}/expand
 # End of testinput8
--- a/testdata/testoutput8-32-3
+++ b/testdata/testoutput8-32-3
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
 #pattern -fullbincode
 /\[()]{65535}/expand
 # End of testinput8
--- a/testdata/testoutput8-32-4
+++ b/testdata/testoutput8-32-4
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
 #pattern -fullbincode
 /\[()]{65535}/expand
 # End of testinput8
--- a/testdata/testoutput8-8-2
+++ b/testdata/testoutput8-8-2
@ -1020,4 +1020,9 @@ Failed: error 114 at offset 509: missing closing parenthesis
 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
 #pattern -fullbincode
 /\[()]{65535}/expand
 Failed: error 120 at offset 131070: regular expression is too large
 # End of testinput8
--- a/testdata/testoutput8-8-3
+++ b/testdata/testoutput8-8-3
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
 #pattern -fullbincode
 /\[()]{65535}/expand
 # End of testinput8
--- a/testdata/testoutput8-8-4
+++ b/testdata/testoutput8-8-4
@ -1019,4 +1019,8 @@ Failed: error 114 at offset 509: missing closing parenthesis
 /([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
 #pattern -fullbincode
 /\[()]{65535}/expand
 # End of testinput8
--- a/testdata/testoutput9
+++ b/testdata/testoutput9
@ -367,7 +367,4 @@ Failed: error 134 at offset 14: character code point value in \x{} or \o{} is to
 /(*:*++++++++++++''''''''''''''''''''+''+++'+++x+++++++++++++++++++++++++++++++++++(++++++++++++++++++++:++++++%++:''''''''''''''''''''''''+++++++++++++++++++++++++++++++++++++++++++++++++++++-++++++++k+++++++''''+++'+++++++++++++++++++++++''''++++++++++++':ƿ)/
 Failed: error 176 at offset 259: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
 /\[()]{65535}/expand
 Failed: error 120 at offset 131070: regular expression is too large
 # End of testinput9