Documentation update re PCRE2_JIT_INVALID_UTF

2019-03-06 17:38:20 +00:00 · 2019-03-06 17:38:20 +00:00 · 590f65f061
parent 7375089fa5
commit 590f65f061
8 changed files with 263 additions and 177 deletions
--- a/doc/html/pcre2_jit_compile.html
+++ b/doc/html/pcre2_jit_compile.html
@ -40,6 +40,7 @@ bits:
  PCRE2_JIT_COMPLETE      compile code for full matching
  PCRE2_JIT_PARTIAL_SOFT  compile code for soft partial matching
  PCRE2_JIT_PARTIAL_HARD  compile code for hard partial matching
  PCRE2_JIT_INVALID_UTF   compile code to handle invalid UTF 
 </pre>
 The yield of the function is 0 for success, or a negative error code otherwise.
 In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@ -312,7 +312,7 @@ document for an overview of all the PCRE2 documentation.
 <b>const unsigned char *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
 <br>
 <br>
-<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>, </b>
+<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>,</b>
 <b>  void *<i>where</i>);</b>
 <br>
 <br>
--- a/doc/html/pcre2jit.html
+++ b/doc/html/pcre2jit.html
@ -16,16 +16,17 @@ please consult the man page, in case the conversion went wrong.
 <li><a name="TOC1" href="#SEC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a>
 <li><a name="TOC2" href="#SEC2">AVAILABILITY OF JIT SUPPORT</a>
 <li><a name="TOC3" href="#SEC3">SIMPLE USE OF JIT</a>
-<li><a name="TOC4" href="#SEC4">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
+<li><a name="TOC4" href="#SEC4">MATCHING SUBJECTS CONTAINING INVALID UTF</a>
-<li><a name="TOC5" href="#SEC5">RETURN VALUES FROM JIT MATCHING</a>
+<li><a name="TOC5" href="#SEC5">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
-<li><a name="TOC6" href="#SEC6">CONTROLLING THE JIT STACK</a>
+<li><a name="TOC6" href="#SEC6">RETURN VALUES FROM JIT MATCHING</a>
-<li><a name="TOC7" href="#SEC7">JIT STACK FAQ</a>
+<li><a name="TOC7" href="#SEC7">CONTROLLING THE JIT STACK</a>
-<li><a name="TOC8" href="#SEC8">FREEING JIT SPECULATIVE MEMORY</a>
+<li><a name="TOC8" href="#SEC8">JIT STACK FAQ</a>
-<li><a name="TOC9" href="#SEC9">EXAMPLE CODE</a>
+<li><a name="TOC9" href="#SEC9">FREEING JIT SPECULATIVE MEMORY</a>
-<li><a name="TOC10" href="#SEC10">JIT FAST PATH API</a>
+<li><a name="TOC10" href="#SEC10">EXAMPLE CODE</a>
-<li><a name="TOC11" href="#SEC11">SEE ALSO</a>
+<li><a name="TOC11" href="#SEC11">JIT FAST PATH API</a>
-<li><a name="TOC12" href="#SEC12">AUTHOR</a>
+<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
-<li><a name="TOC13" href="#SEC13">REVISION</a>
+<li><a name="TOC13" href="#SEC13">AUTHOR</a>
 <li><a name="TOC14" href="#SEC14">REVISION</a>
 </ul>
 <br><a name="SEC1" href="#TOC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a><br>
 <P>
@ -144,7 +145,29 @@ support is not available, or the pattern was not processed by
 <b>pcre2_jit_compile()</b>, or the JIT compiler was not able to handle the
 pattern.
 </P>
-<br><a name="SEC4" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
+<br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
 <P>
 When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
 function expects its subject string to be a valid sequence of UTF code units.
 If it is not, the result is undefined. This is also true by default of matching
 via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
 <b>pcre2_jit_compile()</b>, code that can process a subject containing invalid
 UTF is compiled.
 </P>
 <P>
 In this mode, an invalid code unit sequence never matches any pattern item. It 
 does not match dot, it does not match \p{Any}, it does not even match negative 
 items such as [^X]. A lookbehind assertion fails if it encounters an invalid
 sequence while moving the current point backwards. In other words, an invalid 
 UTF code unit sequence acts as a barrier which no match can cross. Reaching an 
 invalid sequence causes an immediate backtrack.
 </P>
 <P>
 Using this option, an application can run matches in arbitrary data, knowing
 that any matched strings that are returned will be valid UTF. This can be
 useful when searching for text in executable or other binary files.
 </P>
 <br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
 <P>
 The <b>pcre2_match()</b> options that are supported for JIT matching are
 PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
@ -161,7 +184,7 @@ The only unsupported pattern items are \C (match a single data unit) when
 running in a UTF mode, and a callout immediately before an assertion condition
 in a conditional group.
 </P>
-<br><a name="SEC5" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
+<br><a name="SEC6" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
 <P>
 When a pattern is matched using JIT matching, the return values are the same
 as those given by the interpretive <b>pcre2_match()</b> code, with the addition
@ -177,7 +200,7 @@ circumstance when JIT is not used, but the details of exactly what is counted
 are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
 when JIT matching is used.
 <a name="stackcontrol"></a></P>
-<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
+<br><a name="SEC7" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
 <P>
 When the compiled JIT code runs, it needs a block of memory to use as a stack.
 By default, it uses 32KiB on the machine stack. However, some large or
@ -270,7 +293,7 @@ non-default JIT stacks might operate:
 </pre>
 All the functions described in this section do nothing if JIT is not available.
 <a name="stackfaq"></a></P>
-<br><a name="SEC7" href="#TOC1">JIT STACK FAQ</a><br>
+<br><a name="SEC8" href="#TOC1">JIT STACK FAQ</a><br>
 <P>
 (1) Why do we need JIT stacks?
 <br>
@ -349,7 +372,7 @@ stack handling?
 No, thanks to Windows. If POSIX threads were used everywhere, we could throw
 out this complicated API.
 </P>
-<br><a name="SEC8" href="#TOC1">FREEING JIT SPECULATIVE MEMORY</a><br>
+<br><a name="SEC9" href="#TOC1">FREEING JIT SPECULATIVE MEMORY</a><br>
 <P>
 <b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
 </P>
@ -361,7 +384,7 @@ all possible memory. You can cause this to happen by calling
 pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
 memory management, or NULL for standard memory management.
 </P>
-<br><a name="SEC9" href="#TOC1">EXAMPLE CODE</a><br>
+<br><a name="SEC10" href="#TOC1">EXAMPLE CODE</a><br>
 <P>
 This is a single-threaded example that specifies a JIT stack without using a
 callback. A real program should include error checking after all the function
@ -390,7 +413,7 @@ calls.
 </PRE>
 </P>
-<br><a name="SEC10" href="#TOC1">JIT FAST PATH API</a><br>
+<br><a name="SEC11" href="#TOC1">JIT FAST PATH API</a><br>
 <P>
 Because the API described above falls back to interpreted matching when JIT is
 not available, it is convenient for programs that are written for general use
@ -423,11 +446,11 @@ invalid data is passed, the result is undefined.
 Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
 speedups of more than 10%.
 </P>
-<br><a name="SEC11" href="#TOC1">SEE ALSO</a><br>
+<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
 <P>
 <b>pcre2api</b>(3)
 </P>
-<br><a name="SEC12" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
 <P>
 Philip Hazel (FAQ by Zoltan Herczeg)
 <br>
@ -436,11 +459,11 @@ University Computing Service
 Cambridge, England.
 <br>
 </P>
-<br><a name="SEC13" href="#TOC1">REVISION</a><br>
+<br><a name="SEC14" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 16 October 2018
+Last updated: 06 March 2019
 <br>
-Copyright &copy; 1997-2018 University of Cambridge.
+Copyright &copy; 1997-2019 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.
--- a/doc/html/pcre2unicode.html
+++ b/doc/html/pcre2unicode.html
@ -247,11 +247,34 @@ VALIDITY OF UTF STRINGS
 </b><br>
 <P>
 When the PCRE2_UTF option is set, the strings passed as patterns and subjects
-are (by default) checked for validity on entry to the relevant functions.
+are (by default) checked for validity on entry to the relevant functions. If an
-If an invalid UTF string is passed, an negative error code is returned. The
+invalid UTF string is passed, an negative error code is returned. The code unit
-code unit offset to the offending character can be extracted from the match
+offset to the offending character can be extracted from the match data block by
-data block by calling <b>pcre2_get_startchar()</b>, which is used for this
+calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
-purpose after a UTF error.
+error.
 </P>
 <P>
 In some situations, you may already know that your strings are valid, and
 therefore want to skip these checks in order to improve performance, for
 example in the case of a long subject string that is being scanned repeatedly.
 If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
 PCRE2 assumes that the pattern or subject it is given (respectively) contains
 only valid UTF code unit sequences.
 </P>
 <P>
 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
 is usually undefined and your program may crash or loop indefinitely. There is, 
 however, one mode of matching that can handle invalid UTF subject strings. This
 is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option 
 when calling <b>pcre2_jit_compile()</b>. For details, see the
 <a href="pcre2jit.html"><b>pcre2jit</b></a>
 documentation.
 </P>
 <P>
 Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
 the pattern; it does not also apply to subject strings. If you want to disable
 the check for a subject string you must pass this same option to
 <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
 </P>
 <P>
 UTF-16 and UTF-32 strings can indicate their endianness by special code knows
@ -259,13 +282,14 @@ as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
 strings to be in host byte order.
 </P>
 <P>
-A UTF string is checked before any other processing takes place. In the case of
+Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
-<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> calls with a non-zero starting
+processing takes place. In the case of <b>pcre2_match()</b> and
-offset, the check is applied only to that part of the subject that could be
+<b>pcre2_dfa_match()</b> calls with a non-zero starting offset, the check is
-inspected during matching, and there is a check that the starting offset points
+applied only to that part of the subject that could be inspected during
-to the first code unit of a character or to the end of the subject. If there
+matching, and there is a check that the starting offset points to the first
-are no lookbehind assertions in the pattern, the check starts at the starting
+code unit of a character or to the end of the subject. If there are no
-offset. Otherwise, it starts at the length of the longest lookbehind before the
+lookbehind assertions in the pattern, the check starts at the starting offset.
 Otherwise, it starts at the length of the longest lookbehind before the
 starting offset, or at the start of the subject if there are not that many
 characters before the starting offset. Note that the sequences \b and \B are
 one-character lookbehinds.
@ -285,31 +309,12 @@ surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
 UTF-32.)
 </P>
 <P>
-In some situations, you may already know that your strings are valid, and
+Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
-therefore want to skip these checks in order to improve performance, for
+given if an escape sequence for an invalid Unicode code point is encountered in
-example in the case of a long subject string that is being scanned repeatedly.
+the pattern. If you want to allow escape sequences such as \x{d800} (a
-If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
+surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
-PCRE2 assumes that the pattern or subject it is given (respectively) contains
+option. However, this is possible only in UTF-8 and UTF-32 modes, because these
-only valid UTF code unit sequences.
+values are not representable in UTF-16.
 </P>
 <P>
 Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
 the pattern; it does not also apply to subject strings. If you want to disable
 the check for a subject string you must pass this option to <b>pcre2_match()</b>
 or <b>pcre2_dfa_match()</b>.
 </P>
 <P>
 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
 is undefined and your program may crash or loop indefinitely.
 </P>
 <P>
 Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
 that is given if an escape sequence for an invalid Unicode code point is
 encountered in the pattern. If you want to allow escape sequences such as
 \x{d800} (a surrogate code point) you can set the
 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
 only in UTF-8 and UTF-32 modes, because these values are not representable in
 UTF-16.
 <a name="utf8strings"></a></P>
 <br><b>
 Errors in UTF-8 strings
@ -417,7 +422,7 @@ Cambridge, England.
 REVISION
 </b><br>
 <P>
-Last updated: 03 February 2019
+Last updated: 06 March 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@ -180,8 +180,8 @@ REVISION
       Last updated: 17 September 2018
       Copyright (c) 1997-2018 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2API(3)                Library Functions Manual                PCRE2API(3)
@ -3681,8 +3681,8 @@ REVISION
       Last updated: 14 February 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)
@ -4254,8 +4254,8 @@ REVISION
       Last updated: 03 March 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
@ -4685,8 +4685,8 @@ REVISION
       Last updated: 03 February 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)
@ -4890,8 +4890,8 @@ REVISION
       Last updated: 12 February 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)
@ -5010,6 +5010,29 @@ SIMPLE USE OF JIT
       to handle the pattern.
 MATCHING SUBJECTS CONTAINING INVALID UTF
       When a pattern is compiled with the PCRE2_UTF option, the  interpretive
       matching  function expects its subject string to be a valid sequence of
       UTF code units.  If it is not, the result is undefined.  This  is  also
       true   by   default  of  matching  via  JIT.  However,  if  the  option
       PCRE2_JIT_INVALID_UTF is passed to pcre2_jit_compile(), code  that  can
       process a subject containing invalid UTF is compiled.
       In  this  mode, an invalid code unit sequence never matches any pattern
       item. It does not match dot, it does not match  \p{Any},  it  does  not
       even match negative items such as [^X]. A lookbehind assertion fails if
       it encounters an invalid sequence while moving the current point  back-
       wards. In other words, an invalid UTF code unit sequence acts as a bar-
       rier which no match can cross. Reaching an invalid sequence  causes  an
       immediate backtrack.
       Using  this  option,  an application can run matches in arbitrary data,
       knowing that any matched strings that are returned will be  valid  UTF.
       This  can  be  useful  when  searching  for text in executable or other
       binary files.
 UNSUPPORTED OPTIONS AND PATTERN ITEMS
       The pcre2_match() options that  are  supported  for  JIT  matching  are
@ -5287,11 +5310,11 @@ AUTHOR
 REVISION
-       Last updated: 16 October 2018
+       Last updated: 06 March 2019
-       Copyright (c) 1997-2018 University of Cambridge.
+       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)
@ -5360,8 +5383,8 @@ REVISION
       Last updated: 02 February 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)
@ -5581,8 +5604,8 @@ REVISION
       Last updated: 10 October 2018
       Copyright (c) 1997-2018 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)
@ -6021,8 +6044,8 @@ REVISION
       Last updated: 22 December 2014
       Copyright (c) 1997-2014 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3)
@ -9365,8 +9388,8 @@ REVISION
       Last updated: 12 February 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3)
@ -9600,8 +9623,8 @@ REVISION
       Last updated: 03 February 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2POSIX(3)              Library Functions Manual              PCRE2POSIX(3)
@ -9930,8 +9953,8 @@ REVISION
       Last updated: 30 January 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2SAMPLE(3)             Library Functions Manual             PCRE2SAMPLE(3)
@ -10209,8 +10232,8 @@ REVISION
       Last updated: 27 June 2018
       Copyright (c) 1997-2018 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
@ -10710,8 +10733,8 @@ REVISION
       Last updated: 11 February 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
 PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)
@ -10928,59 +10951,63 @@ VALIDITY OF UTF STRINGS
       When the PCRE2_UTF option is set, the strings passed  as  patterns  and
       subjects are (by default) checked for validity on entry to the relevant
-       functions.  If an invalid UTF string is passed, an negative error  code
+       functions. If an invalid UTF string is passed, an negative  error  code
       is  returned.  The  code  unit offset to the offending character can be
       extracted from the match data block by  calling  pcre2_get_startchar(),
       which is used for this purpose after a UTF error.
-       UTF-16 and UTF-32 strings can indicate their endianness by special code
+       In  some  situations, you may already know that your strings are valid,
-       knows as a byte-order mark (BOM). The PCRE2  functions  do  not  handle
+       and therefore want to skip these checks in  order  to  improve  perfor-
-       this, expecting strings to be in host byte order.
+       mance,  for  example in the case of a long subject string that is being
-
+       scanned repeatedly.  If you set the PCRE2_NO_UTF_CHECK option  at  com-
-       A UTF string is checked before any other processing takes place. In the
+       pile  time  or at match time, PCRE2 assumes that the pattern or subject
       case of pcre2_match()  and  pcre2_dfa_match()  calls  with  a  non-zero
       starting  offset, the check is applied only to that part of the subject
       that could be inspected during matching, and there is a check that  the
       starting  offset points to the first code unit of a character or to the
       end of the subject. If there are no lookbehind assertions in  the  pat-
       tern,  the check starts at the starting offset. Otherwise, it starts at
       the length of the longest lookbehind before the starting offset, or  at
       the  start  of the subject if there are not that many characters before
       the starting offset. Note that the sequences \b and \B are  one-charac-
       ter lookbehinds.
       In  addition  to checking the format of the string, there is a check to
       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
       the  surrogate  area. The so-called "non-character" code points are not
       excluded because Unicode corrigendum #9 makes it clear that they should
       not be.
       Characters  in  the "Surrogate Area" of Unicode are reserved for use by
       UTF-16, where they are used in pairs to encode code points with  values
       greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
       are available independently in the  UTF-8  and  UTF-32  encodings.  (In
       other  words,  the  whole  surrogate  thing is a fudge for UTF-16 which
       unfortunately messes up UTF-8 and UTF-32.)
       In some situations, you may already know that your strings  are  valid,
       and  therefore  want  to  skip these checks in order to improve perfor-
       mance, for example in the case of a long subject string that  is  being
       scanned  repeatedly.   If you set the PCRE2_NO_UTF_CHECK option at com-
       pile time or at match time, PCRE2 assumes that the pattern  or  subject
       it is given (respectively) contains only valid UTF code unit sequences.
       If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
       result  is usually undefined and your program may crash or loop indefi-
       nitely. There is, however, one mode of matching that can handle invalid
       UTF  subject  strings.  This is matching via the JIT optimization using
       the PCRE2_JIT_INVALID_UTF option when calling pcre2_jit_compile().  For
       details, see the pcre2jit documentation.
       Passing  PCRE2_NO_UTF_CHECK  to pcre2_compile() just disables the check
       for the pattern; it does not also apply to subject strings. If you want
-       to  disable the check for a subject string you must pass this option to
+       to  disable  the  check  for  a  subject string you must pass this same
-       pcre2_match() or pcre2_dfa_match().
+       option to pcre2_match() or pcre2_dfa_match().
-       If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
+       UTF-16 and UTF-32 strings can indicate their endianness by special code
-       result is undefined and your program may crash or loop indefinitely.
+       knows  as  a  byte-order  mark (BOM). The PCRE2 functions do not handle
       this, expecting strings to be in host byte order.
-       Note  that  setting PCRE2_NO_UTF_CHECK at compile time does not disable
+       Unless PCRE2_NO_UTF_CHECK is set, a UTF string is  checked  before  any
-       the error that is given if an escape sequence for  an  invalid  Unicode
+       other  processing  takes  place.  In  the  case  of  pcre2_match()  and
-       code  point  is encountered in the pattern. If you want to allow escape
+       pcre2_dfa_match() calls with a non-zero starting offset, the  check  is
-       sequences such as \x{d800} (a surrogate code point)  you  can  set  the
+       applied only to that part of the subject that could be inspected during
       matching, and there is a check that the starting offset points  to  the
       first  code  unit of a character or to the end of the subject. If there
       are no lookbehind assertions in the pattern, the check  starts  at  the
       starting  offset.   Otherwise,  it  starts at the length of the longest
       lookbehind before the starting offset, or at the start of  the  subject
       if  there are not that many characters before the starting offset. Note
       that the sequences \b and \B are one-character lookbehinds.
       In addition to checking the format of the string, there is a  check  to
       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
       the surrogate area. The so-called "non-character" code points  are  not
       excluded because Unicode corrigendum #9 makes it clear that they should
       not be.
       Characters in the "Surrogate Area" of Unicode are reserved for  use  by
       UTF-16,  where they are used in pairs to encode code points with values
       greater than 0xFFFF. The code points that are encoded by  UTF-16  pairs
       are  available  independently  in  the  UTF-8 and UTF-32 encodings. (In
       other words, the whole surrogate thing is  a  fudge  for  UTF-16  which
       unfortunately messes up UTF-8 and UTF-32.)
       Setting  PCRE2_NO_UTF_CHECK  at compile time does not disable the error
       that is given if an escape sequence for an invalid Unicode  code  point
       is  encountered  in  the pattern. If you want to allow escape sequences
       such  as  \x{d800}  (a  surrogate  code  point)   you   can   set   the
       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
       sible only in UTF-8 and UTF-32 modes, because these values are not rep-
       resentable in UTF-16.
@ -11079,8 +11106,8 @@ AUTHOR
 REVISION
-       Last updated: 03 February 2019
+       Last updated: 06 March 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
-
+ 
-
+ 
--- a/doc/pcre2_jit_compile.3
+++ b/doc/pcre2_jit_compile.3
@ -1,4 +1,4 @@
-.TH PCRE2_JIT_COMPILE 3 "21 October 2014" "PCRE2 10.00"
+.TH PCRE2_JIT_COMPILE 3 "06 March 2019" "PCRE2 10.33"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH SYNOPSIS
@ -29,6 +29,7 @@ bits:
  PCRE2_JIT_COMPLETE      compile code for full matching
  PCRE2_JIT_PARTIAL_SOFT  compile code for soft partial matching
  PCRE2_JIT_PARTIAL_HARD  compile code for hard partial matching
  PCRE2_JIT_INVALID_UTF   compile code to handle invalid UTF 
 .sp
 The yield of the function is 0 for success, or a negative error code otherwise.
 In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
--- a/doc/pcre2jit.3
+++ b/doc/pcre2jit.3
@ -1,4 +1,4 @@
-.TH PCRE2JIT 3 "16 October 2018" "PCRE2 10.33"
+.TH PCRE2JIT 3 "06 March 2019" "PCRE2 10.33"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@ -120,6 +120,28 @@ support is not available, or the pattern was not processed by
 pattern.
 .
 .
 .SH "MATCHING SUBJECTS CONTAINING INVALID UTF"
 .rs
 .sp
 When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
 function expects its subject string to be a valid sequence of UTF code units.
 If it is not, the result is undefined. This is also true by default of matching
 via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
 \fBpcre2_jit_compile()\fP, code that can process a subject containing invalid
 UTF is compiled.
 .P
 In this mode, an invalid code unit sequence never matches any pattern item. It 
 does not match dot, it does not match \ep{Any}, it does not even match negative 
 items such as [^X]. A lookbehind assertion fails if it encounters an invalid
 sequence while moving the current point backwards. In other words, an invalid 
 UTF code unit sequence acts as a barrier which no match can cross. Reaching an 
 invalid sequence causes an immediate backtrack.
 .P
 Using this option, an application can run matches in arbitrary data, knowing
 that any matched strings that are returned will be valid UTF. This can be
 useful when searching for text in executable or other binary files.
 .
 .
 .SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
 .rs
 .sp
@ -416,6 +438,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 16 October 2018
+Last updated: 06 March 2019
-Copyright (c) 1997-2018 University of Cambridge.
+Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/doc/pcre2unicode.3
+++ b/doc/pcre2unicode.3
@ -1,4 +1,4 @@
-.TH PCRE2UNICODE 3 "03 February 2019" "PCRE2 10.33"
+.TH PCRE2UNICODE 3 "06 March 2019" "PCRE2 10.33"
 .SH NAME
 PCRE - Perl-compatible regular expressions (revised API)
 .SH "UNICODE AND UTF SUPPORT"
@ -230,23 +230,46 @@ adjacent characters.
 .rs
 .sp
 When the PCRE2_UTF option is set, the strings passed as patterns and subjects
-are (by default) checked for validity on entry to the relevant functions.
+are (by default) checked for validity on entry to the relevant functions. If an
-If an invalid UTF string is passed, an negative error code is returned. The
+invalid UTF string is passed, an negative error code is returned. The code unit
-code unit offset to the offending character can be extracted from the match
+offset to the offending character can be extracted from the match data block by
-data block by calling \fBpcre2_get_startchar()\fP, which is used for this
+calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF
-purpose after a UTF error.
+error.
 .P
 In some situations, you may already know that your strings are valid, and
 therefore want to skip these checks in order to improve performance, for
 example in the case of a long subject string that is being scanned repeatedly.
 If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
 PCRE2 assumes that the pattern or subject it is given (respectively) contains
 only valid UTF code unit sequences.
 .P
 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
 is usually undefined and your program may crash or loop indefinitely. There is, 
 however, one mode of matching that can handle invalid UTF subject strings. This
 is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option 
 when calling \fBpcre2_jit_compile()\fP. For details, see the
 .\" HREF
 \fBpcre2jit\fP
 .\"
 documentation.
 .P
 Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
 the pattern; it does not also apply to subject strings. If you want to disable
 the check for a subject string you must pass this same option to
 \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP.
 .P
 UTF-16 and UTF-32 strings can indicate their endianness by special code knows
 as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
 strings to be in host byte order.
 .P
-A UTF string is checked before any other processing takes place. In the case of
+Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
-\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP calls with a non-zero starting
+processing takes place. In the case of \fBpcre2_match()\fP and
-offset, the check is applied only to that part of the subject that could be
+\fBpcre2_dfa_match()\fP calls with a non-zero starting offset, the check is
-inspected during matching, and there is a check that the starting offset points
+applied only to that part of the subject that could be inspected during
-to the first code unit of a character or to the end of the subject. If there
+matching, and there is a check that the starting offset points to the first
-are no lookbehind assertions in the pattern, the check starts at the starting
+code unit of a character or to the end of the subject. If there are no
-offset. Otherwise, it starts at the length of the longest lookbehind before the
+lookbehind assertions in the pattern, the check starts at the starting offset.
 Otherwise, it starts at the length of the longest lookbehind before the
 starting offset, or at the start of the subject if there are not that many
 characters before the starting offset. Note that the sequences \eb and \eB are
 one-character lookbehinds.
@ -263,28 +286,12 @@ independently in the UTF-8 and UTF-32 encodings. (In other words, the whole
 surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
 UTF-32.)
 .P
-In some situations, you may already know that your strings are valid, and
+Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
-therefore want to skip these checks in order to improve performance, for
+given if an escape sequence for an invalid Unicode code point is encountered in
-example in the case of a long subject string that is being scanned repeatedly.
+the pattern. If you want to allow escape sequences such as \ex{d800} (a
-If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
+surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
-PCRE2 assumes that the pattern or subject it is given (respectively) contains
+option. However, this is possible only in UTF-8 and UTF-32 modes, because these
-only valid UTF code unit sequences.
+values are not representable in UTF-16.
 .P
 Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
 the pattern; it does not also apply to subject strings. If you want to disable
 the check for a subject string you must pass this option to \fBpcre2_match()\fP
 or \fBpcre2_dfa_match()\fP.
 .P
 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
 is undefined and your program may crash or loop indefinitely.
 .P
 Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
 that is given if an escape sequence for an invalid Unicode code point is
 encountered in the pattern. If you want to allow escape sequences such as
 \ex{d800} (a surrogate code point) you can set the
 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
 only in UTF-8 and UTF-32 modes, because these values are not representable in
 UTF-16.
 .
 .
 .\" HTML <a name="utf8strings"></a>
@ -393,6 +400,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 03 February 2019
+Last updated: 06 March 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi