Update pcre2test with the /utf8_input option, for generating wide characters in

non-UTF 16-bit and 32-bit modes.
2016-08-03 09:01:02 +00:00 · 2016-08-03 09:01:02 +00:00 · 69c9d81e43
parent 5b6c797a4d
commit 69c9d81e43
14 changed files with 589 additions and 304 deletions
--- a/7
+++ b/7
@ -2,6 +2,13 @@ Change Log for PCRE2
 --------------------
 Version 10.23 xx-xxxxxx-2016
 ----------------------------
 1. Extended pcre2test with the utf8_input modifier so that it is able to
 generate all possible 16-bit and 32-bit code unit values in non-UTF modes.
 Version 10.22 29-July-2016
 --------------------------
--- a/configure.ac
+++ b/configure.ac
@ -9,9 +9,9 @@ dnl The PCRE2_PRERELEASE feature is for identifying release candidates. It might
 dnl be defined as -RC2, for example. For real releases, it should be empty.
 m4_define(pcre2_major, [10])
-m4_define(pcre2_minor, [22])
+m4_define(pcre2_minor, [23])
-m4_define(pcre2_prerelease, [])
+m4_define(pcre2_prerelease, [-RC1])
-m4_define(pcre2_date, [2016-07-29])
+m4_define(pcre2_date, [2016-08-01])
 # NOTE: The CMakeLists.txt file searches for the above variables in the first
 # 50 lines of this file. Please update that if the variables above are moved.
--- a/doc/html/pcre2test.html
+++ b/doc/html/pcre2test.html
@ -61,7 +61,7 @@ subject is processed, and what output is produced.
 <P>
 As the original fairly simple PCRE library evolved, it acquired many different
 features, and as a result, the original <b>pcretest</b> program ended up with a
-lot of options in a messy, arcane syntax, for testing all the features. The
+lot of options in a messy, arcane syntax for testing all the features. The
 move to the new PCRE2 API provided an opportunity to re-implement the test
 program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
 are still many obscure modifiers, some of which are specifically designed for
@ -77,32 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
 all three of these libraries may be simultaneously installed. The
 <b>pcre2test</b> program can be used to test all the libraries. However, its own
 input and output are always in 8-bit format. When testing the 16-bit or 32-bit
-libraries, patterns and subject strings are converted to 16- or 32-bit format
+libraries, patterns and subject strings are converted to 16-bit or 32-bit
-before being passed to the library functions. Results are converted back to
+format before being passed to the library functions. Results are converted back
-8-bit code units for output.
+to 8-bit code units for output.
 </P>
 <P>
 In the rest of this document, the names of library functions and structures
 are given in generic form, for example, <b>pcre_compile()</b>. The actual
 names used in the libraries have a suffix _8, _16, or _32, as appropriate.
-</P>
+<a name="inputencoding"></a></P>
 <br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
 <P>
 Input to <b>pcre2test</b> is processed line by line, either by calling the C
-library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
+library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
-below). The input is processed using using C's string functions, so must not
+Windows environments character 26 (hex 1A) causes an immediate end of file, and
-contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
+no further data is read, so this character should be avoided unless you really
-treats any bytes other than newline as data characters. In some Windows
+want that action.
 environments character 26 (hex 1A) causes an immediate end of file, and no
 further data is read.
 </P>
 <P>
-For maximum portability, therefore, it is safest to avoid non-printing
+The input is processed using using C's string functions, so must not
-characters in <b>pcre2test</b> input files. There is a facility for specifying
+contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
-some or all of a pattern's characters as hexadecimal pairs, thus making it
+treats any bytes other than newline as data characters. An error is generated
-possible to include binary zeroes in a pattern for testing purposes. Subject
+if a binary zero is encountered. Subject lines are processed for backslash
-lines are processed for backslash escapes, which makes it possible to include
+escapes, which makes it possible to include any data value in strings that are
-any data value.
+passed to the library for matching. For patterns, there is a facility for
 specifying some or all of the 8-bit input characters as hexadecimal pairs,
 which makes it possible to include binary zeros.
 </P>
 <br><b>
 Input for the 16-bit and 32-bit libraries
 </b><br>
 <P>
 When testing the 16-bit or 32-bit libraries, there is a need to be able to
 generate character code points greater than 255 in the strings that are passed
 to the library. For subject lines, backslash escapes can be used. In addition,
 when the <b>utf</b> modifier (see
 <a href="#optionmodifiers">"Setting compilation options"</a>
 below) is set, the pattern and any following subject lines are interpreted as
 UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate. 
 </P>
 <P>
 For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
 used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
 or 32-bit mode. It causes the pattern and following subject lines to be treated
 as UTF-8 according to the original definition (RFC 2279), which allows for
 character values up to 0x7fffffff. Each character is placed in one 16-bit or
 32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
 to occur).
 </P>
 <P>
 UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
 values can be handled by the 32-bit library. When testing this library in
 non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
 byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
 character's value. This is the only way of passing such code points in a
 pattern string. For subject strings, using an escape sequence is preferable.
 </P>
 <br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
 <P>
@ -553,7 +582,9 @@ for a description of their effects.
 As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
 non-printing characters in output strings to be printed using the \x{hh...}
 notation. Otherwise, those less than 0x100 are output in hex without the curly
-brackets.
+brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and 
 subject strings to be translated to UTF-16 or UTF-32, respectively, before
 being passed to library functions.
 <a name="controlmodifiers"></a></P>
 <br><b>
 Setting compilation controls
@ -584,6 +615,7 @@ about the pattern:
      pushcopy                  push a copy onto the stack
      stackguard=&#60;number&#62;       test the stackguard feature
      tables=[0|1|2]            select internal tables
      utf8_input                treat input as UTF-8 
 </pre>
 The effects of these modifiers are described in the following sections.
 </P>
@ -684,7 +716,8 @@ nine characters, only two of which are specified in hexadecimal:
  /ab "literal" 32/hex
 </pre>
 Either single or double quotes may be used. There is no way of including
-the delimiter within a substring.
+the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
 mutually exclusive.
 </P>
 <P>
 By default, <b>pcre2test</b> passes patterns as zero-terminated strings to
@ -693,6 +726,19 @@ patterns specified with the <b>hex</b> modifier, the actual length of the
 pattern is passed.
 </P>
 <br><b>
 Specifying wide characters in 16-bit and 32-bit modes
 </b><br>
 <P>
 In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and 
 translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing 
 the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
 can be used. It is mutually exclusive with <b>utf</b>. Input lines are
 interpreted as UTF-8 as a means of specifying wide characters. More details are
 given in
 <a href="#inputencoding">"Input encoding"</a>
 above.
 </P>
 <br><b>
 Generating long repetitive patterns
 </b><br>
 <P>
@ -708,7 +754,8 @@ are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
 example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
 cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
 by decimal digits and "}" is found later in the pattern. If not, the characters
-remain in the pattern unaltered.
+remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
 mutually exclusive.
 </P>
 <P>
 If part of an expanded pattern looks like an expansion, but is really part of
@ -1706,7 +1753,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC21" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 06 July 2016
+Last updated: 02 August 2016
 <br>
 Copyright &copy; 1997-2016 University of Cambridge.
 <br>
--- a/doc/pcre2test.1
+++ b/doc/pcre2test.1
@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "06 July 2016" "PCRE 10.22"
+.TH PCRE2TEST 1 "02 August 2016" "PCRE 10.23"
 .SH NAME
 pcre2test - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@ -29,7 +29,7 @@ subject is processed, and what output is produced.
 .P
 As the original fairly simple PCRE library evolved, it acquired many different
 features, and as a result, the original \fBpcretest\fP program ended up with a
-lot of options in a messy, arcane syntax, for testing all the features. The
+lot of options in a messy, arcane syntax for testing all the features. The
 move to the new PCRE2 API provided an opportunity to re-implement the test
 program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
 are still many obscure modifiers, some of which are specifically designed for
@ -47,32 +47,63 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
 all three of these libraries may be simultaneously installed. The
 \fBpcre2test\fP program can be used to test all the libraries. However, its own
 input and output are always in 8-bit format. When testing the 16-bit or 32-bit
-libraries, patterns and subject strings are converted to 16- or 32-bit format
+libraries, patterns and subject strings are converted to 16-bit or 32-bit
-before being passed to the library functions. Results are converted back to
+format before being passed to the library functions. Results are converted back
-8-bit code units for output.
+to 8-bit code units for output.
 .P
 In the rest of this document, the names of library functions and structures
 are given in generic form, for example, \fBpcre_compile()\fP. The actual
 names used in the libraries have a suffix _8, _16, or _32, as appropriate.
 .
 .
 .\" HTML <a name="inputencoding"></a>
 .SH "INPUT ENCODING"
 .rs
 .sp
 Input to \fBpcre2test\fP is processed line by line, either by calling the C
-library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
+library's \fBfgets()\fP function, or via the \fBlibreadline\fP library. In some
-below). The input is processed using using C's string functions, so must not
+Windows environments character 26 (hex 1A) causes an immediate end of file, and
-contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
+no further data is read, so this character should be avoided unless you really
-treats any bytes other than newline as data characters. In some Windows
+want that action.
 environments character 26 (hex 1A) causes an immediate end of file, and no
 further data is read.
 .P
-For maximum portability, therefore, it is safest to avoid non-printing
+The input is processed using using C's string functions, so must not
-characters in \fBpcre2test\fP input files. There is a facility for specifying
+contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
-some or all of a pattern's characters as hexadecimal pairs, thus making it
+treats any bytes other than newline as data characters. An error is generated
-possible to include binary zeroes in a pattern for testing purposes. Subject
+if a binary zero is encountered. Subject lines are processed for backslash
-lines are processed for backslash escapes, which makes it possible to include
+escapes, which makes it possible to include any data value in strings that are
-any data value.
+passed to the library for matching. For patterns, there is a facility for
 specifying some or all of the 8-bit input characters as hexadecimal pairs,
 which makes it possible to include binary zeros.
 .
 .
 .SS "Input for the 16-bit and 32-bit libraries"
 .rs
 .sp
 When testing the 16-bit or 32-bit libraries, there is a need to be able to
 generate character code points greater than 255 in the strings that are passed
 to the library. For subject lines, backslash escapes can be used. In addition,
 when the \fButf\fP modifier (see
 .\" HTML <a href="#optionmodifiers">
 .\" </a>
 "Setting compilation options"
 .\"
 below) is set, the pattern and any following subject lines are interpreted as
 UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate. 
 .P
 For non-UTF testing of wide characters, the \fButf8_input\fP modifier can be
 used. This is mutually exclusive with \fButf\fP, and is allowed only in 16-bit
 or 32-bit mode. It causes the pattern and following subject lines to be treated
 as UTF-8 according to the original definition (RFC 2279), which allows for
 character values up to 0x7fffffff. Each character is placed in one 16-bit or
 32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
 to occur).
 .P
 UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
 values can be handled by the 32-bit library. When testing this library in
 non-UTF mode with \fButf8_input\fP set, if any character is preceded by the
 byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
 character's value. This is the only way of passing such code points in a
 pattern string. For subject strings, using an escape sequence is preferable.
 .
 .
 .SH "COMMAND LINE OPTIONS"
@ -515,7 +546,9 @@ for a description of their effects.
 As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
 non-printing characters in output strings to be printed using the \ex{hh...}
 notation. Otherwise, those less than 0x100 are output in hex without the curly
-brackets.
+brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and 
 subject strings to be translated to UTF-16 or UTF-32, respectively, before
 being passed to library functions.
 .
 .
 .\" HTML <a name="controlmodifiers"></a>
@ -547,6 +580,7 @@ about the pattern:
      pushcopy                  push a copy onto the stack
      stackguard=<number>       test the stackguard feature
      tables=[0|1|2]            select internal tables
      utf8_input                treat input as UTF-8 
 .sp
 The effects of these modifiers are described in the following sections.
 .
@ -642,7 +676,8 @@ nine characters, only two of which are specified in hexadecimal:
  /ab "literal" 32/hex
 .sp
 Either single or double quotes may be used. There is no way of including
-the delimiter within a substring.
+the delimiter within a substring. The \fBhex\fP and \fBexpand\fP modifiers are
 mutually exclusive.
 .P
 By default, \fBpcre2test\fP passes patterns as zero-terminated strings to
 \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. However, for
@ -650,6 +685,22 @@ patterns specified with the \fBhex\fP modifier, the actual length of the
 pattern is passed.
 .
 .
 .SS "Specifying wide characters in 16-bit and 32-bit modes"
 .rs
 .sp
 In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and 
 translated to UTF-16 or UTF-32 when the \fButf\fP modifier is set. For testing 
 the 16-bit and 32-bit libraries in non-UTF mode, the \fButf8_input\fP modifier
 can be used. It is mutually exclusive with \fButf\fP. Input lines are
 interpreted as UTF-8 as a means of specifying wide characters. More details are
 given in
 .\" HTML <a href="#inputencoding">
 .\" </a>
 "Input encoding"
 .\"
 above.
 .
 .
 .SS "Generating long repetitive patterns"
 .rs
 .sp
@ -665,7 +716,8 @@ are expanded before the pattern is passed to \fBpcre2_compile()\fP. For
 example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
 cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
 by decimal digits and "}" is found later in the pattern. If not, the characters
-remain in the pattern unaltered.
+remain in the pattern unaltered. The \fBexpand\fP and \fBhex\fP modifiers are
 mutually exclusive.
 .P
 If part of an expanded pattern looks like an expansion, but is really part of
 the actual pattern, unwanted expansion can be avoided by giving two values in
@ -1682,6 +1734,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 06 July 2016
+Last updated: 02 August 2016
 Copyright (c) 1997-2016 University of Cambridge.
 .fi
--- a/doc/pcre2test.txt
+++ b/doc/pcre2test.txt
@ -26,7 +26,7 @@ SYNOPSIS
       As the original fairly simple PCRE library evolved,  it  acquired  many
       different  features,  and  as  a  result, the original pcretest program
-       ended up with a lot of options in a messy, arcane syntax,  for  testing
+       ended up with a lot of options in a messy, arcane  syntax  for  testing
       all the features. The move to the new PCRE2 API provided an opportunity
       to re-implement the test program as pcre2test, with a cleaner  modifier
       syntax.  Nevertheless,  there are still many obscure modifiers, some of
@ -45,7 +45,7 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
       installed. The pcre2test program can be used to test all the libraries.
       However, its own input and output are  always  in  8-bit  format.  When
       testing  the  16-bit  or 32-bit libraries, patterns and subject strings
-       are converted to 16- or  32-bit  format  before  being  passed  to  the
+       are converted to 16-bit or 32-bit format before  being  passed  to  the
       library  functions.  Results are converted back to 8-bit code units for
       output.
@ -58,19 +58,46 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
 INPUT ENCODING
       Input  to  pcre2test is processed line by line, either by calling the C
-       library's fgets() function, or via the libreadline library (see below).
+       library's fgets() function, or via the  libreadline  library.  In  some
       Windows  environments  character 26 (hex 1A) causes an immediate end of
       file, and no further data is read, so this character should be  avoided
       unless you really want that action.
       The  input  is  processed using using C's string functions, so must not
       contain binary zeroes, even though in Unix-like  environments,  fgets()
-       treats any bytes other than newline as data characters. In some Windows
+       treats  any  bytes  other  than newline as data characters. An error is
-       environments character 26 (hex 1A) causes an immediate end of file, and
+       generated if a binary zero is encountered. Subject lines are  processed
-       no further data is read.
+       for  backslash  escapes,  which  makes  it possible to include any data
       value in strings that are passed to the library for matching. For  pat-
       terns,  there  is  a  facility  for specifying some or all of the 8-bit
       input characters as hexadecimal  pairs,  which  makes  it  possible  to
       include binary zeros.
-       For  maximum portability, therefore, it is safest to avoid non-printing
+   Input for the 16-bit and 32-bit libraries
-       characters in pcre2test input files. There is a facility for specifying
+
-       some or all of a pattern's characters as hexadecimal pairs, thus making
+       When testing the 16-bit or 32-bit libraries, there is a need to be able
-       it possible to include binary zeroes in a pattern for testing purposes.
+       to generate character code points greater than 255 in the strings  that
-       Subject  lines are processed for backslash escapes, which makes it pos-
+       are  passed to the library. For subject lines, backslash escapes can be
-       sible to include any data value.
+       used. In addition, when the  utf  modifier  (see  "Setting  compilation
       options" below) is set, the pattern and any following subject lines are
       interpreted as UTF-8 strings and translated  to  UTF-16  or  UTF-32  as
       appropriate.
       For  non-UTF testing of wide characters, the utf8_input modifier can be
       used. This is mutually exclusive with  utf,  and  is  allowed  only  in
       16-bit  or  32-bit  mode.  It  causes the pattern and following subject
       lines to be treated as UTF-8 according to the original definition  (RFC
       2279), which allows for character values up to 0x7fffffff. Each charac-
       ter is placed in one 16-bit or 32-bit code unit (in  the  16-bit  case,
       values greater than 0xffff cause an error to occur).
       UTF-8  is  not  capable of encoding values greater than 0x7fffffff, but
       such values can be handled by the 32-bit  library.  When  testing  this
       library  in  non-UTF mode with utf8_input set, if any character is pre-
       ceded by the byte 0xff (which is an illegal byte in  UTF-8)  0x80000000
       is added to the character's value. This is the only way of passing such
       code points in a pattern string. For subject strings, using  an  escape
       sequence is preferable.
 COMMAND LINE OPTIONS
@ -500,7 +527,9 @@ PATTERN MODIFIERS
       As well as turning on the PCRE2_UTF option, the utf modifier causes all
       non-printing  characters  in  output  strings  to  be printed using the
       \x{hh...} notation. Otherwise, those less than 0x100 are output in  hex
-       without the curly brackets.
+       without  the  curly brackets. Setting utf in 16-bit or 32-bit mode also
       causes pattern and subject  strings  to  be  translated  to  UTF-16  or
       UTF-32, respectively, before being passed to library functions.
   Setting compilation controls
@ -529,6 +558,7 @@ PATTERN MODIFIERS
             pushcopy                  push a copy onto the stack
             stackguard=<number>       test the stackguard feature
             tables=[0|1|2]            select internal tables
             utf8_input                treat input as UTF-8
       The effects of these modifiers are described in the following sections.
@ -619,13 +649,23 @@ PATTERN MODIFIERS
         /ab "literal" 32/hex
       Either  single or double quotes may be used. There is no way of includ-
-       ing the delimiter within a substring.
+       ing the delimiter within a substring. The hex and expand modifiers  are
       mutually exclusive.
       By  default,  pcre2test  passes  patterns as zero-terminated strings to
       pcre2_compile(), giving the length as  PCRE2_ZERO_TERMINATED.  However,
       for  patterns specified with the hex modifier, the actual length of the
       pattern is passed.
   Specifying wide characters in 16-bit and 32-bit modes
       In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
       and  translated  to  UTF-16 or UTF-32 when the utf modifier is set. For
       testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
       modifier  can  be  used. It is mutually exclusive with utf. Input lines
       are interpreted as UTF-8 as a means of specifying wide characters. More
       details are given in "Input encoding" above.
   Generating long repetitive patterns
       Some  tests use long patterns that are very repetitive. Instead of cre-
@ -640,7 +680,8 @@ PATTERN MODIFIERS
       ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
       cannot  be  nested. An initial "\[" sequence is recognized only if "]{"
       followed by decimal digits and "}" is found later in  the  pattern.  If
-       not, the characters remain in the pattern unaltered.
+       not, the characters remain in the pattern unaltered. The expand and hex
       modifiers are mutually exclusive.
       If part of an expanded pattern looks like an expansion, but  is  really
       part of the actual pattern, unwanted expansion can be avoided by giving
@ -1548,5 +1589,5 @@ AUTHOR
 REVISION
-       Last updated: 06 July 2016
+       Last updated: 02 August 2016
       Copyright (c) 1997-2016 University of Cambridge.
--- a/src/pcre2.h
+++ b/src/pcre2.h
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
 /* The current PCRE version information. */
 #define PCRE2_MAJOR          10
-#define PCRE2_MINOR          22
+#define PCRE2_MINOR          23
-#define PCRE2_PRERELEASE     
+#define PCRE2_PRERELEASE     -RC1
-#define PCRE2_DATE           2016-07-29
+#define PCRE2_DATE           2016-08-01
 /* When an application links to a PCRE DLL in Windows, the symbols that are
 imported have to be identified as such. When building PCRE2, the appropriate
--- a/src/pcre2test.c
+++ b/src/pcre2test.c
@ -430,8 +430,8 @@ so many of them that they are split into two fields. */
 #define CTL_PUSH                         0x01000000u
 #define CTL_PUSHCOPY                     0x02000000u
 #define CTL_STARTCHAR                    0x04000000u
-#define CTL_ZERO_TERMINATE               0x08000000u
+#define CTL_UTF8_INPUT                   0x08000000u
-/* Spare                                 0x10000000u  */
+#define CTL_ZERO_TERMINATE               0x10000000u
 /* Spare                                 0x20000000u  */
 #define CTL_NL_SET                       0x40000000u  /* Informational */
 #define CTL_BSR_SET                      0x80000000u  /* Informational */
@ -460,7 +460,8 @@ data line. */
                    CTL_GLOBAL|\
                    CTL_MARK|\
                    CTL_MEMORY|\
-                    CTL_STARTCHAR)
+                    CTL_STARTCHAR|\
                    CTL_UTF8_INPUT)
 #define CTL2_ALLPD (CTL2_SUBSTITUTE_EXTENDED|\
                    CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\
@ -621,6 +622,7 @@ static modstruct modlist[] = {
  { "ungreedy",                   MOD_PAT,  MOD_OPT, PCRE2_UNGREEDY,             PO(options) },
  { "use_offset_limit",           MOD_PAT,  MOD_OPT, PCRE2_USE_OFFSET_LIMIT,     PO(options) },
  { "utf",                        MOD_PATP, MOD_OPT, PCRE2_UTF,                  PO(options) },
  { "utf8_input",                 MOD_PAT,  MOD_CTL, CTL_UTF8_INPUT,             PO(control) },
  { "zero_terminate",             MOD_DAT,  MOD_CTL, CTL_ZERO_TERMINATE,         DO(control) }
 };
@ -673,6 +675,7 @@ static uint32_t exclusive_pat_controls[] = {
 /* Data controls that are mutually exclusive. At present these are all in the
 first control word. */
 static uint32_t exclusive_dat_controls[] = {
  CTL_ALLUSEDTEXT | CTL_STARTCHAR,
  CTL_FINDLIMITS  | CTL_NULLCONTEXT };
@ -2715,16 +2718,22 @@ return i + 1;
 #ifdef SUPPORT_PCRE2_16
 /*************************************************
-*          Convert pattern to 16-bit             *
+*           Convert string to 16-bit             *
 *************************************************/
-/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
+/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
-all the input bytes are ASCII, the space needed for a 16-bit string is exactly
+the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
-double the 8-bit size. Otherwise, the size needed for a 16-bit string is no
+code values from 0 to 0x7fffffff. However, values greater than the later UTF
-more than double, because up to 0xffff uses no more than 3 bytes in UTF-8 but
+limit of 0x10ffff cause an error. In non-UTF mode the input is interpreted as
-possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes in
+UTF-8 if the utf8_input modifier is set, but an error is generated for values
-UTF-16. The result is always left in pbuffer16. Impose a minimum size to save
+greater than 0xffff.
-repeated re-sizing.
+
 If all the input bytes are ASCII, the space needed for a 16-bit string is
 exactly double the 8-bit size. Otherwise, the size needed for a 16-bit string
 is no more than double, because up to 0xffff uses no more than 3 bytes in UTF-8
 but possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes
 in UTF-16. The result is always left in pbuffer16. Impose a minimum size to
 save repeated re-sizing.
 Note that this function does not object to surrogate values. This is
 deliberate; it makes it possible to construct UTF-16 strings that are invalid,
@ -2732,7 +2741,7 @@ for the purpose of testing that they are correctly faulted.
 Arguments:
  p          points to a byte string
-  utf        non-zero if converting to UTF-16
+  utf        true in UTF mode
  lenptr     points to number of bytes in the string (excluding trailing zero)
 Returns:     0 on success, with the length updated to the number of 16-bit
@ -2763,7 +2772,7 @@ if (pbuffer16_size < 2*len + 2)
  }
 pp = pbuffer16;
-if (!utf)
+if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
  {
  for (; len > 0; len--) *pp++ = *p++;
  }
@ -2772,12 +2781,12 @@ else while (len > 0)
  uint32_t c;
  int chlen = utf82ord(p, &c);
  if (chlen <= 0) return -1;
  if (!utf && c > 0xffff) return -3;
  if (c > 0x10ffff) return -2;
  p += chlen;
  len -= chlen;
  if (c < 0x10000) *pp++ = c; else
    {
    if (!utf) return -3;
    c -= 0x10000;
    *pp++ = 0xD800 | (c >> 10);
    *pp++ = 0xDC00 | (c & 0x3ff);
@ -2794,15 +2803,25 @@ return 0;
 #ifdef SUPPORT_PCRE2_32
 /*************************************************
-*          Convert pattern to 32-bit             *
+*           Convert string to 32-bit             *
 *************************************************/
-/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
+/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
-all the input bytes are ASCII, the space needed for a 32-bit string is exactly
+the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
-four times the 8-bit size. Otherwise, the size needed for a 32-bit string is no
+code values from 0 to 0x7fffffff. However, values greater than the later UTF 
-more than four times, because the number of characters must be less than the
+limit of 0x10ffff cause an error.
-number of bytes. The result is always left in pbuffer32. Impose a minimum size
+
-to save repeated re-sizing.
+In non-UTF mode the input is interpreted as UTF-8 if the utf8_input modifier
 is set, and no limit is imposed. There is special interpretation of the 0xff
 byte (which is illegal in UTF-8) in this case: it causes the top bit of the
 next character to be set. This provides a way of generating 32-bit characters
 greater than 0x7fffffff.
 If all the input bytes are ASCII, the space needed for a 32-bit string is
 exactly four times the 8-bit size. Otherwise, the size needed for a 32-bit
 string is no more than four times, because the number of characters must be
 less than the number of bytes. The result is always left in pbuffer32. Impose a
 minimum size to save repeated re-sizing.
 Note that this function does not object to surrogate values. This is
 deliberate; it makes it possible to construct UTF-32 strings that are invalid,
@ -2810,7 +2829,7 @@ for the purpose of testing that they are correctly faulted.
 Arguments:
  p          points to a byte string
-  utf        true if UTF-8 (to be converted to UTF-32)
+  utf        true in UTF mode
  lenptr     points to number of bytes in the string (excluding trailing zero)
 Returns:     0 on success, with the length updated to the number of 32-bit
@ -2840,19 +2859,29 @@ if (pbuffer32_size < 4*len + 4)
  }
 pp = pbuffer32;
-if (!utf)
+
 if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
  {
  for (; len > 0; len--) *pp++ = *p++;
  }
 else while (len > 0)
  {
  int chlen; 
  uint32_t c;
-  int chlen = utf82ord(p, &c);
+  uint32_t topbit = 0;
  if (!utf && *p == 0xff && len > 1)
    {
    topbit = 0x80000000u;
    p++;
    len--;
    }     
  chlen = utf82ord(p, &c);
  if (chlen <= 0) return -1;
  if (utf && c > 0x10ffff) return -2;
  p += chlen;
  len -= chlen;
-  *pp++ = c;
+  *pp++ = c | topbit;
  }
 *pp = 0;
@ -3627,7 +3656,7 @@ Returns:      nothing
 static void
 show_controls(uint32_t controls, uint32_t controls2, const char *before)
 {
-fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
+fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
  before,
  ((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
  ((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
@ -3662,6 +3691,7 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s
  ((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "",
  ((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "",
  ((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "",
  ((controls & CTL_UTF8_INPUT) != 0)? " utf8_input" : "",
  ((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : "");
 }
@ -3759,13 +3789,13 @@ warning we must initialize cblock_size. */
 cblock_size = 0;
 #ifdef SUPPORT_PCRE2_8
-if (test_mode == 8) cblock_size = sizeof(pcre2_real_code_8);
+if (test_mode == PCRE8_MODE) cblock_size = sizeof(pcre2_real_code_8);
 #endif
 #ifdef SUPPORT_PCRE2_16
-if (test_mode == 16) cblock_size = sizeof(pcre2_real_code_16);
+if (test_mode == PCRE16_MODE) cblock_size = sizeof(pcre2_real_code_16);
 #endif
 #ifdef SUPPORT_PCRE2_32
-if (test_mode == 32) cblock_size = sizeof(pcre2_real_code_32);
+if (test_mode == PCRE32_MODE) cblock_size = sizeof(pcre2_real_code_32);
 #endif
 (void)pattern_info(PCRE2_INFO_SIZE, &size, FALSE);
@ -4507,6 +4537,23 @@ patlen = p - buffer - 2;
 if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
 utf = (pat_patctl.options & PCRE2_UTF) != 0;
 /* The utf8_input modifier is not allowed in 8-bit mode, and is mutually 
 exclusive with the utf modifier. */
 if ((pat_patctl.control & CTL_UTF8_INPUT) != 0)
  {
  if (test_mode == PCRE8_MODE)
    {
    fprintf(outfile, "** The utf8_input modifier is not allowed in 8-bit mode\n");
    return PR_SKIP;
    }
  if (utf)
    {
    fprintf(outfile, "** The utf and utf8_input modifiers are mutually exclusive\n");
    return PR_SKIP; 
    }   
  }
 /* Check for mutually exclusive modifiers. At present, these are all in the
 first control word. */
@ -4738,7 +4785,7 @@ if ((pat_patctl.control & CTL_POSIX) != 0)
  const char *msg = "** Ignored with POSIX interface:";
 #endif
-  if (test_mode != 8)
+  if (test_mode != PCRE8_MODE)
    {
    fprintf(outfile, "** The POSIX interface is available only in 8-bit mode\n");
    return PR_SKIP;
@ -5622,7 +5669,9 @@ if (dbuffer == NULL || needlen >= dbuffer_size)
 SETCASTPTR(q, dbuffer);  /* Sets q8, q16, or q32, as appropriate. */
 /* Scan the data line, interpreting data escapes, and put the result into a
-buffer of the appropriate width. In UTF mode, input can be UTF-8. */
+buffer of the appropriate width. In UTF mode, input is always UTF-8; otherwise,
 in 16- and 32-bit modes, it can be forced to UTF-8 by the utf8_input modifier.
 */
 while ((c = *p++) != 0)
  {
@ -5691,11 +5740,20 @@ while ((c = *p++) != 0)
    continue;
    }
-  /* Handle a non-escaped character */
+  /* Handle a non-escaped character. In non-UTF 32-bit mode with utf8_input 
  set, do the fudge for setting the top bit. */
  if (c != '\\')
    {
-    if (utf && HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
+    uint32_t topbit = 0;
    if (test_mode == PCRE32_MODE && c == 0xff && *p != 0) 
      {
      topbit = 0x80000000;
      c = *p++;
      }  
    if ((utf || (pat_patctl.control & CTL_UTF8_INPUT) != 0) && 
      HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
    c |= topbit;
    }
  /* Handle backslash escapes */
--- a/testdata/testinput11
+++ b/testdata/testinput11
@ -353,4 +353,19 @@
 /(*THEN:\[A]{65501})/expand
 # We can use pcre2test's utf8_input modifier to create wide pattern characters,
 # even though this test is run when UTF is not supported.
 /abý¿¿¿¿¿z/utf8_input
    abý¿¿¿¿¿z
    ab\x{7fffffff}z
 /abÿý¿¿¿¿¿z/utf8_input
    abÿý¿¿¿¿¿z
    ab\x{ffffffff}z 
 /abÿAz/utf8_input
    abÿAz
    ab\x{80000041}z 
 # End of testinput11
--- a/testdata/testinput12
+++ b/testdata/testinput12
@ -343,4 +343,8 @@
 /./utf
    \x{110000}
 /(*UTF)ab<61>ｿｿｿｿｿz/B
 /ab<61>ｿｿｿｿｿz/utf
 # End of testinput12
--- a/testdata/testoutput11-16
+++ b/testdata/testoutput11-16
@ -643,4 +643,22 @@ Subject length lower bound = 1
 /(*THEN:\[A]{65501})/expand
 # We can use pcre2test's utf8_input modifier to create wide pattern characters,
 # even though this test is run when UTF is not supported.
 /abý¿¿¿¿¿z/utf8_input
 ** Failed: character value greater than 0xffff cannot be converted to 16-bit in non-UTF mode
    abý¿¿¿¿¿z
    ab\x{7fffffff}z
 /abÿý¿¿¿¿¿z/utf8_input
 ** Failed: invalid UTF-8 string cannot be converted to 16-bit string
    abÿý¿¿¿¿¿z
    ab\x{ffffffff}z 
 /abÿAz/utf8_input
 ** Failed: invalid UTF-8 string cannot be converted to 16-bit string
    abÿAz
    ab\x{80000041}z 
 # End of testinput11
--- a/testdata/testoutput11-32
+++ b/testdata/testoutput11-32
@ -646,4 +646,25 @@ Subject length lower bound = 1
 /(*THEN:\[A]{65501})/expand
 # We can use pcre2test's utf8_input modifier to create wide pattern characters,
 # even though this test is run when UTF is not supported.
 /abý¿¿¿¿¿z/utf8_input
    abý¿¿¿¿¿z
 0: ab\x{7fffffff}z
    ab\x{7fffffff}z
 0: ab\x{7fffffff}z
 /abÿý¿¿¿¿¿z/utf8_input
    abÿý¿¿¿¿¿z
 0: ab\x{ffffffff}z
    ab\x{ffffffff}z 
 0: ab\x{ffffffff}z
 /abÿAz/utf8_input
    abÿAz
 0: ab\x{80000041}z
    ab\x{80000041}z 
 0: ab\x{80000041}z
 # End of testinput11
--- a/testdata/testoutput12-16
+++ b/testdata/testoutput12-16
@ -1367,4 +1367,15 @@ Subject length lower bound = 2
    \x{110000}
 ** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
 /(*UTF)abý¿¿¿¿¿z/B
 ------------------------------------------------------------------
        Bra
        ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
        Ket
        End
 ------------------------------------------------------------------
 /abý¿¿¿¿¿z/utf
 ** Failed: character value greater than 0x10ffff cannot be converted to UTF
 # End of testinput12
--- a/testdata/testoutput12-32
+++ b/testdata/testoutput12-32
@ -1361,4 +1361,15 @@ Subject length lower bound = 2
    \x{110000}
 Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
 /(*UTF)abý¿¿¿¿¿z/B
 ------------------------------------------------------------------
        Bra
        ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
        Ket
        End
 ------------------------------------------------------------------
 /abý¿¿¿¿¿z/utf
 ** Failed: character value greater than 0x10ffff cannot be converted to UTF
 # End of testinput12