Update pcre2test with the /utf8_input option, for generating wide characters in

non-UTF 16-bit and 32-bit modes.
2016-08-03 09:01:02 +00:00 · 2016-08-03 09:01:02 +00:00 · 69c9d81e43
commit 69c9d81e43
parent 5b6c797a4d
14 changed files with 589 additions and 304 deletions
--- a/7
+++ b/7
@ -2,6 +2,13 @@ Change Log for PCRE2
 --------------------


+Version 10.23 xx-xxxxxx-2016
+----------------------------
+
+1. Extended pcre2test with the utf8_input modifier so that it is able to
+generate all possible 16-bit and 32-bit code unit values in non-UTF modes.
+
+
 Version 10.22 29-July-2016
 --------------------------

--- a/configure.ac
+++ b/configure.ac
@ -9,9 +9,9 @@ dnl The PCRE2_PRERELEASE feature is for identifying release candidates. It might
 dnl be defined as -RC2, for example. For real releases, it should be empty.

 m4_define(pcre2_major, [10])
-m4_define(pcre2_minor, [22])
-m4_define(pcre2_prerelease, [])
-m4_define(pcre2_date, [2016-07-29])
+m4_define(pcre2_minor, [23])
+m4_define(pcre2_prerelease, [-RC1])
+m4_define(pcre2_date, [2016-08-01])

 # NOTE: The CMakeLists.txt file searches for the above variables in the first
 # 50 lines of this file. Please update that if the variables above are moved.
--- a/doc/html/pcre2test.html
+++ b/doc/html/pcre2test.html
@ -61,7 +61,7 @@ subject is processed, and what output is produced.
 <P>
 As the original fairly simple PCRE library evolved, it acquired many different
 features, and as a result, the original <b>pcretest</b> program ended up with a
-lot of options in a messy, arcane syntax, for testing all the features. The
+lot of options in a messy, arcane syntax for testing all the features. The
 move to the new PCRE2 API provided an opportunity to re-implement the test
 program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
 are still many obscure modifiers, some of which are specifically designed for
@ -77,32 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
 all three of these libraries may be simultaneously installed. The
 <b>pcre2test</b> program can be used to test all the libraries. However, its own
 input and output are always in 8-bit format. When testing the 16-bit or 32-bit
-libraries, patterns and subject strings are converted to 16- or 32-bit format
-before being passed to the library functions. Results are converted back to
-8-bit code units for output.
+libraries, patterns and subject strings are converted to 16-bit or 32-bit
+format before being passed to the library functions. Results are converted back
+to 8-bit code units for output.
 </P>
 <P>
 In the rest of this document, the names of library functions and structures
 are given in generic form, for example, <b>pcre_compile()</b>. The actual
 names used in the libraries have a suffix _8, _16, or _32, as appropriate.
-</P>
+<a name="inputencoding"></a></P>
 <br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
 <P>
 Input to <b>pcre2test</b> is processed line by line, either by calling the C
-library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
-below). The input is processed using using C's string functions, so must not
-contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
-treats any bytes other than newline as data characters. In some Windows
-environments character 26 (hex 1A) causes an immediate end of file, and no
-further data is read.
+library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
+Windows environments character 26 (hex 1A) causes an immediate end of file, and
+no further data is read, so this character should be avoided unless you really
+want that action.
 </P>
 <P>
-For maximum portability, therefore, it is safest to avoid non-printing
-characters in <b>pcre2test</b> input files. There is a facility for specifying
-some or all of a pattern's characters as hexadecimal pairs, thus making it
-possible to include binary zeroes in a pattern for testing purposes. Subject
-lines are processed for backslash escapes, which makes it possible to include
-any data value.
+The input is processed using using C's string functions, so must not
+contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
+treats any bytes other than newline as data characters. An error is generated
+if a binary zero is encountered. Subject lines are processed for backslash
+escapes, which makes it possible to include any data value in strings that are
+passed to the library for matching. For patterns, there is a facility for
+specifying some or all of the 8-bit input characters as hexadecimal pairs,
+which makes it possible to include binary zeros.
+</P>
+<br><b>
+Input for the 16-bit and 32-bit libraries
+</b><br>
+<P>
+When testing the 16-bit or 32-bit libraries, there is a need to be able to
+generate character code points greater than 255 in the strings that are passed
+to the library. For subject lines, backslash escapes can be used. In addition,
+when the <b>utf</b> modifier (see
+<a href="#optionmodifiers">"Setting compilation options"</a>
+below) is set, the pattern and any following subject lines are interpreted as
+UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate. 
+</P>
+<P>
+For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
+used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
+or 32-bit mode. It causes the pattern and following subject lines to be treated
+as UTF-8 according to the original definition (RFC 2279), which allows for
+character values up to 0x7fffffff. Each character is placed in one 16-bit or
+32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
+to occur).
+</P>
+<P>
+UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
+values can be handled by the 32-bit library. When testing this library in
+non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
+byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
+character's value. This is the only way of passing such code points in a
+pattern string. For subject strings, using an escape sequence is preferable.
 </P>
 <br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
 <P>
@ -553,7 +582,9 @@ for a description of their effects.
 As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
 non-printing characters in output strings to be printed using the \x{hh...}
 notation. Otherwise, those less than 0x100 are output in hex without the curly
-brackets.
+brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and 
+subject strings to be translated to UTF-16 or UTF-32, respectively, before
+being passed to library functions.
 <a name="controlmodifiers"></a></P>
 <br><b>
 Setting compilation controls
@ -584,6 +615,7 @@ about the pattern:
      pushcopy                  push a copy onto the stack
      stackguard=&#60;number&#62;       test the stackguard feature
      tables=[0|1|2]            select internal tables
+      utf8_input                treat input as UTF-8 
 </pre>
 The effects of these modifiers are described in the following sections.
 </P>
@ -684,7 +716,8 @@ nine characters, only two of which are specified in hexadecimal:
  /ab "literal" 32/hex
 </pre>
 Either single or double quotes may be used. There is no way of including
-the delimiter within a substring.
+the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
+mutually exclusive.
 </P>
 <P>
 By default, <b>pcre2test</b> passes patterns as zero-terminated strings to
@ -693,6 +726,19 @@ patterns specified with the <b>hex</b> modifier, the actual length of the
 pattern is passed.
 </P>
 <br><b>
+Specifying wide characters in 16-bit and 32-bit modes
+</b><br>
+<P>
+In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and 
+translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing 
+the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
+can be used. It is mutually exclusive with <b>utf</b>. Input lines are
+interpreted as UTF-8 as a means of specifying wide characters. More details are
+given in
+<a href="#inputencoding">"Input encoding"</a>
+above.
+</P>
+<br><b>
 Generating long repetitive patterns
 </b><br>
 <P>
@ -708,7 +754,8 @@ are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
 example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
 cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
 by decimal digits and "}" is found later in the pattern. If not, the characters
-remain in the pattern unaltered.
+remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
+mutually exclusive.
 </P>
 <P>
 If part of an expanded pattern looks like an expansion, but is really part of
@ -1706,7 +1753,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC21" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 06 July 2016
+Last updated: 02 August 2016
 <br>
 Copyright &copy; 1997-2016 University of Cambridge.
 <br>
--- a/doc/pcre2test.1
+++ b/doc/pcre2test.1
@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "06 July 2016" "PCRE 10.22"
+.TH PCRE2TEST 1 "02 August 2016" "PCRE 10.23"
 .SH NAME
 pcre2test - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@ -29,7 +29,7 @@ subject is processed, and what output is produced.
 .P
 As the original fairly simple PCRE library evolved, it acquired many different
 features, and as a result, the original \fBpcretest\fP program ended up with a
-lot of options in a messy, arcane syntax, for testing all the features. The
+lot of options in a messy, arcane syntax for testing all the features. The
 move to the new PCRE2 API provided an opportunity to re-implement the test
 program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
 are still many obscure modifiers, some of which are specifically designed for
@ -47,32 +47,63 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
 all three of these libraries may be simultaneously installed. The
 \fBpcre2test\fP program can be used to test all the libraries. However, its own
 input and output are always in 8-bit format. When testing the 16-bit or 32-bit
-libraries, patterns and subject strings are converted to 16- or 32-bit format
-before being passed to the library functions. Results are converted back to
-8-bit code units for output.
+libraries, patterns and subject strings are converted to 16-bit or 32-bit
+format before being passed to the library functions. Results are converted back
+to 8-bit code units for output.
 .P
 In the rest of this document, the names of library functions and structures
 are given in generic form, for example, \fBpcre_compile()\fP. The actual
 names used in the libraries have a suffix _8, _16, or _32, as appropriate.
 .
 .
+.\" HTML <a name="inputencoding"></a>
 .SH "INPUT ENCODING"
 .rs
 .sp
 Input to \fBpcre2test\fP is processed line by line, either by calling the C
-library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
-below). The input is processed using using C's string functions, so must not
-contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
-treats any bytes other than newline as data characters. In some Windows
-environments character 26 (hex 1A) causes an immediate end of file, and no
-further data is read.
+library's \fBfgets()\fP function, or via the \fBlibreadline\fP library. In some
+Windows environments character 26 (hex 1A) causes an immediate end of file, and
+no further data is read, so this character should be avoided unless you really
+want that action.
 .P
-For maximum portability, therefore, it is safest to avoid non-printing
-characters in \fBpcre2test\fP input files. There is a facility for specifying
-some or all of a pattern's characters as hexadecimal pairs, thus making it
-possible to include binary zeroes in a pattern for testing purposes. Subject
-lines are processed for backslash escapes, which makes it possible to include
-any data value.
+The input is processed using using C's string functions, so must not
+contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
+treats any bytes other than newline as data characters. An error is generated
+if a binary zero is encountered. Subject lines are processed for backslash
+escapes, which makes it possible to include any data value in strings that are
+passed to the library for matching. For patterns, there is a facility for
+specifying some or all of the 8-bit input characters as hexadecimal pairs,
+which makes it possible to include binary zeros.
+.
+.
+.SS "Input for the 16-bit and 32-bit libraries"
+.rs
+.sp
+When testing the 16-bit or 32-bit libraries, there is a need to be able to
+generate character code points greater than 255 in the strings that are passed
+to the library. For subject lines, backslash escapes can be used. In addition,
+when the \fButf\fP modifier (see
+.\" HTML <a href="#optionmodifiers">
+.\" </a>
+"Setting compilation options"
+.\"
+below) is set, the pattern and any following subject lines are interpreted as
+UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate. 
+.P
+For non-UTF testing of wide characters, the \fButf8_input\fP modifier can be
+used. This is mutually exclusive with \fButf\fP, and is allowed only in 16-bit
+or 32-bit mode. It causes the pattern and following subject lines to be treated
+as UTF-8 according to the original definition (RFC 2279), which allows for
+character values up to 0x7fffffff. Each character is placed in one 16-bit or
+32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
+to occur).
+.P
+UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
+values can be handled by the 32-bit library. When testing this library in
+non-UTF mode with \fButf8_input\fP set, if any character is preceded by the
+byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
+character's value. This is the only way of passing such code points in a
+pattern string. For subject strings, using an escape sequence is preferable.
 .
 .
 .SH "COMMAND LINE OPTIONS"
@ -515,7 +546,9 @@ for a description of their effects.
 As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
 non-printing characters in output strings to be printed using the \ex{hh...}
 notation. Otherwise, those less than 0x100 are output in hex without the curly
-brackets.
+brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and 
+subject strings to be translated to UTF-16 or UTF-32, respectively, before
+being passed to library functions.
 .
 .
 .\" HTML <a name="controlmodifiers"></a>
@ -547,6 +580,7 @@ about the pattern:
      pushcopy                  push a copy onto the stack
      stackguard=<number>       test the stackguard feature
      tables=[0|1|2]            select internal tables
+      utf8_input                treat input as UTF-8 
 .sp
 The effects of these modifiers are described in the following sections.
 .
@ -642,7 +676,8 @@ nine characters, only two of which are specified in hexadecimal:
  /ab "literal" 32/hex
 .sp
 Either single or double quotes may be used. There is no way of including
-the delimiter within a substring.
+the delimiter within a substring. The \fBhex\fP and \fBexpand\fP modifiers are
+mutually exclusive.
 .P
 By default, \fBpcre2test\fP passes patterns as zero-terminated strings to
 \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. However, for
@ -650,6 +685,22 @@ patterns specified with the \fBhex\fP modifier, the actual length of the
 pattern is passed.
 .
 .
+.SS "Specifying wide characters in 16-bit and 32-bit modes"
+.rs
+.sp
+In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and 
+translated to UTF-16 or UTF-32 when the \fButf\fP modifier is set. For testing 
+the 16-bit and 32-bit libraries in non-UTF mode, the \fButf8_input\fP modifier
+can be used. It is mutually exclusive with \fButf\fP. Input lines are
+interpreted as UTF-8 as a means of specifying wide characters. More details are
+given in
+.\" HTML <a href="#inputencoding">
+.\" </a>
+"Input encoding"
+.\"
+above.
+.
+.
 .SS "Generating long repetitive patterns"
 .rs
 .sp
@ -665,7 +716,8 @@ are expanded before the pattern is passed to \fBpcre2_compile()\fP. For
 example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
 cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
 by decimal digits and "}" is found later in the pattern. If not, the characters
-remain in the pattern unaltered.
+remain in the pattern unaltered. The \fBexpand\fP and \fBhex\fP modifiers are
+mutually exclusive.
 .P
 If part of an expanded pattern looks like an expansion, but is really part of
 the actual pattern, unwanted expansion can be avoided by giving two values in
@ -1682,6 +1734,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 06 July 2016
+Last updated: 02 August 2016
 Copyright (c) 1997-2016 University of Cambridge.
 .fi
--- a/doc/pcre2test.txt
+++ b/doc/pcre2test.txt
@ -26,7 +26,7 @@ SYNOPSIS

       As the original fairly simple PCRE library evolved,  it  acquired  many
       different  features,  and  as  a  result, the original pcretest program
-       ended up with a lot of options in a messy, arcane syntax,  for  testing
+       ended up with a lot of options in a messy, arcane  syntax  for  testing
       all the features. The move to the new PCRE2 API provided an opportunity
       to re-implement the test program as pcre2test, with a cleaner  modifier
       syntax.  Nevertheless,  there are still many obscure modifiers, some of
@ -45,7 +45,7 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
       installed. The pcre2test program can be used to test all the libraries.
       However, its own input and output are  always  in  8-bit  format.  When
       testing  the  16-bit  or 32-bit libraries, patterns and subject strings
-       are converted to 16- or  32-bit  format  before  being  passed  to  the
+       are converted to 16-bit or 32-bit format before  being  passed  to  the
       library  functions.  Results are converted back to 8-bit code units for
       output.

@ -58,19 +58,46 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
 INPUT ENCODING

       Input  to  pcre2test is processed line by line, either by calling the C
-       library's fgets() function, or via the libreadline library (see below).
+       library's fgets() function, or via the  libreadline  library.  In  some
+       Windows  environments  character 26 (hex 1A) causes an immediate end of
+       file, and no further data is read, so this character should be  avoided
+       unless you really want that action.
+
       The  input  is  processed using using C's string functions, so must not
       contain binary zeroes, even though in Unix-like  environments,  fgets()
-       treats any bytes other than newline as data characters. In some Windows
-       environments character 26 (hex 1A) causes an immediate end of file, and
-       no further data is read.
+       treats  any  bytes  other  than newline as data characters. An error is
+       generated if a binary zero is encountered. Subject lines are  processed
+       for  backslash  escapes,  which  makes  it possible to include any data
+       value in strings that are passed to the library for matching. For  pat-
+       terns,  there  is  a  facility  for specifying some or all of the 8-bit
+       input characters as hexadecimal  pairs,  which  makes  it  possible  to
+       include binary zeros.

-       For  maximum portability, therefore, it is safest to avoid non-printing
-       characters in pcre2test input files. There is a facility for specifying
-       some or all of a pattern's characters as hexadecimal pairs, thus making
-       it possible to include binary zeroes in a pattern for testing purposes.
-       Subject  lines are processed for backslash escapes, which makes it pos-
-       sible to include any data value.
+   Input for the 16-bit and 32-bit libraries
+
+       When testing the 16-bit or 32-bit libraries, there is a need to be able
+       to generate character code points greater than 255 in the strings  that
+       are  passed to the library. For subject lines, backslash escapes can be
+       used. In addition, when the  utf  modifier  (see  "Setting  compilation
+       options" below) is set, the pattern and any following subject lines are
+       interpreted as UTF-8 strings and translated  to  UTF-16  or  UTF-32  as
+       appropriate.
+
+       For  non-UTF testing of wide characters, the utf8_input modifier can be
+       used. This is mutually exclusive with  utf,  and  is  allowed  only  in
+       16-bit  or  32-bit  mode.  It  causes the pattern and following subject
+       lines to be treated as UTF-8 according to the original definition  (RFC
+       2279), which allows for character values up to 0x7fffffff. Each charac-
+       ter is placed in one 16-bit or 32-bit code unit (in  the  16-bit  case,
+       values greater than 0xffff cause an error to occur).
+
+       UTF-8  is  not  capable of encoding values greater than 0x7fffffff, but
+       such values can be handled by the 32-bit  library.  When  testing  this
+       library  in  non-UTF mode with utf8_input set, if any character is pre-
+       ceded by the byte 0xff (which is an illegal byte in  UTF-8)  0x80000000
+       is added to the character's value. This is the only way of passing such
+       code points in a pattern string. For subject strings, using  an  escape
+       sequence is preferable.


 COMMAND LINE OPTIONS
@ -500,7 +527,9 @@ PATTERN MODIFIERS
       As well as turning on the PCRE2_UTF option, the utf modifier causes all
       non-printing  characters  in  output  strings  to  be printed using the
       \x{hh...} notation. Otherwise, those less than 0x100 are output in  hex
-       without the curly brackets.
+       without  the  curly brackets. Setting utf in 16-bit or 32-bit mode also
+       causes pattern and subject  strings  to  be  translated  to  UTF-16  or
+       UTF-32, respectively, before being passed to library functions.

   Setting compilation controls

@ -529,6 +558,7 @@ PATTERN MODIFIERS
             pushcopy                  push a copy onto the stack
             stackguard=<number>       test the stackguard feature
             tables=[0|1|2]            select internal tables
+             utf8_input                treat input as UTF-8

       The effects of these modifiers are described in the following sections.

@ -619,13 +649,23 @@ PATTERN MODIFIERS
         /ab "literal" 32/hex

       Either  single or double quotes may be used. There is no way of includ-
-       ing the delimiter within a substring.
+       ing the delimiter within a substring. The hex and expand modifiers  are
+       mutually exclusive.

       By  default,  pcre2test  passes  patterns as zero-terminated strings to
       pcre2_compile(), giving the length as  PCRE2_ZERO_TERMINATED.  However,
       for  patterns specified with the hex modifier, the actual length of the
       pattern is passed.

+   Specifying wide characters in 16-bit and 32-bit modes
+
+       In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
+       and  translated  to  UTF-16 or UTF-32 when the utf modifier is set. For
+       testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
+       modifier  can  be  used. It is mutually exclusive with utf. Input lines
+       are interpreted as UTF-8 as a means of specifying wide characters. More
+       details are given in "Input encoding" above.
+
   Generating long repetitive patterns

       Some  tests use long patterns that are very repetitive. Instead of cre-
@ -640,7 +680,8 @@ PATTERN MODIFIERS
       ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
       cannot  be  nested. An initial "\[" sequence is recognized only if "]{"
       followed by decimal digits and "}" is found later in  the  pattern.  If
-       not, the characters remain in the pattern unaltered.
+       not, the characters remain in the pattern unaltered. The expand and hex
+       modifiers are mutually exclusive.

       If part of an expanded pattern looks like an expansion, but  is  really
       part of the actual pattern, unwanted expansion can be avoided by giving
@ -1548,5 +1589,5 @@ AUTHOR

 REVISION

-       Last updated: 06 July 2016
+       Last updated: 02 August 2016
       Copyright (c) 1997-2016 University of Cambridge.
--- a/src/pcre2.h
+++ b/src/pcre2.h
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
 /* The current PCRE version information. */

 #define PCRE2_MAJOR          10
-#define PCRE2_MINOR          22
-#define PCRE2_PRERELEASE     
-#define PCRE2_DATE           2016-07-29
+#define PCRE2_MINOR          23
+#define PCRE2_PRERELEASE     -RC1
+#define PCRE2_DATE           2016-08-01

 /* When an application links to a PCRE DLL in Windows, the symbols that are
 imported have to be identified as such. When building PCRE2, the appropriate
--- a/src/pcre2test.c
+++ b/src/pcre2test.c
@ -430,8 +430,8 @@ so many of them that they are split into two fields. */
 #define CTL_PUSH                         0x01000000u
 #define CTL_PUSHCOPY                     0x02000000u
 #define CTL_STARTCHAR                    0x04000000u
-#define CTL_ZERO_TERMINATE               0x08000000u
-/* Spare                                 0x10000000u  */
+#define CTL_UTF8_INPUT                   0x08000000u
+#define CTL_ZERO_TERMINATE               0x10000000u
 /* Spare                                 0x20000000u  */
 #define CTL_NL_SET                       0x40000000u  /* Informational */
 #define CTL_BSR_SET                      0x80000000u  /* Informational */
@ -460,7 +460,8 @@ data line. */
                    CTL_GLOBAL|\
                    CTL_MARK|\
                    CTL_MEMORY|\
-                    CTL_STARTCHAR)
+                    CTL_STARTCHAR|\
+                    CTL_UTF8_INPUT)

 #define CTL2_ALLPD (CTL2_SUBSTITUTE_EXTENDED|\
                    CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\
@ -621,6 +622,7 @@ static modstruct modlist[] = {
  { "ungreedy",                   MOD_PAT,  MOD_OPT, PCRE2_UNGREEDY,             PO(options) },
  { "use_offset_limit",           MOD_PAT,  MOD_OPT, PCRE2_USE_OFFSET_LIMIT,     PO(options) },
  { "utf",                        MOD_PATP, MOD_OPT, PCRE2_UTF,                  PO(options) },
+  { "utf8_input",                 MOD_PAT,  MOD_CTL, CTL_UTF8_INPUT,             PO(control) },
  { "zero_terminate",             MOD_DAT,  MOD_CTL, CTL_ZERO_TERMINATE,         DO(control) }
 };

@ -673,6 +675,7 @@ static uint32_t exclusive_pat_controls[] = {

 /* Data controls that are mutually exclusive. At present these are all in the
 first control word. */
+
 static uint32_t exclusive_dat_controls[] = {
  CTL_ALLUSEDTEXT | CTL_STARTCHAR,
  CTL_FINDLIMITS  | CTL_NULLCONTEXT };
@ -2715,16 +2718,22 @@ return i + 1;

 #ifdef SUPPORT_PCRE2_16
 /*************************************************
-*          Convert pattern to 16-bit             *
+*           Convert string to 16-bit             *
 *************************************************/

-/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
-all the input bytes are ASCII, the space needed for a 16-bit string is exactly
-double the 8-bit size. Otherwise, the size needed for a 16-bit string is no
-more than double, because up to 0xffff uses no more than 3 bytes in UTF-8 but
-possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes in
-UTF-16. The result is always left in pbuffer16. Impose a minimum size to save
-repeated re-sizing.
+/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
+the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
+code values from 0 to 0x7fffffff. However, values greater than the later UTF
+limit of 0x10ffff cause an error. In non-UTF mode the input is interpreted as
+UTF-8 if the utf8_input modifier is set, but an error is generated for values
+greater than 0xffff.
+
+If all the input bytes are ASCII, the space needed for a 16-bit string is
+exactly double the 8-bit size. Otherwise, the size needed for a 16-bit string
+is no more than double, because up to 0xffff uses no more than 3 bytes in UTF-8
+but possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes
+in UTF-16. The result is always left in pbuffer16. Impose a minimum size to
+save repeated re-sizing.

 Note that this function does not object to surrogate values. This is
 deliberate; it makes it possible to construct UTF-16 strings that are invalid,
@ -2732,7 +2741,7 @@ for the purpose of testing that they are correctly faulted.

 Arguments:
  p          points to a byte string
-  utf        non-zero if converting to UTF-16
+  utf        true in UTF mode
  lenptr     points to number of bytes in the string (excluding trailing zero)

 Returns:     0 on success, with the length updated to the number of 16-bit
@ -2763,7 +2772,7 @@ if (pbuffer16_size < 2*len + 2)
  }

 pp = pbuffer16;
-if (!utf)
+if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
  {
  for (; len > 0; len--) *pp++ = *p++;
  }
@ -2772,12 +2781,12 @@ else while (len > 0)
  uint32_t c;
  int chlen = utf82ord(p, &c);
  if (chlen <= 0) return -1;
+  if (!utf && c > 0xffff) return -3;
  if (c > 0x10ffff) return -2;
  p += chlen;
  len -= chlen;
  if (c < 0x10000) *pp++ = c; else
    {
-    if (!utf) return -3;
    c -= 0x10000;
    *pp++ = 0xD800 | (c >> 10);
    *pp++ = 0xDC00 | (c & 0x3ff);
@ -2794,15 +2803,25 @@ return 0;

 #ifdef SUPPORT_PCRE2_32
 /*************************************************
-*          Convert pattern to 32-bit             *
+*           Convert string to 32-bit             *
 *************************************************/

-/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
-all the input bytes are ASCII, the space needed for a 32-bit string is exactly
-four times the 8-bit size. Otherwise, the size needed for a 32-bit string is no
-more than four times, because the number of characters must be less than the
-number of bytes. The result is always left in pbuffer32. Impose a minimum size
-to save repeated re-sizing.
+/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
+the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
+code values from 0 to 0x7fffffff. However, values greater than the later UTF 
+limit of 0x10ffff cause an error.
+
+In non-UTF mode the input is interpreted as UTF-8 if the utf8_input modifier
+is set, and no limit is imposed. There is special interpretation of the 0xff
+byte (which is illegal in UTF-8) in this case: it causes the top bit of the
+next character to be set. This provides a way of generating 32-bit characters
+greater than 0x7fffffff.
+
+If all the input bytes are ASCII, the space needed for a 32-bit string is
+exactly four times the 8-bit size. Otherwise, the size needed for a 32-bit
+string is no more than four times, because the number of characters must be
+less than the number of bytes. The result is always left in pbuffer32. Impose a
+minimum size to save repeated re-sizing.

 Note that this function does not object to surrogate values. This is
 deliberate; it makes it possible to construct UTF-32 strings that are invalid,
@ -2810,7 +2829,7 @@ for the purpose of testing that they are correctly faulted.

 Arguments:
  p          points to a byte string
-  utf        true if UTF-8 (to be converted to UTF-32)
+  utf        true in UTF mode
  lenptr     points to number of bytes in the string (excluding trailing zero)

 Returns:     0 on success, with the length updated to the number of 32-bit
@ -2840,19 +2859,29 @@ if (pbuffer32_size < 4*len + 4)
  }

 pp = pbuffer32;
-if (!utf)
+
+if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
  {
  for (; len > 0; len--) *pp++ = *p++;
  }
+
 else while (len > 0)
  {
+  int chlen; 
  uint32_t c;
-  int chlen = utf82ord(p, &c);
+  uint32_t topbit = 0;
+  if (!utf && *p == 0xff && len > 1)
+    {
+    topbit = 0x80000000u;
+    p++;
+    len--;
+    }     
+  chlen = utf82ord(p, &c);
  if (chlen <= 0) return -1;
  if (utf && c > 0x10ffff) return -2;
  p += chlen;
  len -= chlen;
-  *pp++ = c;
+  *pp++ = c | topbit;
  }

 *pp = 0;
@ -3627,7 +3656,7 @@ Returns:      nothing
 static void
 show_controls(uint32_t controls, uint32_t controls2, const char *before)
 {
-fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
+fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
  before,
  ((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
  ((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
@ -3662,6 +3691,7 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s
  ((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "",
  ((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "",
  ((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "",
+  ((controls & CTL_UTF8_INPUT) != 0)? " utf8_input" : "",
  ((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : "");
 }

@ -3759,13 +3789,13 @@ warning we must initialize cblock_size. */

 cblock_size = 0;
 #ifdef SUPPORT_PCRE2_8
-if (test_mode == 8) cblock_size = sizeof(pcre2_real_code_8);
+if (test_mode == PCRE8_MODE) cblock_size = sizeof(pcre2_real_code_8);
 #endif
 #ifdef SUPPORT_PCRE2_16
-if (test_mode == 16) cblock_size = sizeof(pcre2_real_code_16);
+if (test_mode == PCRE16_MODE) cblock_size = sizeof(pcre2_real_code_16);
 #endif
 #ifdef SUPPORT_PCRE2_32
-if (test_mode == 32) cblock_size = sizeof(pcre2_real_code_32);
+if (test_mode == PCRE32_MODE) cblock_size = sizeof(pcre2_real_code_32);
 #endif

 (void)pattern_info(PCRE2_INFO_SIZE, &size, FALSE);
@ -4507,6 +4537,23 @@ patlen = p - buffer - 2;
 if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
 utf = (pat_patctl.options & PCRE2_UTF) != 0;

+/* The utf8_input modifier is not allowed in 8-bit mode, and is mutually 
+exclusive with the utf modifier. */
+
+if ((pat_patctl.control & CTL_UTF8_INPUT) != 0)
+  {
+  if (test_mode == PCRE8_MODE)
+    {
+    fprintf(outfile, "** The utf8_input modifier is not allowed in 8-bit mode\n");
+    return PR_SKIP;
+    }
+  if (utf)
+    {
+    fprintf(outfile, "** The utf and utf8_input modifiers are mutually exclusive\n");
+    return PR_SKIP; 
+    }   
+  }
+
 /* Check for mutually exclusive modifiers. At present, these are all in the
 first control word. */

@ -4738,7 +4785,7 @@ if ((pat_patctl.control & CTL_POSIX) != 0)
  const char *msg = "** Ignored with POSIX interface:";
 #endif

-  if (test_mode != 8)
+  if (test_mode != PCRE8_MODE)
    {
    fprintf(outfile, "** The POSIX interface is available only in 8-bit mode\n");
    return PR_SKIP;
@ -5622,7 +5669,9 @@ if (dbuffer == NULL || needlen >= dbuffer_size)
 SETCASTPTR(q, dbuffer);  /* Sets q8, q16, or q32, as appropriate. */

 /* Scan the data line, interpreting data escapes, and put the result into a
-buffer of the appropriate width. In UTF mode, input can be UTF-8. */
+buffer of the appropriate width. In UTF mode, input is always UTF-8; otherwise,
+in 16- and 32-bit modes, it can be forced to UTF-8 by the utf8_input modifier.
+*/

 while ((c = *p++) != 0)
  {
@ -5691,11 +5740,20 @@ while ((c = *p++) != 0)
    continue;
    }

-  /* Handle a non-escaped character */
+  /* Handle a non-escaped character. In non-UTF 32-bit mode with utf8_input 
+  set, do the fudge for setting the top bit. */

  if (c != '\\')
    {
-    if (utf && HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
+    uint32_t topbit = 0;
+    if (test_mode == PCRE32_MODE && c == 0xff && *p != 0) 
+      {
+      topbit = 0x80000000;
+      c = *p++;
+      }  
+    if ((utf || (pat_patctl.control & CTL_UTF8_INPUT) != 0) && 
+      HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
+    c |= topbit;
    }

  /* Handle backslash escapes */
--- a/testdata/testinput11
+++ b/testdata/testinput11
@ -353,4 +353,19 @@

 /(*THEN:\[A]{65501})/expand

+# We can use pcre2test's utf8_input modifier to create wide pattern characters,
+# even though this test is run when UTF is not supported.
+
+/abý¿¿¿¿¿z/utf8_input
+    abý¿¿¿¿¿z
+    ab\x{7fffffff}z
+
+/abÿý¿¿¿¿¿z/utf8_input
+    abÿý¿¿¿¿¿z
+    ab\x{ffffffff}z 
+
+/abÿAz/utf8_input
+    abÿAz
+    ab\x{80000041}z 
+
 # End of testinput11
--- a/testdata/testinput12
+++ b/testdata/testinput12
@ -343,4 +343,8 @@
 /./utf
    \x{110000}

+/(*UTF)ab<61>ｿｿｿｿｿz/B
+
+/ab<61>ｿｿｿｿｿz/utf
+
 # End of testinput12
--- a/testdata/testoutput11-16
+++ b/testdata/testoutput11-16
@ -643,4 +643,22 @@ Subject length lower bound = 1

 /(*THEN:\[A]{65501})/expand

+# We can use pcre2test's utf8_input modifier to create wide pattern characters,
+# even though this test is run when UTF is not supported.
+
+/abý¿¿¿¿¿z/utf8_input
+** Failed: character value greater than 0xffff cannot be converted to 16-bit in non-UTF mode
+    abý¿¿¿¿¿z
+    ab\x{7fffffff}z
+
+/abÿý¿¿¿¿¿z/utf8_input
+** Failed: invalid UTF-8 string cannot be converted to 16-bit string
+    abÿý¿¿¿¿¿z
+    ab\x{ffffffff}z 
+
+/abÿAz/utf8_input
+** Failed: invalid UTF-8 string cannot be converted to 16-bit string
+    abÿAz
+    ab\x{80000041}z 
+
 # End of testinput11
--- a/testdata/testoutput11-32
+++ b/testdata/testoutput11-32
@ -646,4 +646,25 @@ Subject length lower bound = 1

 /(*THEN:\[A]{65501})/expand

+# We can use pcre2test's utf8_input modifier to create wide pattern characters,
+# even though this test is run when UTF is not supported.
+
+/abý¿¿¿¿¿z/utf8_input
+    abý¿¿¿¿¿z
+ 0: ab\x{7fffffff}z
+    ab\x{7fffffff}z
+ 0: ab\x{7fffffff}z
+
+/abÿý¿¿¿¿¿z/utf8_input
+    abÿý¿¿¿¿¿z
+ 0: ab\x{ffffffff}z
+    ab\x{ffffffff}z 
+ 0: ab\x{ffffffff}z
+
+/abÿAz/utf8_input
+    abÿAz
+ 0: ab\x{80000041}z
+    ab\x{80000041}z 
+ 0: ab\x{80000041}z
+
 # End of testinput11
--- a/testdata/testoutput12-16
+++ b/testdata/testoutput12-16
@ -1367,4 +1367,15 @@ Subject length lower bound = 2
    \x{110000}
 ** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16

+/(*UTF)abý¿¿¿¿¿z/B
+------------------------------------------------------------------
+        Bra
+        ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
+        Ket
+        End
+------------------------------------------------------------------
+
+/abý¿¿¿¿¿z/utf
+** Failed: character value greater than 0x10ffff cannot be converted to UTF
+
 # End of testinput12
--- a/testdata/testoutput12-32
+++ b/testdata/testoutput12-32
@ -1361,4 +1361,15 @@ Subject length lower bound = 2
    \x{110000}
 Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0

+/(*UTF)abý¿¿¿¿¿z/B
+------------------------------------------------------------------
+        Bra
+        ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
+        Ket
+        End
+------------------------------------------------------------------
+
+/abý¿¿¿¿¿z/utf
+** Failed: character value greater than 0x10ffff cannot be converted to UTF
+
 # End of testinput12