Update pcre2test with the /utf8_input option, for generating wide characters in

non-UTF 16-bit and 32-bit modes.
This commit is contained in:
Philip.Hazel 2016-08-03 09:01:02 +00:00
parent 5b6c797a4d
commit 69c9d81e43
14 changed files with 589 additions and 304 deletions

View File

@ -2,6 +2,13 @@ Change Log for PCRE2
--------------------
Version 10.23 xx-xxxxxx-2016
----------------------------
1. Extended pcre2test with the utf8_input modifier so that it is able to
generate all possible 16-bit and 32-bit code unit values in non-UTF modes.
Version 10.22 29-July-2016
--------------------------

View File

@ -9,9 +9,9 @@ dnl The PCRE2_PRERELEASE feature is for identifying release candidates. It might
dnl be defined as -RC2, for example. For real releases, it should be empty.
m4_define(pcre2_major, [10])
m4_define(pcre2_minor, [22])
m4_define(pcre2_prerelease, [])
m4_define(pcre2_date, [2016-07-29])
m4_define(pcre2_minor, [23])
m4_define(pcre2_prerelease, [-RC1])
m4_define(pcre2_date, [2016-08-01])
# NOTE: The CMakeLists.txt file searches for the above variables in the first
# 50 lines of this file. Please update that if the variables above are moved.

View File

@ -61,7 +61,7 @@ subject is processed, and what output is produced.
<P>
As the original fairly simple PCRE library evolved, it acquired many different
features, and as a result, the original <b>pcretest</b> program ended up with a
lot of options in a messy, arcane syntax, for testing all the features. The
lot of options in a messy, arcane syntax for testing all the features. The
move to the new PCRE2 API provided an opportunity to re-implement the test
program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
are still many obscure modifiers, some of which are specifically designed for
@ -77,32 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
all three of these libraries may be simultaneously installed. The
<b>pcre2test</b> program can be used to test all the libraries. However, its own
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
libraries, patterns and subject strings are converted to 16- or 32-bit format
before being passed to the library functions. Results are converted back to
8-bit code units for output.
libraries, patterns and subject strings are converted to 16-bit or 32-bit
format before being passed to the library functions. Results are converted back
to 8-bit code units for output.
</P>
<P>
In the rest of this document, the names of library functions and structures
are given in generic form, for example, <b>pcre_compile()</b>. The actual
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
</P>
<a name="inputencoding"></a></P>
<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
<P>
Input to <b>pcre2test</b> is processed line by line, either by calling the C
library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
below). The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
treats any bytes other than newline as data characters. In some Windows
environments character 26 (hex 1A) causes an immediate end of file, and no
further data is read.
library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
Windows environments character 26 (hex 1A) causes an immediate end of file, and
no further data is read, so this character should be avoided unless you really
want that action.
</P>
<P>
For maximum portability, therefore, it is safest to avoid non-printing
characters in <b>pcre2test</b> input files. There is a facility for specifying
some or all of a pattern's characters as hexadecimal pairs, thus making it
possible to include binary zeroes in a pattern for testing purposes. Subject
lines are processed for backslash escapes, which makes it possible to include
any data value.
The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
treats any bytes other than newline as data characters. An error is generated
if a binary zero is encountered. Subject lines are processed for backslash
escapes, which makes it possible to include any data value in strings that are
passed to the library for matching. For patterns, there is a facility for
specifying some or all of the 8-bit input characters as hexadecimal pairs,
which makes it possible to include binary zeros.
</P>
<br><b>
Input for the 16-bit and 32-bit libraries
</b><br>
<P>
When testing the 16-bit or 32-bit libraries, there is a need to be able to
generate character code points greater than 255 in the strings that are passed
to the library. For subject lines, backslash escapes can be used. In addition,
when the <b>utf</b> modifier (see
<a href="#optionmodifiers">"Setting compilation options"</a>
below) is set, the pattern and any following subject lines are interpreted as
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
</P>
<P>
For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
or 32-bit mode. It causes the pattern and following subject lines to be treated
as UTF-8 according to the original definition (RFC 2279), which allows for
character values up to 0x7fffffff. Each character is placed in one 16-bit or
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
to occur).
</P>
<P>
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
values can be handled by the 32-bit library. When testing this library in
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
character's value. This is the only way of passing such code points in a
pattern string. For subject strings, using an escape sequence is preferable.
</P>
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
<P>
@ -553,7 +582,9 @@ for a description of their effects.
As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
non-printing characters in output strings to be printed using the \x{hh...}
notation. Otherwise, those less than 0x100 are output in hex without the curly
brackets.
brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and
subject strings to be translated to UTF-16 or UTF-32, respectively, before
being passed to library functions.
<a name="controlmodifiers"></a></P>
<br><b>
Setting compilation controls
@ -584,6 +615,7 @@ about the pattern:
pushcopy push a copy onto the stack
stackguard=&#60;number&#62; test the stackguard feature
tables=[0|1|2] select internal tables
utf8_input treat input as UTF-8
</pre>
The effects of these modifiers are described in the following sections.
</P>
@ -684,7 +716,8 @@ nine characters, only two of which are specified in hexadecimal:
/ab "literal" 32/hex
</pre>
Either single or double quotes may be used. There is no way of including
the delimiter within a substring.
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
mutually exclusive.
</P>
<P>
By default, <b>pcre2test</b> passes patterns as zero-terminated strings to
@ -693,6 +726,19 @@ patterns specified with the <b>hex</b> modifier, the actual length of the
pattern is passed.
</P>
<br><b>
Specifying wide characters in 16-bit and 32-bit modes
</b><br>
<P>
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing
the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
can be used. It is mutually exclusive with <b>utf</b>. Input lines are
interpreted as UTF-8 as a means of specifying wide characters. More details are
given in
<a href="#inputencoding">"Input encoding"</a>
above.
</P>
<br><b>
Generating long repetitive patterns
</b><br>
<P>
@ -708,7 +754,8 @@ are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
by decimal digits and "}" is found later in the pattern. If not, the characters
remain in the pattern unaltered.
remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
mutually exclusive.
</P>
<P>
If part of an expanded pattern looks like an expansion, but is really part of
@ -1706,7 +1753,7 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 06 July 2016
Last updated: 02 August 2016
<br>
Copyright &copy; 1997-2016 University of Cambridge.
<br>

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "06 July 2016" "PCRE 10.22"
.TH PCRE2TEST 1 "02 August 2016" "PCRE 10.23"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@ -29,7 +29,7 @@ subject is processed, and what output is produced.
.P
As the original fairly simple PCRE library evolved, it acquired many different
features, and as a result, the original \fBpcretest\fP program ended up with a
lot of options in a messy, arcane syntax, for testing all the features. The
lot of options in a messy, arcane syntax for testing all the features. The
move to the new PCRE2 API provided an opportunity to re-implement the test
program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
are still many obscure modifiers, some of which are specifically designed for
@ -47,32 +47,63 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
all three of these libraries may be simultaneously installed. The
\fBpcre2test\fP program can be used to test all the libraries. However, its own
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
libraries, patterns and subject strings are converted to 16- or 32-bit format
before being passed to the library functions. Results are converted back to
8-bit code units for output.
libraries, patterns and subject strings are converted to 16-bit or 32-bit
format before being passed to the library functions. Results are converted back
to 8-bit code units for output.
.P
In the rest of this document, the names of library functions and structures
are given in generic form, for example, \fBpcre_compile()\fP. The actual
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
.
.
.\" HTML <a name="inputencoding"></a>
.SH "INPUT ENCODING"
.rs
.sp
Input to \fBpcre2test\fP is processed line by line, either by calling the C
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
below). The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
treats any bytes other than newline as data characters. In some Windows
environments character 26 (hex 1A) causes an immediate end of file, and no
further data is read.
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library. In some
Windows environments character 26 (hex 1A) causes an immediate end of file, and
no further data is read, so this character should be avoided unless you really
want that action.
.P
For maximum portability, therefore, it is safest to avoid non-printing
characters in \fBpcre2test\fP input files. There is a facility for specifying
some or all of a pattern's characters as hexadecimal pairs, thus making it
possible to include binary zeroes in a pattern for testing purposes. Subject
lines are processed for backslash escapes, which makes it possible to include
any data value.
The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
treats any bytes other than newline as data characters. An error is generated
if a binary zero is encountered. Subject lines are processed for backslash
escapes, which makes it possible to include any data value in strings that are
passed to the library for matching. For patterns, there is a facility for
specifying some or all of the 8-bit input characters as hexadecimal pairs,
which makes it possible to include binary zeros.
.
.
.SS "Input for the 16-bit and 32-bit libraries"
.rs
.sp
When testing the 16-bit or 32-bit libraries, there is a need to be able to
generate character code points greater than 255 in the strings that are passed
to the library. For subject lines, backslash escapes can be used. In addition,
when the \fButf\fP modifier (see
.\" HTML <a href="#optionmodifiers">
.\" </a>
"Setting compilation options"
.\"
below) is set, the pattern and any following subject lines are interpreted as
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
.P
For non-UTF testing of wide characters, the \fButf8_input\fP modifier can be
used. This is mutually exclusive with \fButf\fP, and is allowed only in 16-bit
or 32-bit mode. It causes the pattern and following subject lines to be treated
as UTF-8 according to the original definition (RFC 2279), which allows for
character values up to 0x7fffffff. Each character is placed in one 16-bit or
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
to occur).
.P
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
values can be handled by the 32-bit library. When testing this library in
non-UTF mode with \fButf8_input\fP set, if any character is preceded by the
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
character's value. This is the only way of passing such code points in a
pattern string. For subject strings, using an escape sequence is preferable.
.
.
.SH "COMMAND LINE OPTIONS"
@ -515,7 +546,9 @@ for a description of their effects.
As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
non-printing characters in output strings to be printed using the \ex{hh...}
notation. Otherwise, those less than 0x100 are output in hex without the curly
brackets.
brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and
subject strings to be translated to UTF-16 or UTF-32, respectively, before
being passed to library functions.
.
.
.\" HTML <a name="controlmodifiers"></a>
@ -547,6 +580,7 @@ about the pattern:
pushcopy push a copy onto the stack
stackguard=<number> test the stackguard feature
tables=[0|1|2] select internal tables
utf8_input treat input as UTF-8
.sp
The effects of these modifiers are described in the following sections.
.
@ -642,7 +676,8 @@ nine characters, only two of which are specified in hexadecimal:
/ab "literal" 32/hex
.sp
Either single or double quotes may be used. There is no way of including
the delimiter within a substring.
the delimiter within a substring. The \fBhex\fP and \fBexpand\fP modifiers are
mutually exclusive.
.P
By default, \fBpcre2test\fP passes patterns as zero-terminated strings to
\fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. However, for
@ -650,6 +685,22 @@ patterns specified with the \fBhex\fP modifier, the actual length of the
pattern is passed.
.
.
.SS "Specifying wide characters in 16-bit and 32-bit modes"
.rs
.sp
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
translated to UTF-16 or UTF-32 when the \fButf\fP modifier is set. For testing
the 16-bit and 32-bit libraries in non-UTF mode, the \fButf8_input\fP modifier
can be used. It is mutually exclusive with \fButf\fP. Input lines are
interpreted as UTF-8 as a means of specifying wide characters. More details are
given in
.\" HTML <a href="#inputencoding">
.\" </a>
"Input encoding"
.\"
above.
.
.
.SS "Generating long repetitive patterns"
.rs
.sp
@ -665,7 +716,8 @@ are expanded before the pattern is passed to \fBpcre2_compile()\fP. For
example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
by decimal digits and "}" is found later in the pattern. If not, the characters
remain in the pattern unaltered.
remain in the pattern unaltered. The \fBexpand\fP and \fBhex\fP modifiers are
mutually exclusive.
.P
If part of an expanded pattern looks like an expansion, but is really part of
the actual pattern, unwanted expansion can be avoided by giving two values in
@ -1682,6 +1734,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 06 July 2016
Last updated: 02 August 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -26,7 +26,7 @@ SYNOPSIS
As the original fairly simple PCRE library evolved, it acquired many
different features, and as a result, the original pcretest program
ended up with a lot of options in a messy, arcane syntax, for testing
ended up with a lot of options in a messy, arcane syntax for testing
all the features. The move to the new PCRE2 API provided an opportunity
to re-implement the test program as pcre2test, with a cleaner modifier
syntax. Nevertheless, there are still many obscure modifiers, some of
@ -45,7 +45,7 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
installed. The pcre2test program can be used to test all the libraries.
However, its own input and output are always in 8-bit format. When
testing the 16-bit or 32-bit libraries, patterns and subject strings
are converted to 16- or 32-bit format before being passed to the
are converted to 16-bit or 32-bit format before being passed to the
library functions. Results are converted back to 8-bit code units for
output.
@ -58,19 +58,46 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
INPUT ENCODING
Input to pcre2test is processed line by line, either by calling the C
library's fgets() function, or via the libreadline library (see below).
library's fgets() function, or via the libreadline library. In some
Windows environments character 26 (hex 1A) causes an immediate end of
file, and no further data is read, so this character should be avoided
unless you really want that action.
The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, fgets()
treats any bytes other than newline as data characters. In some Windows
environments character 26 (hex 1A) causes an immediate end of file, and
no further data is read.
treats any bytes other than newline as data characters. An error is
generated if a binary zero is encountered. Subject lines are processed
for backslash escapes, which makes it possible to include any data
value in strings that are passed to the library for matching. For pat-
terns, there is a facility for specifying some or all of the 8-bit
input characters as hexadecimal pairs, which makes it possible to
include binary zeros.
For maximum portability, therefore, it is safest to avoid non-printing
characters in pcre2test input files. There is a facility for specifying
some or all of a pattern's characters as hexadecimal pairs, thus making
it possible to include binary zeroes in a pattern for testing purposes.
Subject lines are processed for backslash escapes, which makes it pos-
sible to include any data value.
Input for the 16-bit and 32-bit libraries
When testing the 16-bit or 32-bit libraries, there is a need to be able
to generate character code points greater than 255 in the strings that
are passed to the library. For subject lines, backslash escapes can be
used. In addition, when the utf modifier (see "Setting compilation
options" below) is set, the pattern and any following subject lines are
interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as
appropriate.
For non-UTF testing of wide characters, the utf8_input modifier can be
used. This is mutually exclusive with utf, and is allowed only in
16-bit or 32-bit mode. It causes the pattern and following subject
lines to be treated as UTF-8 according to the original definition (RFC
2279), which allows for character values up to 0x7fffffff. Each charac-
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
values greater than 0xffff cause an error to occur).
UTF-8 is not capable of encoding values greater than 0x7fffffff, but
such values can be handled by the 32-bit library. When testing this
library in non-UTF mode with utf8_input set, if any character is pre-
ceded by the byte 0xff (which is an illegal byte in UTF-8) 0x80000000
is added to the character's value. This is the only way of passing such
code points in a pattern string. For subject strings, using an escape
sequence is preferable.
COMMAND LINE OPTIONS
@ -500,7 +527,9 @@ PATTERN MODIFIERS
As well as turning on the PCRE2_UTF option, the utf modifier causes all
non-printing characters in output strings to be printed using the
\x{hh...} notation. Otherwise, those less than 0x100 are output in hex
without the curly brackets.
without the curly brackets. Setting utf in 16-bit or 32-bit mode also
causes pattern and subject strings to be translated to UTF-16 or
UTF-32, respectively, before being passed to library functions.
Setting compilation controls
@ -529,6 +558,7 @@ PATTERN MODIFIERS
pushcopy push a copy onto the stack
stackguard=<number> test the stackguard feature
tables=[0|1|2] select internal tables
utf8_input treat input as UTF-8
The effects of these modifiers are described in the following sections.
@ -619,13 +649,23 @@ PATTERN MODIFIERS
/ab "literal" 32/hex
Either single or double quotes may be used. There is no way of includ-
ing the delimiter within a substring.
ing the delimiter within a substring. The hex and expand modifiers are
mutually exclusive.
By default, pcre2test passes patterns as zero-terminated strings to
pcre2_compile(), giving the length as PCRE2_ZERO_TERMINATED. However,
for patterns specified with the hex modifier, the actual length of the
pattern is passed.
Specifying wide characters in 16-bit and 32-bit modes
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
and translated to UTF-16 or UTF-32 when the utf modifier is set. For
testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
modifier can be used. It is mutually exclusive with utf. Input lines
are interpreted as UTF-8 as a means of specifying wide characters. More
details are given in "Input encoding" above.
Generating long repetitive patterns
Some tests use long patterns that are very repetitive. Instead of cre-
@ -640,7 +680,8 @@ PATTERN MODIFIERS
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
cannot be nested. An initial "\[" sequence is recognized only if "]{"
followed by decimal digits and "}" is found later in the pattern. If
not, the characters remain in the pattern unaltered.
not, the characters remain in the pattern unaltered. The expand and hex
modifiers are mutually exclusive.
If part of an expanded pattern looks like an expansion, but is really
part of the actual pattern, unwanted expansion can be avoided by giving
@ -1548,5 +1589,5 @@ AUTHOR
REVISION
Last updated: 06 July 2016
Last updated: 02 August 2016
Copyright (c) 1997-2016 University of Cambridge.

View File

@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
/* The current PCRE version information. */
#define PCRE2_MAJOR 10
#define PCRE2_MINOR 22
#define PCRE2_PRERELEASE
#define PCRE2_DATE 2016-07-29
#define PCRE2_MINOR 23
#define PCRE2_PRERELEASE -RC1
#define PCRE2_DATE 2016-08-01
/* When an application links to a PCRE DLL in Windows, the symbols that are
imported have to be identified as such. When building PCRE2, the appropriate

View File

@ -430,8 +430,8 @@ so many of them that they are split into two fields. */
#define CTL_PUSH 0x01000000u
#define CTL_PUSHCOPY 0x02000000u
#define CTL_STARTCHAR 0x04000000u
#define CTL_ZERO_TERMINATE 0x08000000u
/* Spare 0x10000000u */
#define CTL_UTF8_INPUT 0x08000000u
#define CTL_ZERO_TERMINATE 0x10000000u
/* Spare 0x20000000u */
#define CTL_NL_SET 0x40000000u /* Informational */
#define CTL_BSR_SET 0x80000000u /* Informational */
@ -460,7 +460,8 @@ data line. */
CTL_GLOBAL|\
CTL_MARK|\
CTL_MEMORY|\
CTL_STARTCHAR)
CTL_STARTCHAR|\
CTL_UTF8_INPUT)
#define CTL2_ALLPD (CTL2_SUBSTITUTE_EXTENDED|\
CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\
@ -621,6 +622,7 @@ static modstruct modlist[] = {
{ "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) },
{ "use_offset_limit", MOD_PAT, MOD_OPT, PCRE2_USE_OFFSET_LIMIT, PO(options) },
{ "utf", MOD_PATP, MOD_OPT, PCRE2_UTF, PO(options) },
{ "utf8_input", MOD_PAT, MOD_CTL, CTL_UTF8_INPUT, PO(control) },
{ "zero_terminate", MOD_DAT, MOD_CTL, CTL_ZERO_TERMINATE, DO(control) }
};
@ -673,6 +675,7 @@ static uint32_t exclusive_pat_controls[] = {
/* Data controls that are mutually exclusive. At present these are all in the
first control word. */
static uint32_t exclusive_dat_controls[] = {
CTL_ALLUSEDTEXT | CTL_STARTCHAR,
CTL_FINDLIMITS | CTL_NULLCONTEXT };
@ -2715,16 +2718,22 @@ return i + 1;
#ifdef SUPPORT_PCRE2_16
/*************************************************
* Convert pattern to 16-bit *
* Convert string to 16-bit *
*************************************************/
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
all the input bytes are ASCII, the space needed for a 16-bit string is exactly
double the 8-bit size. Otherwise, the size needed for a 16-bit string is no
more than double, because up to 0xffff uses no more than 3 bytes in UTF-8 but
possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes in
UTF-16. The result is always left in pbuffer16. Impose a minimum size to save
repeated re-sizing.
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
code values from 0 to 0x7fffffff. However, values greater than the later UTF
limit of 0x10ffff cause an error. In non-UTF mode the input is interpreted as
UTF-8 if the utf8_input modifier is set, but an error is generated for values
greater than 0xffff.
If all the input bytes are ASCII, the space needed for a 16-bit string is
exactly double the 8-bit size. Otherwise, the size needed for a 16-bit string
is no more than double, because up to 0xffff uses no more than 3 bytes in UTF-8
but possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes
in UTF-16. The result is always left in pbuffer16. Impose a minimum size to
save repeated re-sizing.
Note that this function does not object to surrogate values. This is
deliberate; it makes it possible to construct UTF-16 strings that are invalid,
@ -2732,7 +2741,7 @@ for the purpose of testing that they are correctly faulted.
Arguments:
p points to a byte string
utf non-zero if converting to UTF-16
utf true in UTF mode
lenptr points to number of bytes in the string (excluding trailing zero)
Returns: 0 on success, with the length updated to the number of 16-bit
@ -2763,7 +2772,7 @@ if (pbuffer16_size < 2*len + 2)
}
pp = pbuffer16;
if (!utf)
if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
{
for (; len > 0; len--) *pp++ = *p++;
}
@ -2772,12 +2781,12 @@ else while (len > 0)
uint32_t c;
int chlen = utf82ord(p, &c);
if (chlen <= 0) return -1;
if (!utf && c > 0xffff) return -3;
if (c > 0x10ffff) return -2;
p += chlen;
len -= chlen;
if (c < 0x10000) *pp++ = c; else
{
if (!utf) return -3;
c -= 0x10000;
*pp++ = 0xD800 | (c >> 10);
*pp++ = 0xDC00 | (c & 0x3ff);
@ -2794,15 +2803,25 @@ return 0;
#ifdef SUPPORT_PCRE2_32
/*************************************************
* Convert pattern to 32-bit *
* Convert string to 32-bit *
*************************************************/
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
all the input bytes are ASCII, the space needed for a 32-bit string is exactly
four times the 8-bit size. Otherwise, the size needed for a 32-bit string is no
more than four times, because the number of characters must be less than the
number of bytes. The result is always left in pbuffer32. Impose a minimum size
to save repeated re-sizing.
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
code values from 0 to 0x7fffffff. However, values greater than the later UTF
limit of 0x10ffff cause an error.
In non-UTF mode the input is interpreted as UTF-8 if the utf8_input modifier
is set, and no limit is imposed. There is special interpretation of the 0xff
byte (which is illegal in UTF-8) in this case: it causes the top bit of the
next character to be set. This provides a way of generating 32-bit characters
greater than 0x7fffffff.
If all the input bytes are ASCII, the space needed for a 32-bit string is
exactly four times the 8-bit size. Otherwise, the size needed for a 32-bit
string is no more than four times, because the number of characters must be
less than the number of bytes. The result is always left in pbuffer32. Impose a
minimum size to save repeated re-sizing.
Note that this function does not object to surrogate values. This is
deliberate; it makes it possible to construct UTF-32 strings that are invalid,
@ -2810,7 +2829,7 @@ for the purpose of testing that they are correctly faulted.
Arguments:
p points to a byte string
utf true if UTF-8 (to be converted to UTF-32)
utf true in UTF mode
lenptr points to number of bytes in the string (excluding trailing zero)
Returns: 0 on success, with the length updated to the number of 32-bit
@ -2840,19 +2859,29 @@ if (pbuffer32_size < 4*len + 4)
}
pp = pbuffer32;
if (!utf)
if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
{
for (; len > 0; len--) *pp++ = *p++;
}
else while (len > 0)
{
int chlen;
uint32_t c;
int chlen = utf82ord(p, &c);
uint32_t topbit = 0;
if (!utf && *p == 0xff && len > 1)
{
topbit = 0x80000000u;
p++;
len--;
}
chlen = utf82ord(p, &c);
if (chlen <= 0) return -1;
if (utf && c > 0x10ffff) return -2;
p += chlen;
len -= chlen;
*pp++ = c;
*pp++ = c | topbit;
}
*pp = 0;
@ -3627,7 +3656,7 @@ Returns: nothing
static void
show_controls(uint32_t controls, uint32_t controls2, const char *before)
{
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
before,
((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
@ -3662,6 +3691,7 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s
((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "",
((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "",
((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "",
((controls & CTL_UTF8_INPUT) != 0)? " utf8_input" : "",
((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : "");
}
@ -3759,13 +3789,13 @@ warning we must initialize cblock_size. */
cblock_size = 0;
#ifdef SUPPORT_PCRE2_8
if (test_mode == 8) cblock_size = sizeof(pcre2_real_code_8);
if (test_mode == PCRE8_MODE) cblock_size = sizeof(pcre2_real_code_8);
#endif
#ifdef SUPPORT_PCRE2_16
if (test_mode == 16) cblock_size = sizeof(pcre2_real_code_16);
if (test_mode == PCRE16_MODE) cblock_size = sizeof(pcre2_real_code_16);
#endif
#ifdef SUPPORT_PCRE2_32
if (test_mode == 32) cblock_size = sizeof(pcre2_real_code_32);
if (test_mode == PCRE32_MODE) cblock_size = sizeof(pcre2_real_code_32);
#endif
(void)pattern_info(PCRE2_INFO_SIZE, &size, FALSE);
@ -4507,6 +4537,23 @@ patlen = p - buffer - 2;
if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
utf = (pat_patctl.options & PCRE2_UTF) != 0;
/* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
exclusive with the utf modifier. */
if ((pat_patctl.control & CTL_UTF8_INPUT) != 0)
{
if (test_mode == PCRE8_MODE)
{
fprintf(outfile, "** The utf8_input modifier is not allowed in 8-bit mode\n");
return PR_SKIP;
}
if (utf)
{
fprintf(outfile, "** The utf and utf8_input modifiers are mutually exclusive\n");
return PR_SKIP;
}
}
/* Check for mutually exclusive modifiers. At present, these are all in the
first control word. */
@ -4738,7 +4785,7 @@ if ((pat_patctl.control & CTL_POSIX) != 0)
const char *msg = "** Ignored with POSIX interface:";
#endif
if (test_mode != 8)
if (test_mode != PCRE8_MODE)
{
fprintf(outfile, "** The POSIX interface is available only in 8-bit mode\n");
return PR_SKIP;
@ -5622,7 +5669,9 @@ if (dbuffer == NULL || needlen >= dbuffer_size)
SETCASTPTR(q, dbuffer); /* Sets q8, q16, or q32, as appropriate. */
/* Scan the data line, interpreting data escapes, and put the result into a
buffer of the appropriate width. In UTF mode, input can be UTF-8. */
buffer of the appropriate width. In UTF mode, input is always UTF-8; otherwise,
in 16- and 32-bit modes, it can be forced to UTF-8 by the utf8_input modifier.
*/
while ((c = *p++) != 0)
{
@ -5691,11 +5740,20 @@ while ((c = *p++) != 0)
continue;
}
/* Handle a non-escaped character */
/* Handle a non-escaped character. In non-UTF 32-bit mode with utf8_input
set, do the fudge for setting the top bit. */
if (c != '\\')
{
if (utf && HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
uint32_t topbit = 0;
if (test_mode == PCRE32_MODE && c == 0xff && *p != 0)
{
topbit = 0x80000000;
c = *p++;
}
if ((utf || (pat_patctl.control & CTL_UTF8_INPUT) != 0) &&
HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
c |= topbit;
}
/* Handle backslash escapes */

15
testdata/testinput11 vendored
View File

@ -353,4 +353,19 @@
/(*THEN:\[A]{65501})/expand
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
# even though this test is run when UTF is not supported.
/abý¿¿¿¿¿z/utf8_input
abý¿¿¿¿¿z
ab\x{7fffffff}z
/abÿý¿¿¿¿¿z/utf8_input
abÿý¿¿¿¿¿z
ab\x{ffffffff}z
/abÿAz/utf8_input
abÿAz
ab\x{80000041}z
# End of testinput11

View File

@ -343,4 +343,8 @@
/./utf
\x{110000}
/(*UTF)ab<61>ソソソソソz/B
/ab<61>ソソソソソz/utf
# End of testinput12

View File

@ -643,4 +643,22 @@ Subject length lower bound = 1
/(*THEN:\[A]{65501})/expand
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
# even though this test is run when UTF is not supported.
/abý¿¿¿¿¿z/utf8_input
** Failed: character value greater than 0xffff cannot be converted to 16-bit in non-UTF mode
abý¿¿¿¿¿z
ab\x{7fffffff}z
/abÿý¿¿¿¿¿z/utf8_input
** Failed: invalid UTF-8 string cannot be converted to 16-bit string
abÿý¿¿¿¿¿z
ab\x{ffffffff}z
/abÿAz/utf8_input
** Failed: invalid UTF-8 string cannot be converted to 16-bit string
abÿAz
ab\x{80000041}z
# End of testinput11

View File

@ -646,4 +646,25 @@ Subject length lower bound = 1
/(*THEN:\[A]{65501})/expand
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
# even though this test is run when UTF is not supported.
/abý¿¿¿¿¿z/utf8_input
abý¿¿¿¿¿z
0: ab\x{7fffffff}z
ab\x{7fffffff}z
0: ab\x{7fffffff}z
/abÿý¿¿¿¿¿z/utf8_input
abÿý¿¿¿¿¿z
0: ab\x{ffffffff}z
ab\x{ffffffff}z
0: ab\x{ffffffff}z
/abÿAz/utf8_input
abÿAz
0: ab\x{80000041}z
ab\x{80000041}z
0: ab\x{80000041}z
# End of testinput11

View File

@ -1367,4 +1367,15 @@ Subject length lower bound = 2
\x{110000}
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
/(*UTF)abý¿¿¿¿¿z/B
------------------------------------------------------------------
Bra
ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
Ket
End
------------------------------------------------------------------
/abý¿¿¿¿¿z/utf
** Failed: character value greater than 0x10ffff cannot be converted to UTF
# End of testinput12

View File

@ -1361,4 +1361,15 @@ Subject length lower bound = 2
\x{110000}
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
/(*UTF)abý¿¿¿¿¿z/B
------------------------------------------------------------------
Bra
ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
Ket
End
------------------------------------------------------------------
/abý¿¿¿¿¿z/utf
** Failed: character value greater than 0x10ffff cannot be converted to UTF
# End of testinput12