Update pcre2test with the /utf8_input option, for generating wide characters in
non-UTF 16-bit and 32-bit modes.
This commit is contained in:
parent
5b6c797a4d
commit
69c9d81e43
|
@ -2,6 +2,13 @@ Change Log for PCRE2
|
|||
--------------------
|
||||
|
||||
|
||||
Version 10.23 xx-xxxxxx-2016
|
||||
----------------------------
|
||||
|
||||
1. Extended pcre2test with the utf8_input modifier so that it is able to
|
||||
generate all possible 16-bit and 32-bit code unit values in non-UTF modes.
|
||||
|
||||
|
||||
Version 10.22 29-July-2016
|
||||
--------------------------
|
||||
|
||||
|
|
|
@ -9,9 +9,9 @@ dnl The PCRE2_PRERELEASE feature is for identifying release candidates. It might
|
|||
dnl be defined as -RC2, for example. For real releases, it should be empty.
|
||||
|
||||
m4_define(pcre2_major, [10])
|
||||
m4_define(pcre2_minor, [22])
|
||||
m4_define(pcre2_prerelease, [])
|
||||
m4_define(pcre2_date, [2016-07-29])
|
||||
m4_define(pcre2_minor, [23])
|
||||
m4_define(pcre2_prerelease, [-RC1])
|
||||
m4_define(pcre2_date, [2016-08-01])
|
||||
|
||||
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
||||
# 50 lines of this file. Please update that if the variables above are moved.
|
||||
|
|
|
@ -61,7 +61,7 @@ subject is processed, and what output is produced.
|
|||
<P>
|
||||
As the original fairly simple PCRE library evolved, it acquired many different
|
||||
features, and as a result, the original <b>pcretest</b> program ended up with a
|
||||
lot of options in a messy, arcane syntax, for testing all the features. The
|
||||
lot of options in a messy, arcane syntax for testing all the features. The
|
||||
move to the new PCRE2 API provided an opportunity to re-implement the test
|
||||
program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
|
||||
are still many obscure modifiers, some of which are specifically designed for
|
||||
|
@ -77,32 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
|
|||
all three of these libraries may be simultaneously installed. The
|
||||
<b>pcre2test</b> program can be used to test all the libraries. However, its own
|
||||
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
||||
libraries, patterns and subject strings are converted to 16- or 32-bit format
|
||||
before being passed to the library functions. Results are converted back to
|
||||
8-bit code units for output.
|
||||
libraries, patterns and subject strings are converted to 16-bit or 32-bit
|
||||
format before being passed to the library functions. Results are converted back
|
||||
to 8-bit code units for output.
|
||||
</P>
|
||||
<P>
|
||||
In the rest of this document, the names of library functions and structures
|
||||
are given in generic form, for example, <b>pcre_compile()</b>. The actual
|
||||
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
||||
</P>
|
||||
<a name="inputencoding"></a></P>
|
||||
<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
|
||||
<P>
|
||||
Input to <b>pcre2test</b> is processed line by line, either by calling the C
|
||||
library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
|
||||
below). The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||
treats any bytes other than newline as data characters. In some Windows
|
||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
||||
further data is read.
|
||||
library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
|
||||
Windows environments character 26 (hex 1A) causes an immediate end of file, and
|
||||
no further data is read, so this character should be avoided unless you really
|
||||
want that action.
|
||||
</P>
|
||||
<P>
|
||||
For maximum portability, therefore, it is safest to avoid non-printing
|
||||
characters in <b>pcre2test</b> input files. There is a facility for specifying
|
||||
some or all of a pattern's characters as hexadecimal pairs, thus making it
|
||||
possible to include binary zeroes in a pattern for testing purposes. Subject
|
||||
lines are processed for backslash escapes, which makes it possible to include
|
||||
any data value.
|
||||
The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||
treats any bytes other than newline as data characters. An error is generated
|
||||
if a binary zero is encountered. Subject lines are processed for backslash
|
||||
escapes, which makes it possible to include any data value in strings that are
|
||||
passed to the library for matching. For patterns, there is a facility for
|
||||
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||
which makes it possible to include binary zeros.
|
||||
</P>
|
||||
<br><b>
|
||||
Input for the 16-bit and 32-bit libraries
|
||||
</b><br>
|
||||
<P>
|
||||
When testing the 16-bit or 32-bit libraries, there is a need to be able to
|
||||
generate character code points greater than 255 in the strings that are passed
|
||||
to the library. For subject lines, backslash escapes can be used. In addition,
|
||||
when the <b>utf</b> modifier (see
|
||||
<a href="#optionmodifiers">"Setting compilation options"</a>
|
||||
below) is set, the pattern and any following subject lines are interpreted as
|
||||
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
|
||||
</P>
|
||||
<P>
|
||||
For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
|
||||
used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
|
||||
or 32-bit mode. It causes the pattern and following subject lines to be treated
|
||||
as UTF-8 according to the original definition (RFC 2279), which allows for
|
||||
character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
|
||||
to occur).
|
||||
</P>
|
||||
<P>
|
||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
||||
values can be handled by the 32-bit library. When testing this library in
|
||||
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
|
||||
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
||||
character's value. This is the only way of passing such code points in a
|
||||
pattern string. For subject strings, using an escape sequence is preferable.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
||||
<P>
|
||||
|
@ -553,7 +582,9 @@ for a description of their effects.
|
|||
As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
|
||||
non-printing characters in output strings to be printed using the \x{hh...}
|
||||
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
||||
brackets.
|
||||
brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and
|
||||
subject strings to be translated to UTF-16 or UTF-32, respectively, before
|
||||
being passed to library functions.
|
||||
<a name="controlmodifiers"></a></P>
|
||||
<br><b>
|
||||
Setting compilation controls
|
||||
|
@ -584,6 +615,7 @@ about the pattern:
|
|||
pushcopy push a copy onto the stack
|
||||
stackguard=<number> test the stackguard feature
|
||||
tables=[0|1|2] select internal tables
|
||||
utf8_input treat input as UTF-8
|
||||
</pre>
|
||||
The effects of these modifiers are described in the following sections.
|
||||
</P>
|
||||
|
@ -684,7 +716,8 @@ nine characters, only two of which are specified in hexadecimal:
|
|||
/ab "literal" 32/hex
|
||||
</pre>
|
||||
Either single or double quotes may be used. There is no way of including
|
||||
the delimiter within a substring.
|
||||
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
|
||||
mutually exclusive.
|
||||
</P>
|
||||
<P>
|
||||
By default, <b>pcre2test</b> passes patterns as zero-terminated strings to
|
||||
|
@ -693,6 +726,19 @@ patterns specified with the <b>hex</b> modifier, the actual length of the
|
|||
pattern is passed.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying wide characters in 16-bit and 32-bit modes
|
||||
</b><br>
|
||||
<P>
|
||||
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
|
||||
translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing
|
||||
the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
|
||||
can be used. It is mutually exclusive with <b>utf</b>. Input lines are
|
||||
interpreted as UTF-8 as a means of specifying wide characters. More details are
|
||||
given in
|
||||
<a href="#inputencoding">"Input encoding"</a>
|
||||
above.
|
||||
</P>
|
||||
<br><b>
|
||||
Generating long repetitive patterns
|
||||
</b><br>
|
||||
<P>
|
||||
|
@ -708,7 +754,8 @@ are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
|
|||
example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||
cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
|
||||
by decimal digits and "}" is found later in the pattern. If not, the characters
|
||||
remain in the pattern unaltered.
|
||||
remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
|
||||
mutually exclusive.
|
||||
</P>
|
||||
<P>
|
||||
If part of an expanded pattern looks like an expansion, but is really part of
|
||||
|
@ -1706,7 +1753,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 06 July 2016
|
||||
Last updated: 02 August 2016
|
||||
<br>
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2TEST 1 "06 July 2016" "PCRE 10.22"
|
||||
.TH PCRE2TEST 1 "02 August 2016" "PCRE 10.23"
|
||||
.SH NAME
|
||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
|
@ -29,7 +29,7 @@ subject is processed, and what output is produced.
|
|||
.P
|
||||
As the original fairly simple PCRE library evolved, it acquired many different
|
||||
features, and as a result, the original \fBpcretest\fP program ended up with a
|
||||
lot of options in a messy, arcane syntax, for testing all the features. The
|
||||
lot of options in a messy, arcane syntax for testing all the features. The
|
||||
move to the new PCRE2 API provided an opportunity to re-implement the test
|
||||
program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
|
||||
are still many obscure modifiers, some of which are specifically designed for
|
||||
|
@ -47,32 +47,63 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
|
|||
all three of these libraries may be simultaneously installed. The
|
||||
\fBpcre2test\fP program can be used to test all the libraries. However, its own
|
||||
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
||||
libraries, patterns and subject strings are converted to 16- or 32-bit format
|
||||
before being passed to the library functions. Results are converted back to
|
||||
8-bit code units for output.
|
||||
libraries, patterns and subject strings are converted to 16-bit or 32-bit
|
||||
format before being passed to the library functions. Results are converted back
|
||||
to 8-bit code units for output.
|
||||
.P
|
||||
In the rest of this document, the names of library functions and structures
|
||||
are given in generic form, for example, \fBpcre_compile()\fP. The actual
|
||||
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="inputencoding"></a>
|
||||
.SH "INPUT ENCODING"
|
||||
.rs
|
||||
.sp
|
||||
Input to \fBpcre2test\fP is processed line by line, either by calling the C
|
||||
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
|
||||
below). The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
|
||||
treats any bytes other than newline as data characters. In some Windows
|
||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
||||
further data is read.
|
||||
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library. In some
|
||||
Windows environments character 26 (hex 1A) causes an immediate end of file, and
|
||||
no further data is read, so this character should be avoided unless you really
|
||||
want that action.
|
||||
.P
|
||||
For maximum portability, therefore, it is safest to avoid non-printing
|
||||
characters in \fBpcre2test\fP input files. There is a facility for specifying
|
||||
some or all of a pattern's characters as hexadecimal pairs, thus making it
|
||||
possible to include binary zeroes in a pattern for testing purposes. Subject
|
||||
lines are processed for backslash escapes, which makes it possible to include
|
||||
any data value.
|
||||
The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
|
||||
treats any bytes other than newline as data characters. An error is generated
|
||||
if a binary zero is encountered. Subject lines are processed for backslash
|
||||
escapes, which makes it possible to include any data value in strings that are
|
||||
passed to the library for matching. For patterns, there is a facility for
|
||||
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||
which makes it possible to include binary zeros.
|
||||
.
|
||||
.
|
||||
.SS "Input for the 16-bit and 32-bit libraries"
|
||||
.rs
|
||||
.sp
|
||||
When testing the 16-bit or 32-bit libraries, there is a need to be able to
|
||||
generate character code points greater than 255 in the strings that are passed
|
||||
to the library. For subject lines, backslash escapes can be used. In addition,
|
||||
when the \fButf\fP modifier (see
|
||||
.\" HTML <a href="#optionmodifiers">
|
||||
.\" </a>
|
||||
"Setting compilation options"
|
||||
.\"
|
||||
below) is set, the pattern and any following subject lines are interpreted as
|
||||
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
|
||||
.P
|
||||
For non-UTF testing of wide characters, the \fButf8_input\fP modifier can be
|
||||
used. This is mutually exclusive with \fButf\fP, and is allowed only in 16-bit
|
||||
or 32-bit mode. It causes the pattern and following subject lines to be treated
|
||||
as UTF-8 according to the original definition (RFC 2279), which allows for
|
||||
character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
|
||||
to occur).
|
||||
.P
|
||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
||||
values can be handled by the 32-bit library. When testing this library in
|
||||
non-UTF mode with \fButf8_input\fP set, if any character is preceded by the
|
||||
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
||||
character's value. This is the only way of passing such code points in a
|
||||
pattern string. For subject strings, using an escape sequence is preferable.
|
||||
.
|
||||
.
|
||||
.SH "COMMAND LINE OPTIONS"
|
||||
|
@ -515,7 +546,9 @@ for a description of their effects.
|
|||
As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
|
||||
non-printing characters in output strings to be printed using the \ex{hh...}
|
||||
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
||||
brackets.
|
||||
brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and
|
||||
subject strings to be translated to UTF-16 or UTF-32, respectively, before
|
||||
being passed to library functions.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="controlmodifiers"></a>
|
||||
|
@ -547,6 +580,7 @@ about the pattern:
|
|||
pushcopy push a copy onto the stack
|
||||
stackguard=<number> test the stackguard feature
|
||||
tables=[0|1|2] select internal tables
|
||||
utf8_input treat input as UTF-8
|
||||
.sp
|
||||
The effects of these modifiers are described in the following sections.
|
||||
.
|
||||
|
@ -642,7 +676,8 @@ nine characters, only two of which are specified in hexadecimal:
|
|||
/ab "literal" 32/hex
|
||||
.sp
|
||||
Either single or double quotes may be used. There is no way of including
|
||||
the delimiter within a substring.
|
||||
the delimiter within a substring. The \fBhex\fP and \fBexpand\fP modifiers are
|
||||
mutually exclusive.
|
||||
.P
|
||||
By default, \fBpcre2test\fP passes patterns as zero-terminated strings to
|
||||
\fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. However, for
|
||||
|
@ -650,6 +685,22 @@ patterns specified with the \fBhex\fP modifier, the actual length of the
|
|||
pattern is passed.
|
||||
.
|
||||
.
|
||||
.SS "Specifying wide characters in 16-bit and 32-bit modes"
|
||||
.rs
|
||||
.sp
|
||||
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
|
||||
translated to UTF-16 or UTF-32 when the \fButf\fP modifier is set. For testing
|
||||
the 16-bit and 32-bit libraries in non-UTF mode, the \fButf8_input\fP modifier
|
||||
can be used. It is mutually exclusive with \fButf\fP. Input lines are
|
||||
interpreted as UTF-8 as a means of specifying wide characters. More details are
|
||||
given in
|
||||
.\" HTML <a href="#inputencoding">
|
||||
.\" </a>
|
||||
"Input encoding"
|
||||
.\"
|
||||
above.
|
||||
.
|
||||
.
|
||||
.SS "Generating long repetitive patterns"
|
||||
.rs
|
||||
.sp
|
||||
|
@ -665,7 +716,8 @@ are expanded before the pattern is passed to \fBpcre2_compile()\fP. For
|
|||
example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||
cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
|
||||
by decimal digits and "}" is found later in the pattern. If not, the characters
|
||||
remain in the pattern unaltered.
|
||||
remain in the pattern unaltered. The \fBexpand\fP and \fBhex\fP modifiers are
|
||||
mutually exclusive.
|
||||
.P
|
||||
If part of an expanded pattern looks like an expansion, but is really part of
|
||||
the actual pattern, unwanted expansion can be avoided by giving two values in
|
||||
|
@ -1682,6 +1734,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 06 July 2016
|
||||
Last updated: 02 August 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -26,7 +26,7 @@ SYNOPSIS
|
|||
|
||||
As the original fairly simple PCRE library evolved, it acquired many
|
||||
different features, and as a result, the original pcretest program
|
||||
ended up with a lot of options in a messy, arcane syntax, for testing
|
||||
ended up with a lot of options in a messy, arcane syntax for testing
|
||||
all the features. The move to the new PCRE2 API provided an opportunity
|
||||
to re-implement the test program as pcre2test, with a cleaner modifier
|
||||
syntax. Nevertheless, there are still many obscure modifiers, some of
|
||||
|
@ -45,7 +45,7 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
|||
installed. The pcre2test program can be used to test all the libraries.
|
||||
However, its own input and output are always in 8-bit format. When
|
||||
testing the 16-bit or 32-bit libraries, patterns and subject strings
|
||||
are converted to 16- or 32-bit format before being passed to the
|
||||
are converted to 16-bit or 32-bit format before being passed to the
|
||||
library functions. Results are converted back to 8-bit code units for
|
||||
output.
|
||||
|
||||
|
@ -58,19 +58,46 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
|||
INPUT ENCODING
|
||||
|
||||
Input to pcre2test is processed line by line, either by calling the C
|
||||
library's fgets() function, or via the libreadline library (see below).
|
||||
library's fgets() function, or via the libreadline library. In some
|
||||
Windows environments character 26 (hex 1A) causes an immediate end of
|
||||
file, and no further data is read, so this character should be avoided
|
||||
unless you really want that action.
|
||||
|
||||
The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, fgets()
|
||||
treats any bytes other than newline as data characters. In some Windows
|
||||
environments character 26 (hex 1A) causes an immediate end of file, and
|
||||
no further data is read.
|
||||
treats any bytes other than newline as data characters. An error is
|
||||
generated if a binary zero is encountered. Subject lines are processed
|
||||
for backslash escapes, which makes it possible to include any data
|
||||
value in strings that are passed to the library for matching. For pat-
|
||||
terns, there is a facility for specifying some or all of the 8-bit
|
||||
input characters as hexadecimal pairs, which makes it possible to
|
||||
include binary zeros.
|
||||
|
||||
For maximum portability, therefore, it is safest to avoid non-printing
|
||||
characters in pcre2test input files. There is a facility for specifying
|
||||
some or all of a pattern's characters as hexadecimal pairs, thus making
|
||||
it possible to include binary zeroes in a pattern for testing purposes.
|
||||
Subject lines are processed for backslash escapes, which makes it pos-
|
||||
sible to include any data value.
|
||||
Input for the 16-bit and 32-bit libraries
|
||||
|
||||
When testing the 16-bit or 32-bit libraries, there is a need to be able
|
||||
to generate character code points greater than 255 in the strings that
|
||||
are passed to the library. For subject lines, backslash escapes can be
|
||||
used. In addition, when the utf modifier (see "Setting compilation
|
||||
options" below) is set, the pattern and any following subject lines are
|
||||
interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as
|
||||
appropriate.
|
||||
|
||||
For non-UTF testing of wide characters, the utf8_input modifier can be
|
||||
used. This is mutually exclusive with utf, and is allowed only in
|
||||
16-bit or 32-bit mode. It causes the pattern and following subject
|
||||
lines to be treated as UTF-8 according to the original definition (RFC
|
||||
2279), which allows for character values up to 0x7fffffff. Each charac-
|
||||
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
|
||||
values greater than 0xffff cause an error to occur).
|
||||
|
||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but
|
||||
such values can be handled by the 32-bit library. When testing this
|
||||
library in non-UTF mode with utf8_input set, if any character is pre-
|
||||
ceded by the byte 0xff (which is an illegal byte in UTF-8) 0x80000000
|
||||
is added to the character's value. This is the only way of passing such
|
||||
code points in a pattern string. For subject strings, using an escape
|
||||
sequence is preferable.
|
||||
|
||||
|
||||
COMMAND LINE OPTIONS
|
||||
|
@ -500,7 +527,9 @@ PATTERN MODIFIERS
|
|||
As well as turning on the PCRE2_UTF option, the utf modifier causes all
|
||||
non-printing characters in output strings to be printed using the
|
||||
\x{hh...} notation. Otherwise, those less than 0x100 are output in hex
|
||||
without the curly brackets.
|
||||
without the curly brackets. Setting utf in 16-bit or 32-bit mode also
|
||||
causes pattern and subject strings to be translated to UTF-16 or
|
||||
UTF-32, respectively, before being passed to library functions.
|
||||
|
||||
Setting compilation controls
|
||||
|
||||
|
@ -529,6 +558,7 @@ PATTERN MODIFIERS
|
|||
pushcopy push a copy onto the stack
|
||||
stackguard=<number> test the stackguard feature
|
||||
tables=[0|1|2] select internal tables
|
||||
utf8_input treat input as UTF-8
|
||||
|
||||
The effects of these modifiers are described in the following sections.
|
||||
|
||||
|
@ -619,13 +649,23 @@ PATTERN MODIFIERS
|
|||
/ab "literal" 32/hex
|
||||
|
||||
Either single or double quotes may be used. There is no way of includ-
|
||||
ing the delimiter within a substring.
|
||||
ing the delimiter within a substring. The hex and expand modifiers are
|
||||
mutually exclusive.
|
||||
|
||||
By default, pcre2test passes patterns as zero-terminated strings to
|
||||
pcre2_compile(), giving the length as PCRE2_ZERO_TERMINATED. However,
|
||||
for patterns specified with the hex modifier, the actual length of the
|
||||
pattern is passed.
|
||||
|
||||
Specifying wide characters in 16-bit and 32-bit modes
|
||||
|
||||
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
|
||||
and translated to UTF-16 or UTF-32 when the utf modifier is set. For
|
||||
testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
|
||||
modifier can be used. It is mutually exclusive with utf. Input lines
|
||||
are interpreted as UTF-8 as a means of specifying wide characters. More
|
||||
details are given in "Input encoding" above.
|
||||
|
||||
Generating long repetitive patterns
|
||||
|
||||
Some tests use long patterns that are very repetitive. Instead of cre-
|
||||
|
@ -640,7 +680,8 @@ PATTERN MODIFIERS
|
|||
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||
cannot be nested. An initial "\[" sequence is recognized only if "]{"
|
||||
followed by decimal digits and "}" is found later in the pattern. If
|
||||
not, the characters remain in the pattern unaltered.
|
||||
not, the characters remain in the pattern unaltered. The expand and hex
|
||||
modifiers are mutually exclusive.
|
||||
|
||||
If part of an expanded pattern looks like an expansion, but is really
|
||||
part of the actual pattern, unwanted expansion can be avoided by giving
|
||||
|
@ -1548,5 +1589,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 06 July 2016
|
||||
Last updated: 02 August 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
|
|
|
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
|
|||
/* The current PCRE version information. */
|
||||
|
||||
#define PCRE2_MAJOR 10
|
||||
#define PCRE2_MINOR 22
|
||||
#define PCRE2_PRERELEASE
|
||||
#define PCRE2_DATE 2016-07-29
|
||||
#define PCRE2_MINOR 23
|
||||
#define PCRE2_PRERELEASE -RC1
|
||||
#define PCRE2_DATE 2016-08-01
|
||||
|
||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||
imported have to be identified as such. When building PCRE2, the appropriate
|
||||
|
|
124
src/pcre2test.c
124
src/pcre2test.c
|
@ -430,8 +430,8 @@ so many of them that they are split into two fields. */
|
|||
#define CTL_PUSH 0x01000000u
|
||||
#define CTL_PUSHCOPY 0x02000000u
|
||||
#define CTL_STARTCHAR 0x04000000u
|
||||
#define CTL_ZERO_TERMINATE 0x08000000u
|
||||
/* Spare 0x10000000u */
|
||||
#define CTL_UTF8_INPUT 0x08000000u
|
||||
#define CTL_ZERO_TERMINATE 0x10000000u
|
||||
/* Spare 0x20000000u */
|
||||
#define CTL_NL_SET 0x40000000u /* Informational */
|
||||
#define CTL_BSR_SET 0x80000000u /* Informational */
|
||||
|
@ -460,7 +460,8 @@ data line. */
|
|||
CTL_GLOBAL|\
|
||||
CTL_MARK|\
|
||||
CTL_MEMORY|\
|
||||
CTL_STARTCHAR)
|
||||
CTL_STARTCHAR|\
|
||||
CTL_UTF8_INPUT)
|
||||
|
||||
#define CTL2_ALLPD (CTL2_SUBSTITUTE_EXTENDED|\
|
||||
CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\
|
||||
|
@ -621,6 +622,7 @@ static modstruct modlist[] = {
|
|||
{ "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) },
|
||||
{ "use_offset_limit", MOD_PAT, MOD_OPT, PCRE2_USE_OFFSET_LIMIT, PO(options) },
|
||||
{ "utf", MOD_PATP, MOD_OPT, PCRE2_UTF, PO(options) },
|
||||
{ "utf8_input", MOD_PAT, MOD_CTL, CTL_UTF8_INPUT, PO(control) },
|
||||
{ "zero_terminate", MOD_DAT, MOD_CTL, CTL_ZERO_TERMINATE, DO(control) }
|
||||
};
|
||||
|
||||
|
@ -673,6 +675,7 @@ static uint32_t exclusive_pat_controls[] = {
|
|||
|
||||
/* Data controls that are mutually exclusive. At present these are all in the
|
||||
first control word. */
|
||||
|
||||
static uint32_t exclusive_dat_controls[] = {
|
||||
CTL_ALLUSEDTEXT | CTL_STARTCHAR,
|
||||
CTL_FINDLIMITS | CTL_NULLCONTEXT };
|
||||
|
@ -2715,16 +2718,22 @@ return i + 1;
|
|||
|
||||
#ifdef SUPPORT_PCRE2_16
|
||||
/*************************************************
|
||||
* Convert pattern to 16-bit *
|
||||
* Convert string to 16-bit *
|
||||
*************************************************/
|
||||
|
||||
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
|
||||
all the input bytes are ASCII, the space needed for a 16-bit string is exactly
|
||||
double the 8-bit size. Otherwise, the size needed for a 16-bit string is no
|
||||
more than double, because up to 0xffff uses no more than 3 bytes in UTF-8 but
|
||||
possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes in
|
||||
UTF-16. The result is always left in pbuffer16. Impose a minimum size to save
|
||||
repeated re-sizing.
|
||||
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
|
||||
the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
|
||||
code values from 0 to 0x7fffffff. However, values greater than the later UTF
|
||||
limit of 0x10ffff cause an error. In non-UTF mode the input is interpreted as
|
||||
UTF-8 if the utf8_input modifier is set, but an error is generated for values
|
||||
greater than 0xffff.
|
||||
|
||||
If all the input bytes are ASCII, the space needed for a 16-bit string is
|
||||
exactly double the 8-bit size. Otherwise, the size needed for a 16-bit string
|
||||
is no more than double, because up to 0xffff uses no more than 3 bytes in UTF-8
|
||||
but possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes
|
||||
in UTF-16. The result is always left in pbuffer16. Impose a minimum size to
|
||||
save repeated re-sizing.
|
||||
|
||||
Note that this function does not object to surrogate values. This is
|
||||
deliberate; it makes it possible to construct UTF-16 strings that are invalid,
|
||||
|
@ -2732,7 +2741,7 @@ for the purpose of testing that they are correctly faulted.
|
|||
|
||||
Arguments:
|
||||
p points to a byte string
|
||||
utf non-zero if converting to UTF-16
|
||||
utf true in UTF mode
|
||||
lenptr points to number of bytes in the string (excluding trailing zero)
|
||||
|
||||
Returns: 0 on success, with the length updated to the number of 16-bit
|
||||
|
@ -2763,7 +2772,7 @@ if (pbuffer16_size < 2*len + 2)
|
|||
}
|
||||
|
||||
pp = pbuffer16;
|
||||
if (!utf)
|
||||
if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
|
||||
{
|
||||
for (; len > 0; len--) *pp++ = *p++;
|
||||
}
|
||||
|
@ -2772,12 +2781,12 @@ else while (len > 0)
|
|||
uint32_t c;
|
||||
int chlen = utf82ord(p, &c);
|
||||
if (chlen <= 0) return -1;
|
||||
if (!utf && c > 0xffff) return -3;
|
||||
if (c > 0x10ffff) return -2;
|
||||
p += chlen;
|
||||
len -= chlen;
|
||||
if (c < 0x10000) *pp++ = c; else
|
||||
{
|
||||
if (!utf) return -3;
|
||||
c -= 0x10000;
|
||||
*pp++ = 0xD800 | (c >> 10);
|
||||
*pp++ = 0xDC00 | (c & 0x3ff);
|
||||
|
@ -2794,15 +2803,25 @@ return 0;
|
|||
|
||||
#ifdef SUPPORT_PCRE2_32
|
||||
/*************************************************
|
||||
* Convert pattern to 32-bit *
|
||||
* Convert string to 32-bit *
|
||||
*************************************************/
|
||||
|
||||
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
|
||||
all the input bytes are ASCII, the space needed for a 32-bit string is exactly
|
||||
four times the 8-bit size. Otherwise, the size needed for a 32-bit string is no
|
||||
more than four times, because the number of characters must be less than the
|
||||
number of bytes. The result is always left in pbuffer32. Impose a minimum size
|
||||
to save repeated re-sizing.
|
||||
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
|
||||
the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
|
||||
code values from 0 to 0x7fffffff. However, values greater than the later UTF
|
||||
limit of 0x10ffff cause an error.
|
||||
|
||||
In non-UTF mode the input is interpreted as UTF-8 if the utf8_input modifier
|
||||
is set, and no limit is imposed. There is special interpretation of the 0xff
|
||||
byte (which is illegal in UTF-8) in this case: it causes the top bit of the
|
||||
next character to be set. This provides a way of generating 32-bit characters
|
||||
greater than 0x7fffffff.
|
||||
|
||||
If all the input bytes are ASCII, the space needed for a 32-bit string is
|
||||
exactly four times the 8-bit size. Otherwise, the size needed for a 32-bit
|
||||
string is no more than four times, because the number of characters must be
|
||||
less than the number of bytes. The result is always left in pbuffer32. Impose a
|
||||
minimum size to save repeated re-sizing.
|
||||
|
||||
Note that this function does not object to surrogate values. This is
|
||||
deliberate; it makes it possible to construct UTF-32 strings that are invalid,
|
||||
|
@ -2810,7 +2829,7 @@ for the purpose of testing that they are correctly faulted.
|
|||
|
||||
Arguments:
|
||||
p points to a byte string
|
||||
utf true if UTF-8 (to be converted to UTF-32)
|
||||
utf true in UTF mode
|
||||
lenptr points to number of bytes in the string (excluding trailing zero)
|
||||
|
||||
Returns: 0 on success, with the length updated to the number of 32-bit
|
||||
|
@ -2840,19 +2859,29 @@ if (pbuffer32_size < 4*len + 4)
|
|||
}
|
||||
|
||||
pp = pbuffer32;
|
||||
if (!utf)
|
||||
|
||||
if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
|
||||
{
|
||||
for (; len > 0; len--) *pp++ = *p++;
|
||||
}
|
||||
|
||||
else while (len > 0)
|
||||
{
|
||||
int chlen;
|
||||
uint32_t c;
|
||||
int chlen = utf82ord(p, &c);
|
||||
uint32_t topbit = 0;
|
||||
if (!utf && *p == 0xff && len > 1)
|
||||
{
|
||||
topbit = 0x80000000u;
|
||||
p++;
|
||||
len--;
|
||||
}
|
||||
chlen = utf82ord(p, &c);
|
||||
if (chlen <= 0) return -1;
|
||||
if (utf && c > 0x10ffff) return -2;
|
||||
p += chlen;
|
||||
len -= chlen;
|
||||
*pp++ = c;
|
||||
*pp++ = c | topbit;
|
||||
}
|
||||
|
||||
*pp = 0;
|
||||
|
@ -3627,7 +3656,7 @@ Returns: nothing
|
|||
static void
|
||||
show_controls(uint32_t controls, uint32_t controls2, const char *before)
|
||||
{
|
||||
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||
before,
|
||||
((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
|
||||
((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
|
||||
|
@ -3662,6 +3691,7 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s
|
|||
((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "",
|
||||
((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "",
|
||||
((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "",
|
||||
((controls & CTL_UTF8_INPUT) != 0)? " utf8_input" : "",
|
||||
((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : "");
|
||||
}
|
||||
|
||||
|
@ -3759,13 +3789,13 @@ warning we must initialize cblock_size. */
|
|||
|
||||
cblock_size = 0;
|
||||
#ifdef SUPPORT_PCRE2_8
|
||||
if (test_mode == 8) cblock_size = sizeof(pcre2_real_code_8);
|
||||
if (test_mode == PCRE8_MODE) cblock_size = sizeof(pcre2_real_code_8);
|
||||
#endif
|
||||
#ifdef SUPPORT_PCRE2_16
|
||||
if (test_mode == 16) cblock_size = sizeof(pcre2_real_code_16);
|
||||
if (test_mode == PCRE16_MODE) cblock_size = sizeof(pcre2_real_code_16);
|
||||
#endif
|
||||
#ifdef SUPPORT_PCRE2_32
|
||||
if (test_mode == 32) cblock_size = sizeof(pcre2_real_code_32);
|
||||
if (test_mode == PCRE32_MODE) cblock_size = sizeof(pcre2_real_code_32);
|
||||
#endif
|
||||
|
||||
(void)pattern_info(PCRE2_INFO_SIZE, &size, FALSE);
|
||||
|
@ -4507,6 +4537,23 @@ patlen = p - buffer - 2;
|
|||
if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
|
||||
utf = (pat_patctl.options & PCRE2_UTF) != 0;
|
||||
|
||||
/* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
|
||||
exclusive with the utf modifier. */
|
||||
|
||||
if ((pat_patctl.control & CTL_UTF8_INPUT) != 0)
|
||||
{
|
||||
if (test_mode == PCRE8_MODE)
|
||||
{
|
||||
fprintf(outfile, "** The utf8_input modifier is not allowed in 8-bit mode\n");
|
||||
return PR_SKIP;
|
||||
}
|
||||
if (utf)
|
||||
{
|
||||
fprintf(outfile, "** The utf and utf8_input modifiers are mutually exclusive\n");
|
||||
return PR_SKIP;
|
||||
}
|
||||
}
|
||||
|
||||
/* Check for mutually exclusive modifiers. At present, these are all in the
|
||||
first control word. */
|
||||
|
||||
|
@ -4738,7 +4785,7 @@ if ((pat_patctl.control & CTL_POSIX) != 0)
|
|||
const char *msg = "** Ignored with POSIX interface:";
|
||||
#endif
|
||||
|
||||
if (test_mode != 8)
|
||||
if (test_mode != PCRE8_MODE)
|
||||
{
|
||||
fprintf(outfile, "** The POSIX interface is available only in 8-bit mode\n");
|
||||
return PR_SKIP;
|
||||
|
@ -5622,7 +5669,9 @@ if (dbuffer == NULL || needlen >= dbuffer_size)
|
|||
SETCASTPTR(q, dbuffer); /* Sets q8, q16, or q32, as appropriate. */
|
||||
|
||||
/* Scan the data line, interpreting data escapes, and put the result into a
|
||||
buffer of the appropriate width. In UTF mode, input can be UTF-8. */
|
||||
buffer of the appropriate width. In UTF mode, input is always UTF-8; otherwise,
|
||||
in 16- and 32-bit modes, it can be forced to UTF-8 by the utf8_input modifier.
|
||||
*/
|
||||
|
||||
while ((c = *p++) != 0)
|
||||
{
|
||||
|
@ -5691,11 +5740,20 @@ while ((c = *p++) != 0)
|
|||
continue;
|
||||
}
|
||||
|
||||
/* Handle a non-escaped character */
|
||||
/* Handle a non-escaped character. In non-UTF 32-bit mode with utf8_input
|
||||
set, do the fudge for setting the top bit. */
|
||||
|
||||
if (c != '\\')
|
||||
{
|
||||
if (utf && HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
|
||||
uint32_t topbit = 0;
|
||||
if (test_mode == PCRE32_MODE && c == 0xff && *p != 0)
|
||||
{
|
||||
topbit = 0x80000000;
|
||||
c = *p++;
|
||||
}
|
||||
if ((utf || (pat_patctl.control & CTL_UTF8_INPUT) != 0) &&
|
||||
HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
|
||||
c |= topbit;
|
||||
}
|
||||
|
||||
/* Handle backslash escapes */
|
||||
|
|
|
@ -353,4 +353,19 @@
|
|||
|
||||
/(*THEN:\[A]{65501})/expand
|
||||
|
||||
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
|
||||
# even though this test is run when UTF is not supported.
|
||||
|
||||
/abý¿¿¿¿¿z/utf8_input
|
||||
abý¿¿¿¿¿z
|
||||
ab\x{7fffffff}z
|
||||
|
||||
/abÿý¿¿¿¿¿z/utf8_input
|
||||
abÿý¿¿¿¿¿z
|
||||
ab\x{ffffffff}z
|
||||
|
||||
/abÿAz/utf8_input
|
||||
abÿAz
|
||||
ab\x{80000041}z
|
||||
|
||||
# End of testinput11
|
||||
|
|
|
@ -343,4 +343,8 @@
|
|||
/./utf
|
||||
\x{110000}
|
||||
|
||||
/(*UTF)ab<61>ソソソソソz/B
|
||||
|
||||
/ab<61>ソソソソソz/utf
|
||||
|
||||
# End of testinput12
|
||||
|
|
|
@ -643,4 +643,22 @@ Subject length lower bound = 1
|
|||
|
||||
/(*THEN:\[A]{65501})/expand
|
||||
|
||||
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
|
||||
# even though this test is run when UTF is not supported.
|
||||
|
||||
/abý¿¿¿¿¿z/utf8_input
|
||||
** Failed: character value greater than 0xffff cannot be converted to 16-bit in non-UTF mode
|
||||
abý¿¿¿¿¿z
|
||||
ab\x{7fffffff}z
|
||||
|
||||
/abÿý¿¿¿¿¿z/utf8_input
|
||||
** Failed: invalid UTF-8 string cannot be converted to 16-bit string
|
||||
abÿý¿¿¿¿¿z
|
||||
ab\x{ffffffff}z
|
||||
|
||||
/abÿAz/utf8_input
|
||||
** Failed: invalid UTF-8 string cannot be converted to 16-bit string
|
||||
abÿAz
|
||||
ab\x{80000041}z
|
||||
|
||||
# End of testinput11
|
||||
|
|
|
@ -646,4 +646,25 @@ Subject length lower bound = 1
|
|||
|
||||
/(*THEN:\[A]{65501})/expand
|
||||
|
||||
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
|
||||
# even though this test is run when UTF is not supported.
|
||||
|
||||
/abý¿¿¿¿¿z/utf8_input
|
||||
abý¿¿¿¿¿z
|
||||
0: ab\x{7fffffff}z
|
||||
ab\x{7fffffff}z
|
||||
0: ab\x{7fffffff}z
|
||||
|
||||
/abÿý¿¿¿¿¿z/utf8_input
|
||||
abÿý¿¿¿¿¿z
|
||||
0: ab\x{ffffffff}z
|
||||
ab\x{ffffffff}z
|
||||
0: ab\x{ffffffff}z
|
||||
|
||||
/abÿAz/utf8_input
|
||||
abÿAz
|
||||
0: ab\x{80000041}z
|
||||
ab\x{80000041}z
|
||||
0: ab\x{80000041}z
|
||||
|
||||
# End of testinput11
|
||||
|
|
|
@ -1367,4 +1367,15 @@ Subject length lower bound = 2
|
|||
\x{110000}
|
||||
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
|
||||
|
||||
/(*UTF)abý¿¿¿¿¿z/B
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
|
||||
Ket
|
||||
End
|
||||
------------------------------------------------------------------
|
||||
|
||||
/abý¿¿¿¿¿z/utf
|
||||
** Failed: character value greater than 0x10ffff cannot be converted to UTF
|
||||
|
||||
# End of testinput12
|
||||
|
|
|
@ -1361,4 +1361,15 @@ Subject length lower bound = 2
|
|||
\x{110000}
|
||||
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
|
||||
|
||||
/(*UTF)abý¿¿¿¿¿z/B
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
|
||||
Ket
|
||||
End
|
||||
------------------------------------------------------------------
|
||||
|
||||
/abý¿¿¿¿¿z/utf
|
||||
** Failed: character value greater than 0x10ffff cannot be converted to UTF
|
||||
|
||||
# End of testinput12
|
||||
|
|
Loading…
Reference in New Issue