Update pcre2test with the /utf8_input option, for generating wide characters in
non-UTF 16-bit and 32-bit modes.
This commit is contained in:
parent
5b6c797a4d
commit
69c9d81e43
|
@ -2,6 +2,13 @@ Change Log for PCRE2
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
|
|
||||||
|
Version 10.23 xx-xxxxxx-2016
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
1. Extended pcre2test with the utf8_input modifier so that it is able to
|
||||||
|
generate all possible 16-bit and 32-bit code unit values in non-UTF modes.
|
||||||
|
|
||||||
|
|
||||||
Version 10.22 29-July-2016
|
Version 10.22 29-July-2016
|
||||||
--------------------------
|
--------------------------
|
||||||
|
|
||||||
|
|
|
@ -9,9 +9,9 @@ dnl The PCRE2_PRERELEASE feature is for identifying release candidates. It might
|
||||||
dnl be defined as -RC2, for example. For real releases, it should be empty.
|
dnl be defined as -RC2, for example. For real releases, it should be empty.
|
||||||
|
|
||||||
m4_define(pcre2_major, [10])
|
m4_define(pcre2_major, [10])
|
||||||
m4_define(pcre2_minor, [22])
|
m4_define(pcre2_minor, [23])
|
||||||
m4_define(pcre2_prerelease, [])
|
m4_define(pcre2_prerelease, [-RC1])
|
||||||
m4_define(pcre2_date, [2016-07-29])
|
m4_define(pcre2_date, [2016-08-01])
|
||||||
|
|
||||||
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
||||||
# 50 lines of this file. Please update that if the variables above are moved.
|
# 50 lines of this file. Please update that if the variables above are moved.
|
||||||
|
|
|
@ -61,7 +61,7 @@ subject is processed, and what output is produced.
|
||||||
<P>
|
<P>
|
||||||
As the original fairly simple PCRE library evolved, it acquired many different
|
As the original fairly simple PCRE library evolved, it acquired many different
|
||||||
features, and as a result, the original <b>pcretest</b> program ended up with a
|
features, and as a result, the original <b>pcretest</b> program ended up with a
|
||||||
lot of options in a messy, arcane syntax, for testing all the features. The
|
lot of options in a messy, arcane syntax for testing all the features. The
|
||||||
move to the new PCRE2 API provided an opportunity to re-implement the test
|
move to the new PCRE2 API provided an opportunity to re-implement the test
|
||||||
program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
|
program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
|
||||||
are still many obscure modifiers, some of which are specifically designed for
|
are still many obscure modifiers, some of which are specifically designed for
|
||||||
|
@ -77,32 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
|
||||||
all three of these libraries may be simultaneously installed. The
|
all three of these libraries may be simultaneously installed. The
|
||||||
<b>pcre2test</b> program can be used to test all the libraries. However, its own
|
<b>pcre2test</b> program can be used to test all the libraries. However, its own
|
||||||
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
||||||
libraries, patterns and subject strings are converted to 16- or 32-bit format
|
libraries, patterns and subject strings are converted to 16-bit or 32-bit
|
||||||
before being passed to the library functions. Results are converted back to
|
format before being passed to the library functions. Results are converted back
|
||||||
8-bit code units for output.
|
to 8-bit code units for output.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In the rest of this document, the names of library functions and structures
|
In the rest of this document, the names of library functions and structures
|
||||||
are given in generic form, for example, <b>pcre_compile()</b>. The actual
|
are given in generic form, for example, <b>pcre_compile()</b>. The actual
|
||||||
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
||||||
</P>
|
<a name="inputencoding"></a></P>
|
||||||
<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
|
<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
|
||||||
<P>
|
<P>
|
||||||
Input to <b>pcre2test</b> is processed line by line, either by calling the C
|
Input to <b>pcre2test</b> is processed line by line, either by calling the C
|
||||||
library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
|
library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
|
||||||
below). The input is processed using using C's string functions, so must not
|
Windows environments character 26 (hex 1A) causes an immediate end of file, and
|
||||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
no further data is read, so this character should be avoided unless you really
|
||||||
treats any bytes other than newline as data characters. In some Windows
|
want that action.
|
||||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
|
||||||
further data is read.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For maximum portability, therefore, it is safest to avoid non-printing
|
The input is processed using using C's string functions, so must not
|
||||||
characters in <b>pcre2test</b> input files. There is a facility for specifying
|
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||||
some or all of a pattern's characters as hexadecimal pairs, thus making it
|
treats any bytes other than newline as data characters. An error is generated
|
||||||
possible to include binary zeroes in a pattern for testing purposes. Subject
|
if a binary zero is encountered. Subject lines are processed for backslash
|
||||||
lines are processed for backslash escapes, which makes it possible to include
|
escapes, which makes it possible to include any data value in strings that are
|
||||||
any data value.
|
passed to the library for matching. For patterns, there is a facility for
|
||||||
|
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||||
|
which makes it possible to include binary zeros.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Input for the 16-bit and 32-bit libraries
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
When testing the 16-bit or 32-bit libraries, there is a need to be able to
|
||||||
|
generate character code points greater than 255 in the strings that are passed
|
||||||
|
to the library. For subject lines, backslash escapes can be used. In addition,
|
||||||
|
when the <b>utf</b> modifier (see
|
||||||
|
<a href="#optionmodifiers">"Setting compilation options"</a>
|
||||||
|
below) is set, the pattern and any following subject lines are interpreted as
|
||||||
|
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
|
||||||
|
used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
|
||||||
|
or 32-bit mode. It causes the pattern and following subject lines to be treated
|
||||||
|
as UTF-8 according to the original definition (RFC 2279), which allows for
|
||||||
|
character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||||
|
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
|
||||||
|
to occur).
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
||||||
|
values can be handled by the 32-bit library. When testing this library in
|
||||||
|
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
|
||||||
|
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
||||||
|
character's value. This is the only way of passing such code points in a
|
||||||
|
pattern string. For subject strings, using an escape sequence is preferable.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -553,7 +582,9 @@ for a description of their effects.
|
||||||
As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
|
As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
|
||||||
non-printing characters in output strings to be printed using the \x{hh...}
|
non-printing characters in output strings to be printed using the \x{hh...}
|
||||||
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
||||||
brackets.
|
brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and
|
||||||
|
subject strings to be translated to UTF-16 or UTF-32, respectively, before
|
||||||
|
being passed to library functions.
|
||||||
<a name="controlmodifiers"></a></P>
|
<a name="controlmodifiers"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Setting compilation controls
|
Setting compilation controls
|
||||||
|
@ -584,6 +615,7 @@ about the pattern:
|
||||||
pushcopy push a copy onto the stack
|
pushcopy push a copy onto the stack
|
||||||
stackguard=<number> test the stackguard feature
|
stackguard=<number> test the stackguard feature
|
||||||
tables=[0|1|2] select internal tables
|
tables=[0|1|2] select internal tables
|
||||||
|
utf8_input treat input as UTF-8
|
||||||
</pre>
|
</pre>
|
||||||
The effects of these modifiers are described in the following sections.
|
The effects of these modifiers are described in the following sections.
|
||||||
</P>
|
</P>
|
||||||
|
@ -684,7 +716,8 @@ nine characters, only two of which are specified in hexadecimal:
|
||||||
/ab "literal" 32/hex
|
/ab "literal" 32/hex
|
||||||
</pre>
|
</pre>
|
||||||
Either single or double quotes may be used. There is no way of including
|
Either single or double quotes may be used. There is no way of including
|
||||||
the delimiter within a substring.
|
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
|
||||||
|
mutually exclusive.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
By default, <b>pcre2test</b> passes patterns as zero-terminated strings to
|
By default, <b>pcre2test</b> passes patterns as zero-terminated strings to
|
||||||
|
@ -693,6 +726,19 @@ patterns specified with the <b>hex</b> modifier, the actual length of the
|
||||||
pattern is passed.
|
pattern is passed.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
Specifying wide characters in 16-bit and 32-bit modes
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
|
||||||
|
translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing
|
||||||
|
the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
|
||||||
|
can be used. It is mutually exclusive with <b>utf</b>. Input lines are
|
||||||
|
interpreted as UTF-8 as a means of specifying wide characters. More details are
|
||||||
|
given in
|
||||||
|
<a href="#inputencoding">"Input encoding"</a>
|
||||||
|
above.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
Generating long repetitive patterns
|
Generating long repetitive patterns
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -708,7 +754,8 @@ are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
|
||||||
example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||||
cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
|
cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
|
||||||
by decimal digits and "}" is found later in the pattern. If not, the characters
|
by decimal digits and "}" is found later in the pattern. If not, the characters
|
||||||
remain in the pattern unaltered.
|
remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
|
||||||
|
mutually exclusive.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If part of an expanded pattern looks like an expansion, but is really part of
|
If part of an expanded pattern looks like an expansion, but is really part of
|
||||||
|
@ -1706,7 +1753,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 06 July 2016
|
Last updated: 02 August 2016
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2016 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2TEST 1 "06 July 2016" "PCRE 10.22"
|
.TH PCRE2TEST 1 "02 August 2016" "PCRE 10.23"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -29,7 +29,7 @@ subject is processed, and what output is produced.
|
||||||
.P
|
.P
|
||||||
As the original fairly simple PCRE library evolved, it acquired many different
|
As the original fairly simple PCRE library evolved, it acquired many different
|
||||||
features, and as a result, the original \fBpcretest\fP program ended up with a
|
features, and as a result, the original \fBpcretest\fP program ended up with a
|
||||||
lot of options in a messy, arcane syntax, for testing all the features. The
|
lot of options in a messy, arcane syntax for testing all the features. The
|
||||||
move to the new PCRE2 API provided an opportunity to re-implement the test
|
move to the new PCRE2 API provided an opportunity to re-implement the test
|
||||||
program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
|
program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
|
||||||
are still many obscure modifiers, some of which are specifically designed for
|
are still many obscure modifiers, some of which are specifically designed for
|
||||||
|
@ -47,32 +47,63 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
|
||||||
all three of these libraries may be simultaneously installed. The
|
all three of these libraries may be simultaneously installed. The
|
||||||
\fBpcre2test\fP program can be used to test all the libraries. However, its own
|
\fBpcre2test\fP program can be used to test all the libraries. However, its own
|
||||||
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
||||||
libraries, patterns and subject strings are converted to 16- or 32-bit format
|
libraries, patterns and subject strings are converted to 16-bit or 32-bit
|
||||||
before being passed to the library functions. Results are converted back to
|
format before being passed to the library functions. Results are converted back
|
||||||
8-bit code units for output.
|
to 8-bit code units for output.
|
||||||
.P
|
.P
|
||||||
In the rest of this document, the names of library functions and structures
|
In the rest of this document, the names of library functions and structures
|
||||||
are given in generic form, for example, \fBpcre_compile()\fP. The actual
|
are given in generic form, for example, \fBpcre_compile()\fP. The actual
|
||||||
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.\" HTML <a name="inputencoding"></a>
|
||||||
.SH "INPUT ENCODING"
|
.SH "INPUT ENCODING"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
Input to \fBpcre2test\fP is processed line by line, either by calling the C
|
Input to \fBpcre2test\fP is processed line by line, either by calling the C
|
||||||
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
|
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library. In some
|
||||||
below). The input is processed using using C's string functions, so must not
|
Windows environments character 26 (hex 1A) causes an immediate end of file, and
|
||||||
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
|
no further data is read, so this character should be avoided unless you really
|
||||||
treats any bytes other than newline as data characters. In some Windows
|
want that action.
|
||||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
|
||||||
further data is read.
|
|
||||||
.P
|
.P
|
||||||
For maximum portability, therefore, it is safest to avoid non-printing
|
The input is processed using using C's string functions, so must not
|
||||||
characters in \fBpcre2test\fP input files. There is a facility for specifying
|
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
|
||||||
some or all of a pattern's characters as hexadecimal pairs, thus making it
|
treats any bytes other than newline as data characters. An error is generated
|
||||||
possible to include binary zeroes in a pattern for testing purposes. Subject
|
if a binary zero is encountered. Subject lines are processed for backslash
|
||||||
lines are processed for backslash escapes, which makes it possible to include
|
escapes, which makes it possible to include any data value in strings that are
|
||||||
any data value.
|
passed to the library for matching. For patterns, there is a facility for
|
||||||
|
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||||
|
which makes it possible to include binary zeros.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Input for the 16-bit and 32-bit libraries"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
When testing the 16-bit or 32-bit libraries, there is a need to be able to
|
||||||
|
generate character code points greater than 255 in the strings that are passed
|
||||||
|
to the library. For subject lines, backslash escapes can be used. In addition,
|
||||||
|
when the \fButf\fP modifier (see
|
||||||
|
.\" HTML <a href="#optionmodifiers">
|
||||||
|
.\" </a>
|
||||||
|
"Setting compilation options"
|
||||||
|
.\"
|
||||||
|
below) is set, the pattern and any following subject lines are interpreted as
|
||||||
|
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
|
||||||
|
.P
|
||||||
|
For non-UTF testing of wide characters, the \fButf8_input\fP modifier can be
|
||||||
|
used. This is mutually exclusive with \fButf\fP, and is allowed only in 16-bit
|
||||||
|
or 32-bit mode. It causes the pattern and following subject lines to be treated
|
||||||
|
as UTF-8 according to the original definition (RFC 2279), which allows for
|
||||||
|
character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||||
|
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
|
||||||
|
to occur).
|
||||||
|
.P
|
||||||
|
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
||||||
|
values can be handled by the 32-bit library. When testing this library in
|
||||||
|
non-UTF mode with \fButf8_input\fP set, if any character is preceded by the
|
||||||
|
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
||||||
|
character's value. This is the only way of passing such code points in a
|
||||||
|
pattern string. For subject strings, using an escape sequence is preferable.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "COMMAND LINE OPTIONS"
|
.SH "COMMAND LINE OPTIONS"
|
||||||
|
@ -515,7 +546,9 @@ for a description of their effects.
|
||||||
As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
|
As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
|
||||||
non-printing characters in output strings to be printed using the \ex{hh...}
|
non-printing characters in output strings to be printed using the \ex{hh...}
|
||||||
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
||||||
brackets.
|
brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and
|
||||||
|
subject strings to be translated to UTF-16 or UTF-32, respectively, before
|
||||||
|
being passed to library functions.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="controlmodifiers"></a>
|
.\" HTML <a name="controlmodifiers"></a>
|
||||||
|
@ -547,6 +580,7 @@ about the pattern:
|
||||||
pushcopy push a copy onto the stack
|
pushcopy push a copy onto the stack
|
||||||
stackguard=<number> test the stackguard feature
|
stackguard=<number> test the stackguard feature
|
||||||
tables=[0|1|2] select internal tables
|
tables=[0|1|2] select internal tables
|
||||||
|
utf8_input treat input as UTF-8
|
||||||
.sp
|
.sp
|
||||||
The effects of these modifiers are described in the following sections.
|
The effects of these modifiers are described in the following sections.
|
||||||
.
|
.
|
||||||
|
@ -642,7 +676,8 @@ nine characters, only two of which are specified in hexadecimal:
|
||||||
/ab "literal" 32/hex
|
/ab "literal" 32/hex
|
||||||
.sp
|
.sp
|
||||||
Either single or double quotes may be used. There is no way of including
|
Either single or double quotes may be used. There is no way of including
|
||||||
the delimiter within a substring.
|
the delimiter within a substring. The \fBhex\fP and \fBexpand\fP modifiers are
|
||||||
|
mutually exclusive.
|
||||||
.P
|
.P
|
||||||
By default, \fBpcre2test\fP passes patterns as zero-terminated strings to
|
By default, \fBpcre2test\fP passes patterns as zero-terminated strings to
|
||||||
\fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. However, for
|
\fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. However, for
|
||||||
|
@ -650,6 +685,22 @@ patterns specified with the \fBhex\fP modifier, the actual length of the
|
||||||
pattern is passed.
|
pattern is passed.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SS "Specifying wide characters in 16-bit and 32-bit modes"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
|
||||||
|
translated to UTF-16 or UTF-32 when the \fButf\fP modifier is set. For testing
|
||||||
|
the 16-bit and 32-bit libraries in non-UTF mode, the \fButf8_input\fP modifier
|
||||||
|
can be used. It is mutually exclusive with \fButf\fP. Input lines are
|
||||||
|
interpreted as UTF-8 as a means of specifying wide characters. More details are
|
||||||
|
given in
|
||||||
|
.\" HTML <a href="#inputencoding">
|
||||||
|
.\" </a>
|
||||||
|
"Input encoding"
|
||||||
|
.\"
|
||||||
|
above.
|
||||||
|
.
|
||||||
|
.
|
||||||
.SS "Generating long repetitive patterns"
|
.SS "Generating long repetitive patterns"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -665,7 +716,8 @@ are expanded before the pattern is passed to \fBpcre2_compile()\fP. For
|
||||||
example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||||
cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
|
cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
|
||||||
by decimal digits and "}" is found later in the pattern. If not, the characters
|
by decimal digits and "}" is found later in the pattern. If not, the characters
|
||||||
remain in the pattern unaltered.
|
remain in the pattern unaltered. The \fBexpand\fP and \fBhex\fP modifiers are
|
||||||
|
mutually exclusive.
|
||||||
.P
|
.P
|
||||||
If part of an expanded pattern looks like an expansion, but is really part of
|
If part of an expanded pattern looks like an expansion, but is really part of
|
||||||
the actual pattern, unwanted expansion can be avoided by giving two values in
|
the actual pattern, unwanted expansion can be avoided by giving two values in
|
||||||
|
@ -1682,6 +1734,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 06 July 2016
|
Last updated: 02 August 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -26,7 +26,7 @@ SYNOPSIS
|
||||||
|
|
||||||
As the original fairly simple PCRE library evolved, it acquired many
|
As the original fairly simple PCRE library evolved, it acquired many
|
||||||
different features, and as a result, the original pcretest program
|
different features, and as a result, the original pcretest program
|
||||||
ended up with a lot of options in a messy, arcane syntax, for testing
|
ended up with a lot of options in a messy, arcane syntax for testing
|
||||||
all the features. The move to the new PCRE2 API provided an opportunity
|
all the features. The move to the new PCRE2 API provided an opportunity
|
||||||
to re-implement the test program as pcre2test, with a cleaner modifier
|
to re-implement the test program as pcre2test, with a cleaner modifier
|
||||||
syntax. Nevertheless, there are still many obscure modifiers, some of
|
syntax. Nevertheless, there are still many obscure modifiers, some of
|
||||||
|
@ -45,7 +45,7 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
||||||
installed. The pcre2test program can be used to test all the libraries.
|
installed. The pcre2test program can be used to test all the libraries.
|
||||||
However, its own input and output are always in 8-bit format. When
|
However, its own input and output are always in 8-bit format. When
|
||||||
testing the 16-bit or 32-bit libraries, patterns and subject strings
|
testing the 16-bit or 32-bit libraries, patterns and subject strings
|
||||||
are converted to 16- or 32-bit format before being passed to the
|
are converted to 16-bit or 32-bit format before being passed to the
|
||||||
library functions. Results are converted back to 8-bit code units for
|
library functions. Results are converted back to 8-bit code units for
|
||||||
output.
|
output.
|
||||||
|
|
||||||
|
@ -58,19 +58,46 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
||||||
INPUT ENCODING
|
INPUT ENCODING
|
||||||
|
|
||||||
Input to pcre2test is processed line by line, either by calling the C
|
Input to pcre2test is processed line by line, either by calling the C
|
||||||
library's fgets() function, or via the libreadline library (see below).
|
library's fgets() function, or via the libreadline library. In some
|
||||||
|
Windows environments character 26 (hex 1A) causes an immediate end of
|
||||||
|
file, and no further data is read, so this character should be avoided
|
||||||
|
unless you really want that action.
|
||||||
|
|
||||||
The input is processed using using C's string functions, so must not
|
The input is processed using using C's string functions, so must not
|
||||||
contain binary zeroes, even though in Unix-like environments, fgets()
|
contain binary zeroes, even though in Unix-like environments, fgets()
|
||||||
treats any bytes other than newline as data characters. In some Windows
|
treats any bytes other than newline as data characters. An error is
|
||||||
environments character 26 (hex 1A) causes an immediate end of file, and
|
generated if a binary zero is encountered. Subject lines are processed
|
||||||
no further data is read.
|
for backslash escapes, which makes it possible to include any data
|
||||||
|
value in strings that are passed to the library for matching. For pat-
|
||||||
|
terns, there is a facility for specifying some or all of the 8-bit
|
||||||
|
input characters as hexadecimal pairs, which makes it possible to
|
||||||
|
include binary zeros.
|
||||||
|
|
||||||
For maximum portability, therefore, it is safest to avoid non-printing
|
Input for the 16-bit and 32-bit libraries
|
||||||
characters in pcre2test input files. There is a facility for specifying
|
|
||||||
some or all of a pattern's characters as hexadecimal pairs, thus making
|
When testing the 16-bit or 32-bit libraries, there is a need to be able
|
||||||
it possible to include binary zeroes in a pattern for testing purposes.
|
to generate character code points greater than 255 in the strings that
|
||||||
Subject lines are processed for backslash escapes, which makes it pos-
|
are passed to the library. For subject lines, backslash escapes can be
|
||||||
sible to include any data value.
|
used. In addition, when the utf modifier (see "Setting compilation
|
||||||
|
options" below) is set, the pattern and any following subject lines are
|
||||||
|
interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as
|
||||||
|
appropriate.
|
||||||
|
|
||||||
|
For non-UTF testing of wide characters, the utf8_input modifier can be
|
||||||
|
used. This is mutually exclusive with utf, and is allowed only in
|
||||||
|
16-bit or 32-bit mode. It causes the pattern and following subject
|
||||||
|
lines to be treated as UTF-8 according to the original definition (RFC
|
||||||
|
2279), which allows for character values up to 0x7fffffff. Each charac-
|
||||||
|
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
|
||||||
|
values greater than 0xffff cause an error to occur).
|
||||||
|
|
||||||
|
UTF-8 is not capable of encoding values greater than 0x7fffffff, but
|
||||||
|
such values can be handled by the 32-bit library. When testing this
|
||||||
|
library in non-UTF mode with utf8_input set, if any character is pre-
|
||||||
|
ceded by the byte 0xff (which is an illegal byte in UTF-8) 0x80000000
|
||||||
|
is added to the character's value. This is the only way of passing such
|
||||||
|
code points in a pattern string. For subject strings, using an escape
|
||||||
|
sequence is preferable.
|
||||||
|
|
||||||
|
|
||||||
COMMAND LINE OPTIONS
|
COMMAND LINE OPTIONS
|
||||||
|
@ -500,7 +527,9 @@ PATTERN MODIFIERS
|
||||||
As well as turning on the PCRE2_UTF option, the utf modifier causes all
|
As well as turning on the PCRE2_UTF option, the utf modifier causes all
|
||||||
non-printing characters in output strings to be printed using the
|
non-printing characters in output strings to be printed using the
|
||||||
\x{hh...} notation. Otherwise, those less than 0x100 are output in hex
|
\x{hh...} notation. Otherwise, those less than 0x100 are output in hex
|
||||||
without the curly brackets.
|
without the curly brackets. Setting utf in 16-bit or 32-bit mode also
|
||||||
|
causes pattern and subject strings to be translated to UTF-16 or
|
||||||
|
UTF-32, respectively, before being passed to library functions.
|
||||||
|
|
||||||
Setting compilation controls
|
Setting compilation controls
|
||||||
|
|
||||||
|
@ -529,6 +558,7 @@ PATTERN MODIFIERS
|
||||||
pushcopy push a copy onto the stack
|
pushcopy push a copy onto the stack
|
||||||
stackguard=<number> test the stackguard feature
|
stackguard=<number> test the stackguard feature
|
||||||
tables=[0|1|2] select internal tables
|
tables=[0|1|2] select internal tables
|
||||||
|
utf8_input treat input as UTF-8
|
||||||
|
|
||||||
The effects of these modifiers are described in the following sections.
|
The effects of these modifiers are described in the following sections.
|
||||||
|
|
||||||
|
@ -619,13 +649,23 @@ PATTERN MODIFIERS
|
||||||
/ab "literal" 32/hex
|
/ab "literal" 32/hex
|
||||||
|
|
||||||
Either single or double quotes may be used. There is no way of includ-
|
Either single or double quotes may be used. There is no way of includ-
|
||||||
ing the delimiter within a substring.
|
ing the delimiter within a substring. The hex and expand modifiers are
|
||||||
|
mutually exclusive.
|
||||||
|
|
||||||
By default, pcre2test passes patterns as zero-terminated strings to
|
By default, pcre2test passes patterns as zero-terminated strings to
|
||||||
pcre2_compile(), giving the length as PCRE2_ZERO_TERMINATED. However,
|
pcre2_compile(), giving the length as PCRE2_ZERO_TERMINATED. However,
|
||||||
for patterns specified with the hex modifier, the actual length of the
|
for patterns specified with the hex modifier, the actual length of the
|
||||||
pattern is passed.
|
pattern is passed.
|
||||||
|
|
||||||
|
Specifying wide characters in 16-bit and 32-bit modes
|
||||||
|
|
||||||
|
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
|
||||||
|
and translated to UTF-16 or UTF-32 when the utf modifier is set. For
|
||||||
|
testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
|
||||||
|
modifier can be used. It is mutually exclusive with utf. Input lines
|
||||||
|
are interpreted as UTF-8 as a means of specifying wide characters. More
|
||||||
|
details are given in "Input encoding" above.
|
||||||
|
|
||||||
Generating long repetitive patterns
|
Generating long repetitive patterns
|
||||||
|
|
||||||
Some tests use long patterns that are very repetitive. Instead of cre-
|
Some tests use long patterns that are very repetitive. Instead of cre-
|
||||||
|
@ -640,7 +680,8 @@ PATTERN MODIFIERS
|
||||||
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||||
cannot be nested. An initial "\[" sequence is recognized only if "]{"
|
cannot be nested. An initial "\[" sequence is recognized only if "]{"
|
||||||
followed by decimal digits and "}" is found later in the pattern. If
|
followed by decimal digits and "}" is found later in the pattern. If
|
||||||
not, the characters remain in the pattern unaltered.
|
not, the characters remain in the pattern unaltered. The expand and hex
|
||||||
|
modifiers are mutually exclusive.
|
||||||
|
|
||||||
If part of an expanded pattern looks like an expansion, but is really
|
If part of an expanded pattern looks like an expansion, but is really
|
||||||
part of the actual pattern, unwanted expansion can be avoided by giving
|
part of the actual pattern, unwanted expansion can be avoided by giving
|
||||||
|
@ -1548,5 +1589,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 06 July 2016
|
Last updated: 02 August 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
|
|
|
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
|
||||||
/* The current PCRE version information. */
|
/* The current PCRE version information. */
|
||||||
|
|
||||||
#define PCRE2_MAJOR 10
|
#define PCRE2_MAJOR 10
|
||||||
#define PCRE2_MINOR 22
|
#define PCRE2_MINOR 23
|
||||||
#define PCRE2_PRERELEASE
|
#define PCRE2_PRERELEASE -RC1
|
||||||
#define PCRE2_DATE 2016-07-29
|
#define PCRE2_DATE 2016-08-01
|
||||||
|
|
||||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||||
imported have to be identified as such. When building PCRE2, the appropriate
|
imported have to be identified as such. When building PCRE2, the appropriate
|
||||||
|
|
124
src/pcre2test.c
124
src/pcre2test.c
|
@ -430,8 +430,8 @@ so many of them that they are split into two fields. */
|
||||||
#define CTL_PUSH 0x01000000u
|
#define CTL_PUSH 0x01000000u
|
||||||
#define CTL_PUSHCOPY 0x02000000u
|
#define CTL_PUSHCOPY 0x02000000u
|
||||||
#define CTL_STARTCHAR 0x04000000u
|
#define CTL_STARTCHAR 0x04000000u
|
||||||
#define CTL_ZERO_TERMINATE 0x08000000u
|
#define CTL_UTF8_INPUT 0x08000000u
|
||||||
/* Spare 0x10000000u */
|
#define CTL_ZERO_TERMINATE 0x10000000u
|
||||||
/* Spare 0x20000000u */
|
/* Spare 0x20000000u */
|
||||||
#define CTL_NL_SET 0x40000000u /* Informational */
|
#define CTL_NL_SET 0x40000000u /* Informational */
|
||||||
#define CTL_BSR_SET 0x80000000u /* Informational */
|
#define CTL_BSR_SET 0x80000000u /* Informational */
|
||||||
|
@ -460,7 +460,8 @@ data line. */
|
||||||
CTL_GLOBAL|\
|
CTL_GLOBAL|\
|
||||||
CTL_MARK|\
|
CTL_MARK|\
|
||||||
CTL_MEMORY|\
|
CTL_MEMORY|\
|
||||||
CTL_STARTCHAR)
|
CTL_STARTCHAR|\
|
||||||
|
CTL_UTF8_INPUT)
|
||||||
|
|
||||||
#define CTL2_ALLPD (CTL2_SUBSTITUTE_EXTENDED|\
|
#define CTL2_ALLPD (CTL2_SUBSTITUTE_EXTENDED|\
|
||||||
CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\
|
CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\
|
||||||
|
@ -621,6 +622,7 @@ static modstruct modlist[] = {
|
||||||
{ "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) },
|
{ "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) },
|
||||||
{ "use_offset_limit", MOD_PAT, MOD_OPT, PCRE2_USE_OFFSET_LIMIT, PO(options) },
|
{ "use_offset_limit", MOD_PAT, MOD_OPT, PCRE2_USE_OFFSET_LIMIT, PO(options) },
|
||||||
{ "utf", MOD_PATP, MOD_OPT, PCRE2_UTF, PO(options) },
|
{ "utf", MOD_PATP, MOD_OPT, PCRE2_UTF, PO(options) },
|
||||||
|
{ "utf8_input", MOD_PAT, MOD_CTL, CTL_UTF8_INPUT, PO(control) },
|
||||||
{ "zero_terminate", MOD_DAT, MOD_CTL, CTL_ZERO_TERMINATE, DO(control) }
|
{ "zero_terminate", MOD_DAT, MOD_CTL, CTL_ZERO_TERMINATE, DO(control) }
|
||||||
};
|
};
|
||||||
|
|
||||||
|
@ -673,6 +675,7 @@ static uint32_t exclusive_pat_controls[] = {
|
||||||
|
|
||||||
/* Data controls that are mutually exclusive. At present these are all in the
|
/* Data controls that are mutually exclusive. At present these are all in the
|
||||||
first control word. */
|
first control word. */
|
||||||
|
|
||||||
static uint32_t exclusive_dat_controls[] = {
|
static uint32_t exclusive_dat_controls[] = {
|
||||||
CTL_ALLUSEDTEXT | CTL_STARTCHAR,
|
CTL_ALLUSEDTEXT | CTL_STARTCHAR,
|
||||||
CTL_FINDLIMITS | CTL_NULLCONTEXT };
|
CTL_FINDLIMITS | CTL_NULLCONTEXT };
|
||||||
|
@ -2715,16 +2718,22 @@ return i + 1;
|
||||||
|
|
||||||
#ifdef SUPPORT_PCRE2_16
|
#ifdef SUPPORT_PCRE2_16
|
||||||
/*************************************************
|
/*************************************************
|
||||||
* Convert pattern to 16-bit *
|
* Convert string to 16-bit *
|
||||||
*************************************************/
|
*************************************************/
|
||||||
|
|
||||||
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
|
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
|
||||||
all the input bytes are ASCII, the space needed for a 16-bit string is exactly
|
the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
|
||||||
double the 8-bit size. Otherwise, the size needed for a 16-bit string is no
|
code values from 0 to 0x7fffffff. However, values greater than the later UTF
|
||||||
more than double, because up to 0xffff uses no more than 3 bytes in UTF-8 but
|
limit of 0x10ffff cause an error. In non-UTF mode the input is interpreted as
|
||||||
possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes in
|
UTF-8 if the utf8_input modifier is set, but an error is generated for values
|
||||||
UTF-16. The result is always left in pbuffer16. Impose a minimum size to save
|
greater than 0xffff.
|
||||||
repeated re-sizing.
|
|
||||||
|
If all the input bytes are ASCII, the space needed for a 16-bit string is
|
||||||
|
exactly double the 8-bit size. Otherwise, the size needed for a 16-bit string
|
||||||
|
is no more than double, because up to 0xffff uses no more than 3 bytes in UTF-8
|
||||||
|
but possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes
|
||||||
|
in UTF-16. The result is always left in pbuffer16. Impose a minimum size to
|
||||||
|
save repeated re-sizing.
|
||||||
|
|
||||||
Note that this function does not object to surrogate values. This is
|
Note that this function does not object to surrogate values. This is
|
||||||
deliberate; it makes it possible to construct UTF-16 strings that are invalid,
|
deliberate; it makes it possible to construct UTF-16 strings that are invalid,
|
||||||
|
@ -2732,7 +2741,7 @@ for the purpose of testing that they are correctly faulted.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
p points to a byte string
|
p points to a byte string
|
||||||
utf non-zero if converting to UTF-16
|
utf true in UTF mode
|
||||||
lenptr points to number of bytes in the string (excluding trailing zero)
|
lenptr points to number of bytes in the string (excluding trailing zero)
|
||||||
|
|
||||||
Returns: 0 on success, with the length updated to the number of 16-bit
|
Returns: 0 on success, with the length updated to the number of 16-bit
|
||||||
|
@ -2763,7 +2772,7 @@ if (pbuffer16_size < 2*len + 2)
|
||||||
}
|
}
|
||||||
|
|
||||||
pp = pbuffer16;
|
pp = pbuffer16;
|
||||||
if (!utf)
|
if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
|
||||||
{
|
{
|
||||||
for (; len > 0; len--) *pp++ = *p++;
|
for (; len > 0; len--) *pp++ = *p++;
|
||||||
}
|
}
|
||||||
|
@ -2772,12 +2781,12 @@ else while (len > 0)
|
||||||
uint32_t c;
|
uint32_t c;
|
||||||
int chlen = utf82ord(p, &c);
|
int chlen = utf82ord(p, &c);
|
||||||
if (chlen <= 0) return -1;
|
if (chlen <= 0) return -1;
|
||||||
|
if (!utf && c > 0xffff) return -3;
|
||||||
if (c > 0x10ffff) return -2;
|
if (c > 0x10ffff) return -2;
|
||||||
p += chlen;
|
p += chlen;
|
||||||
len -= chlen;
|
len -= chlen;
|
||||||
if (c < 0x10000) *pp++ = c; else
|
if (c < 0x10000) *pp++ = c; else
|
||||||
{
|
{
|
||||||
if (!utf) return -3;
|
|
||||||
c -= 0x10000;
|
c -= 0x10000;
|
||||||
*pp++ = 0xD800 | (c >> 10);
|
*pp++ = 0xD800 | (c >> 10);
|
||||||
*pp++ = 0xDC00 | (c & 0x3ff);
|
*pp++ = 0xDC00 | (c & 0x3ff);
|
||||||
|
@ -2794,15 +2803,25 @@ return 0;
|
||||||
|
|
||||||
#ifdef SUPPORT_PCRE2_32
|
#ifdef SUPPORT_PCRE2_32
|
||||||
/*************************************************
|
/*************************************************
|
||||||
* Convert pattern to 32-bit *
|
* Convert string to 32-bit *
|
||||||
*************************************************/
|
*************************************************/
|
||||||
|
|
||||||
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
|
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
|
||||||
all the input bytes are ASCII, the space needed for a 32-bit string is exactly
|
the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
|
||||||
four times the 8-bit size. Otherwise, the size needed for a 32-bit string is no
|
code values from 0 to 0x7fffffff. However, values greater than the later UTF
|
||||||
more than four times, because the number of characters must be less than the
|
limit of 0x10ffff cause an error.
|
||||||
number of bytes. The result is always left in pbuffer32. Impose a minimum size
|
|
||||||
to save repeated re-sizing.
|
In non-UTF mode the input is interpreted as UTF-8 if the utf8_input modifier
|
||||||
|
is set, and no limit is imposed. There is special interpretation of the 0xff
|
||||||
|
byte (which is illegal in UTF-8) in this case: it causes the top bit of the
|
||||||
|
next character to be set. This provides a way of generating 32-bit characters
|
||||||
|
greater than 0x7fffffff.
|
||||||
|
|
||||||
|
If all the input bytes are ASCII, the space needed for a 32-bit string is
|
||||||
|
exactly four times the 8-bit size. Otherwise, the size needed for a 32-bit
|
||||||
|
string is no more than four times, because the number of characters must be
|
||||||
|
less than the number of bytes. The result is always left in pbuffer32. Impose a
|
||||||
|
minimum size to save repeated re-sizing.
|
||||||
|
|
||||||
Note that this function does not object to surrogate values. This is
|
Note that this function does not object to surrogate values. This is
|
||||||
deliberate; it makes it possible to construct UTF-32 strings that are invalid,
|
deliberate; it makes it possible to construct UTF-32 strings that are invalid,
|
||||||
|
@ -2810,7 +2829,7 @@ for the purpose of testing that they are correctly faulted.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
p points to a byte string
|
p points to a byte string
|
||||||
utf true if UTF-8 (to be converted to UTF-32)
|
utf true in UTF mode
|
||||||
lenptr points to number of bytes in the string (excluding trailing zero)
|
lenptr points to number of bytes in the string (excluding trailing zero)
|
||||||
|
|
||||||
Returns: 0 on success, with the length updated to the number of 32-bit
|
Returns: 0 on success, with the length updated to the number of 32-bit
|
||||||
|
@ -2840,19 +2859,29 @@ if (pbuffer32_size < 4*len + 4)
|
||||||
}
|
}
|
||||||
|
|
||||||
pp = pbuffer32;
|
pp = pbuffer32;
|
||||||
if (!utf)
|
|
||||||
|
if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
|
||||||
{
|
{
|
||||||
for (; len > 0; len--) *pp++ = *p++;
|
for (; len > 0; len--) *pp++ = *p++;
|
||||||
}
|
}
|
||||||
|
|
||||||
else while (len > 0)
|
else while (len > 0)
|
||||||
{
|
{
|
||||||
|
int chlen;
|
||||||
uint32_t c;
|
uint32_t c;
|
||||||
int chlen = utf82ord(p, &c);
|
uint32_t topbit = 0;
|
||||||
|
if (!utf && *p == 0xff && len > 1)
|
||||||
|
{
|
||||||
|
topbit = 0x80000000u;
|
||||||
|
p++;
|
||||||
|
len--;
|
||||||
|
}
|
||||||
|
chlen = utf82ord(p, &c);
|
||||||
if (chlen <= 0) return -1;
|
if (chlen <= 0) return -1;
|
||||||
if (utf && c > 0x10ffff) return -2;
|
if (utf && c > 0x10ffff) return -2;
|
||||||
p += chlen;
|
p += chlen;
|
||||||
len -= chlen;
|
len -= chlen;
|
||||||
*pp++ = c;
|
*pp++ = c | topbit;
|
||||||
}
|
}
|
||||||
|
|
||||||
*pp = 0;
|
*pp = 0;
|
||||||
|
@ -3627,7 +3656,7 @@ Returns: nothing
|
||||||
static void
|
static void
|
||||||
show_controls(uint32_t controls, uint32_t controls2, const char *before)
|
show_controls(uint32_t controls, uint32_t controls2, const char *before)
|
||||||
{
|
{
|
||||||
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||||
before,
|
before,
|
||||||
((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
|
((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
|
||||||
((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
|
((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
|
||||||
|
@ -3662,6 +3691,7 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s
|
||||||
((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "",
|
((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "",
|
||||||
((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "",
|
((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "",
|
||||||
((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "",
|
((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "",
|
||||||
|
((controls & CTL_UTF8_INPUT) != 0)? " utf8_input" : "",
|
||||||
((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : "");
|
((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : "");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -3759,13 +3789,13 @@ warning we must initialize cblock_size. */
|
||||||
|
|
||||||
cblock_size = 0;
|
cblock_size = 0;
|
||||||
#ifdef SUPPORT_PCRE2_8
|
#ifdef SUPPORT_PCRE2_8
|
||||||
if (test_mode == 8) cblock_size = sizeof(pcre2_real_code_8);
|
if (test_mode == PCRE8_MODE) cblock_size = sizeof(pcre2_real_code_8);
|
||||||
#endif
|
#endif
|
||||||
#ifdef SUPPORT_PCRE2_16
|
#ifdef SUPPORT_PCRE2_16
|
||||||
if (test_mode == 16) cblock_size = sizeof(pcre2_real_code_16);
|
if (test_mode == PCRE16_MODE) cblock_size = sizeof(pcre2_real_code_16);
|
||||||
#endif
|
#endif
|
||||||
#ifdef SUPPORT_PCRE2_32
|
#ifdef SUPPORT_PCRE2_32
|
||||||
if (test_mode == 32) cblock_size = sizeof(pcre2_real_code_32);
|
if (test_mode == PCRE32_MODE) cblock_size = sizeof(pcre2_real_code_32);
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
(void)pattern_info(PCRE2_INFO_SIZE, &size, FALSE);
|
(void)pattern_info(PCRE2_INFO_SIZE, &size, FALSE);
|
||||||
|
@ -4507,6 +4537,23 @@ patlen = p - buffer - 2;
|
||||||
if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
|
if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
|
||||||
utf = (pat_patctl.options & PCRE2_UTF) != 0;
|
utf = (pat_patctl.options & PCRE2_UTF) != 0;
|
||||||
|
|
||||||
|
/* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
|
||||||
|
exclusive with the utf modifier. */
|
||||||
|
|
||||||
|
if ((pat_patctl.control & CTL_UTF8_INPUT) != 0)
|
||||||
|
{
|
||||||
|
if (test_mode == PCRE8_MODE)
|
||||||
|
{
|
||||||
|
fprintf(outfile, "** The utf8_input modifier is not allowed in 8-bit mode\n");
|
||||||
|
return PR_SKIP;
|
||||||
|
}
|
||||||
|
if (utf)
|
||||||
|
{
|
||||||
|
fprintf(outfile, "** The utf and utf8_input modifiers are mutually exclusive\n");
|
||||||
|
return PR_SKIP;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/* Check for mutually exclusive modifiers. At present, these are all in the
|
/* Check for mutually exclusive modifiers. At present, these are all in the
|
||||||
first control word. */
|
first control word. */
|
||||||
|
|
||||||
|
@ -4738,7 +4785,7 @@ if ((pat_patctl.control & CTL_POSIX) != 0)
|
||||||
const char *msg = "** Ignored with POSIX interface:";
|
const char *msg = "** Ignored with POSIX interface:";
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
if (test_mode != 8)
|
if (test_mode != PCRE8_MODE)
|
||||||
{
|
{
|
||||||
fprintf(outfile, "** The POSIX interface is available only in 8-bit mode\n");
|
fprintf(outfile, "** The POSIX interface is available only in 8-bit mode\n");
|
||||||
return PR_SKIP;
|
return PR_SKIP;
|
||||||
|
@ -5622,7 +5669,9 @@ if (dbuffer == NULL || needlen >= dbuffer_size)
|
||||||
SETCASTPTR(q, dbuffer); /* Sets q8, q16, or q32, as appropriate. */
|
SETCASTPTR(q, dbuffer); /* Sets q8, q16, or q32, as appropriate. */
|
||||||
|
|
||||||
/* Scan the data line, interpreting data escapes, and put the result into a
|
/* Scan the data line, interpreting data escapes, and put the result into a
|
||||||
buffer of the appropriate width. In UTF mode, input can be UTF-8. */
|
buffer of the appropriate width. In UTF mode, input is always UTF-8; otherwise,
|
||||||
|
in 16- and 32-bit modes, it can be forced to UTF-8 by the utf8_input modifier.
|
||||||
|
*/
|
||||||
|
|
||||||
while ((c = *p++) != 0)
|
while ((c = *p++) != 0)
|
||||||
{
|
{
|
||||||
|
@ -5691,11 +5740,20 @@ while ((c = *p++) != 0)
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Handle a non-escaped character */
|
/* Handle a non-escaped character. In non-UTF 32-bit mode with utf8_input
|
||||||
|
set, do the fudge for setting the top bit. */
|
||||||
|
|
||||||
if (c != '\\')
|
if (c != '\\')
|
||||||
{
|
{
|
||||||
if (utf && HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
|
uint32_t topbit = 0;
|
||||||
|
if (test_mode == PCRE32_MODE && c == 0xff && *p != 0)
|
||||||
|
{
|
||||||
|
topbit = 0x80000000;
|
||||||
|
c = *p++;
|
||||||
|
}
|
||||||
|
if ((utf || (pat_patctl.control & CTL_UTF8_INPUT) != 0) &&
|
||||||
|
HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
|
||||||
|
c |= topbit;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Handle backslash escapes */
|
/* Handle backslash escapes */
|
||||||
|
|
|
@ -353,4 +353,19 @@
|
||||||
|
|
||||||
/(*THEN:\[A]{65501})/expand
|
/(*THEN:\[A]{65501})/expand
|
||||||
|
|
||||||
|
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
|
||||||
|
# even though this test is run when UTF is not supported.
|
||||||
|
|
||||||
|
/abý¿¿¿¿¿z/utf8_input
|
||||||
|
abý¿¿¿¿¿z
|
||||||
|
ab\x{7fffffff}z
|
||||||
|
|
||||||
|
/abÿý¿¿¿¿¿z/utf8_input
|
||||||
|
abÿý¿¿¿¿¿z
|
||||||
|
ab\x{ffffffff}z
|
||||||
|
|
||||||
|
/abÿAz/utf8_input
|
||||||
|
abÿAz
|
||||||
|
ab\x{80000041}z
|
||||||
|
|
||||||
# End of testinput11
|
# End of testinput11
|
||||||
|
|
|
@ -343,4 +343,8 @@
|
||||||
/./utf
|
/./utf
|
||||||
\x{110000}
|
\x{110000}
|
||||||
|
|
||||||
|
/(*UTF)ab<61>ソソソソソz/B
|
||||||
|
|
||||||
|
/ab<61>ソソソソソz/utf
|
||||||
|
|
||||||
# End of testinput12
|
# End of testinput12
|
||||||
|
|
|
@ -643,4 +643,22 @@ Subject length lower bound = 1
|
||||||
|
|
||||||
/(*THEN:\[A]{65501})/expand
|
/(*THEN:\[A]{65501})/expand
|
||||||
|
|
||||||
|
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
|
||||||
|
# even though this test is run when UTF is not supported.
|
||||||
|
|
||||||
|
/abý¿¿¿¿¿z/utf8_input
|
||||||
|
** Failed: character value greater than 0xffff cannot be converted to 16-bit in non-UTF mode
|
||||||
|
abý¿¿¿¿¿z
|
||||||
|
ab\x{7fffffff}z
|
||||||
|
|
||||||
|
/abÿý¿¿¿¿¿z/utf8_input
|
||||||
|
** Failed: invalid UTF-8 string cannot be converted to 16-bit string
|
||||||
|
abÿý¿¿¿¿¿z
|
||||||
|
ab\x{ffffffff}z
|
||||||
|
|
||||||
|
/abÿAz/utf8_input
|
||||||
|
** Failed: invalid UTF-8 string cannot be converted to 16-bit string
|
||||||
|
abÿAz
|
||||||
|
ab\x{80000041}z
|
||||||
|
|
||||||
# End of testinput11
|
# End of testinput11
|
||||||
|
|
|
@ -646,4 +646,25 @@ Subject length lower bound = 1
|
||||||
|
|
||||||
/(*THEN:\[A]{65501})/expand
|
/(*THEN:\[A]{65501})/expand
|
||||||
|
|
||||||
|
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
|
||||||
|
# even though this test is run when UTF is not supported.
|
||||||
|
|
||||||
|
/abý¿¿¿¿¿z/utf8_input
|
||||||
|
abý¿¿¿¿¿z
|
||||||
|
0: ab\x{7fffffff}z
|
||||||
|
ab\x{7fffffff}z
|
||||||
|
0: ab\x{7fffffff}z
|
||||||
|
|
||||||
|
/abÿý¿¿¿¿¿z/utf8_input
|
||||||
|
abÿý¿¿¿¿¿z
|
||||||
|
0: ab\x{ffffffff}z
|
||||||
|
ab\x{ffffffff}z
|
||||||
|
0: ab\x{ffffffff}z
|
||||||
|
|
||||||
|
/abÿAz/utf8_input
|
||||||
|
abÿAz
|
||||||
|
0: ab\x{80000041}z
|
||||||
|
ab\x{80000041}z
|
||||||
|
0: ab\x{80000041}z
|
||||||
|
|
||||||
# End of testinput11
|
# End of testinput11
|
||||||
|
|
|
@ -1367,4 +1367,15 @@ Subject length lower bound = 2
|
||||||
\x{110000}
|
\x{110000}
|
||||||
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
|
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
|
||||||
|
|
||||||
|
/(*UTF)abý¿¿¿¿¿z/B
|
||||||
|
------------------------------------------------------------------
|
||||||
|
Bra
|
||||||
|
ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
|
||||||
|
Ket
|
||||||
|
End
|
||||||
|
------------------------------------------------------------------
|
||||||
|
|
||||||
|
/abý¿¿¿¿¿z/utf
|
||||||
|
** Failed: character value greater than 0x10ffff cannot be converted to UTF
|
||||||
|
|
||||||
# End of testinput12
|
# End of testinput12
|
||||||
|
|
|
@ -1361,4 +1361,15 @@ Subject length lower bound = 2
|
||||||
\x{110000}
|
\x{110000}
|
||||||
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
|
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
|
||||||
|
|
||||||
|
/(*UTF)abý¿¿¿¿¿z/B
|
||||||
|
------------------------------------------------------------------
|
||||||
|
Bra
|
||||||
|
ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
|
||||||
|
Ket
|
||||||
|
End
|
||||||
|
------------------------------------------------------------------
|
||||||
|
|
||||||
|
/abý¿¿¿¿¿z/utf
|
||||||
|
** Failed: character value greater than 0x10ffff cannot be converted to UTF
|
||||||
|
|
||||||
# End of testinput12
|
# End of testinput12
|
||||||
|
|
Loading…
Reference in New Issue