Update pcre2test with the /utf8_input option, for generating wide characters in
non-UTF 16-bit and 32-bit modes.
This commit is contained in:
parent
5b6c797a4d
commit
69c9d81e43
|
@ -2,6 +2,13 @@ Change Log for PCRE2
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
|
|
||||||
|
Version 10.23 xx-xxxxxx-2016
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
1. Extended pcre2test with the utf8_input modifier so that it is able to
|
||||||
|
generate all possible 16-bit and 32-bit code unit values in non-UTF modes.
|
||||||
|
|
||||||
|
|
||||||
Version 10.22 29-July-2016
|
Version 10.22 29-July-2016
|
||||||
--------------------------
|
--------------------------
|
||||||
|
|
||||||
|
|
|
@ -9,9 +9,9 @@ dnl The PCRE2_PRERELEASE feature is for identifying release candidates. It might
|
||||||
dnl be defined as -RC2, for example. For real releases, it should be empty.
|
dnl be defined as -RC2, for example. For real releases, it should be empty.
|
||||||
|
|
||||||
m4_define(pcre2_major, [10])
|
m4_define(pcre2_major, [10])
|
||||||
m4_define(pcre2_minor, [22])
|
m4_define(pcre2_minor, [23])
|
||||||
m4_define(pcre2_prerelease, [])
|
m4_define(pcre2_prerelease, [-RC1])
|
||||||
m4_define(pcre2_date, [2016-07-29])
|
m4_define(pcre2_date, [2016-08-01])
|
||||||
|
|
||||||
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
||||||
# 50 lines of this file. Please update that if the variables above are moved.
|
# 50 lines of this file. Please update that if the variables above are moved.
|
||||||
|
|
|
@ -61,7 +61,7 @@ subject is processed, and what output is produced.
|
||||||
<P>
|
<P>
|
||||||
As the original fairly simple PCRE library evolved, it acquired many different
|
As the original fairly simple PCRE library evolved, it acquired many different
|
||||||
features, and as a result, the original <b>pcretest</b> program ended up with a
|
features, and as a result, the original <b>pcretest</b> program ended up with a
|
||||||
lot of options in a messy, arcane syntax, for testing all the features. The
|
lot of options in a messy, arcane syntax for testing all the features. The
|
||||||
move to the new PCRE2 API provided an opportunity to re-implement the test
|
move to the new PCRE2 API provided an opportunity to re-implement the test
|
||||||
program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
|
program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
|
||||||
are still many obscure modifiers, some of which are specifically designed for
|
are still many obscure modifiers, some of which are specifically designed for
|
||||||
|
@ -77,32 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
|
||||||
all three of these libraries may be simultaneously installed. The
|
all three of these libraries may be simultaneously installed. The
|
||||||
<b>pcre2test</b> program can be used to test all the libraries. However, its own
|
<b>pcre2test</b> program can be used to test all the libraries. However, its own
|
||||||
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
||||||
libraries, patterns and subject strings are converted to 16- or 32-bit format
|
libraries, patterns and subject strings are converted to 16-bit or 32-bit
|
||||||
before being passed to the library functions. Results are converted back to
|
format before being passed to the library functions. Results are converted back
|
||||||
8-bit code units for output.
|
to 8-bit code units for output.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In the rest of this document, the names of library functions and structures
|
In the rest of this document, the names of library functions and structures
|
||||||
are given in generic form, for example, <b>pcre_compile()</b>. The actual
|
are given in generic form, for example, <b>pcre_compile()</b>. The actual
|
||||||
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
||||||
</P>
|
<a name="inputencoding"></a></P>
|
||||||
<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
|
<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
|
||||||
<P>
|
<P>
|
||||||
Input to <b>pcre2test</b> is processed line by line, either by calling the C
|
Input to <b>pcre2test</b> is processed line by line, either by calling the C
|
||||||
library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
|
library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
|
||||||
below). The input is processed using using C's string functions, so must not
|
Windows environments character 26 (hex 1A) causes an immediate end of file, and
|
||||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
no further data is read, so this character should be avoided unless you really
|
||||||
treats any bytes other than newline as data characters. In some Windows
|
want that action.
|
||||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
|
||||||
further data is read.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For maximum portability, therefore, it is safest to avoid non-printing
|
The input is processed using using C's string functions, so must not
|
||||||
characters in <b>pcre2test</b> input files. There is a facility for specifying
|
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||||
some or all of a pattern's characters as hexadecimal pairs, thus making it
|
treats any bytes other than newline as data characters. An error is generated
|
||||||
possible to include binary zeroes in a pattern for testing purposes. Subject
|
if a binary zero is encountered. Subject lines are processed for backslash
|
||||||
lines are processed for backslash escapes, which makes it possible to include
|
escapes, which makes it possible to include any data value in strings that are
|
||||||
any data value.
|
passed to the library for matching. For patterns, there is a facility for
|
||||||
|
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||||
|
which makes it possible to include binary zeros.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Input for the 16-bit and 32-bit libraries
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
When testing the 16-bit or 32-bit libraries, there is a need to be able to
|
||||||
|
generate character code points greater than 255 in the strings that are passed
|
||||||
|
to the library. For subject lines, backslash escapes can be used. In addition,
|
||||||
|
when the <b>utf</b> modifier (see
|
||||||
|
<a href="#optionmodifiers">"Setting compilation options"</a>
|
||||||
|
below) is set, the pattern and any following subject lines are interpreted as
|
||||||
|
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
|
||||||
|
used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
|
||||||
|
or 32-bit mode. It causes the pattern and following subject lines to be treated
|
||||||
|
as UTF-8 according to the original definition (RFC 2279), which allows for
|
||||||
|
character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||||
|
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
|
||||||
|
to occur).
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
||||||
|
values can be handled by the 32-bit library. When testing this library in
|
||||||
|
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
|
||||||
|
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
||||||
|
character's value. This is the only way of passing such code points in a
|
||||||
|
pattern string. For subject strings, using an escape sequence is preferable.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -553,7 +582,9 @@ for a description of their effects.
|
||||||
As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
|
As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
|
||||||
non-printing characters in output strings to be printed using the \x{hh...}
|
non-printing characters in output strings to be printed using the \x{hh...}
|
||||||
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
||||||
brackets.
|
brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and
|
||||||
|
subject strings to be translated to UTF-16 or UTF-32, respectively, before
|
||||||
|
being passed to library functions.
|
||||||
<a name="controlmodifiers"></a></P>
|
<a name="controlmodifiers"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Setting compilation controls
|
Setting compilation controls
|
||||||
|
@ -584,6 +615,7 @@ about the pattern:
|
||||||
pushcopy push a copy onto the stack
|
pushcopy push a copy onto the stack
|
||||||
stackguard=<number> test the stackguard feature
|
stackguard=<number> test the stackguard feature
|
||||||
tables=[0|1|2] select internal tables
|
tables=[0|1|2] select internal tables
|
||||||
|
utf8_input treat input as UTF-8
|
||||||
</pre>
|
</pre>
|
||||||
The effects of these modifiers are described in the following sections.
|
The effects of these modifiers are described in the following sections.
|
||||||
</P>
|
</P>
|
||||||
|
@ -684,7 +716,8 @@ nine characters, only two of which are specified in hexadecimal:
|
||||||
/ab "literal" 32/hex
|
/ab "literal" 32/hex
|
||||||
</pre>
|
</pre>
|
||||||
Either single or double quotes may be used. There is no way of including
|
Either single or double quotes may be used. There is no way of including
|
||||||
the delimiter within a substring.
|
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
|
||||||
|
mutually exclusive.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
By default, <b>pcre2test</b> passes patterns as zero-terminated strings to
|
By default, <b>pcre2test</b> passes patterns as zero-terminated strings to
|
||||||
|
@ -693,6 +726,19 @@ patterns specified with the <b>hex</b> modifier, the actual length of the
|
||||||
pattern is passed.
|
pattern is passed.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
Specifying wide characters in 16-bit and 32-bit modes
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
|
||||||
|
translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing
|
||||||
|
the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
|
||||||
|
can be used. It is mutually exclusive with <b>utf</b>. Input lines are
|
||||||
|
interpreted as UTF-8 as a means of specifying wide characters. More details are
|
||||||
|
given in
|
||||||
|
<a href="#inputencoding">"Input encoding"</a>
|
||||||
|
above.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
Generating long repetitive patterns
|
Generating long repetitive patterns
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -708,7 +754,8 @@ are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
|
||||||
example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||||
cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
|
cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
|
||||||
by decimal digits and "}" is found later in the pattern. If not, the characters
|
by decimal digits and "}" is found later in the pattern. If not, the characters
|
||||||
remain in the pattern unaltered.
|
remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
|
||||||
|
mutually exclusive.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If part of an expanded pattern looks like an expansion, but is really part of
|
If part of an expanded pattern looks like an expansion, but is really part of
|
||||||
|
@ -1706,7 +1753,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 06 July 2016
|
Last updated: 02 August 2016
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2016 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -169,8 +169,8 @@ REVISION
|
||||||
Last updated: 16 October 2015
|
Last updated: 16 October 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2API(3) Library Functions Manual PCRE2API(3)
|
PCRE2API(3) Library Functions Manual PCRE2API(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -3154,8 +3154,8 @@ REVISION
|
||||||
Last updated: 17 June 2016
|
Last updated: 17 June 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
|
PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -3647,8 +3647,8 @@ REVISION
|
||||||
Last updated: 01 April 2016
|
Last updated: 01 April 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
|
PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -4011,8 +4011,8 @@ REVISION
|
||||||
Last updated: 23 March 2015
|
Last updated: 23 March 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
|
PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -4196,8 +4196,8 @@ REVISION
|
||||||
Last updated: 15 March 2015
|
Last updated: 15 March 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
|
PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -4593,8 +4593,8 @@ REVISION
|
||||||
Last updated: 05 June 2016
|
Last updated: 05 June 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
|
PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -4671,8 +4671,8 @@ REVISION
|
||||||
Last updated: 05 November 2015
|
Last updated: 05 November 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
|
PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -4890,8 +4890,8 @@ REVISION
|
||||||
Last updated: 29 September 2014
|
Last updated: 29 September 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
|
PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -5330,8 +5330,8 @@ REVISION
|
||||||
Last updated: 22 December 2014
|
Last updated: 22 December 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
|
PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -8370,8 +8370,8 @@ REVISION
|
||||||
Last updated: 20 June 2016
|
Last updated: 20 June 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3)
|
PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -8543,8 +8543,8 @@ REVISION
|
||||||
Last updated: 02 January 2015
|
Last updated: 02 January 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3)
|
PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -8819,8 +8819,8 @@ REVISION
|
||||||
Last updated: 31 January 2016
|
Last updated: 31 January 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3)
|
PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -9085,8 +9085,8 @@ REVISION
|
||||||
Last updated: 24 May 2016
|
Last updated: 24 May 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2STACK(3) Library Functions Manual PCRE2STACK(3)
|
PCRE2STACK(3) Library Functions Manual PCRE2STACK(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -9251,8 +9251,8 @@ REVISION
|
||||||
Last updated: 21 November 2014
|
Last updated: 21 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
|
PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -9687,8 +9687,8 @@ REVISION
|
||||||
Last updated: 16 October 2015
|
Last updated: 16 October 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
|
PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -9930,5 +9930,5 @@ REVISION
|
||||||
Last updated: 03 July 2016
|
Last updated: 03 July 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2TEST 1 "06 July 2016" "PCRE 10.22"
|
.TH PCRE2TEST 1 "02 August 2016" "PCRE 10.23"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -29,7 +29,7 @@ subject is processed, and what output is produced.
|
||||||
.P
|
.P
|
||||||
As the original fairly simple PCRE library evolved, it acquired many different
|
As the original fairly simple PCRE library evolved, it acquired many different
|
||||||
features, and as a result, the original \fBpcretest\fP program ended up with a
|
features, and as a result, the original \fBpcretest\fP program ended up with a
|
||||||
lot of options in a messy, arcane syntax, for testing all the features. The
|
lot of options in a messy, arcane syntax for testing all the features. The
|
||||||
move to the new PCRE2 API provided an opportunity to re-implement the test
|
move to the new PCRE2 API provided an opportunity to re-implement the test
|
||||||
program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
|
program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
|
||||||
are still many obscure modifiers, some of which are specifically designed for
|
are still many obscure modifiers, some of which are specifically designed for
|
||||||
|
@ -47,32 +47,63 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
|
||||||
all three of these libraries may be simultaneously installed. The
|
all three of these libraries may be simultaneously installed. The
|
||||||
\fBpcre2test\fP program can be used to test all the libraries. However, its own
|
\fBpcre2test\fP program can be used to test all the libraries. However, its own
|
||||||
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
|
||||||
libraries, patterns and subject strings are converted to 16- or 32-bit format
|
libraries, patterns and subject strings are converted to 16-bit or 32-bit
|
||||||
before being passed to the library functions. Results are converted back to
|
format before being passed to the library functions. Results are converted back
|
||||||
8-bit code units for output.
|
to 8-bit code units for output.
|
||||||
.P
|
.P
|
||||||
In the rest of this document, the names of library functions and structures
|
In the rest of this document, the names of library functions and structures
|
||||||
are given in generic form, for example, \fBpcre_compile()\fP. The actual
|
are given in generic form, for example, \fBpcre_compile()\fP. The actual
|
||||||
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.\" HTML <a name="inputencoding"></a>
|
||||||
.SH "INPUT ENCODING"
|
.SH "INPUT ENCODING"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
Input to \fBpcre2test\fP is processed line by line, either by calling the C
|
Input to \fBpcre2test\fP is processed line by line, either by calling the C
|
||||||
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
|
library's \fBfgets()\fP function, or via the \fBlibreadline\fP library. In some
|
||||||
below). The input is processed using using C's string functions, so must not
|
Windows environments character 26 (hex 1A) causes an immediate end of file, and
|
||||||
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
|
no further data is read, so this character should be avoided unless you really
|
||||||
treats any bytes other than newline as data characters. In some Windows
|
want that action.
|
||||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
|
||||||
further data is read.
|
|
||||||
.P
|
.P
|
||||||
For maximum portability, therefore, it is safest to avoid non-printing
|
The input is processed using using C's string functions, so must not
|
||||||
characters in \fBpcre2test\fP input files. There is a facility for specifying
|
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
|
||||||
some or all of a pattern's characters as hexadecimal pairs, thus making it
|
treats any bytes other than newline as data characters. An error is generated
|
||||||
possible to include binary zeroes in a pattern for testing purposes. Subject
|
if a binary zero is encountered. Subject lines are processed for backslash
|
||||||
lines are processed for backslash escapes, which makes it possible to include
|
escapes, which makes it possible to include any data value in strings that are
|
||||||
any data value.
|
passed to the library for matching. For patterns, there is a facility for
|
||||||
|
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||||
|
which makes it possible to include binary zeros.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Input for the 16-bit and 32-bit libraries"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
When testing the 16-bit or 32-bit libraries, there is a need to be able to
|
||||||
|
generate character code points greater than 255 in the strings that are passed
|
||||||
|
to the library. For subject lines, backslash escapes can be used. In addition,
|
||||||
|
when the \fButf\fP modifier (see
|
||||||
|
.\" HTML <a href="#optionmodifiers">
|
||||||
|
.\" </a>
|
||||||
|
"Setting compilation options"
|
||||||
|
.\"
|
||||||
|
below) is set, the pattern and any following subject lines are interpreted as
|
||||||
|
UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
|
||||||
|
.P
|
||||||
|
For non-UTF testing of wide characters, the \fButf8_input\fP modifier can be
|
||||||
|
used. This is mutually exclusive with \fButf\fP, and is allowed only in 16-bit
|
||||||
|
or 32-bit mode. It causes the pattern and following subject lines to be treated
|
||||||
|
as UTF-8 according to the original definition (RFC 2279), which allows for
|
||||||
|
character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||||
|
32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
|
||||||
|
to occur).
|
||||||
|
.P
|
||||||
|
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
||||||
|
values can be handled by the 32-bit library. When testing this library in
|
||||||
|
non-UTF mode with \fButf8_input\fP set, if any character is preceded by the
|
||||||
|
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
||||||
|
character's value. This is the only way of passing such code points in a
|
||||||
|
pattern string. For subject strings, using an escape sequence is preferable.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "COMMAND LINE OPTIONS"
|
.SH "COMMAND LINE OPTIONS"
|
||||||
|
@ -515,7 +546,9 @@ for a description of their effects.
|
||||||
As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
|
As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
|
||||||
non-printing characters in output strings to be printed using the \ex{hh...}
|
non-printing characters in output strings to be printed using the \ex{hh...}
|
||||||
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
notation. Otherwise, those less than 0x100 are output in hex without the curly
|
||||||
brackets.
|
brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and
|
||||||
|
subject strings to be translated to UTF-16 or UTF-32, respectively, before
|
||||||
|
being passed to library functions.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="controlmodifiers"></a>
|
.\" HTML <a name="controlmodifiers"></a>
|
||||||
|
@ -547,6 +580,7 @@ about the pattern:
|
||||||
pushcopy push a copy onto the stack
|
pushcopy push a copy onto the stack
|
||||||
stackguard=<number> test the stackguard feature
|
stackguard=<number> test the stackguard feature
|
||||||
tables=[0|1|2] select internal tables
|
tables=[0|1|2] select internal tables
|
||||||
|
utf8_input treat input as UTF-8
|
||||||
.sp
|
.sp
|
||||||
The effects of these modifiers are described in the following sections.
|
The effects of these modifiers are described in the following sections.
|
||||||
.
|
.
|
||||||
|
@ -642,7 +676,8 @@ nine characters, only two of which are specified in hexadecimal:
|
||||||
/ab "literal" 32/hex
|
/ab "literal" 32/hex
|
||||||
.sp
|
.sp
|
||||||
Either single or double quotes may be used. There is no way of including
|
Either single or double quotes may be used. There is no way of including
|
||||||
the delimiter within a substring.
|
the delimiter within a substring. The \fBhex\fP and \fBexpand\fP modifiers are
|
||||||
|
mutually exclusive.
|
||||||
.P
|
.P
|
||||||
By default, \fBpcre2test\fP passes patterns as zero-terminated strings to
|
By default, \fBpcre2test\fP passes patterns as zero-terminated strings to
|
||||||
\fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. However, for
|
\fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. However, for
|
||||||
|
@ -650,6 +685,22 @@ patterns specified with the \fBhex\fP modifier, the actual length of the
|
||||||
pattern is passed.
|
pattern is passed.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SS "Specifying wide characters in 16-bit and 32-bit modes"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
|
||||||
|
translated to UTF-16 or UTF-32 when the \fButf\fP modifier is set. For testing
|
||||||
|
the 16-bit and 32-bit libraries in non-UTF mode, the \fButf8_input\fP modifier
|
||||||
|
can be used. It is mutually exclusive with \fButf\fP. Input lines are
|
||||||
|
interpreted as UTF-8 as a means of specifying wide characters. More details are
|
||||||
|
given in
|
||||||
|
.\" HTML <a href="#inputencoding">
|
||||||
|
.\" </a>
|
||||||
|
"Input encoding"
|
||||||
|
.\"
|
||||||
|
above.
|
||||||
|
.
|
||||||
|
.
|
||||||
.SS "Generating long repetitive patterns"
|
.SS "Generating long repetitive patterns"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -665,7 +716,8 @@ are expanded before the pattern is passed to \fBpcre2_compile()\fP. For
|
||||||
example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||||
cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
|
cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
|
||||||
by decimal digits and "}" is found later in the pattern. If not, the characters
|
by decimal digits and "}" is found later in the pattern. If not, the characters
|
||||||
remain in the pattern unaltered.
|
remain in the pattern unaltered. The \fBexpand\fP and \fBhex\fP modifiers are
|
||||||
|
mutually exclusive.
|
||||||
.P
|
.P
|
||||||
If part of an expanded pattern looks like an expansion, but is really part of
|
If part of an expanded pattern looks like an expansion, but is really part of
|
||||||
the actual pattern, unwanted expansion can be avoided by giving two values in
|
the actual pattern, unwanted expansion can be avoided by giving two values in
|
||||||
|
@ -1682,6 +1734,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 06 July 2016
|
Last updated: 02 August 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -26,7 +26,7 @@ SYNOPSIS
|
||||||
|
|
||||||
As the original fairly simple PCRE library evolved, it acquired many
|
As the original fairly simple PCRE library evolved, it acquired many
|
||||||
different features, and as a result, the original pcretest program
|
different features, and as a result, the original pcretest program
|
||||||
ended up with a lot of options in a messy, arcane syntax, for testing
|
ended up with a lot of options in a messy, arcane syntax for testing
|
||||||
all the features. The move to the new PCRE2 API provided an opportunity
|
all the features. The move to the new PCRE2 API provided an opportunity
|
||||||
to re-implement the test program as pcre2test, with a cleaner modifier
|
to re-implement the test program as pcre2test, with a cleaner modifier
|
||||||
syntax. Nevertheless, there are still many obscure modifiers, some of
|
syntax. Nevertheless, there are still many obscure modifiers, some of
|
||||||
|
@ -45,7 +45,7 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
||||||
installed. The pcre2test program can be used to test all the libraries.
|
installed. The pcre2test program can be used to test all the libraries.
|
||||||
However, its own input and output are always in 8-bit format. When
|
However, its own input and output are always in 8-bit format. When
|
||||||
testing the 16-bit or 32-bit libraries, patterns and subject strings
|
testing the 16-bit or 32-bit libraries, patterns and subject strings
|
||||||
are converted to 16- or 32-bit format before being passed to the
|
are converted to 16-bit or 32-bit format before being passed to the
|
||||||
library functions. Results are converted back to 8-bit code units for
|
library functions. Results are converted back to 8-bit code units for
|
||||||
output.
|
output.
|
||||||
|
|
||||||
|
@ -58,49 +58,76 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
||||||
INPUT ENCODING
|
INPUT ENCODING
|
||||||
|
|
||||||
Input to pcre2test is processed line by line, either by calling the C
|
Input to pcre2test is processed line by line, either by calling the C
|
||||||
library's fgets() function, or via the libreadline library (see below).
|
library's fgets() function, or via the libreadline library. In some
|
||||||
|
Windows environments character 26 (hex 1A) causes an immediate end of
|
||||||
|
file, and no further data is read, so this character should be avoided
|
||||||
|
unless you really want that action.
|
||||||
|
|
||||||
The input is processed using using C's string functions, so must not
|
The input is processed using using C's string functions, so must not
|
||||||
contain binary zeroes, even though in Unix-like environments, fgets()
|
contain binary zeroes, even though in Unix-like environments, fgets()
|
||||||
treats any bytes other than newline as data characters. In some Windows
|
treats any bytes other than newline as data characters. An error is
|
||||||
environments character 26 (hex 1A) causes an immediate end of file, and
|
generated if a binary zero is encountered. Subject lines are processed
|
||||||
no further data is read.
|
for backslash escapes, which makes it possible to include any data
|
||||||
|
value in strings that are passed to the library for matching. For pat-
|
||||||
|
terns, there is a facility for specifying some or all of the 8-bit
|
||||||
|
input characters as hexadecimal pairs, which makes it possible to
|
||||||
|
include binary zeros.
|
||||||
|
|
||||||
For maximum portability, therefore, it is safest to avoid non-printing
|
Input for the 16-bit and 32-bit libraries
|
||||||
characters in pcre2test input files. There is a facility for specifying
|
|
||||||
some or all of a pattern's characters as hexadecimal pairs, thus making
|
When testing the 16-bit or 32-bit libraries, there is a need to be able
|
||||||
it possible to include binary zeroes in a pattern for testing purposes.
|
to generate character code points greater than 255 in the strings that
|
||||||
Subject lines are processed for backslash escapes, which makes it pos-
|
are passed to the library. For subject lines, backslash escapes can be
|
||||||
sible to include any data value.
|
used. In addition, when the utf modifier (see "Setting compilation
|
||||||
|
options" below) is set, the pattern and any following subject lines are
|
||||||
|
interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as
|
||||||
|
appropriate.
|
||||||
|
|
||||||
|
For non-UTF testing of wide characters, the utf8_input modifier can be
|
||||||
|
used. This is mutually exclusive with utf, and is allowed only in
|
||||||
|
16-bit or 32-bit mode. It causes the pattern and following subject
|
||||||
|
lines to be treated as UTF-8 according to the original definition (RFC
|
||||||
|
2279), which allows for character values up to 0x7fffffff. Each charac-
|
||||||
|
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
|
||||||
|
values greater than 0xffff cause an error to occur).
|
||||||
|
|
||||||
|
UTF-8 is not capable of encoding values greater than 0x7fffffff, but
|
||||||
|
such values can be handled by the 32-bit library. When testing this
|
||||||
|
library in non-UTF mode with utf8_input set, if any character is pre-
|
||||||
|
ceded by the byte 0xff (which is an illegal byte in UTF-8) 0x80000000
|
||||||
|
is added to the character's value. This is the only way of passing such
|
||||||
|
code points in a pattern string. For subject strings, using an escape
|
||||||
|
sequence is preferable.
|
||||||
|
|
||||||
|
|
||||||
COMMAND LINE OPTIONS
|
COMMAND LINE OPTIONS
|
||||||
|
|
||||||
-8 If the 8-bit library has been built, this option causes it to
|
-8 If the 8-bit library has been built, this option causes it to
|
||||||
be used (this is the default). If the 8-bit library has not
|
be used (this is the default). If the 8-bit library has not
|
||||||
been built, this option causes an error.
|
been built, this option causes an error.
|
||||||
|
|
||||||
-16 If the 16-bit library has been built, this option causes it
|
-16 If the 16-bit library has been built, this option causes it
|
||||||
to be used. If only the 16-bit library has been built, this
|
to be used. If only the 16-bit library has been built, this
|
||||||
is the default. If the 16-bit library has not been built,
|
is the default. If the 16-bit library has not been built,
|
||||||
this option causes an error.
|
this option causes an error.
|
||||||
|
|
||||||
-32 If the 32-bit library has been built, this option causes it
|
-32 If the 32-bit library has been built, this option causes it
|
||||||
to be used. If only the 32-bit library has been built, this
|
to be used. If only the 32-bit library has been built, this
|
||||||
is the default. If the 32-bit library has not been built,
|
is the default. If the 32-bit library has not been built,
|
||||||
this option causes an error.
|
this option causes an error.
|
||||||
|
|
||||||
-b Behave as if each pattern has the /fullbincode modifier; the
|
-b Behave as if each pattern has the /fullbincode modifier; the
|
||||||
full internal binary form of the pattern is output after com-
|
full internal binary form of the pattern is output after com-
|
||||||
pilation.
|
pilation.
|
||||||
|
|
||||||
-C Output the version number of the PCRE2 library, and all
|
-C Output the version number of the PCRE2 library, and all
|
||||||
available information about the optional features that are
|
available information about the optional features that are
|
||||||
included, and then exit with zero exit code. All other
|
included, and then exit with zero exit code. All other
|
||||||
options are ignored.
|
options are ignored.
|
||||||
|
|
||||||
-C option Output information about a specific build-time option, then
|
-C option Output information about a specific build-time option, then
|
||||||
exit. This functionality is intended for use in scripts such
|
exit. This functionality is intended for use in scripts such
|
||||||
as RunTest. The following options output the value and set
|
as RunTest. The following options output the value and set
|
||||||
the exit code as indicated:
|
the exit code as indicated:
|
||||||
|
|
||||||
ebcdic-nl the code for LF (= NL) in an EBCDIC environment:
|
ebcdic-nl the code for LF (= NL) in an EBCDIC environment:
|
||||||
|
@ -116,7 +143,7 @@ COMMAND LINE OPTIONS
|
||||||
ANYCRLF or ANY
|
ANYCRLF or ANY
|
||||||
exit code is always 0
|
exit code is always 0
|
||||||
|
|
||||||
The following options output 1 for true or 0 for false, and
|
The following options output 1 for true or 0 for false, and
|
||||||
set the exit code to the same value:
|
set the exit code to the same value:
|
||||||
|
|
||||||
backslash-C \C is supported (not locked out)
|
backslash-C \C is supported (not locked out)
|
||||||
|
@ -127,22 +154,22 @@ COMMAND LINE OPTIONS
|
||||||
pcre2-8 the 8-bit library was built
|
pcre2-8 the 8-bit library was built
|
||||||
unicode Unicode support is available
|
unicode Unicode support is available
|
||||||
|
|
||||||
If an unknown option is given, an error message is output;
|
If an unknown option is given, an error message is output;
|
||||||
the exit code is 0.
|
the exit code is 0.
|
||||||
|
|
||||||
-d Behave as if each pattern has the debug modifier; the inter-
|
-d Behave as if each pattern has the debug modifier; the inter-
|
||||||
nal form and information about the compiled pattern is output
|
nal form and information about the compiled pattern is output
|
||||||
after compilation; -d is equivalent to -b -i.
|
after compilation; -d is equivalent to -b -i.
|
||||||
|
|
||||||
-dfa Behave as if each subject line has the dfa modifier; matching
|
-dfa Behave as if each subject line has the dfa modifier; matching
|
||||||
is done using the pcre2_dfa_match() function instead of the
|
is done using the pcre2_dfa_match() function instead of the
|
||||||
default pcre2_match().
|
default pcre2_match().
|
||||||
|
|
||||||
-error number[,number,...]
|
-error number[,number,...]
|
||||||
Call pcre2_get_error_message() for each of the error numbers
|
Call pcre2_get_error_message() for each of the error numbers
|
||||||
in the comma-separated list, display the resulting messages
|
in the comma-separated list, display the resulting messages
|
||||||
on the standard output, then exit with zero exit code. The
|
on the standard output, then exit with zero exit code. The
|
||||||
numbers may be positive or negative. This is a convenience
|
numbers may be positive or negative. This is a convenience
|
||||||
facility for PCRE2 maintainers.
|
facility for PCRE2 maintainers.
|
||||||
|
|
||||||
-help Output a brief summary these options and then exit.
|
-help Output a brief summary these options and then exit.
|
||||||
|
@ -150,8 +177,8 @@ COMMAND LINE OPTIONS
|
||||||
-i Behave as if each pattern has the /info modifier; information
|
-i Behave as if each pattern has the /info modifier; information
|
||||||
about the compiled pattern is given after compilation.
|
about the compiled pattern is given after compilation.
|
||||||
|
|
||||||
-jit Behave as if each pattern line has the jit modifier; after
|
-jit Behave as if each pattern line has the jit modifier; after
|
||||||
successful compilation, each pattern is passed to the just-
|
successful compilation, each pattern is passed to the just-
|
||||||
in-time compiler, if available.
|
in-time compiler, if available.
|
||||||
|
|
||||||
-pattern modifier-list
|
-pattern modifier-list
|
||||||
|
@ -160,25 +187,25 @@ COMMAND LINE OPTIONS
|
||||||
-q Do not output the version number of pcre2test at the start of
|
-q Do not output the version number of pcre2test at the start of
|
||||||
execution.
|
execution.
|
||||||
|
|
||||||
-S size On Unix-like systems, set the size of the run-time stack to
|
-S size On Unix-like systems, set the size of the run-time stack to
|
||||||
size megabytes.
|
size megabytes.
|
||||||
|
|
||||||
-subject modifier-list
|
-subject modifier-list
|
||||||
Behave as if each subject line contains the given modifiers.
|
Behave as if each subject line contains the given modifiers.
|
||||||
|
|
||||||
-t Run each compile and match many times with a timer, and out-
|
-t Run each compile and match many times with a timer, and out-
|
||||||
put the resulting times per compile or match. When JIT is
|
put the resulting times per compile or match. When JIT is
|
||||||
used, separate times are given for the initial compile and
|
used, separate times are given for the initial compile and
|
||||||
the JIT compile. You can control the number of iterations
|
the JIT compile. You can control the number of iterations
|
||||||
that are used for timing by following -t with a number (as a
|
that are used for timing by following -t with a number (as a
|
||||||
separate item on the command line). For example, "-t 1000"
|
separate item on the command line). For example, "-t 1000"
|
||||||
iterates 1000 times. The default is to iterate 500,000 times.
|
iterates 1000 times. The default is to iterate 500,000 times.
|
||||||
|
|
||||||
-tm This is like -t except that it times only the matching phase,
|
-tm This is like -t except that it times only the matching phase,
|
||||||
not the compile phase.
|
not the compile phase.
|
||||||
|
|
||||||
-T -TM These behave like -t and -tm, but in addition, at the end of
|
-T -TM These behave like -t and -tm, but in addition, at the end of
|
||||||
a run, the total times for all compiles and matches are out-
|
a run, the total times for all compiles and matches are out-
|
||||||
put.
|
put.
|
||||||
|
|
||||||
-version Output the PCRE2 version number and then exit.
|
-version Output the PCRE2 version number and then exit.
|
||||||
|
@ -186,139 +213,139 @@ COMMAND LINE OPTIONS
|
||||||
|
|
||||||
DESCRIPTION
|
DESCRIPTION
|
||||||
|
|
||||||
If pcre2test is given two filename arguments, it reads from the first
|
If pcre2test is given two filename arguments, it reads from the first
|
||||||
and writes to the second. If the first name is "-", input is taken from
|
and writes to the second. If the first name is "-", input is taken from
|
||||||
the standard input. If pcre2test is given only one argument, it reads
|
the standard input. If pcre2test is given only one argument, it reads
|
||||||
from that file and writes to stdout. Otherwise, it reads from stdin and
|
from that file and writes to stdout. Otherwise, it reads from stdin and
|
||||||
writes to stdout.
|
writes to stdout.
|
||||||
|
|
||||||
When pcre2test is built, a configuration option can specify that it
|
When pcre2test is built, a configuration option can specify that it
|
||||||
should be linked with the libreadline or libedit library. When this is
|
should be linked with the libreadline or libedit library. When this is
|
||||||
done, if the input is from a terminal, it is read using the readline()
|
done, if the input is from a terminal, it is read using the readline()
|
||||||
function. This provides line-editing and history facilities. The output
|
function. This provides line-editing and history facilities. The output
|
||||||
from the -help option states whether or not readline() will be used.
|
from the -help option states whether or not readline() will be used.
|
||||||
|
|
||||||
The program handles any number of tests, each of which consists of a
|
The program handles any number of tests, each of which consists of a
|
||||||
set of input lines. Each set starts with a regular expression pattern,
|
set of input lines. Each set starts with a regular expression pattern,
|
||||||
followed by any number of subject lines to be matched against that pat-
|
followed by any number of subject lines to be matched against that pat-
|
||||||
tern. In between sets of test data, command lines that begin with # may
|
tern. In between sets of test data, command lines that begin with # may
|
||||||
appear. This file format, with some restrictions, can also be processed
|
appear. This file format, with some restrictions, can also be processed
|
||||||
by the perltest.sh script that is distributed with PCRE2 as a means of
|
by the perltest.sh script that is distributed with PCRE2 as a means of
|
||||||
checking that the behaviour of PCRE2 and Perl is the same.
|
checking that the behaviour of PCRE2 and Perl is the same.
|
||||||
|
|
||||||
When the input is a terminal, pcre2test prompts for each line of input,
|
When the input is a terminal, pcre2test prompts for each line of input,
|
||||||
using "re>" to prompt for regular expression patterns, and "data>" to
|
using "re>" to prompt for regular expression patterns, and "data>" to
|
||||||
prompt for subject lines. Command lines starting with # can be entered
|
prompt for subject lines. Command lines starting with # can be entered
|
||||||
only in response to the "re>" prompt.
|
only in response to the "re>" prompt.
|
||||||
|
|
||||||
Each subject line is matched separately and independently. If you want
|
Each subject line is matched separately and independently. If you want
|
||||||
to do multi-line matches, you have to use the \n escape sequence (or \r
|
to do multi-line matches, you have to use the \n escape sequence (or \r
|
||||||
or \r\n, etc., depending on the newline setting) in a single line of
|
or \r\n, etc., depending on the newline setting) in a single line of
|
||||||
input to encode the newline sequences. There is no limit on the length
|
input to encode the newline sequences. There is no limit on the length
|
||||||
of subject lines; the input buffer is automatically extended if it is
|
of subject lines; the input buffer is automatically extended if it is
|
||||||
too small. There are replication features that makes it possible to
|
too small. There are replication features that makes it possible to
|
||||||
generate long repetitive pattern or subject lines without having to
|
generate long repetitive pattern or subject lines without having to
|
||||||
supply them explicitly.
|
supply them explicitly.
|
||||||
|
|
||||||
An empty line or the end of the file signals the end of the subject
|
An empty line or the end of the file signals the end of the subject
|
||||||
lines for a test, at which point a new pattern or command line is
|
lines for a test, at which point a new pattern or command line is
|
||||||
expected if there is still input to be read.
|
expected if there is still input to be read.
|
||||||
|
|
||||||
|
|
||||||
COMMAND LINES
|
COMMAND LINES
|
||||||
|
|
||||||
In between sets of test data, a line that begins with # is interpreted
|
In between sets of test data, a line that begins with # is interpreted
|
||||||
as a command line. If the first character is followed by white space or
|
as a command line. If the first character is followed by white space or
|
||||||
an exclamation mark, the line is treated as a comment, and ignored.
|
an exclamation mark, the line is treated as a comment, and ignored.
|
||||||
Otherwise, the following commands are recognized:
|
Otherwise, the following commands are recognized:
|
||||||
|
|
||||||
#forbid_utf
|
#forbid_utf
|
||||||
|
|
||||||
Subsequent patterns automatically have the PCRE2_NEVER_UTF and
|
Subsequent patterns automatically have the PCRE2_NEVER_UTF and
|
||||||
PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF
|
PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF
|
||||||
and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of
|
and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of
|
||||||
patterns. This command also forces an error if a subsequent pattern
|
patterns. This command also forces an error if a subsequent pattern
|
||||||
contains any occurrences of \P, \p, or \X, which are still supported
|
contains any occurrences of \P, \p, or \X, which are still supported
|
||||||
when PCRE2_UTF is not set, but which require Unicode property support
|
when PCRE2_UTF is not set, but which require Unicode property support
|
||||||
to be included in the library.
|
to be included in the library.
|
||||||
|
|
||||||
This is a trigger guard that is used in test files to ensure that UTF
|
This is a trigger guard that is used in test files to ensure that UTF
|
||||||
or Unicode property tests are not accidentally added to files that are
|
or Unicode property tests are not accidentally added to files that are
|
||||||
used when Unicode support is not included in the library. Setting
|
used when Unicode support is not included in the library. Setting
|
||||||
PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained
|
PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained
|
||||||
by the use of #pattern; the difference is that #forbid_utf cannot be
|
by the use of #pattern; the difference is that #forbid_utf cannot be
|
||||||
unset, and the automatic options are not displayed in pattern informa-
|
unset, and the automatic options are not displayed in pattern informa-
|
||||||
tion, to avoid cluttering up test output.
|
tion, to avoid cluttering up test output.
|
||||||
|
|
||||||
#load <filename>
|
#load <filename>
|
||||||
|
|
||||||
This command is used to load a set of precompiled patterns from a file,
|
This command is used to load a set of precompiled patterns from a file,
|
||||||
as described in the section entitled "Saving and restoring compiled
|
as described in the section entitled "Saving and restoring compiled
|
||||||
patterns" below.
|
patterns" below.
|
||||||
|
|
||||||
#newline_default [<newline-list>]
|
#newline_default [<newline-list>]
|
||||||
|
|
||||||
When PCRE2 is built, a default newline convention can be specified.
|
When PCRE2 is built, a default newline convention can be specified.
|
||||||
This determines which characters and/or character pairs are recognized
|
This determines which characters and/or character pairs are recognized
|
||||||
as indicating a newline in a pattern or subject string. The default can
|
as indicating a newline in a pattern or subject string. The default can
|
||||||
be overridden when a pattern is compiled. The standard test files con-
|
be overridden when a pattern is compiled. The standard test files con-
|
||||||
tain tests of various newline conventions, but the majority of the
|
tain tests of various newline conventions, but the majority of the
|
||||||
tests expect a single linefeed to be recognized as a newline by
|
tests expect a single linefeed to be recognized as a newline by
|
||||||
default. Without special action the tests would fail when PCRE2 is com-
|
default. Without special action the tests would fail when PCRE2 is com-
|
||||||
piled with either CR or CRLF as the default newline.
|
piled with either CR or CRLF as the default newline.
|
||||||
|
|
||||||
The #newline_default command specifies a list of newline types that are
|
The #newline_default command specifies a list of newline types that are
|
||||||
acceptable as the default. The types must be one of CR, LF, CRLF, ANY-
|
acceptable as the default. The types must be one of CR, LF, CRLF, ANY-
|
||||||
CRLF, or ANY (in upper or lower case), for example:
|
CRLF, or ANY (in upper or lower case), for example:
|
||||||
|
|
||||||
#newline_default LF Any anyCRLF
|
#newline_default LF Any anyCRLF
|
||||||
|
|
||||||
If the default newline is in the list, this command has no effect. Oth-
|
If the default newline is in the list, this command has no effect. Oth-
|
||||||
erwise, except when testing the POSIX API, a newline modifier that
|
erwise, except when testing the POSIX API, a newline modifier that
|
||||||
specifies the first newline convention in the list (LF in the above
|
specifies the first newline convention in the list (LF in the above
|
||||||
example) is added to any pattern that does not already have a newline
|
example) is added to any pattern that does not already have a newline
|
||||||
modifier. If the newline list is empty, the feature is turned off. This
|
modifier. If the newline list is empty, the feature is turned off. This
|
||||||
command is present in a number of the standard test input files.
|
command is present in a number of the standard test input files.
|
||||||
|
|
||||||
When the POSIX API is being tested there is no way to override the
|
When the POSIX API is being tested there is no way to override the
|
||||||
default newline convention, though it is possible to set the newline
|
default newline convention, though it is possible to set the newline
|
||||||
convention from within the pattern. A warning is given if the posix
|
convention from within the pattern. A warning is given if the posix
|
||||||
modifier is used when #newline_default would set a default for the non-
|
modifier is used when #newline_default would set a default for the non-
|
||||||
POSIX API.
|
POSIX API.
|
||||||
|
|
||||||
#pattern <modifier-list>
|
#pattern <modifier-list>
|
||||||
|
|
||||||
This command sets a default modifier list that applies to all subse-
|
This command sets a default modifier list that applies to all subse-
|
||||||
quent patterns. Modifiers on a pattern can change these settings.
|
quent patterns. Modifiers on a pattern can change these settings.
|
||||||
|
|
||||||
#perltest
|
#perltest
|
||||||
|
|
||||||
The appearance of this line causes all subsequent modifier settings to
|
The appearance of this line causes all subsequent modifier settings to
|
||||||
be checked for compatibility with the perltest.sh script, which is used
|
be checked for compatibility with the perltest.sh script, which is used
|
||||||
to confirm that Perl gives the same results as PCRE2. Also, apart from
|
to confirm that Perl gives the same results as PCRE2. Also, apart from
|
||||||
comment lines, none of the other command lines are permitted, because
|
comment lines, none of the other command lines are permitted, because
|
||||||
they and many of the modifiers are specific to pcre2test, and should
|
they and many of the modifiers are specific to pcre2test, and should
|
||||||
not be used in test files that are also processed by perltest.sh. The
|
not be used in test files that are also processed by perltest.sh. The
|
||||||
#perltest command helps detect tests that are accidentally put in the
|
#perltest command helps detect tests that are accidentally put in the
|
||||||
wrong file.
|
wrong file.
|
||||||
|
|
||||||
#pop [<modifiers>]
|
#pop [<modifiers>]
|
||||||
#popcopy [<modifiers>]
|
#popcopy [<modifiers>]
|
||||||
|
|
||||||
These commands are used to manipulate the stack of compiled patterns,
|
These commands are used to manipulate the stack of compiled patterns,
|
||||||
as described in the section entitled "Saving and restoring compiled
|
as described in the section entitled "Saving and restoring compiled
|
||||||
patterns" below.
|
patterns" below.
|
||||||
|
|
||||||
#save <filename>
|
#save <filename>
|
||||||
|
|
||||||
This command is used to save a set of compiled patterns to a file, as
|
This command is used to save a set of compiled patterns to a file, as
|
||||||
described in the section entitled "Saving and restoring compiled pat-
|
described in the section entitled "Saving and restoring compiled pat-
|
||||||
terns" below.
|
terns" below.
|
||||||
|
|
||||||
#subject <modifier-list>
|
#subject <modifier-list>
|
||||||
|
|
||||||
This command sets a default modifier list that applies to all subse-
|
This command sets a default modifier list that applies to all subse-
|
||||||
quent subject lines. Modifiers on a subject line can change these set-
|
quent subject lines. Modifiers on a subject line can change these set-
|
||||||
tings.
|
tings.
|
||||||
|
|
||||||
|
|
||||||
|
@ -326,58 +353,58 @@ MODIFIER SYNTAX
|
||||||
|
|
||||||
Modifier lists are used with both pattern and subject lines. Items in a
|
Modifier lists are used with both pattern and subject lines. Items in a
|
||||||
list are separated by commas followed by optional white space. Trailing
|
list are separated by commas followed by optional white space. Trailing
|
||||||
whitespace in a modifier list is ignored. Some modifiers may be given
|
whitespace in a modifier list is ignored. Some modifiers may be given
|
||||||
for both patterns and subject lines, whereas others are valid only for
|
for both patterns and subject lines, whereas others are valid only for
|
||||||
one or the other. Each modifier has a long name, for example
|
one or the other. Each modifier has a long name, for example
|
||||||
"anchored", and some of them must be followed by an equals sign and a
|
"anchored", and some of them must be followed by an equals sign and a
|
||||||
value, for example, "offset=12". Values cannot contain comma charac-
|
value, for example, "offset=12". Values cannot contain comma charac-
|
||||||
ters, but may contain spaces. Modifiers that do not take values may be
|
ters, but may contain spaces. Modifiers that do not take values may be
|
||||||
preceded by a minus sign to turn off a previous setting.
|
preceded by a minus sign to turn off a previous setting.
|
||||||
|
|
||||||
A few of the more common modifiers can also be specified as single let-
|
A few of the more common modifiers can also be specified as single let-
|
||||||
ters, for example "i" for "caseless". In documentation, following the
|
ters, for example "i" for "caseless". In documentation, following the
|
||||||
Perl convention, these are written with a slash ("the /i modifier") for
|
Perl convention, these are written with a slash ("the /i modifier") for
|
||||||
clarity. Abbreviated modifiers must all be concatenated in the first
|
clarity. Abbreviated modifiers must all be concatenated in the first
|
||||||
item of a modifier list. If the first item is not recognized as a long
|
item of a modifier list. If the first item is not recognized as a long
|
||||||
modifier name, it is interpreted as a sequence of these abbreviations.
|
modifier name, it is interpreted as a sequence of these abbreviations.
|
||||||
For example:
|
For example:
|
||||||
|
|
||||||
/abc/ig,newline=cr,jit=3
|
/abc/ig,newline=cr,jit=3
|
||||||
|
|
||||||
This is a pattern line whose modifier list starts with two one-letter
|
This is a pattern line whose modifier list starts with two one-letter
|
||||||
modifiers (/i and /g). The lower-case abbreviated modifiers are the
|
modifiers (/i and /g). The lower-case abbreviated modifiers are the
|
||||||
same as used in Perl.
|
same as used in Perl.
|
||||||
|
|
||||||
|
|
||||||
PATTERN SYNTAX
|
PATTERN SYNTAX
|
||||||
|
|
||||||
A pattern line must start with one of the following characters (common
|
A pattern line must start with one of the following characters (common
|
||||||
symbols, excluding pattern meta-characters):
|
symbols, excluding pattern meta-characters):
|
||||||
|
|
||||||
/ ! " ' ` - = _ : ; , % & @ ~
|
/ ! " ' ` - = _ : ; , % & @ ~
|
||||||
|
|
||||||
This is interpreted as the pattern's delimiter. A regular expression
|
This is interpreted as the pattern's delimiter. A regular expression
|
||||||
may be continued over several input lines, in which case the newline
|
may be continued over several input lines, in which case the newline
|
||||||
characters are included within it. It is possible to include the delim-
|
characters are included within it. It is possible to include the delim-
|
||||||
iter within the pattern by escaping it with a backslash, for example
|
iter within the pattern by escaping it with a backslash, for example
|
||||||
|
|
||||||
/abc\/def/
|
/abc\/def/
|
||||||
|
|
||||||
If you do this, the escape and the delimiter form part of the pattern,
|
If you do this, the escape and the delimiter form part of the pattern,
|
||||||
but since the delimiters are all non-alphanumeric, this does not affect
|
but since the delimiters are all non-alphanumeric, this does not affect
|
||||||
its interpretation. If the terminating delimiter is immediately fol-
|
its interpretation. If the terminating delimiter is immediately fol-
|
||||||
lowed by a backslash, for example,
|
lowed by a backslash, for example,
|
||||||
|
|
||||||
/abc/\
|
/abc/\
|
||||||
|
|
||||||
then a backslash is added to the end of the pattern. This is done to
|
then a backslash is added to the end of the pattern. This is done to
|
||||||
provide a way of testing the error condition that arises if a pattern
|
provide a way of testing the error condition that arises if a pattern
|
||||||
finishes with a backslash, because
|
finishes with a backslash, because
|
||||||
|
|
||||||
/abc\/
|
/abc\/
|
||||||
|
|
||||||
is interpreted as the first line of a pattern that starts with "abc/",
|
is interpreted as the first line of a pattern that starts with "abc/",
|
||||||
causing pcre2test to read the next line as a continuation of the regu-
|
causing pcre2test to read the next line as a continuation of the regu-
|
||||||
lar expression.
|
lar expression.
|
||||||
|
|
||||||
A pattern can be followed by a modifier list (details below).
|
A pattern can be followed by a modifier list (details below).
|
||||||
|
@ -385,7 +412,7 @@ PATTERN SYNTAX
|
||||||
|
|
||||||
SUBJECT LINE SYNTAX
|
SUBJECT LINE SYNTAX
|
||||||
|
|
||||||
Before each subject line is passed to pcre2_match() or
|
Before each subject line is passed to pcre2_match() or
|
||||||
pcre2_dfa_match(), leading and trailing white space is removed, and the
|
pcre2_dfa_match(), leading and trailing white space is removed, and the
|
||||||
line is scanned for backslash escapes. The following provide a means of
|
line is scanned for backslash escapes. The following provide a means of
|
||||||
encoding non-printing characters in a visible way:
|
encoding non-printing characters in a visible way:
|
||||||
|
@ -405,23 +432,23 @@ SUBJECT LINE SYNTAX
|
||||||
\x{hh...} hexadecimal character (any number of hex digits)
|
\x{hh...} hexadecimal character (any number of hex digits)
|
||||||
|
|
||||||
The use of \x{hh...} is not dependent on the use of the utf modifier on
|
The use of \x{hh...} is not dependent on the use of the utf modifier on
|
||||||
the pattern. It is recognized always. There may be any number of hexa-
|
the pattern. It is recognized always. There may be any number of hexa-
|
||||||
decimal digits inside the braces; invalid values provoke error mes-
|
decimal digits inside the braces; invalid values provoke error mes-
|
||||||
sages.
|
sages.
|
||||||
|
|
||||||
Note that \xhh specifies one byte rather than one character in UTF-8
|
Note that \xhh specifies one byte rather than one character in UTF-8
|
||||||
mode; this makes it possible to construct invalid UTF-8 sequences for
|
mode; this makes it possible to construct invalid UTF-8 sequences for
|
||||||
testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
|
testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
|
||||||
character in UTF-8 mode, generating more than one byte if the value is
|
character in UTF-8 mode, generating more than one byte if the value is
|
||||||
greater than 127. When testing the 8-bit library not in UTF-8 mode,
|
greater than 127. When testing the 8-bit library not in UTF-8 mode,
|
||||||
\x{hh} generates one byte for values less than 256, and causes an error
|
\x{hh} generates one byte for values less than 256, and causes an error
|
||||||
for greater values.
|
for greater values.
|
||||||
|
|
||||||
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
|
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
|
||||||
possible to construct invalid UTF-16 sequences for testing purposes.
|
possible to construct invalid UTF-16 sequences for testing purposes.
|
||||||
|
|
||||||
In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
|
In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
|
||||||
makes it possible to construct invalid UTF-32 sequences for testing
|
makes it possible to construct invalid UTF-32 sequences for testing
|
||||||
purposes.
|
purposes.
|
||||||
|
|
||||||
There is a special backslash sequence that specifies replication of one
|
There is a special backslash sequence that specifies replication of one
|
||||||
|
@ -429,45 +456,45 @@ SUBJECT LINE SYNTAX
|
||||||
|
|
||||||
\[<characters>]{<count>}
|
\[<characters>]{<count>}
|
||||||
|
|
||||||
This makes it possible to test long strings without having to provide
|
This makes it possible to test long strings without having to provide
|
||||||
them as part of the file. For example:
|
them as part of the file. For example:
|
||||||
|
|
||||||
\[abc]{4}
|
\[abc]{4}
|
||||||
|
|
||||||
is converted to "abcabcabcabc". This feature does not support nesting.
|
is converted to "abcabcabcabc". This feature does not support nesting.
|
||||||
To include a closing square bracket in the characters, code it as \x5D.
|
To include a closing square bracket in the characters, code it as \x5D.
|
||||||
|
|
||||||
A backslash followed by an equals sign marks the end of the subject
|
A backslash followed by an equals sign marks the end of the subject
|
||||||
string and the start of a modifier list. For example:
|
string and the start of a modifier list. For example:
|
||||||
|
|
||||||
abc\=notbol,notempty
|
abc\=notbol,notempty
|
||||||
|
|
||||||
If the subject string is empty and \= is followed by whitespace, the
|
If the subject string is empty and \= is followed by whitespace, the
|
||||||
line is treated as a comment line, and is not used for matching. For
|
line is treated as a comment line, and is not used for matching. For
|
||||||
example:
|
example:
|
||||||
|
|
||||||
\= This is a comment.
|
\= This is a comment.
|
||||||
abc\= This is an invalid modifier list.
|
abc\= This is an invalid modifier list.
|
||||||
|
|
||||||
A backslash followed by any other non-alphanumeric character just
|
A backslash followed by any other non-alphanumeric character just
|
||||||
escapes that character. A backslash followed by anything else causes an
|
escapes that character. A backslash followed by anything else causes an
|
||||||
error. However, if the very last character in the line is a backslash
|
error. However, if the very last character in the line is a backslash
|
||||||
(and there is no modifier list), it is ignored. This gives a way of
|
(and there is no modifier list), it is ignored. This gives a way of
|
||||||
passing an empty line as data, since a real empty line terminates the
|
passing an empty line as data, since a real empty line terminates the
|
||||||
data input.
|
data input.
|
||||||
|
|
||||||
|
|
||||||
PATTERN MODIFIERS
|
PATTERN MODIFIERS
|
||||||
|
|
||||||
There are several types of modifier that can appear in pattern lines.
|
There are several types of modifier that can appear in pattern lines.
|
||||||
Except where noted below, they may also be used in #pattern commands. A
|
Except where noted below, they may also be used in #pattern commands. A
|
||||||
pattern's modifier list can add to or override default modifiers that
|
pattern's modifier list can add to or override default modifiers that
|
||||||
were set by a previous #pattern command.
|
were set by a previous #pattern command.
|
||||||
|
|
||||||
Setting compilation options
|
Setting compilation options
|
||||||
|
|
||||||
The following modifiers set options for pcre2_compile(). The most com-
|
The following modifiers set options for pcre2_compile(). The most com-
|
||||||
mon ones have single-letter abbreviations. See pcre2api for a descrip-
|
mon ones have single-letter abbreviations. See pcre2api for a descrip-
|
||||||
tion of their effects.
|
tion of their effects.
|
||||||
|
|
||||||
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
|
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
|
||||||
|
@ -498,13 +525,15 @@ PATTERN MODIFIERS
|
||||||
utf set PCRE2_UTF
|
utf set PCRE2_UTF
|
||||||
|
|
||||||
As well as turning on the PCRE2_UTF option, the utf modifier causes all
|
As well as turning on the PCRE2_UTF option, the utf modifier causes all
|
||||||
non-printing characters in output strings to be printed using the
|
non-printing characters in output strings to be printed using the
|
||||||
\x{hh...} notation. Otherwise, those less than 0x100 are output in hex
|
\x{hh...} notation. Otherwise, those less than 0x100 are output in hex
|
||||||
without the curly brackets.
|
without the curly brackets. Setting utf in 16-bit or 32-bit mode also
|
||||||
|
causes pattern and subject strings to be translated to UTF-16 or
|
||||||
|
UTF-32, respectively, before being passed to library functions.
|
||||||
|
|
||||||
Setting compilation controls
|
Setting compilation controls
|
||||||
|
|
||||||
The following modifiers affect the compilation process or request
|
The following modifiers affect the compilation process or request
|
||||||
information about the pattern:
|
information about the pattern:
|
||||||
|
|
||||||
bsr=[anycrlf|unicode] specify \R handling
|
bsr=[anycrlf|unicode] specify \R handling
|
||||||
|
@ -529,39 +558,40 @@ PATTERN MODIFIERS
|
||||||
pushcopy push a copy onto the stack
|
pushcopy push a copy onto the stack
|
||||||
stackguard=<number> test the stackguard feature
|
stackguard=<number> test the stackguard feature
|
||||||
tables=[0|1|2] select internal tables
|
tables=[0|1|2] select internal tables
|
||||||
|
utf8_input treat input as UTF-8
|
||||||
|
|
||||||
The effects of these modifiers are described in the following sections.
|
The effects of these modifiers are described in the following sections.
|
||||||
|
|
||||||
Newline and \R handling
|
Newline and \R handling
|
||||||
|
|
||||||
The bsr modifier specifies what \R in a pattern should match. If it is
|
The bsr modifier specifies what \R in a pattern should match. If it is
|
||||||
set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to
|
set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to
|
||||||
"unicode", \R matches any Unicode newline sequence. The default is
|
"unicode", \R matches any Unicode newline sequence. The default is
|
||||||
specified when PCRE2 is built, with the default default being Unicode.
|
specified when PCRE2 is built, with the default default being Unicode.
|
||||||
|
|
||||||
The newline modifier specifies which characters are to be interpreted
|
The newline modifier specifies which characters are to be interpreted
|
||||||
as newlines, both in the pattern and in subject lines. The type must be
|
as newlines, both in the pattern and in subject lines. The type must be
|
||||||
one of CR, LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
|
one of CR, LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
|
||||||
|
|
||||||
Information about a pattern
|
Information about a pattern
|
||||||
|
|
||||||
The debug modifier is a shorthand for info,fullbincode, requesting all
|
The debug modifier is a shorthand for info,fullbincode, requesting all
|
||||||
available information.
|
available information.
|
||||||
|
|
||||||
The bincode modifier causes a representation of the compiled code to be
|
The bincode modifier causes a representation of the compiled code to be
|
||||||
output after compilation. This information does not contain length and
|
output after compilation. This information does not contain length and
|
||||||
offset values, which ensures that the same output is generated for dif-
|
offset values, which ensures that the same output is generated for dif-
|
||||||
ferent internal link sizes and different code unit widths. By using
|
ferent internal link sizes and different code unit widths. By using
|
||||||
bincode, the same regression tests can be used in different environ-
|
bincode, the same regression tests can be used in different environ-
|
||||||
ments.
|
ments.
|
||||||
|
|
||||||
The fullbincode modifier, by contrast, does include length and offset
|
The fullbincode modifier, by contrast, does include length and offset
|
||||||
values. This is used in a few special tests that run only for specific
|
values. This is used in a few special tests that run only for specific
|
||||||
code unit widths and link sizes, and is also useful for one-off tests.
|
code unit widths and link sizes, and is also useful for one-off tests.
|
||||||
|
|
||||||
The info modifier requests information about the compiled pattern
|
The info modifier requests information about the compiled pattern
|
||||||
(whether it is anchored, has a fixed first character, and so on). The
|
(whether it is anchored, has a fixed first character, and so on). The
|
||||||
information is obtained from the pcre2_pattern_info() function. Here
|
information is obtained from the pcre2_pattern_info() function. Here
|
||||||
are some typical examples:
|
are some typical examples:
|
||||||
|
|
||||||
re> /(?i)(^a|^b)/m,info
|
re> /(?i)(^a|^b)/m,info
|
||||||
|
@ -579,68 +609,79 @@ PATTERN MODIFIERS
|
||||||
Last code unit = 'c' (caseless)
|
Last code unit = 'c' (caseless)
|
||||||
Subject length lower bound = 3
|
Subject length lower bound = 3
|
||||||
|
|
||||||
"Compile options" are those specified by modifiers; "overall options"
|
"Compile options" are those specified by modifiers; "overall options"
|
||||||
have added options that are taken or deduced from the pattern. If both
|
have added options that are taken or deduced from the pattern. If both
|
||||||
sets of options are the same, just a single "options" line is output;
|
sets of options are the same, just a single "options" line is output;
|
||||||
if there are no options, the line is omitted. "First code unit" is
|
if there are no options, the line is omitted. "First code unit" is
|
||||||
where any match must start; if there is more than one they are listed
|
where any match must start; if there is more than one they are listed
|
||||||
as "starting code units". "Last code unit" is the last literal code
|
as "starting code units". "Last code unit" is the last literal code
|
||||||
unit that must be present in any match. This is not necessarily the
|
unit that must be present in any match. This is not necessarily the
|
||||||
last character. These lines are omitted if no starting or ending code
|
last character. These lines are omitted if no starting or ending code
|
||||||
units are recorded.
|
units are recorded.
|
||||||
|
|
||||||
The callout_info modifier requests information about all the callouts
|
The callout_info modifier requests information about all the callouts
|
||||||
in the pattern. A list of them is output at the end of any other infor-
|
in the pattern. A list of them is output at the end of any other infor-
|
||||||
mation that is requested. For each callout, either its number or string
|
mation that is requested. For each callout, either its number or string
|
||||||
is given, followed by the item that follows it in the pattern.
|
is given, followed by the item that follows it in the pattern.
|
||||||
|
|
||||||
Passing a NULL context
|
Passing a NULL context
|
||||||
|
|
||||||
Normally, pcre2test passes a context block to pcre2_compile(). If the
|
Normally, pcre2test passes a context block to pcre2_compile(). If the
|
||||||
null_context modifier is set, however, NULL is passed. This is for
|
null_context modifier is set, however, NULL is passed. This is for
|
||||||
testing that pcre2_compile() behaves correctly in this case (it uses
|
testing that pcre2_compile() behaves correctly in this case (it uses
|
||||||
default values).
|
default values).
|
||||||
|
|
||||||
Specifying pattern characters in hexadecimal
|
Specifying pattern characters in hexadecimal
|
||||||
|
|
||||||
The hex modifier specifies that the characters of the pattern, except
|
The hex modifier specifies that the characters of the pattern, except
|
||||||
for substrings enclosed in single or double quotes, are to be inter-
|
for substrings enclosed in single or double quotes, are to be inter-
|
||||||
preted as pairs of hexadecimal digits. This feature is provided as a
|
preted as pairs of hexadecimal digits. This feature is provided as a
|
||||||
way of creating patterns that contain binary zeros and other non-print-
|
way of creating patterns that contain binary zeros and other non-print-
|
||||||
ing characters. White space is permitted between pairs of digits. For
|
ing characters. White space is permitted between pairs of digits. For
|
||||||
example, this pattern contains three characters:
|
example, this pattern contains three characters:
|
||||||
|
|
||||||
/ab 32 59/hex
|
/ab 32 59/hex
|
||||||
|
|
||||||
Parts of such a pattern are taken literally if quoted. This pattern
|
Parts of such a pattern are taken literally if quoted. This pattern
|
||||||
contains nine characters, only two of which are specified in hexadeci-
|
contains nine characters, only two of which are specified in hexadeci-
|
||||||
mal:
|
mal:
|
||||||
|
|
||||||
/ab "literal" 32/hex
|
/ab "literal" 32/hex
|
||||||
|
|
||||||
Either single or double quotes may be used. There is no way of includ-
|
Either single or double quotes may be used. There is no way of includ-
|
||||||
ing the delimiter within a substring.
|
ing the delimiter within a substring. The hex and expand modifiers are
|
||||||
|
mutually exclusive.
|
||||||
|
|
||||||
By default, pcre2test passes patterns as zero-terminated strings to
|
By default, pcre2test passes patterns as zero-terminated strings to
|
||||||
pcre2_compile(), giving the length as PCRE2_ZERO_TERMINATED. However,
|
pcre2_compile(), giving the length as PCRE2_ZERO_TERMINATED. However,
|
||||||
for patterns specified with the hex modifier, the actual length of the
|
for patterns specified with the hex modifier, the actual length of the
|
||||||
pattern is passed.
|
pattern is passed.
|
||||||
|
|
||||||
|
Specifying wide characters in 16-bit and 32-bit modes
|
||||||
|
|
||||||
|
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
|
||||||
|
and translated to UTF-16 or UTF-32 when the utf modifier is set. For
|
||||||
|
testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
|
||||||
|
modifier can be used. It is mutually exclusive with utf. Input lines
|
||||||
|
are interpreted as UTF-8 as a means of specifying wide characters. More
|
||||||
|
details are given in "Input encoding" above.
|
||||||
|
|
||||||
Generating long repetitive patterns
|
Generating long repetitive patterns
|
||||||
|
|
||||||
Some tests use long patterns that are very repetitive. Instead of cre-
|
Some tests use long patterns that are very repetitive. Instead of cre-
|
||||||
ating a very long input line for such a pattern, you can use a special
|
ating a very long input line for such a pattern, you can use a special
|
||||||
repetition feature, similar to the one described for subject lines
|
repetition feature, similar to the one described for subject lines
|
||||||
above. If the expand modifier is present on a pattern, parts of the
|
above. If the expand modifier is present on a pattern, parts of the
|
||||||
pattern that have the form
|
pattern that have the form
|
||||||
|
|
||||||
\[<characters>]{<count>}
|
\[<characters>]{<count>}
|
||||||
|
|
||||||
are expanded before the pattern is passed to pcre2_compile(). For exam-
|
are expanded before the pattern is passed to pcre2_compile(). For exam-
|
||||||
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||||
cannot be nested. An initial "\[" sequence is recognized only if "]{"
|
cannot be nested. An initial "\[" sequence is recognized only if "]{"
|
||||||
followed by decimal digits and "}" is found later in the pattern. If
|
followed by decimal digits and "}" is found later in the pattern. If
|
||||||
not, the characters remain in the pattern unaltered.
|
not, the characters remain in the pattern unaltered. The expand and hex
|
||||||
|
modifiers are mutually exclusive.
|
||||||
|
|
||||||
If part of an expanded pattern looks like an expansion, but is really
|
If part of an expanded pattern looks like an expansion, but is really
|
||||||
part of the actual pattern, unwanted expansion can be avoided by giving
|
part of the actual pattern, unwanted expansion can be avoided by giving
|
||||||
|
@ -1548,5 +1589,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 06 July 2016
|
Last updated: 02 August 2016
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2016 University of Cambridge.
|
||||||
|
|
|
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
|
||||||
/* The current PCRE version information. */
|
/* The current PCRE version information. */
|
||||||
|
|
||||||
#define PCRE2_MAJOR 10
|
#define PCRE2_MAJOR 10
|
||||||
#define PCRE2_MINOR 22
|
#define PCRE2_MINOR 23
|
||||||
#define PCRE2_PRERELEASE
|
#define PCRE2_PRERELEASE -RC1
|
||||||
#define PCRE2_DATE 2016-07-29
|
#define PCRE2_DATE 2016-08-01
|
||||||
|
|
||||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||||
imported have to be identified as such. When building PCRE2, the appropriate
|
imported have to be identified as such. When building PCRE2, the appropriate
|
||||||
|
|
124
src/pcre2test.c
124
src/pcre2test.c
|
@ -430,8 +430,8 @@ so many of them that they are split into two fields. */
|
||||||
#define CTL_PUSH 0x01000000u
|
#define CTL_PUSH 0x01000000u
|
||||||
#define CTL_PUSHCOPY 0x02000000u
|
#define CTL_PUSHCOPY 0x02000000u
|
||||||
#define CTL_STARTCHAR 0x04000000u
|
#define CTL_STARTCHAR 0x04000000u
|
||||||
#define CTL_ZERO_TERMINATE 0x08000000u
|
#define CTL_UTF8_INPUT 0x08000000u
|
||||||
/* Spare 0x10000000u */
|
#define CTL_ZERO_TERMINATE 0x10000000u
|
||||||
/* Spare 0x20000000u */
|
/* Spare 0x20000000u */
|
||||||
#define CTL_NL_SET 0x40000000u /* Informational */
|
#define CTL_NL_SET 0x40000000u /* Informational */
|
||||||
#define CTL_BSR_SET 0x80000000u /* Informational */
|
#define CTL_BSR_SET 0x80000000u /* Informational */
|
||||||
|
@ -460,7 +460,8 @@ data line. */
|
||||||
CTL_GLOBAL|\
|
CTL_GLOBAL|\
|
||||||
CTL_MARK|\
|
CTL_MARK|\
|
||||||
CTL_MEMORY|\
|
CTL_MEMORY|\
|
||||||
CTL_STARTCHAR)
|
CTL_STARTCHAR|\
|
||||||
|
CTL_UTF8_INPUT)
|
||||||
|
|
||||||
#define CTL2_ALLPD (CTL2_SUBSTITUTE_EXTENDED|\
|
#define CTL2_ALLPD (CTL2_SUBSTITUTE_EXTENDED|\
|
||||||
CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\
|
CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\
|
||||||
|
@ -621,6 +622,7 @@ static modstruct modlist[] = {
|
||||||
{ "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) },
|
{ "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) },
|
||||||
{ "use_offset_limit", MOD_PAT, MOD_OPT, PCRE2_USE_OFFSET_LIMIT, PO(options) },
|
{ "use_offset_limit", MOD_PAT, MOD_OPT, PCRE2_USE_OFFSET_LIMIT, PO(options) },
|
||||||
{ "utf", MOD_PATP, MOD_OPT, PCRE2_UTF, PO(options) },
|
{ "utf", MOD_PATP, MOD_OPT, PCRE2_UTF, PO(options) },
|
||||||
|
{ "utf8_input", MOD_PAT, MOD_CTL, CTL_UTF8_INPUT, PO(control) },
|
||||||
{ "zero_terminate", MOD_DAT, MOD_CTL, CTL_ZERO_TERMINATE, DO(control) }
|
{ "zero_terminate", MOD_DAT, MOD_CTL, CTL_ZERO_TERMINATE, DO(control) }
|
||||||
};
|
};
|
||||||
|
|
||||||
|
@ -673,6 +675,7 @@ static uint32_t exclusive_pat_controls[] = {
|
||||||
|
|
||||||
/* Data controls that are mutually exclusive. At present these are all in the
|
/* Data controls that are mutually exclusive. At present these are all in the
|
||||||
first control word. */
|
first control word. */
|
||||||
|
|
||||||
static uint32_t exclusive_dat_controls[] = {
|
static uint32_t exclusive_dat_controls[] = {
|
||||||
CTL_ALLUSEDTEXT | CTL_STARTCHAR,
|
CTL_ALLUSEDTEXT | CTL_STARTCHAR,
|
||||||
CTL_FINDLIMITS | CTL_NULLCONTEXT };
|
CTL_FINDLIMITS | CTL_NULLCONTEXT };
|
||||||
|
@ -2715,16 +2718,22 @@ return i + 1;
|
||||||
|
|
||||||
#ifdef SUPPORT_PCRE2_16
|
#ifdef SUPPORT_PCRE2_16
|
||||||
/*************************************************
|
/*************************************************
|
||||||
* Convert pattern to 16-bit *
|
* Convert string to 16-bit *
|
||||||
*************************************************/
|
*************************************************/
|
||||||
|
|
||||||
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
|
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
|
||||||
all the input bytes are ASCII, the space needed for a 16-bit string is exactly
|
the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
|
||||||
double the 8-bit size. Otherwise, the size needed for a 16-bit string is no
|
code values from 0 to 0x7fffffff. However, values greater than the later UTF
|
||||||
more than double, because up to 0xffff uses no more than 3 bytes in UTF-8 but
|
limit of 0x10ffff cause an error. In non-UTF mode the input is interpreted as
|
||||||
possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes in
|
UTF-8 if the utf8_input modifier is set, but an error is generated for values
|
||||||
UTF-16. The result is always left in pbuffer16. Impose a minimum size to save
|
greater than 0xffff.
|
||||||
repeated re-sizing.
|
|
||||||
|
If all the input bytes are ASCII, the space needed for a 16-bit string is
|
||||||
|
exactly double the 8-bit size. Otherwise, the size needed for a 16-bit string
|
||||||
|
is no more than double, because up to 0xffff uses no more than 3 bytes in UTF-8
|
||||||
|
but possibly 4 in UTF-16. Higher values use 4 bytes in UTF-8 and up to 4 bytes
|
||||||
|
in UTF-16. The result is always left in pbuffer16. Impose a minimum size to
|
||||||
|
save repeated re-sizing.
|
||||||
|
|
||||||
Note that this function does not object to surrogate values. This is
|
Note that this function does not object to surrogate values. This is
|
||||||
deliberate; it makes it possible to construct UTF-16 strings that are invalid,
|
deliberate; it makes it possible to construct UTF-16 strings that are invalid,
|
||||||
|
@ -2732,7 +2741,7 @@ for the purpose of testing that they are correctly faulted.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
p points to a byte string
|
p points to a byte string
|
||||||
utf non-zero if converting to UTF-16
|
utf true in UTF mode
|
||||||
lenptr points to number of bytes in the string (excluding trailing zero)
|
lenptr points to number of bytes in the string (excluding trailing zero)
|
||||||
|
|
||||||
Returns: 0 on success, with the length updated to the number of 16-bit
|
Returns: 0 on success, with the length updated to the number of 16-bit
|
||||||
|
@ -2763,7 +2772,7 @@ if (pbuffer16_size < 2*len + 2)
|
||||||
}
|
}
|
||||||
|
|
||||||
pp = pbuffer16;
|
pp = pbuffer16;
|
||||||
if (!utf)
|
if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
|
||||||
{
|
{
|
||||||
for (; len > 0; len--) *pp++ = *p++;
|
for (; len > 0; len--) *pp++ = *p++;
|
||||||
}
|
}
|
||||||
|
@ -2772,12 +2781,12 @@ else while (len > 0)
|
||||||
uint32_t c;
|
uint32_t c;
|
||||||
int chlen = utf82ord(p, &c);
|
int chlen = utf82ord(p, &c);
|
||||||
if (chlen <= 0) return -1;
|
if (chlen <= 0) return -1;
|
||||||
|
if (!utf && c > 0xffff) return -3;
|
||||||
if (c > 0x10ffff) return -2;
|
if (c > 0x10ffff) return -2;
|
||||||
p += chlen;
|
p += chlen;
|
||||||
len -= chlen;
|
len -= chlen;
|
||||||
if (c < 0x10000) *pp++ = c; else
|
if (c < 0x10000) *pp++ = c; else
|
||||||
{
|
{
|
||||||
if (!utf) return -3;
|
|
||||||
c -= 0x10000;
|
c -= 0x10000;
|
||||||
*pp++ = 0xD800 | (c >> 10);
|
*pp++ = 0xD800 | (c >> 10);
|
||||||
*pp++ = 0xDC00 | (c & 0x3ff);
|
*pp++ = 0xDC00 | (c & 0x3ff);
|
||||||
|
@ -2794,15 +2803,25 @@ return 0;
|
||||||
|
|
||||||
#ifdef SUPPORT_PCRE2_32
|
#ifdef SUPPORT_PCRE2_32
|
||||||
/*************************************************
|
/*************************************************
|
||||||
* Convert pattern to 32-bit *
|
* Convert string to 32-bit *
|
||||||
*************************************************/
|
*************************************************/
|
||||||
|
|
||||||
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes. If
|
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
|
||||||
all the input bytes are ASCII, the space needed for a 32-bit string is exactly
|
the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
|
||||||
four times the 8-bit size. Otherwise, the size needed for a 32-bit string is no
|
code values from 0 to 0x7fffffff. However, values greater than the later UTF
|
||||||
more than four times, because the number of characters must be less than the
|
limit of 0x10ffff cause an error.
|
||||||
number of bytes. The result is always left in pbuffer32. Impose a minimum size
|
|
||||||
to save repeated re-sizing.
|
In non-UTF mode the input is interpreted as UTF-8 if the utf8_input modifier
|
||||||
|
is set, and no limit is imposed. There is special interpretation of the 0xff
|
||||||
|
byte (which is illegal in UTF-8) in this case: it causes the top bit of the
|
||||||
|
next character to be set. This provides a way of generating 32-bit characters
|
||||||
|
greater than 0x7fffffff.
|
||||||
|
|
||||||
|
If all the input bytes are ASCII, the space needed for a 32-bit string is
|
||||||
|
exactly four times the 8-bit size. Otherwise, the size needed for a 32-bit
|
||||||
|
string is no more than four times, because the number of characters must be
|
||||||
|
less than the number of bytes. The result is always left in pbuffer32. Impose a
|
||||||
|
minimum size to save repeated re-sizing.
|
||||||
|
|
||||||
Note that this function does not object to surrogate values. This is
|
Note that this function does not object to surrogate values. This is
|
||||||
deliberate; it makes it possible to construct UTF-32 strings that are invalid,
|
deliberate; it makes it possible to construct UTF-32 strings that are invalid,
|
||||||
|
@ -2810,7 +2829,7 @@ for the purpose of testing that they are correctly faulted.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
p points to a byte string
|
p points to a byte string
|
||||||
utf true if UTF-8 (to be converted to UTF-32)
|
utf true in UTF mode
|
||||||
lenptr points to number of bytes in the string (excluding trailing zero)
|
lenptr points to number of bytes in the string (excluding trailing zero)
|
||||||
|
|
||||||
Returns: 0 on success, with the length updated to the number of 32-bit
|
Returns: 0 on success, with the length updated to the number of 32-bit
|
||||||
|
@ -2840,19 +2859,29 @@ if (pbuffer32_size < 4*len + 4)
|
||||||
}
|
}
|
||||||
|
|
||||||
pp = pbuffer32;
|
pp = pbuffer32;
|
||||||
if (!utf)
|
|
||||||
|
if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
|
||||||
{
|
{
|
||||||
for (; len > 0; len--) *pp++ = *p++;
|
for (; len > 0; len--) *pp++ = *p++;
|
||||||
}
|
}
|
||||||
|
|
||||||
else while (len > 0)
|
else while (len > 0)
|
||||||
{
|
{
|
||||||
|
int chlen;
|
||||||
uint32_t c;
|
uint32_t c;
|
||||||
int chlen = utf82ord(p, &c);
|
uint32_t topbit = 0;
|
||||||
|
if (!utf && *p == 0xff && len > 1)
|
||||||
|
{
|
||||||
|
topbit = 0x80000000u;
|
||||||
|
p++;
|
||||||
|
len--;
|
||||||
|
}
|
||||||
|
chlen = utf82ord(p, &c);
|
||||||
if (chlen <= 0) return -1;
|
if (chlen <= 0) return -1;
|
||||||
if (utf && c > 0x10ffff) return -2;
|
if (utf && c > 0x10ffff) return -2;
|
||||||
p += chlen;
|
p += chlen;
|
||||||
len -= chlen;
|
len -= chlen;
|
||||||
*pp++ = c;
|
*pp++ = c | topbit;
|
||||||
}
|
}
|
||||||
|
|
||||||
*pp = 0;
|
*pp = 0;
|
||||||
|
@ -3627,7 +3656,7 @@ Returns: nothing
|
||||||
static void
|
static void
|
||||||
show_controls(uint32_t controls, uint32_t controls2, const char *before)
|
show_controls(uint32_t controls, uint32_t controls2, const char *before)
|
||||||
{
|
{
|
||||||
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||||
before,
|
before,
|
||||||
((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
|
((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
|
||||||
((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
|
((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
|
||||||
|
@ -3662,6 +3691,7 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s
|
||||||
((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "",
|
((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "",
|
||||||
((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "",
|
((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "",
|
||||||
((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "",
|
((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "",
|
||||||
|
((controls & CTL_UTF8_INPUT) != 0)? " utf8_input" : "",
|
||||||
((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : "");
|
((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : "");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -3759,13 +3789,13 @@ warning we must initialize cblock_size. */
|
||||||
|
|
||||||
cblock_size = 0;
|
cblock_size = 0;
|
||||||
#ifdef SUPPORT_PCRE2_8
|
#ifdef SUPPORT_PCRE2_8
|
||||||
if (test_mode == 8) cblock_size = sizeof(pcre2_real_code_8);
|
if (test_mode == PCRE8_MODE) cblock_size = sizeof(pcre2_real_code_8);
|
||||||
#endif
|
#endif
|
||||||
#ifdef SUPPORT_PCRE2_16
|
#ifdef SUPPORT_PCRE2_16
|
||||||
if (test_mode == 16) cblock_size = sizeof(pcre2_real_code_16);
|
if (test_mode == PCRE16_MODE) cblock_size = sizeof(pcre2_real_code_16);
|
||||||
#endif
|
#endif
|
||||||
#ifdef SUPPORT_PCRE2_32
|
#ifdef SUPPORT_PCRE2_32
|
||||||
if (test_mode == 32) cblock_size = sizeof(pcre2_real_code_32);
|
if (test_mode == PCRE32_MODE) cblock_size = sizeof(pcre2_real_code_32);
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
(void)pattern_info(PCRE2_INFO_SIZE, &size, FALSE);
|
(void)pattern_info(PCRE2_INFO_SIZE, &size, FALSE);
|
||||||
|
@ -4507,6 +4537,23 @@ patlen = p - buffer - 2;
|
||||||
if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
|
if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
|
||||||
utf = (pat_patctl.options & PCRE2_UTF) != 0;
|
utf = (pat_patctl.options & PCRE2_UTF) != 0;
|
||||||
|
|
||||||
|
/* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
|
||||||
|
exclusive with the utf modifier. */
|
||||||
|
|
||||||
|
if ((pat_patctl.control & CTL_UTF8_INPUT) != 0)
|
||||||
|
{
|
||||||
|
if (test_mode == PCRE8_MODE)
|
||||||
|
{
|
||||||
|
fprintf(outfile, "** The utf8_input modifier is not allowed in 8-bit mode\n");
|
||||||
|
return PR_SKIP;
|
||||||
|
}
|
||||||
|
if (utf)
|
||||||
|
{
|
||||||
|
fprintf(outfile, "** The utf and utf8_input modifiers are mutually exclusive\n");
|
||||||
|
return PR_SKIP;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/* Check for mutually exclusive modifiers. At present, these are all in the
|
/* Check for mutually exclusive modifiers. At present, these are all in the
|
||||||
first control word. */
|
first control word. */
|
||||||
|
|
||||||
|
@ -4738,7 +4785,7 @@ if ((pat_patctl.control & CTL_POSIX) != 0)
|
||||||
const char *msg = "** Ignored with POSIX interface:";
|
const char *msg = "** Ignored with POSIX interface:";
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
if (test_mode != 8)
|
if (test_mode != PCRE8_MODE)
|
||||||
{
|
{
|
||||||
fprintf(outfile, "** The POSIX interface is available only in 8-bit mode\n");
|
fprintf(outfile, "** The POSIX interface is available only in 8-bit mode\n");
|
||||||
return PR_SKIP;
|
return PR_SKIP;
|
||||||
|
@ -5622,7 +5669,9 @@ if (dbuffer == NULL || needlen >= dbuffer_size)
|
||||||
SETCASTPTR(q, dbuffer); /* Sets q8, q16, or q32, as appropriate. */
|
SETCASTPTR(q, dbuffer); /* Sets q8, q16, or q32, as appropriate. */
|
||||||
|
|
||||||
/* Scan the data line, interpreting data escapes, and put the result into a
|
/* Scan the data line, interpreting data escapes, and put the result into a
|
||||||
buffer of the appropriate width. In UTF mode, input can be UTF-8. */
|
buffer of the appropriate width. In UTF mode, input is always UTF-8; otherwise,
|
||||||
|
in 16- and 32-bit modes, it can be forced to UTF-8 by the utf8_input modifier.
|
||||||
|
*/
|
||||||
|
|
||||||
while ((c = *p++) != 0)
|
while ((c = *p++) != 0)
|
||||||
{
|
{
|
||||||
|
@ -5691,11 +5740,20 @@ while ((c = *p++) != 0)
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Handle a non-escaped character */
|
/* Handle a non-escaped character. In non-UTF 32-bit mode with utf8_input
|
||||||
|
set, do the fudge for setting the top bit. */
|
||||||
|
|
||||||
if (c != '\\')
|
if (c != '\\')
|
||||||
{
|
{
|
||||||
if (utf && HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
|
uint32_t topbit = 0;
|
||||||
|
if (test_mode == PCRE32_MODE && c == 0xff && *p != 0)
|
||||||
|
{
|
||||||
|
topbit = 0x80000000;
|
||||||
|
c = *p++;
|
||||||
|
}
|
||||||
|
if ((utf || (pat_patctl.control & CTL_UTF8_INPUT) != 0) &&
|
||||||
|
HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
|
||||||
|
c |= topbit;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Handle backslash escapes */
|
/* Handle backslash escapes */
|
||||||
|
|
|
@ -353,4 +353,19 @@
|
||||||
|
|
||||||
/(*THEN:\[A]{65501})/expand
|
/(*THEN:\[A]{65501})/expand
|
||||||
|
|
||||||
|
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
|
||||||
|
# even though this test is run when UTF is not supported.
|
||||||
|
|
||||||
|
/abý¿¿¿¿¿z/utf8_input
|
||||||
|
abý¿¿¿¿¿z
|
||||||
|
ab\x{7fffffff}z
|
||||||
|
|
||||||
|
/abÿý¿¿¿¿¿z/utf8_input
|
||||||
|
abÿý¿¿¿¿¿z
|
||||||
|
ab\x{ffffffff}z
|
||||||
|
|
||||||
|
/abÿAz/utf8_input
|
||||||
|
abÿAz
|
||||||
|
ab\x{80000041}z
|
||||||
|
|
||||||
# End of testinput11
|
# End of testinput11
|
||||||
|
|
|
@ -343,4 +343,8 @@
|
||||||
/./utf
|
/./utf
|
||||||
\x{110000}
|
\x{110000}
|
||||||
|
|
||||||
|
/(*UTF)ab<61>ソソソソソz/B
|
||||||
|
|
||||||
|
/ab<61>ソソソソソz/utf
|
||||||
|
|
||||||
# End of testinput12
|
# End of testinput12
|
||||||
|
|
|
@ -643,4 +643,22 @@ Subject length lower bound = 1
|
||||||
|
|
||||||
/(*THEN:\[A]{65501})/expand
|
/(*THEN:\[A]{65501})/expand
|
||||||
|
|
||||||
|
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
|
||||||
|
# even though this test is run when UTF is not supported.
|
||||||
|
|
||||||
|
/abý¿¿¿¿¿z/utf8_input
|
||||||
|
** Failed: character value greater than 0xffff cannot be converted to 16-bit in non-UTF mode
|
||||||
|
abý¿¿¿¿¿z
|
||||||
|
ab\x{7fffffff}z
|
||||||
|
|
||||||
|
/abÿý¿¿¿¿¿z/utf8_input
|
||||||
|
** Failed: invalid UTF-8 string cannot be converted to 16-bit string
|
||||||
|
abÿý¿¿¿¿¿z
|
||||||
|
ab\x{ffffffff}z
|
||||||
|
|
||||||
|
/abÿAz/utf8_input
|
||||||
|
** Failed: invalid UTF-8 string cannot be converted to 16-bit string
|
||||||
|
abÿAz
|
||||||
|
ab\x{80000041}z
|
||||||
|
|
||||||
# End of testinput11
|
# End of testinput11
|
||||||
|
|
|
@ -646,4 +646,25 @@ Subject length lower bound = 1
|
||||||
|
|
||||||
/(*THEN:\[A]{65501})/expand
|
/(*THEN:\[A]{65501})/expand
|
||||||
|
|
||||||
|
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
|
||||||
|
# even though this test is run when UTF is not supported.
|
||||||
|
|
||||||
|
/abý¿¿¿¿¿z/utf8_input
|
||||||
|
abý¿¿¿¿¿z
|
||||||
|
0: ab\x{7fffffff}z
|
||||||
|
ab\x{7fffffff}z
|
||||||
|
0: ab\x{7fffffff}z
|
||||||
|
|
||||||
|
/abÿý¿¿¿¿¿z/utf8_input
|
||||||
|
abÿý¿¿¿¿¿z
|
||||||
|
0: ab\x{ffffffff}z
|
||||||
|
ab\x{ffffffff}z
|
||||||
|
0: ab\x{ffffffff}z
|
||||||
|
|
||||||
|
/abÿAz/utf8_input
|
||||||
|
abÿAz
|
||||||
|
0: ab\x{80000041}z
|
||||||
|
ab\x{80000041}z
|
||||||
|
0: ab\x{80000041}z
|
||||||
|
|
||||||
# End of testinput11
|
# End of testinput11
|
||||||
|
|
|
@ -1367,4 +1367,15 @@ Subject length lower bound = 2
|
||||||
\x{110000}
|
\x{110000}
|
||||||
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
|
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
|
||||||
|
|
||||||
|
/(*UTF)abý¿¿¿¿¿z/B
|
||||||
|
------------------------------------------------------------------
|
||||||
|
Bra
|
||||||
|
ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
|
||||||
|
Ket
|
||||||
|
End
|
||||||
|
------------------------------------------------------------------
|
||||||
|
|
||||||
|
/abý¿¿¿¿¿z/utf
|
||||||
|
** Failed: character value greater than 0x10ffff cannot be converted to UTF
|
||||||
|
|
||||||
# End of testinput12
|
# End of testinput12
|
||||||
|
|
|
@ -1361,4 +1361,15 @@ Subject length lower bound = 2
|
||||||
\x{110000}
|
\x{110000}
|
||||||
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
|
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
|
||||||
|
|
||||||
|
/(*UTF)abý¿¿¿¿¿z/B
|
||||||
|
------------------------------------------------------------------
|
||||||
|
Bra
|
||||||
|
ab\x{fd}\x{bf}\x{bf}\x{bf}\x{bf}\x{bf}z
|
||||||
|
Ket
|
||||||
|
End
|
||||||
|
------------------------------------------------------------------
|
||||||
|
|
||||||
|
/abý¿¿¿¿¿z/utf
|
||||||
|
** Failed: character value greater than 0x10ffff cannot be converted to UTF
|
||||||
|
|
||||||
# End of testinput12
|
# End of testinput12
|
||||||
|
|
Loading…
Reference in New Issue