More documentation and test updates.
This commit is contained in:
parent
eb4fffbbf4
commit
91f2e97474
|
@ -148,7 +148,7 @@ listing), and the short pages for individual functions, are concatenated in
|
||||||
pcre2limits details of size and other limits
|
pcre2limits details of size and other limits
|
||||||
pcre2matching discussion of the two matching algorithms
|
pcre2matching discussion of the two matching algorithms
|
||||||
pcre2partial details of the partial matching facility
|
pcre2partial details of the partial matching facility
|
||||||
pcre2pattern syntax and semantics of supported regular expression patterns
|
pcre2pattern syntax and semantics of supported regular expression patterns
|
||||||
pcre2perform discussion of performance issues
|
pcre2perform discussion of performance issues
|
||||||
pcre2posix the POSIX-compatible C API for the 8-bit library
|
pcre2posix the POSIX-compatible C API for the 8-bit library
|
||||||
pcre2sample discussion of the pcre2demo program
|
pcre2sample discussion of the pcre2demo program
|
||||||
|
|
|
@ -17,9 +17,9 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a>
|
<li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a>
|
||||||
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
|
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
|
||||||
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
|
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
|
||||||
<li><a name="TOC5" href="#SEC5">Unicode and UTF SUPPORT</a>
|
<li><a name="TOC5" href="#SEC5">UNICODE AND UTF SUPPORT</a>
|
||||||
<li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a>
|
<li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a>
|
||||||
<li><a name="TOC7" href="#SEC7">CODE VALUE OF NEWLINE</a>
|
<li><a name="TOC7" href="#SEC7">NEWLINE RECOGNITION</a>
|
||||||
<li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a>
|
<li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a>
|
||||||
<li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
|
<li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
|
||||||
<li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
|
<li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
|
||||||
|
@ -91,12 +91,12 @@ respectively. These can be interpreted either as single-unit characters or
|
||||||
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
|
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
|
||||||
the following to the <b>configure</b> command:
|
the following to the <b>configure</b> command:
|
||||||
<pre>
|
<pre>
|
||||||
--enable-pcre16
|
--enable-pcre2-16
|
||||||
--enable-pcre32
|
--enable-pcre2-32
|
||||||
</pre>
|
</pre>
|
||||||
If you do not want the 8-bit library, add
|
If you do not want the 8-bit library, add
|
||||||
<pre>
|
<pre>
|
||||||
--disable-pcre8
|
--disable-pcre2-8
|
||||||
</pre>
|
</pre>
|
||||||
as well. At least one of the three libraries must be built. Note that the POSIX
|
as well. At least one of the three libraries must be built. Note that the POSIX
|
||||||
wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit
|
wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit
|
||||||
|
@ -106,14 +106,15 @@ libraries.
|
||||||
<br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
|
<br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
|
||||||
<P>
|
<P>
|
||||||
The Autotools PCRE2 building process uses <b>libtool</b> to build both shared
|
The Autotools PCRE2 building process uses <b>libtool</b> to build both shared
|
||||||
and static libraries by default. You can suppress one of these by adding one of
|
and static libraries by default. You can suppress an unwanted library by adding
|
||||||
|
one of
|
||||||
<pre>
|
<pre>
|
||||||
--disable-shared
|
--disable-shared
|
||||||
--disable-static
|
--disable-static
|
||||||
</pre>
|
</pre>
|
||||||
to the <b>configure</b> command, as required.
|
to the <b>configure</b> command.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC5" href="#TOC1">Unicode and UTF SUPPORT</a><br>
|
<br><a name="SEC5" href="#TOC1">UNICODE AND UTF SUPPORT</a><br>
|
||||||
<P>
|
<P>
|
||||||
By default, PCRE2 is built with support for Unicode and UTF character strings.
|
By default, PCRE2 is built with support for Unicode and UTF character strings.
|
||||||
To build it without Unicode support, add
|
To build it without Unicode support, add
|
||||||
|
@ -126,20 +127,15 @@ in the same configuration.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
||||||
or UTF-32. To do that you have have to set the PCRE2_UTF option when you call
|
or UTF-32. To do that, applications that use the library have to set the
|
||||||
<b>pcre2_compile()</b> to compile a pattern.
|
PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
It is not possible to support both EBCDIC and UTF-8 codes in the same version
|
UTF support allows the libraries to process character code points up to
|
||||||
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
|
0x10ffff in the strings that they handle. It also provides support for
|
||||||
exclusive.
|
accessing the Unicode properties of such characters, using pattern escapes such
|
||||||
</P>
|
as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
|
||||||
<P>
|
<i>Nd</i> are supported. Details are given in the
|
||||||
UTF support allows the libraries to process character codepoints up to 0x10ffff
|
|
||||||
in the strings that they handle. It also provides support for accessing the
|
|
||||||
properties of such characters, using pattern escapes such as \P, \p, and \X.
|
|
||||||
Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
|
|
||||||
supported. Details are given in the
|
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
documentation.
|
documentation.
|
||||||
</P>
|
</P>
|
||||||
|
@ -150,7 +146,7 @@ Just-in-time compiler support is included in the build by specifying
|
||||||
--enable-jit
|
--enable-jit
|
||||||
</pre>
|
</pre>
|
||||||
This support is available only for certain hardware architectures. If this
|
This support is available only for certain hardware architectures. If this
|
||||||
option is set for an unsupported architecture, a compile time error occurs.
|
option is set for an unsupported architecture, a building error occurs.
|
||||||
See the
|
See the
|
||||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||||
documentation for a discussion of JIT usage. When JIT support is enabled,
|
documentation for a discussion of JIT usage. When JIT support is enabled,
|
||||||
|
@ -160,7 +156,7 @@ pcre2grep automatically makes use of it, unless you add
|
||||||
</pre>
|
</pre>
|
||||||
to the "configure" command.
|
to the "configure" command.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC7" href="#TOC1">CODE VALUE OF NEWLINE</a><br>
|
<br><a name="SEC7" href="#TOC1">NEWLINE RECOGNITION</a><br>
|
||||||
<P>
|
<P>
|
||||||
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
|
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
|
||||||
of a line. This is the normal newline character on Unix-like systems. You can
|
of a line. This is the normal newline character on Unix-like systems. You can
|
||||||
|
@ -168,12 +164,13 @@ compile PCRE2 to use carriage return (CR) instead, by adding
|
||||||
<pre>
|
<pre>
|
||||||
--enable-newline-is-cr
|
--enable-newline-is-cr
|
||||||
</pre>
|
</pre>
|
||||||
to the <b>configure</b> command. There is also a --enable-newline-is-lf option,
|
to the <b>configure</b> command. There is also an --enable-newline-is-lf option,
|
||||||
which explicitly specifies linefeed as the newline character.
|
which explicitly specifies linefeed as the newline character.
|
||||||
<br>
|
</P>
|
||||||
<br>
|
<P>
|
||||||
Alternatively, you can specify that line endings are to be indicated by the two
|
Alternatively, you can specify that line endings are to be indicated by the
|
||||||
character sequence CRLF. If you want this, add
|
two-character sequence CRLF (CR immediately followed by LF). If you want this,
|
||||||
|
add
|
||||||
<pre>
|
<pre>
|
||||||
--enable-newline-is-crlf
|
--enable-newline-is-crlf
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -186,22 +183,26 @@ indicating a line ending. Finally, a fifth option, specified by
|
||||||
<pre>
|
<pre>
|
||||||
--enable-newline-is-any
|
--enable-newline-is-any
|
||||||
</pre>
|
</pre>
|
||||||
causes PCRE2 to recognize any Unicode newline sequence.
|
causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
|
||||||
|
sequences are the three just mentioned, plus the single characters VT (vertical
|
||||||
|
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
|
||||||
|
separator, U+2028), and PS (paragraph separator, U+2029).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Whatever line ending convention is selected when PCRE2 is built can be
|
Whatever default line ending convention is selected when PCRE2 is built can be
|
||||||
overridden when the library functions are called. At build time it is
|
overridden by applications that use the library. At build time it is
|
||||||
conventional to use the standard for your operating system.
|
conventional to use the standard for your operating system.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br>
|
<br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||||
<P>
|
<P>
|
||||||
By default, the sequence \R in a pattern matches any Unicode newline sequence,
|
By default, the sequence \R in a pattern matches any Unicode newline sequence,
|
||||||
whatever has been selected as the line ending sequence. If you specify
|
independently of what has been selected as the line ending sequence. If you
|
||||||
|
specify
|
||||||
<pre>
|
<pre>
|
||||||
--enable-bsr-anycrlf
|
--enable-bsr-anycrlf
|
||||||
</pre>
|
</pre>
|
||||||
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
|
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
|
||||||
selected when PCRE2 is built can be overridden when the library functions are
|
selected when PCRE2 is built can be overridden by applications that use the
|
||||||
called.
|
called.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
|
<br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
|
||||||
|
@ -210,10 +211,10 @@ Within a compiled pattern, offset values are used to point from one part to
|
||||||
another (for example, from an opening parenthesis to an alternation
|
another (for example, from an opening parenthesis to an alternation
|
||||||
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
|
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
|
||||||
are used for these offsets, leading to a maximum size for a compiled pattern of
|
are used for these offsets, leading to a maximum size for a compiled pattern of
|
||||||
around 64K. This is sufficient to handle all but the most gigantic patterns.
|
around 64K code units. This is sufficient to handle all but the most gigantic
|
||||||
Nevertheless, some people do want to process truly enormous patterns, so it is
|
patterns. Nevertheless, some people do want to process truly enormous patterns,
|
||||||
possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
|
so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
|
||||||
setting such as
|
adding a setting such as
|
||||||
<pre>
|
<pre>
|
||||||
--with-link-size=3
|
--with-link-size=3
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -294,16 +295,20 @@ hand".)
|
||||||
<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
|
<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
|
||||||
<P>
|
<P>
|
||||||
PCRE2 assumes by default that it will run in an environment where the character
|
PCRE2 assumes by default that it will run in an environment where the character
|
||||||
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
|
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
|
||||||
most computer operating systems. PCRE2 can, however, be compiled to run in an
|
most computer operating systems. PCRE2 can, however, be compiled to run in an
|
||||||
EBCDIC environment by adding
|
8-bit EBCDIC environment by adding
|
||||||
<pre>
|
<pre>
|
||||||
--enable-ebcdic --disable-unicode
|
--enable-ebcdic --disable-unicode
|
||||||
</pre>
|
</pre>
|
||||||
to the <b>configure</b> command. This setting implies
|
to the <b>configure</b> command. This setting implies
|
||||||
--enable-rebuild-chartables. You should only use it if you know that you are in
|
--enable-rebuild-chartables. You should only use it if you know that you are in
|
||||||
an EBCDIC environment (for example, an IBM mainframe operating system). The
|
an EBCDIC environment (for example, an IBM mainframe operating system).
|
||||||
--enable-ebcdic option is incompatible with Unicode support.
|
</P>
|
||||||
|
<P>
|
||||||
|
It is not possible to support both EBCDIC and UTF-8 codes in the same version
|
||||||
|
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
|
||||||
|
exclusive.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
|
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
|
||||||
|
@ -347,8 +352,8 @@ parameter value by adding, for example,
|
||||||
<pre>
|
<pre>
|
||||||
--with-pcre2grep-bufsize=50K
|
--with-pcre2grep-bufsize=50K
|
||||||
</pre>
|
</pre>
|
||||||
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can, however,
|
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override this
|
||||||
override this value by specifying a run-time option.
|
value by using --buffer-size on the command line..
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
|
<br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -362,16 +367,16 @@ to the <b>configure</b> command, <b>pcre2test</b> is linked with the
|
||||||
from a terminal, it reads it using the <b>readline()</b> function. This provides
|
from a terminal, it reads it using the <b>readline()</b> function. This provides
|
||||||
line-editing and history facilities. Note that <b>libreadline</b> is
|
line-editing and history facilities. Note that <b>libreadline</b> is
|
||||||
GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this
|
GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this
|
||||||
way, there may be licensing issues. These can be avoided by linking with
|
way, there may be licensing issues. These can be avoided by linking instead
|
||||||
<b>libedit</b> (which has a BSD licence) instead.
|
with <b>libedit</b>, which has a BSD licence.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Setting this option causes the <b>-lreadline</b> option to be added to the
|
Setting --enable-pcre2test-libreadline causes the <b>-lreadline</b> option to be
|
||||||
<b>pcre2test</b> build. In many operating environments with a sytem-installed
|
added to the <b>pcre2test</b> build. In many operating environments with a
|
||||||
readline library this is sufficient. However, in some environments (e.g. if an
|
sytem-installed readline library this is sufficient. However, in some
|
||||||
unmodified distribution version of readline is in use), some extra
|
environments (e.g. if an unmodified distribution version of readline is in
|
||||||
configuration may be necessary. The INSTALL file for <b>libreadline</b> says
|
use), some extra configuration may be necessary. The INSTALL file for
|
||||||
this:
|
<b>libreadline</b> says this:
|
||||||
<pre>
|
<pre>
|
||||||
"Readline uses the termcap functions, but does not link with
|
"Readline uses the termcap functions, but does not link with
|
||||||
the termcap or curses library itself, allowing applications
|
the termcap or curses library itself, allowing applications
|
||||||
|
@ -386,13 +391,13 @@ immediately before the <b>configure</b> command.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
||||||
<P>
|
<P>
|
||||||
By adding the
|
If you add
|
||||||
<pre>
|
<pre>
|
||||||
--enable-valgrind
|
--enable-valgrind
|
||||||
</pre>
|
</pre>
|
||||||
option to to the <b>configure</b> command, PCRE2 will use valgrind annotations
|
to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
|
||||||
to mark certain memory regions as unaddressable. This allows it to detect
|
certain memory regions as unaddressable. This allows it to detect invalid
|
||||||
invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
|
memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -466,7 +471,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 03 November 2014
|
Last updated: 23 November 2014
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -85,29 +85,27 @@ expect.
|
||||||
<P>
|
<P>
|
||||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
||||||
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
||||||
if it were a++[bc]. The <b>pcre2test</b> output when this pattern is anchored
|
if it were a++[bc]. The <b>pcre2test</b> output when this pattern is compiled
|
||||||
and then applied with automatic callouts to the string "aaaa" is:
|
with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
|
||||||
|
"aaaa" is:
|
||||||
<pre>
|
<pre>
|
||||||
--->aaaa
|
--->aaaa
|
||||||
+0 ^ ^
|
+0 ^ a+
|
||||||
+1 ^ a+
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
|
||||||
No match
|
No match
|
||||||
</pre>
|
</pre>
|
||||||
This indicates that when matching [bc] fails, there is no backtracking into a+
|
This indicates that when matching [bc] fails, there is no backtracking into a+
|
||||||
and therefore the callouts that would be taken for the backtracks do not occur.
|
and therefore the callouts that would be taken for the backtracks do not occur.
|
||||||
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS
|
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
|
||||||
to <b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). If
|
<b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). In this
|
||||||
this is done in <b>pcre2test</b> (using the /no_auto_possess qualifier), the
|
case, the output changes to this:
|
||||||
output changes to this:
|
|
||||||
<pre>
|
<pre>
|
||||||
--->aaaa
|
--->aaaa
|
||||||
+0 ^ ^
|
+0 ^ a+
|
||||||
+1 ^ a+
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
+2 ^^ [bc]
|
||||||
+3 ^^ [bc]
|
|
||||||
No match
|
No match
|
||||||
</pre>
|
</pre>
|
||||||
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
||||||
|
@ -137,10 +135,10 @@ callouts such as the example above are obeyed.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
|
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
|
||||||
<P>
|
<P>
|
||||||
During matching, when PCRE2 reaches a callout point, the external function that
|
During matching, when PCRE2 reaches a callout point, if an external function is
|
||||||
is set in the match context is called (if it is set). This applies to both
|
set in the match context, it is called. This applies to both normal and DFA
|
||||||
normal and DFA matching. The only argument to the callout function is a pointer
|
matching. The only argument to the callout function is a pointer to a
|
||||||
to a <b>pcre2_callout</b> block. This structure contains the following fields:
|
<b>pcre2_callout</b> block. This structure contains the following fields:
|
||||||
<pre>
|
<pre>
|
||||||
uint32_t <i>version</i>;
|
uint32_t <i>version</i>;
|
||||||
uint32_t <i>callout_number</i>;
|
uint32_t <i>callout_number</i>;
|
||||||
|
@ -169,7 +167,7 @@ automatically generated callouts).
|
||||||
<P>
|
<P>
|
||||||
The <i>offset_vector</i> field is a pointer to the vector of capturing offsets
|
The <i>offset_vector</i> field is a pointer to the vector of capturing offsets
|
||||||
(the "ovector") that was passed to the matching function in the match data
|
(the "ovector") that was passed to the matching function in the match data
|
||||||
block. When <b>pcre2_match()</b> is used, the contents can be inspected, in
|
block. When <b>pcre2_match()</b> is used, the contents can be inspected in
|
||||||
order to extract substrings that have been matched so far, in the same way as
|
order to extract substrings that have been matched so far, in the same way as
|
||||||
for extracting substrings after a match has completed. For the DFA matching
|
for extracting substrings after a match has completed. For the DFA matching
|
||||||
function, this field is not useful.
|
function, this field is not useful.
|
||||||
|
@ -261,7 +259,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 19 October 2014
|
Last updated: 23 November 2014
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -467,8 +467,8 @@ used. There is no short form for this option.
|
||||||
Processing some regular expression patterns can require a very large amount of
|
Processing some regular expression patterns can require a very large amount of
|
||||||
memory, leading in some cases to a program crash if not enough is available.
|
memory, leading in some cases to a program crash if not enough is available.
|
||||||
Other patterns may take a very long time to search for all possible matching
|
Other patterns may take a very long time to search for all possible matching
|
||||||
strings. The <b>pcre2_exec()</b> function that is called by <b>pcre2grep</b> to do
|
strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
|
||||||
the matching has two parameters that can limit the resources that it uses.
|
do the matching has two parameters that can limit the resources that it uses.
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
The <b>--match-limit</b> option provides a means of limiting resource usage
|
The <b>--match-limit</b> option provides a means of limiting resource usage
|
||||||
|
@ -750,7 +750,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 28 September 2014
|
Last updated: 23 November 2014
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -31,11 +31,11 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<P>
|
<P>
|
||||||
Just-in-time compiling is a heavyweight optimization that can greatly speed up
|
Just-in-time compiling is a heavyweight optimization that can greatly speed up
|
||||||
pattern matching. However, it comes at the cost of extra processing before the
|
pattern matching. However, it comes at the cost of extra processing before the
|
||||||
match is performed. Therefore, it is of most benefit when the same pattern is
|
match is performed, so it is of most benefit when the same pattern is going to
|
||||||
going to be matched many times. This does not necessarily mean many calls of a
|
be matched many times. This does not necessarily mean many calls of a matching
|
||||||
matching function; if the pattern is not anchored, matching attempts may take
|
function; if the pattern is not anchored, matching attempts may take place many
|
||||||
place many times at various positions in the subject, even for a single call.
|
times at various positions in the subject, even for a single call. Therefore,
|
||||||
Therefore, if the subject string is very long, it may still pay to use JIT for
|
if the subject string is very long, it may still pay to use JIT even for
|
||||||
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
|
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
|
||||||
32-bit PCRE2 libraries.
|
32-bit PCRE2 libraries.
|
||||||
</P>
|
</P>
|
||||||
|
@ -103,7 +103,7 @@ option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
|
||||||
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
|
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
|
||||||
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
|
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
|
||||||
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
|
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
|
||||||
returns zero. This is an alternative way of testing if JIT is available.
|
returns zero. This is an alternative way of testing whether JIT is available.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
At present, it is not possible to free JIT compiled code except when the entire
|
At present, it is not possible to free JIT compiled code except when the entire
|
||||||
|
@ -299,7 +299,7 @@ compiled patterns, contexts, and stacks in any order, anytime. Just \fIdo
|
||||||
not\fP call <b>pcre2_match()</b> with a match context pointing to an already
|
not\fP call <b>pcre2_match()</b> with a match context pointing to an already
|
||||||
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
|
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
|
||||||
used by <b>pcre2_match()</b> in another thread). You can also replace the stack
|
used by <b>pcre2_match()</b> in another thread). You can also replace the stack
|
||||||
in a context at any time when it is not in use. You can also free the previous
|
in a context at any time when it is not in use. You should free the previous
|
||||||
stack before assigning a replacement.
|
stack before assigning a replacement.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -418,7 +418,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 12 November 2014
|
Last updated: 23 November 2014
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -421,7 +421,7 @@ appear.
|
||||||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||||
</pre>
|
</pre>
|
||||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||||
limits set by the caller of pcre2_exec(), not increase them.
|
limits set by the caller of pcre2_match(), not increase them.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
|
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -553,7 +553,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 14 November 2014
|
Last updated: 23 November 2014
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -72,7 +72,7 @@ but its use can lead to some strange effects because it breaks up multi-unit
|
||||||
characters (see the description of \C in the
|
characters (see the description of \C in the
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
documentation). The use of \C is not supported in the alternative matching
|
documentation). The use of \C is not supported in the alternative matching
|
||||||
function <b>pcre2_dfa_exec()</b>, nor is it supported in UTF mode by the JIT
|
function <b>pcre2_dfa_match()</b>, nor is it supported in UTF mode by the JIT
|
||||||
optimization. If JIT optimization is requested for a UTF pattern that contains
|
optimization. If JIT optimization is requested for a UTF pattern that contains
|
||||||
\C, it will not succeed, and so the matching will be carried out by the normal
|
\C, it will not succeed, and so the matching will be carried out by the normal
|
||||||
interpretive function.
|
interpretive function.
|
||||||
|
@ -141,15 +141,15 @@ UTF-32.)
|
||||||
In some situations, you may already know that your strings are valid, and
|
In some situations, you may already know that your strings are valid, and
|
||||||
therefore want to skip these checks in order to improve performance, for
|
therefore want to skip these checks in order to improve performance, for
|
||||||
example in the case of a long subject string that is being scanned repeatedly.
|
example in the case of a long subject string that is being scanned repeatedly.
|
||||||
If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2
|
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
||||||
assumes that the pattern or subject it is given (respectively) contains only
|
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
||||||
valid UTF code unit sequences.
|
only valid UTF code unit sequences.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
|
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
|
||||||
the pattern; it does not also apply to subject strings. If you want to disable
|
the pattern; it does not also apply to subject strings. If you want to disable
|
||||||
the check for a subject string you must pass this option to <b>pcre2_exec()</b>
|
the check for a subject string you must pass this option to <b>pcre2_match()</b>
|
||||||
or <b>pcre2_dfa_exec()</b>.
|
or <b>pcre2_dfa_match()</b>.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||||
|
@ -261,7 +261,7 @@ Cambridge, England.
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 03 November 2014
|
Last updated: 23 November 2014
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
338
doc/pcre2.txt
338
doc/pcre2.txt
|
@ -2667,12 +2667,12 @@ BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
||||||
or UTF-16/UTF-32 strings. To build these additional libraries, add one
|
or UTF-16/UTF-32 strings. To build these additional libraries, add one
|
||||||
or both of the following to the configure command:
|
or both of the following to the configure command:
|
||||||
|
|
||||||
--enable-pcre16
|
--enable-pcre2-16
|
||||||
--enable-pcre32
|
--enable-pcre2-32
|
||||||
|
|
||||||
If you do not want the 8-bit library, add
|
If you do not want the 8-bit library, add
|
||||||
|
|
||||||
--disable-pcre8
|
--disable-pcre2-8
|
||||||
|
|
||||||
as well. At least one of the three libraries must be built. Note that
|
as well. At least one of the three libraries must be built. Note that
|
||||||
the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
|
the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
|
||||||
|
@ -2683,16 +2683,16 @@ BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
||||||
BUILDING SHARED AND STATIC LIBRARIES
|
BUILDING SHARED AND STATIC LIBRARIES
|
||||||
|
|
||||||
The Autotools PCRE2 building process uses libtool to build both shared
|
The Autotools PCRE2 building process uses libtool to build both shared
|
||||||
and static libraries by default. You can suppress one of these by
|
and static libraries by default. You can suppress an unwanted library
|
||||||
adding one of
|
by adding one of
|
||||||
|
|
||||||
--disable-shared
|
--disable-shared
|
||||||
--disable-static
|
--disable-static
|
||||||
|
|
||||||
to the configure command, as required.
|
to the configure command.
|
||||||
|
|
||||||
|
|
||||||
Unicode and UTF SUPPORT
|
UNICODE AND UTF SUPPORT
|
||||||
|
|
||||||
By default, PCRE2 is built with support for Unicode and UTF character
|
By default, PCRE2 is built with support for Unicode and UTF character
|
||||||
strings. To build it without Unicode support, add
|
strings. To build it without Unicode support, add
|
||||||
|
@ -2704,18 +2704,16 @@ Unicode and UTF SUPPORT
|
||||||
another without, in the same configuration.
|
another without, in the same configuration.
|
||||||
|
|
||||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
|
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
|
||||||
UTF-16 or UTF-32. To do that you have have to set the PCRE2_UTF option
|
UTF-16 or UTF-32. To do that, applications that use the library have to
|
||||||
when you call pcre2_compile() to compile a pattern.
|
set the PCRE2_UTF option when they call pcre2_compile() to compile a
|
||||||
|
pattern.
|
||||||
|
|
||||||
It is not possible to support both EBCDIC and UTF-8 codes in the same
|
UTF support allows the libraries to process character code points up to
|
||||||
version of the library. Consequently, --enable-unicode and --enable-
|
0x10ffff in the strings that they handle. It also provides support for
|
||||||
ebcdic are mutually exclusive.
|
accessing the Unicode properties of such characters, using pattern
|
||||||
|
escapes such as \P, \p, and \X. Only the general category properties
|
||||||
UTF support allows the libraries to process character codepoints up to
|
such as Lu and Nd are supported. Details are given in the pcre2pattern
|
||||||
0x10ffff in the strings that they handle. It also provides support for
|
documentation.
|
||||||
accessing the properties of such characters, using pattern escapes such
|
|
||||||
as \P, \p, and \X. Only the general category properties such as Lu and
|
|
||||||
Nd are supported. Details are given in the pcre2pattern documentation.
|
|
||||||
|
|
||||||
|
|
||||||
JUST-IN-TIME COMPILER SUPPORT
|
JUST-IN-TIME COMPILER SUPPORT
|
||||||
|
@ -2725,17 +2723,17 @@ JUST-IN-TIME COMPILER SUPPORT
|
||||||
--enable-jit
|
--enable-jit
|
||||||
|
|
||||||
This support is available only for certain hardware architectures. If
|
This support is available only for certain hardware architectures. If
|
||||||
this option is set for an unsupported architecture, a compile time
|
this option is set for an unsupported architecture, a building error
|
||||||
error occurs. See the pcre2jit documentation for a discussion of JIT
|
occurs. See the pcre2jit documentation for a discussion of JIT usage.
|
||||||
usage. When JIT support is enabled, pcre2grep automatically makes use
|
When JIT support is enabled, pcre2grep automatically makes use of it,
|
||||||
of it, unless you add
|
unless you add
|
||||||
|
|
||||||
--disable-pcre2grep-jit
|
--disable-pcre2grep-jit
|
||||||
|
|
||||||
to the "configure" command.
|
to the "configure" command.
|
||||||
|
|
||||||
|
|
||||||
CODE VALUE OF NEWLINE
|
NEWLINE RECOGNITION
|
||||||
|
|
||||||
By default, PCRE2 interprets the linefeed (LF) character as indicating
|
By default, PCRE2 interprets the linefeed (LF) character as indicating
|
||||||
the end of a line. This is the normal newline character on Unix-like
|
the end of a line. This is the normal newline character on Unix-like
|
||||||
|
@ -2744,11 +2742,12 @@ CODE VALUE OF NEWLINE
|
||||||
|
|
||||||
--enable-newline-is-cr
|
--enable-newline-is-cr
|
||||||
|
|
||||||
to the configure command. There is also a --enable-newline-is-lf
|
to the configure command. There is also an --enable-newline-is-lf
|
||||||
option, which explicitly specifies linefeed as the newline character.
|
option, which explicitly specifies linefeed as the newline character.
|
||||||
|
|
||||||
Alternatively, you can specify that line endings are to be indicated by
|
Alternatively, you can specify that line endings are to be indicated by
|
||||||
the two character sequence CRLF. If you want this, add
|
the two-character sequence CRLF (CR immediately followed by LF). If you
|
||||||
|
want this, add
|
||||||
|
|
||||||
--enable-newline-is-crlf
|
--enable-newline-is-crlf
|
||||||
|
|
||||||
|
@ -2756,41 +2755,46 @@ CODE VALUE OF NEWLINE
|
||||||
|
|
||||||
--enable-newline-is-anycrlf
|
--enable-newline-is-anycrlf
|
||||||
|
|
||||||
which causes PCRE2 to recognize any of the three sequences CR, LF, or
|
which causes PCRE2 to recognize any of the three sequences CR, LF, or
|
||||||
CRLF as indicating a line ending. Finally, a fifth option, specified by
|
CRLF as indicating a line ending. Finally, a fifth option, specified by
|
||||||
|
|
||||||
--enable-newline-is-any
|
--enable-newline-is-any
|
||||||
|
|
||||||
causes PCRE2 to recognize any Unicode newline sequence.
|
causes PCRE2 to recognize any Unicode newline sequence. The Unicode
|
||||||
|
newline sequences are the three just mentioned, plus the single charac-
|
||||||
|
ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
|
||||||
|
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
|
||||||
|
U+2029).
|
||||||
|
|
||||||
Whatever line ending convention is selected when PCRE2 is built can be
|
Whatever default line ending convention is selected when PCRE2 is built
|
||||||
overridden when the library functions are called. At build time it is
|
can be overridden by applications that use the library. At build time
|
||||||
conventional to use the standard for your operating system.
|
it is conventional to use the standard for your operating system.
|
||||||
|
|
||||||
|
|
||||||
WHAT \R MATCHES
|
WHAT \R MATCHES
|
||||||
|
|
||||||
By default, the sequence \R in a pattern matches any Unicode newline
|
By default, the sequence \R in a pattern matches any Unicode newline
|
||||||
sequence, whatever has been selected as the line ending sequence. If
|
sequence, independently of what has been selected as the line ending
|
||||||
you specify
|
sequence. If you specify
|
||||||
|
|
||||||
--enable-bsr-anycrlf
|
--enable-bsr-anycrlf
|
||||||
|
|
||||||
the default is changed so that \R matches only CR, LF, or CRLF. What-
|
the default is changed so that \R matches only CR, LF, or CRLF. What-
|
||||||
ever is selected when PCRE2 is built can be overridden when the library
|
ever is selected when PCRE2 is built can be overridden by applications
|
||||||
functions are called.
|
that use the called.
|
||||||
|
|
||||||
|
|
||||||
HANDLING VERY LARGE PATTERNS
|
HANDLING VERY LARGE PATTERNS
|
||||||
|
|
||||||
Within a compiled pattern, offset values are used to point from one
|
Within a compiled pattern, offset values are used to point from one
|
||||||
part to another (for example, from an opening parenthesis to an alter-
|
part to another (for example, from an opening parenthesis to an alter-
|
||||||
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
|
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
|
||||||
two-byte values are used for these offsets, leading to a maximum size
|
two-byte values are used for these offsets, leading to a maximum size
|
||||||
for a compiled pattern of around 64K. This is sufficient to handle all
|
for a compiled pattern of around 64K code units. This is sufficient to
|
||||||
but the most gigantic patterns. Nevertheless, some people do want to
|
handle all but the most gigantic patterns. Nevertheless, some people do
|
||||||
process truly enormous patterns, so it is possible to compile PCRE2 to
|
want to process truly enormous patterns, so it is possible to compile
|
||||||
use three-byte or four-byte offsets by adding a setting such as
|
PCRE2 to use three-byte or four-byte offsets by adding a setting such
|
||||||
|
as
|
||||||
|
|
||||||
--with-link-size=3
|
--with-link-size=3
|
||||||
|
|
||||||
|
@ -2876,25 +2880,28 @@ CREATING CHARACTER TABLES AT BUILD TIME
|
||||||
USING EBCDIC CODE
|
USING EBCDIC CODE
|
||||||
|
|
||||||
PCRE2 assumes by default that it will run in an environment where the
|
PCRE2 assumes by default that it will run in an environment where the
|
||||||
character code is ASCII (or Unicode, which is a superset of ASCII).
|
character code is ASCII or Unicode, which is a superset of ASCII. This
|
||||||
This is the case for most computer operating systems. PCRE2 can, how-
|
is the case for most computer operating systems. PCRE2 can, however, be
|
||||||
ever, be compiled to run in an EBCDIC environment by adding
|
compiled to run in an 8-bit EBCDIC environment by adding
|
||||||
|
|
||||||
--enable-ebcdic --disable-unicode
|
--enable-ebcdic --disable-unicode
|
||||||
|
|
||||||
to the configure command. This setting implies --enable-rebuild-charta-
|
to the configure command. This setting implies --enable-rebuild-charta-
|
||||||
bles. You should only use it if you know that you are in an EBCDIC
|
bles. You should only use it if you know that you are in an EBCDIC
|
||||||
environment (for example, an IBM mainframe operating system). The
|
environment (for example, an IBM mainframe operating system).
|
||||||
--enable-ebcdic option is incompatible with Unicode support.
|
|
||||||
|
It is not possible to support both EBCDIC and UTF-8 codes in the same
|
||||||
|
version of the library. Consequently, --enable-unicode and --enable-
|
||||||
|
ebcdic are mutually exclusive.
|
||||||
|
|
||||||
The EBCDIC character that corresponds to an ASCII LF is assumed to have
|
The EBCDIC character that corresponds to an ASCII LF is assumed to have
|
||||||
the value 0x15 by default. However, in some EBCDIC environments, 0x25
|
the value 0x15 by default. However, in some EBCDIC environments, 0x25
|
||||||
is used. In such an environment you should use
|
is used. In such an environment you should use
|
||||||
|
|
||||||
--enable-ebcdic-nl25
|
--enable-ebcdic-nl25
|
||||||
|
|
||||||
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
|
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
|
||||||
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
|
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
|
||||||
0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
|
0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
|
||||||
acter (which, in Unicode, is 0x85).
|
acter (which, in Unicode, is 0x85).
|
||||||
|
|
||||||
|
@ -2905,32 +2912,32 @@ USING EBCDIC CODE
|
||||||
|
|
||||||
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
|
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
|
||||||
|
|
||||||
By default, pcre2grep reads all files as plain text. You can build it
|
By default, pcre2grep reads all files as plain text. You can build it
|
||||||
so that it recognizes files whose names end in .gz or .bz2, and reads
|
so that it recognizes files whose names end in .gz or .bz2, and reads
|
||||||
them with libz or libbz2, respectively, by adding one or both of
|
them with libz or libbz2, respectively, by adding one or both of
|
||||||
|
|
||||||
--enable-pcre2grep-libz
|
--enable-pcre2grep-libz
|
||||||
--enable-pcre2grep-libbz2
|
--enable-pcre2grep-libbz2
|
||||||
|
|
||||||
to the configure command. These options naturally require that the rel-
|
to the configure command. These options naturally require that the rel-
|
||||||
evant libraries are installed on your system. Configuration will fail
|
evant libraries are installed on your system. Configuration will fail
|
||||||
if they are not.
|
if they are not.
|
||||||
|
|
||||||
|
|
||||||
PCRE2GREP BUFFER SIZE
|
PCRE2GREP BUFFER SIZE
|
||||||
|
|
||||||
pcre2grep uses an internal buffer to hold a "window" on the file it is
|
pcre2grep uses an internal buffer to hold a "window" on the file it is
|
||||||
scanning, in order to be able to output "before" and "after" lines when
|
scanning, in order to be able to output "before" and "after" lines when
|
||||||
it finds a match. The size of the buffer is controlled by a parameter
|
it finds a match. The size of the buffer is controlled by a parameter
|
||||||
whose default value is 20K. The buffer itself is three times this size,
|
whose default value is 20K. The buffer itself is three times this size,
|
||||||
but because of the way it is used for holding "before" lines, the long-
|
but because of the way it is used for holding "before" lines, the long-
|
||||||
est line that is guaranteed to be processable is the parameter size.
|
est line that is guaranteed to be processable is the parameter size.
|
||||||
You can change the default parameter value by adding, for example,
|
You can change the default parameter value by adding, for example,
|
||||||
|
|
||||||
--with-pcre2grep-bufsize=50K
|
--with-pcre2grep-bufsize=50K
|
||||||
|
|
||||||
to the configure command. The caller of pcre2grep can, however, over-
|
to the configure command. The caller of pcre2grep can override this
|
||||||
ride this value by specifying a run-time option.
|
value by using --buffer-size on the command line..
|
||||||
|
|
||||||
|
|
||||||
PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
||||||
|
@ -2940,26 +2947,26 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
||||||
--enable-pcre2test-libreadline
|
--enable-pcre2test-libreadline
|
||||||
--enable-pcre2test-libedit
|
--enable-pcre2test-libedit
|
||||||
|
|
||||||
to the configure command, pcre2test is linked with the libreadline
|
to the configure command, pcre2test is linked with the libreadline
|
||||||
orlibedit library, respectively, and when its input is from a terminal,
|
orlibedit library, respectively, and when its input is from a terminal,
|
||||||
it reads it using the readline() function. This provides line-editing
|
it reads it using the readline() function. This provides line-editing
|
||||||
and history facilities. Note that libreadline is GPL-licensed, so if
|
and history facilities. Note that libreadline is GPL-licensed, so if
|
||||||
you distribute a binary of pcre2test linked in this way, there may be
|
you distribute a binary of pcre2test linked in this way, there may be
|
||||||
licensing issues. These can be avoided by linking with libedit (which
|
licensing issues. These can be avoided by linking instead with libedit,
|
||||||
has a BSD licence) instead.
|
which has a BSD licence.
|
||||||
|
|
||||||
Setting this option causes the -lreadline option to be added to the
|
Setting --enable-pcre2test-libreadline causes the -lreadline option to
|
||||||
pcre2test build. In many operating environments with a sytem-installed
|
be added to the pcre2test build. In many operating environments with a
|
||||||
readline library this is sufficient. However, in some environments
|
sytem-installed readline library this is sufficient. However, in some
|
||||||
(e.g. if an unmodified distribution version of readline is in use),
|
environments (e.g. if an unmodified distribution version of readline is
|
||||||
some extra configuration may be necessary. The INSTALL file for
|
in use), some extra configuration may be necessary. The INSTALL file
|
||||||
libreadline says this:
|
for libreadline says this:
|
||||||
|
|
||||||
"Readline uses the termcap functions, but does not link with
|
"Readline uses the termcap functions, but does not link with
|
||||||
the termcap or curses library itself, allowing applications
|
the termcap or curses library itself, allowing applications
|
||||||
which link with readline the to choose an appropriate library."
|
which link with readline the to choose an appropriate library."
|
||||||
|
|
||||||
If your environment has not been set up so that an appropriate library
|
If your environment has not been set up so that an appropriate library
|
||||||
is automatically included, you may need to add something like
|
is automatically included, you may need to add something like
|
||||||
|
|
||||||
LIBS="-ncurses"
|
LIBS="-ncurses"
|
||||||
|
@ -2969,19 +2976,19 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
||||||
|
|
||||||
DEBUGGING WITH VALGRIND SUPPORT
|
DEBUGGING WITH VALGRIND SUPPORT
|
||||||
|
|
||||||
By adding the
|
If you add
|
||||||
|
|
||||||
--enable-valgrind
|
--enable-valgrind
|
||||||
|
|
||||||
option to to the configure command, PCRE2 will use valgrind annotations
|
to the configure command, PCRE2 will use valgrind annotations to mark
|
||||||
to mark certain memory regions as unaddressable. This allows it to
|
certain memory regions as unaddressable. This allows it to detect
|
||||||
detect invalid memory accesses, and is mostly useful for debugging
|
invalid memory accesses, and is mostly useful for debugging PCRE2
|
||||||
PCRE2 itself.
|
itself.
|
||||||
|
|
||||||
|
|
||||||
CODE COVERAGE REPORTING
|
CODE COVERAGE REPORTING
|
||||||
|
|
||||||
If your C compiler is gcc, you can build a version of PCRE2 that can
|
If your C compiler is gcc, you can build a version of PCRE2 that can
|
||||||
generate a code coverage report for its test suite. To enable this, you
|
generate a code coverage report for its test suite. To enable this, you
|
||||||
must install lcov version 1.6 or above. Then specify
|
must install lcov version 1.6 or above. Then specify
|
||||||
|
|
||||||
|
@ -2990,20 +2997,20 @@ CODE COVERAGE REPORTING
|
||||||
to the configure command and build PCRE2 in the usual way.
|
to the configure command and build PCRE2 in the usual way.
|
||||||
|
|
||||||
Note that using ccache (a caching C compiler) is incompatible with code
|
Note that using ccache (a caching C compiler) is incompatible with code
|
||||||
coverage reporting. If you have configured ccache to run automatically
|
coverage reporting. If you have configured ccache to run automatically
|
||||||
on your system, you must set the environment variable
|
on your system, you must set the environment variable
|
||||||
|
|
||||||
CCACHE_DISABLE=1
|
CCACHE_DISABLE=1
|
||||||
|
|
||||||
before running make to build PCRE2, so that ccache is not used.
|
before running make to build PCRE2, so that ccache is not used.
|
||||||
|
|
||||||
When --enable-coverage is used, the following addition targets are
|
When --enable-coverage is used, the following addition targets are
|
||||||
added to the Makefile:
|
added to the Makefile:
|
||||||
|
|
||||||
make coverage
|
make coverage
|
||||||
|
|
||||||
This creates a fresh coverage report for the PCRE2 test suite. It is
|
This creates a fresh coverage report for the PCRE2 test suite. It is
|
||||||
equivalent to running "make coverage-reset", "make coverage-baseline",
|
equivalent to running "make coverage-reset", "make coverage-baseline",
|
||||||
"make check", and then "make coverage-report".
|
"make check", and then "make coverage-report".
|
||||||
|
|
||||||
make coverage-reset
|
make coverage-reset
|
||||||
|
@ -3020,18 +3027,18 @@ CODE COVERAGE REPORTING
|
||||||
|
|
||||||
make coverage-clean-report
|
make coverage-clean-report
|
||||||
|
|
||||||
This removes the generated coverage report without cleaning the cover-
|
This removes the generated coverage report without cleaning the cover-
|
||||||
age data itself.
|
age data itself.
|
||||||
|
|
||||||
make coverage-clean-data
|
make coverage-clean-data
|
||||||
|
|
||||||
This removes the captured coverage data without removing the coverage
|
This removes the captured coverage data without removing the coverage
|
||||||
files created at compile time (*.gcno).
|
files created at compile time (*.gcno).
|
||||||
|
|
||||||
make coverage-clean
|
make coverage-clean
|
||||||
|
|
||||||
This cleans all coverage data including the generated coverage report.
|
This cleans all coverage data including the generated coverage report.
|
||||||
For more information about code coverage, see the gcov and lcov docu-
|
For more information about code coverage, see the gcov and lcov docu-
|
||||||
mentation.
|
mentation.
|
||||||
|
|
||||||
|
|
||||||
|
@ -3049,7 +3056,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 03 November 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -3122,62 +3129,59 @@ MISSING CALLOUTS
|
||||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
|
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
|
||||||
that what follows cannot be part of the repeat. For example, a+[bc] is
|
that what follows cannot be part of the repeat. For example, a+[bc] is
|
||||||
compiled as if it were a++[bc]. The pcre2test output when this pattern
|
compiled as if it were a++[bc]. The pcre2test output when this pattern
|
||||||
is anchored and then applied with automatic callouts to the string
|
is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
|
||||||
"aaaa" is:
|
to the string "aaaa" is:
|
||||||
|
|
||||||
--->aaaa
|
--->aaaa
|
||||||
+0 ^ ^
|
+0 ^ a+
|
||||||
+1 ^ a+
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
|
||||||
No match
|
No match
|
||||||
|
|
||||||
This indicates that when matching [bc] fails, there is no backtracking
|
This indicates that when matching [bc] fails, there is no backtracking
|
||||||
into a+ and therefore the callouts that would be taken for the back-
|
into a+ and therefore the callouts that would be taken for the back-
|
||||||
tracks do not occur. You can disable the auto-possessify feature by
|
tracks do not occur. You can disable the auto-possessify feature by
|
||||||
passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat-
|
passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat-
|
||||||
tern with (*NO_AUTO_POSSESS). If this is done in pcre2test (using the
|
tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
|
||||||
/no_auto_possess qualifier), the output changes to this:
|
|
||||||
|
|
||||||
--->aaaa
|
--->aaaa
|
||||||
+0 ^ ^
|
+0 ^ a+
|
||||||
+1 ^ a+
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
+2 ^^ [bc]
|
||||||
+3 ^^ [bc]
|
|
||||||
No match
|
No match
|
||||||
|
|
||||||
This time, when matching [bc] fails, the matcher backtracks into a+ and
|
This time, when matching [bc] fails, the matcher backtracks into a+ and
|
||||||
tries again, repeatedly, until a+ itself fails.
|
tries again, repeatedly, until a+ itself fails.
|
||||||
|
|
||||||
Other optimizations that provide fast "no match" results also affect
|
Other optimizations that provide fast "no match" results also affect
|
||||||
callouts. For example, if the pattern is
|
callouts. For example, if the pattern is
|
||||||
|
|
||||||
ab(?C4)cd
|
ab(?C4)cd
|
||||||
|
|
||||||
PCRE2 knows that any matching string must contain the letter "d". If
|
PCRE2 knows that any matching string must contain the letter "d". If
|
||||||
the subject string is "abyz", the lack of "d" means that matching
|
the subject string is "abyz", the lack of "d" means that matching
|
||||||
doesn't ever start, and the callout is never reached. However, with
|
doesn't ever start, and the callout is never reached. However, with
|
||||||
"abyd", though the result is still no match, the callout is obeyed.
|
"abyd", though the result is still no match, the callout is obeyed.
|
||||||
|
|
||||||
PCRE2 also knows the minimum length of a matching string, and will
|
PCRE2 also knows the minimum length of a matching string, and will
|
||||||
immediately give a "no match" return without actually running a match
|
immediately give a "no match" return without actually running a match
|
||||||
if the subject is not long enough, or, for unanchored patterns, if it
|
if the subject is not long enough, or, for unanchored patterns, if it
|
||||||
has been scanned far enough.
|
has been scanned far enough.
|
||||||
|
|
||||||
You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
|
You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
|
||||||
MIZE option to pcre2_compile(), or by starting the pattern with
|
MIZE option to pcre2_compile(), or by starting the pattern with
|
||||||
(*NO_START_OPT). This slows down the matching process, but does ensure
|
(*NO_START_OPT). This slows down the matching process, but does ensure
|
||||||
that callouts such as the example above are obeyed.
|
that callouts such as the example above are obeyed.
|
||||||
|
|
||||||
|
|
||||||
THE CALLOUT INTERFACE
|
THE CALLOUT INTERFACE
|
||||||
|
|
||||||
During matching, when PCRE2 reaches a callout point, the external func-
|
During matching, when PCRE2 reaches a callout point, if an external
|
||||||
tion that is set in the match context is called (if it is set). This
|
function is set in the match context, it is called. This applies to
|
||||||
applies to both normal and DFA matching. The only argument to the call-
|
both normal and DFA matching. The only argument to the callout function
|
||||||
out function is a pointer to a pcre2_callout block. This structure con-
|
is a pointer to a pcre2_callout block. This structure contains the fol-
|
||||||
tains the following fields:
|
lowing fields:
|
||||||
|
|
||||||
uint32_t version;
|
uint32_t version;
|
||||||
uint32_t callout_number;
|
uint32_t callout_number;
|
||||||
|
@ -3193,69 +3197,69 @@ THE CALLOUT INTERFACE
|
||||||
PCRE2_SIZE pattern_position;
|
PCRE2_SIZE pattern_position;
|
||||||
PCRE2_SIZE next_item_length;
|
PCRE2_SIZE next_item_length;
|
||||||
|
|
||||||
The version field contains the version number of the block format. The
|
The version field contains the version number of the block format. The
|
||||||
current version is 0. The version number will change in future if addi-
|
current version is 0. The version number will change in future if addi-
|
||||||
tional fields are added, but the intention is never to remove any of
|
tional fields are added, but the intention is never to remove any of
|
||||||
the existing fields.
|
the existing fields.
|
||||||
|
|
||||||
The callout_number field contains the number of the callout, as com-
|
The callout_number field contains the number of the callout, as com-
|
||||||
piled into the pattern (that is, the number after ?C for manual call-
|
piled into the pattern (that is, the number after ?C for manual call-
|
||||||
outs, and 255 for automatically generated callouts).
|
outs, and 255 for automatically generated callouts).
|
||||||
|
|
||||||
The offset_vector field is a pointer to the vector of capturing offsets
|
The offset_vector field is a pointer to the vector of capturing offsets
|
||||||
(the "ovector") that was passed to the matching function in the match
|
(the "ovector") that was passed to the matching function in the match
|
||||||
data block. When pcre2_match() is used, the contents can be inspected,
|
data block. When pcre2_match() is used, the contents can be inspected
|
||||||
in order to extract substrings that have been matched so far, in the
|
in order to extract substrings that have been matched so far, in the
|
||||||
same way as for extracting substrings after a match has completed. For
|
same way as for extracting substrings after a match has completed. For
|
||||||
the DFA matching function, this field is not useful.
|
the DFA matching function, this field is not useful.
|
||||||
|
|
||||||
The subject and subject_length fields contain copies of the values that
|
The subject and subject_length fields contain copies of the values that
|
||||||
were passed to the matching function.
|
were passed to the matching function.
|
||||||
|
|
||||||
The start_match field normally contains the offset within the subject
|
The start_match field normally contains the offset within the subject
|
||||||
at which the current match attempt started. However, if the escape
|
at which the current match attempt started. However, if the escape
|
||||||
sequence \K has been encountered, this value is changed to reflect the
|
sequence \K has been encountered, this value is changed to reflect the
|
||||||
modified starting point. If the pattern is not anchored, the callout
|
modified starting point. If the pattern is not anchored, the callout
|
||||||
function may be called several times from the same point in the pattern
|
function may be called several times from the same point in the pattern
|
||||||
for different starting points in the subject.
|
for different starting points in the subject.
|
||||||
|
|
||||||
The current_position field contains the offset within the subject of
|
The current_position field contains the offset within the subject of
|
||||||
the current match pointer.
|
the current match pointer.
|
||||||
|
|
||||||
When the pcre2_match() is used, the capture_top field contains one more
|
When the pcre2_match() is used, the capture_top field contains one more
|
||||||
than the number of the highest numbered captured substring so far. If
|
than the number of the highest numbered captured substring so far. If
|
||||||
no substrings have been captured, the value of capture_top is one. This
|
no substrings have been captured, the value of capture_top is one. This
|
||||||
is always the case when the DFA functions are used, because they do not
|
is always the case when the DFA functions are used, because they do not
|
||||||
support captured substrings.
|
support captured substrings.
|
||||||
|
|
||||||
The capture_last field contains the number of the most recently cap-
|
The capture_last field contains the number of the most recently cap-
|
||||||
tured substring. However, when a recursion exits, the value reverts to
|
tured substring. However, when a recursion exits, the value reverts to
|
||||||
what it was outside the recursion, as do the values of all captured
|
what it was outside the recursion, as do the values of all captured
|
||||||
substrings. If no substrings have been captured, the value of cap-
|
substrings. If no substrings have been captured, the value of cap-
|
||||||
ture_last is 0. This is always the case for the DFA matching functions.
|
ture_last is 0. This is always the case for the DFA matching functions.
|
||||||
|
|
||||||
The callout_data field contains a value that is passed to a matching
|
The callout_data field contains a value that is passed to a matching
|
||||||
function specifically so that it can be passed back in callouts. It is
|
function specifically so that it can be passed back in callouts. It is
|
||||||
set in the match context when the callout is set up by calling
|
set in the match context when the callout is set up by calling
|
||||||
pcre2_set_callout() (see the pcre2api documentation).
|
pcre2_set_callout() (see the pcre2api documentation).
|
||||||
|
|
||||||
The pattern_position field contains the offset to the next item to be
|
The pattern_position field contains the offset to the next item to be
|
||||||
matched in the pattern string.
|
matched in the pattern string.
|
||||||
|
|
||||||
The next_item_length field contains the length of the next item to be
|
The next_item_length field contains the length of the next item to be
|
||||||
matched in the pattern string. When the callout immediately precedes an
|
matched in the pattern string. When the callout immediately precedes an
|
||||||
alternation bar, a closing parenthesis, or the end of the pattern, the
|
alternation bar, a closing parenthesis, or the end of the pattern, the
|
||||||
length is zero. When the callout precedes an opening parenthesis, the
|
length is zero. When the callout precedes an opening parenthesis, the
|
||||||
length is that of the entire subpattern.
|
length is that of the entire subpattern.
|
||||||
|
|
||||||
The pattern_position and next_item_length fields are intended to help
|
The pattern_position and next_item_length fields are intended to help
|
||||||
in distinguishing between different automatic callouts, which all have
|
in distinguishing between different automatic callouts, which all have
|
||||||
the same callout number. However, they are set for all callouts.
|
the same callout number. However, they are set for all callouts.
|
||||||
|
|
||||||
In callouts from pcre2_match() the mark field contains a pointer to the
|
In callouts from pcre2_match() the mark field contains a pointer to the
|
||||||
zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
|
zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
|
||||||
(*THEN) item in the match, or NULL if no such items have been passed.
|
(*THEN) item in the match, or NULL if no such items have been passed.
|
||||||
Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
|
Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
|
||||||
previous (*MARK). In callouts from the DFA matching function this field
|
previous (*MARK). In callouts from the DFA matching function this field
|
||||||
always contains NULL.
|
always contains NULL.
|
||||||
|
|
||||||
|
@ -3263,16 +3267,16 @@ THE CALLOUT INTERFACE
|
||||||
RETURN VALUES
|
RETURN VALUES
|
||||||
|
|
||||||
The external callout function returns an integer to PCRE2. If the value
|
The external callout function returns an integer to PCRE2. If the value
|
||||||
is zero, matching proceeds as normal. If the value is greater than
|
is zero, matching proceeds as normal. If the value is greater than
|
||||||
zero, matching fails at the current point, but the testing of other
|
zero, matching fails at the current point, but the testing of other
|
||||||
matching possibilities goes ahead, just as if a lookahead assertion had
|
matching possibilities goes ahead, just as if a lookahead assertion had
|
||||||
failed. If the value is less than zero, the match is abandoned, and the
|
failed. If the value is less than zero, the match is abandoned, and the
|
||||||
matching function returns the negative value.
|
matching function returns the negative value.
|
||||||
|
|
||||||
Negative values should normally be chosen from the set of
|
Negative values should normally be chosen from the set of
|
||||||
PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
|
PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
|
||||||
standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
|
standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
|
||||||
reserved for use by callout functions; it will never be used by PCRE2
|
reserved for use by callout functions; it will never be used by PCRE2
|
||||||
itself.
|
itself.
|
||||||
|
|
||||||
|
|
||||||
|
@ -3285,7 +3289,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 19 October 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -3487,12 +3491,12 @@ PCRE2 JUST-IN-TIME COMPILER SUPPORT
|
||||||
|
|
||||||
Just-in-time compiling is a heavyweight optimization that can greatly
|
Just-in-time compiling is a heavyweight optimization that can greatly
|
||||||
speed up pattern matching. However, it comes at the cost of extra pro-
|
speed up pattern matching. However, it comes at the cost of extra pro-
|
||||||
cessing before the match is performed. Therefore, it is of most benefit
|
cessing before the match is performed, so it is of most benefit when
|
||||||
when the same pattern is going to be matched many times. This does not
|
the same pattern is going to be matched many times. This does not nec-
|
||||||
necessarily mean many calls of a matching function; if the pattern is
|
essarily mean many calls of a matching function; if the pattern is not
|
||||||
not anchored, matching attempts may take place many times at various
|
anchored, matching attempts may take place many times at various posi-
|
||||||
positions in the subject, even for a single call. Therefore, if the
|
tions in the subject, even for a single call. Therefore, if the subject
|
||||||
subject string is very long, it may still pay to use JIT for one-off
|
string is very long, it may still pay to use JIT even for one-off
|
||||||
matches. JIT support is available for all of the 8-bit, 16-bit and
|
matches. JIT support is available for all of the 8-bit, 16-bit and
|
||||||
32-bit PCRE2 libraries.
|
32-bit PCRE2 libraries.
|
||||||
|
|
||||||
|
@ -3558,8 +3562,8 @@ SIMPLE USE OF JIT
|
||||||
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
|
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
|
||||||
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
|
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
|
||||||
ing. If pcre2_jit_compile() is called with no option bits set, it imme-
|
ing. If pcre2_jit_compile() is called with no option bits set, it imme-
|
||||||
diately returns zero. This is an alternative way of testing if JIT is
|
diately returns zero. This is an alternative way of testing whether JIT
|
||||||
available.
|
is available.
|
||||||
|
|
||||||
At present, it is not possible to free JIT compiled code except when
|
At present, it is not possible to free JIT compiled code except when
|
||||||
the entire compiled pattern is freed by calling pcre2_free_code().
|
the entire compiled pattern is freed by calling pcre2_free_code().
|
||||||
|
@ -3745,7 +3749,7 @@ JIT STACK FAQ
|
||||||
an already freed stack, as that will cause SEGFAULT. (Also, do not free
|
an already freed stack, as that will cause SEGFAULT. (Also, do not free
|
||||||
a stack currently used by pcre2_match() in another thread). You can
|
a stack currently used by pcre2_match() in another thread). You can
|
||||||
also replace the stack in a context at any time when it is not in use.
|
also replace the stack in a context at any time when it is not in use.
|
||||||
You can also free the previous stack before assigning a replacement.
|
You should free the previous stack before assigning a replacement.
|
||||||
|
|
||||||
(5) Should I allocate/free a stack every time before/after calling
|
(5) Should I allocate/free a stack every time before/after calling
|
||||||
pcre2_match()?
|
pcre2_match()?
|
||||||
|
@ -3855,7 +3859,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 12 November 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -4642,7 +4646,7 @@ WIDE CHARACTERS AND UTF MODES
|
||||||
UTF mode, but its use can lead to some strange effects because it
|
UTF mode, but its use can lead to some strange effects because it
|
||||||
breaks up multi-unit characters (see the description of \C in the
|
breaks up multi-unit characters (see the description of \C in the
|
||||||
pcre2pattern documentation). The use of \C is not supported in the
|
pcre2pattern documentation). The use of \C is not supported in the
|
||||||
alternative matching function pcre2_dfa_exec(), nor is it supported in
|
alternative matching function pcre2_dfa_match(), nor is it supported in
|
||||||
UTF mode by the JIT optimization. If JIT optimization is requested for
|
UTF mode by the JIT optimization. If JIT optimization is requested for
|
||||||
a UTF pattern that contains \C, it will not succeed, and so the match-
|
a UTF pattern that contains \C, it will not succeed, and so the match-
|
||||||
ing will be carried out by the normal interpretive function.
|
ing will be carried out by the normal interpretive function.
|
||||||
|
@ -4701,14 +4705,14 @@ VALIDITY OF UTF STRINGS
|
||||||
In some situations, you may already know that your strings are valid,
|
In some situations, you may already know that your strings are valid,
|
||||||
and therefore want to skip these checks in order to improve perfor-
|
and therefore want to skip these checks in order to improve perfor-
|
||||||
mance, for example in the case of a long subject string that is being
|
mance, for example in the case of a long subject string that is being
|
||||||
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK flag at compile
|
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
|
||||||
time or at run time, PCRE2 assumes that the pattern or subject it is
|
pile time or at match time, PCRE2 assumes that the pattern or subject
|
||||||
given (respectively) contains only valid UTF code unit sequences.
|
it is given (respectively) contains only valid UTF code unit sequences.
|
||||||
|
|
||||||
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
|
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
|
||||||
for the pattern; it does not also apply to subject strings. If you want
|
for the pattern; it does not also apply to subject strings. If you want
|
||||||
to disable the check for a subject string you must pass this option to
|
to disable the check for a subject string you must pass this option to
|
||||||
pcre2_exec() or pcre2_dfa_exec().
|
pcre2_match() or pcre2_dfa_match().
|
||||||
|
|
||||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
|
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
|
||||||
result is undefined and your program may crash or loop indefinitely.
|
result is undefined and your program may crash or loop indefinitely.
|
||||||
|
@ -4807,7 +4811,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 03 November 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "21 November 2014" "PCRE2 10.00"
|
.TH PCRE2API 3 "23 November 2014" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -2090,6 +2090,13 @@ returned by \fBpcre2_get_startchar()\fP. For a non-partial match, this can be
|
||||||
different to the value of \fIovector[0]\fP if the pattern contains the \eK
|
different to the value of \fIovector[0]\fP if the pattern contains the \eK
|
||||||
escape sequence. After a partial match, however, this value is always the same
|
escape sequence. After a partial match, however, this value is always the same
|
||||||
as \fIovector[0]\fP because \eK does not affect the result of a partial match.
|
as \fIovector[0]\fP because \eK does not affect the result of a partial match.
|
||||||
|
.P
|
||||||
|
The \fBstartchar\fP field is also used to return the offset of an invalid
|
||||||
|
UTF character when UTF checking fails. Details are given in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2unicode\fP
|
||||||
|
.\"
|
||||||
|
page.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="errorlist"></a>
|
.\" HTML <a name="errorlist"></a>
|
||||||
|
@ -2707,6 +2714,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 21 November 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
109
doc/pcre2build.3
109
doc/pcre2build.3
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2BUILD 3 "03 November 2014" "PCRE2 10.00"
|
.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.
|
.
|
||||||
|
@ -74,12 +74,12 @@ respectively. These can be interpreted either as single-unit characters or
|
||||||
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
|
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
|
||||||
the following to the \fBconfigure\fP command:
|
the following to the \fBconfigure\fP command:
|
||||||
.sp
|
.sp
|
||||||
--enable-pcre16
|
--enable-pcre2-16
|
||||||
--enable-pcre32
|
--enable-pcre2-32
|
||||||
.sp
|
.sp
|
||||||
If you do not want the 8-bit library, add
|
If you do not want the 8-bit library, add
|
||||||
.sp
|
.sp
|
||||||
--disable-pcre8
|
--disable-pcre2-8
|
||||||
.sp
|
.sp
|
||||||
as well. At least one of the three libraries must be built. Note that the POSIX
|
as well. At least one of the three libraries must be built. Note that the POSIX
|
||||||
wrapper is for the 8-bit library only, and that \fBpcre2grep\fP is an 8-bit
|
wrapper is for the 8-bit library only, and that \fBpcre2grep\fP is an 8-bit
|
||||||
|
@ -91,15 +91,16 @@ libraries.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
The Autotools PCRE2 building process uses \fBlibtool\fP to build both shared
|
The Autotools PCRE2 building process uses \fBlibtool\fP to build both shared
|
||||||
and static libraries by default. You can suppress one of these by adding one of
|
and static libraries by default. You can suppress an unwanted library by adding
|
||||||
|
one of
|
||||||
.sp
|
.sp
|
||||||
--disable-shared
|
--disable-shared
|
||||||
--disable-static
|
--disable-static
|
||||||
.sp
|
.sp
|
||||||
to the \fBconfigure\fP command, as required.
|
to the \fBconfigure\fP command.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "Unicode and UTF SUPPORT"
|
.SH "UNICODE AND UTF SUPPORT"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
By default, PCRE2 is built with support for Unicode and UTF character strings.
|
By default, PCRE2 is built with support for Unicode and UTF character strings.
|
||||||
|
@ -112,18 +113,14 @@ is not possible to build one library with Unicode support, and another without,
|
||||||
in the same configuration.
|
in the same configuration.
|
||||||
.P
|
.P
|
||||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
||||||
or UTF-32. To do that you have have to set the PCRE2_UTF option when you call
|
or UTF-32. To do that, applications that use the library have to set the
|
||||||
\fBpcre2_compile()\fP to compile a pattern.
|
PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
|
||||||
.P
|
.P
|
||||||
It is not possible to support both EBCDIC and UTF-8 codes in the same version
|
UTF support allows the libraries to process character code points up to
|
||||||
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
|
0x10ffff in the strings that they handle. It also provides support for
|
||||||
exclusive.
|
accessing the Unicode properties of such characters, using pattern escapes such
|
||||||
.P
|
as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
|
||||||
UTF support allows the libraries to process character codepoints up to 0x10ffff
|
\fINd\fP are supported. Details are given in the
|
||||||
in the strings that they handle. It also provides support for accessing the
|
|
||||||
properties of such characters, using pattern escapes such as \eP, \ep, and \eX.
|
|
||||||
Only the general category properties such as \fILu\fP and \fINd\fP are
|
|
||||||
supported. Details are given in the
|
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
|
@ -138,7 +135,7 @@ Just-in-time compiler support is included in the build by specifying
|
||||||
--enable-jit
|
--enable-jit
|
||||||
.sp
|
.sp
|
||||||
This support is available only for certain hardware architectures. If this
|
This support is available only for certain hardware architectures. If this
|
||||||
option is set for an unsupported architecture, a compile time error occurs.
|
option is set for an unsupported architecture, a building error occurs.
|
||||||
See the
|
See the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2jit\fP
|
\fBpcre2jit\fP
|
||||||
|
@ -151,7 +148,7 @@ pcre2grep automatically makes use of it, unless you add
|
||||||
to the "configure" command.
|
to the "configure" command.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "CODE VALUE OF NEWLINE"
|
.SH "NEWLINE RECOGNITION"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
|
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
|
||||||
|
@ -160,11 +157,12 @@ compile PCRE2 to use carriage return (CR) instead, by adding
|
||||||
.sp
|
.sp
|
||||||
--enable-newline-is-cr
|
--enable-newline-is-cr
|
||||||
.sp
|
.sp
|
||||||
to the \fBconfigure\fP command. There is also a --enable-newline-is-lf option,
|
to the \fBconfigure\fP command. There is also an --enable-newline-is-lf option,
|
||||||
which explicitly specifies linefeed as the newline character.
|
which explicitly specifies linefeed as the newline character.
|
||||||
.sp
|
.P
|
||||||
Alternatively, you can specify that line endings are to be indicated by the two
|
Alternatively, you can specify that line endings are to be indicated by the
|
||||||
character sequence CRLF. If you want this, add
|
two-character sequence CRLF (CR immediately followed by LF). If you want this,
|
||||||
|
add
|
||||||
.sp
|
.sp
|
||||||
--enable-newline-is-crlf
|
--enable-newline-is-crlf
|
||||||
.sp
|
.sp
|
||||||
|
@ -177,10 +175,13 @@ indicating a line ending. Finally, a fifth option, specified by
|
||||||
.sp
|
.sp
|
||||||
--enable-newline-is-any
|
--enable-newline-is-any
|
||||||
.sp
|
.sp
|
||||||
causes PCRE2 to recognize any Unicode newline sequence.
|
causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
|
||||||
|
sequences are the three just mentioned, plus the single characters VT (vertical
|
||||||
|
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
|
||||||
|
separator, U+2028), and PS (paragraph separator, U+2029).
|
||||||
.P
|
.P
|
||||||
Whatever line ending convention is selected when PCRE2 is built can be
|
Whatever default line ending convention is selected when PCRE2 is built can be
|
||||||
overridden when the library functions are called. At build time it is
|
overridden by applications that use the library. At build time it is
|
||||||
conventional to use the standard for your operating system.
|
conventional to use the standard for your operating system.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
@ -188,12 +189,13 @@ conventional to use the standard for your operating system.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
By default, the sequence \eR in a pattern matches any Unicode newline sequence,
|
By default, the sequence \eR in a pattern matches any Unicode newline sequence,
|
||||||
whatever has been selected as the line ending sequence. If you specify
|
independently of what has been selected as the line ending sequence. If you
|
||||||
|
specify
|
||||||
.sp
|
.sp
|
||||||
--enable-bsr-anycrlf
|
--enable-bsr-anycrlf
|
||||||
.sp
|
.sp
|
||||||
the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
|
the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
|
||||||
selected when PCRE2 is built can be overridden when the library functions are
|
selected when PCRE2 is built can be overridden by applications that use the
|
||||||
called.
|
called.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
@ -204,10 +206,10 @@ Within a compiled pattern, offset values are used to point from one part to
|
||||||
another (for example, from an opening parenthesis to an alternation
|
another (for example, from an opening parenthesis to an alternation
|
||||||
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
|
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
|
||||||
are used for these offsets, leading to a maximum size for a compiled pattern of
|
are used for these offsets, leading to a maximum size for a compiled pattern of
|
||||||
around 64K. This is sufficient to handle all but the most gigantic patterns.
|
around 64K code units. This is sufficient to handle all but the most gigantic
|
||||||
Nevertheless, some people do want to process truly enormous patterns, so it is
|
patterns. Nevertheless, some people do want to process truly enormous patterns,
|
||||||
possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
|
so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
|
||||||
setting such as
|
adding a setting such as
|
||||||
.sp
|
.sp
|
||||||
--with-link-size=3
|
--with-link-size=3
|
||||||
.sp
|
.sp
|
||||||
|
@ -299,16 +301,19 @@ hand".)
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
PCRE2 assumes by default that it will run in an environment where the character
|
PCRE2 assumes by default that it will run in an environment where the character
|
||||||
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
|
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
|
||||||
most computer operating systems. PCRE2 can, however, be compiled to run in an
|
most computer operating systems. PCRE2 can, however, be compiled to run in an
|
||||||
EBCDIC environment by adding
|
8-bit EBCDIC environment by adding
|
||||||
.sp
|
.sp
|
||||||
--enable-ebcdic --disable-unicode
|
--enable-ebcdic --disable-unicode
|
||||||
.sp
|
.sp
|
||||||
to the \fBconfigure\fP command. This setting implies
|
to the \fBconfigure\fP command. This setting implies
|
||||||
--enable-rebuild-chartables. You should only use it if you know that you are in
|
--enable-rebuild-chartables. You should only use it if you know that you are in
|
||||||
an EBCDIC environment (for example, an IBM mainframe operating system). The
|
an EBCDIC environment (for example, an IBM mainframe operating system).
|
||||||
--enable-ebcdic option is incompatible with Unicode support.
|
.P
|
||||||
|
It is not possible to support both EBCDIC and UTF-8 codes in the same version
|
||||||
|
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
|
||||||
|
exclusive.
|
||||||
.P
|
.P
|
||||||
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
|
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
|
||||||
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
|
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
|
||||||
|
@ -354,8 +359,8 @@ parameter value by adding, for example,
|
||||||
.sp
|
.sp
|
||||||
--with-pcre2grep-bufsize=50K
|
--with-pcre2grep-bufsize=50K
|
||||||
.sp
|
.sp
|
||||||
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can, however,
|
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override this
|
||||||
override this value by specifying a run-time option.
|
value by using --buffer-size on the command line..
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT"
|
.SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT"
|
||||||
|
@ -371,15 +376,15 @@ to the \fBconfigure\fP command, \fBpcre2test\fP is linked with the
|
||||||
from a terminal, it reads it using the \fBreadline()\fP function. This provides
|
from a terminal, it reads it using the \fBreadline()\fP function. This provides
|
||||||
line-editing and history facilities. Note that \fBlibreadline\fP is
|
line-editing and history facilities. Note that \fBlibreadline\fP is
|
||||||
GPL-licensed, so if you distribute a binary of \fBpcre2test\fP linked in this
|
GPL-licensed, so if you distribute a binary of \fBpcre2test\fP linked in this
|
||||||
way, there may be licensing issues. These can be avoided by linking with
|
way, there may be licensing issues. These can be avoided by linking instead
|
||||||
\fBlibedit\fP (which has a BSD licence) instead.
|
with \fBlibedit\fP, which has a BSD licence.
|
||||||
.P
|
.P
|
||||||
Setting this option causes the \fB-lreadline\fP option to be added to the
|
Setting --enable-pcre2test-libreadline causes the \fB-lreadline\fP option to be
|
||||||
\fBpcre2test\fP build. In many operating environments with a sytem-installed
|
added to the \fBpcre2test\fP build. In many operating environments with a
|
||||||
readline library this is sufficient. However, in some environments (e.g. if an
|
sytem-installed readline library this is sufficient. However, in some
|
||||||
unmodified distribution version of readline is in use), some extra
|
environments (e.g. if an unmodified distribution version of readline is in
|
||||||
configuration may be necessary. The INSTALL file for \fBlibreadline\fP says
|
use), some extra configuration may be necessary. The INSTALL file for
|
||||||
this:
|
\fBlibreadline\fP says this:
|
||||||
.sp
|
.sp
|
||||||
"Readline uses the termcap functions, but does not link with
|
"Readline uses the termcap functions, but does not link with
|
||||||
the termcap or curses library itself, allowing applications
|
the termcap or curses library itself, allowing applications
|
||||||
|
@ -396,13 +401,13 @@ immediately before the \fBconfigure\fP command.
|
||||||
.SH "DEBUGGING WITH VALGRIND SUPPORT"
|
.SH "DEBUGGING WITH VALGRIND SUPPORT"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
By adding the
|
If you add
|
||||||
.sp
|
.sp
|
||||||
--enable-valgrind
|
--enable-valgrind
|
||||||
.sp
|
.sp
|
||||||
option to to the \fBconfigure\fP command, PCRE2 will use valgrind annotations
|
to the \fBconfigure\fP command, PCRE2 will use valgrind annotations to mark
|
||||||
to mark certain memory regions as unaddressable. This allows it to detect
|
certain memory regions as unaddressable. This allows it to detect invalid
|
||||||
invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
|
memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "CODE COVERAGE REPORTING"
|
.SH "CODE COVERAGE REPORTING"
|
||||||
|
@ -482,6 +487,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 03 November 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2CALLOUT 3 "19 October 2014" "PCRE2 10.00"
|
.TH PCRE2CALLOUT 3 "23 November 2014" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -68,29 +68,27 @@ expect.
|
||||||
.P
|
.P
|
||||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
||||||
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
||||||
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is anchored
|
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
|
||||||
and then applied with automatic callouts to the string "aaaa" is:
|
with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
|
||||||
|
"aaaa" is:
|
||||||
.sp
|
.sp
|
||||||
--->aaaa
|
--->aaaa
|
||||||
+0 ^ ^
|
+0 ^ a+
|
||||||
+1 ^ a+
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
|
||||||
No match
|
No match
|
||||||
.sp
|
.sp
|
||||||
This indicates that when matching [bc] fails, there is no backtracking into a+
|
This indicates that when matching [bc] fails, there is no backtracking into a+
|
||||||
and therefore the callouts that would be taken for the backtracks do not occur.
|
and therefore the callouts that would be taken for the backtracks do not occur.
|
||||||
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS
|
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
|
||||||
to \fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). If
|
\fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
|
||||||
this is done in \fBpcre2test\fP (using the /no_auto_possess qualifier), the
|
case, the output changes to this:
|
||||||
output changes to this:
|
|
||||||
.sp
|
.sp
|
||||||
--->aaaa
|
--->aaaa
|
||||||
+0 ^ ^
|
+0 ^ a+
|
||||||
+1 ^ a+
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
+2 ^ ^ [bc]
|
||||||
+3 ^ ^ [bc]
|
+2 ^^ [bc]
|
||||||
+3 ^^ [bc]
|
|
||||||
No match
|
No match
|
||||||
.sp
|
.sp
|
||||||
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
||||||
|
@ -119,10 +117,10 @@ callouts such as the example above are obeyed.
|
||||||
.SH "THE CALLOUT INTERFACE"
|
.SH "THE CALLOUT INTERFACE"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
During matching, when PCRE2 reaches a callout point, the external function that
|
During matching, when PCRE2 reaches a callout point, if an external function is
|
||||||
is set in the match context is called (if it is set). This applies to both
|
set in the match context, it is called. This applies to both normal and DFA
|
||||||
normal and DFA matching. The only argument to the callout function is a pointer
|
matching. The only argument to the callout function is a pointer to a
|
||||||
to a \fBpcre2_callout\fP block. This structure contains the following fields:
|
\fBpcre2_callout\fP block. This structure contains the following fields:
|
||||||
.sp
|
.sp
|
||||||
uint32_t \fIversion\fP;
|
uint32_t \fIversion\fP;
|
||||||
uint32_t \fIcallout_number\fP;
|
uint32_t \fIcallout_number\fP;
|
||||||
|
@ -149,7 +147,7 @@ automatically generated callouts).
|
||||||
.P
|
.P
|
||||||
The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
|
The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
|
||||||
(the "ovector") that was passed to the matching function in the match data
|
(the "ovector") that was passed to the matching function in the match data
|
||||||
block. When \fBpcre2_match()\fP is used, the contents can be inspected, in
|
block. When \fBpcre2_match()\fP is used, the contents can be inspected in
|
||||||
order to extract substrings that have been matched so far, in the same way as
|
order to extract substrings that have been matched so far, in the same way as
|
||||||
for extracting substrings after a match has completed. For the DFA matching
|
for extracting substrings after a match has completed. For the DFA matching
|
||||||
function, this field is not useful.
|
function, this field is not useful.
|
||||||
|
@ -238,6 +236,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 19 October 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2GREP 1 "28 September 2014" "PCRE2 10.00"
|
.TH PCRE2GREP 1 "23 November 2014" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
pcre2grep - a grep with Perl-compatible regular expressions.
|
pcre2grep - a grep with Perl-compatible regular expressions.
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -403,8 +403,8 @@ used. There is no short form for this option.
|
||||||
Processing some regular expression patterns can require a very large amount of
|
Processing some regular expression patterns can require a very large amount of
|
||||||
memory, leading in some cases to a program crash if not enough is available.
|
memory, leading in some cases to a program crash if not enough is available.
|
||||||
Other patterns may take a very long time to search for all possible matching
|
Other patterns may take a very long time to search for all possible matching
|
||||||
strings. The \fBpcre2_exec()\fP function that is called by \fBpcre2grep\fP to do
|
strings. The \fBpcre2_match()\fP function that is called by \fBpcre2grep\fP to
|
||||||
the matching has two parameters that can limit the resources that it uses.
|
do the matching has two parameters that can limit the resources that it uses.
|
||||||
.sp
|
.sp
|
||||||
The \fB--match-limit\fP option provides a means of limiting resource usage
|
The \fB--match-limit\fP option provides a means of limiting resource usage
|
||||||
when processing patterns that are not going to match, but which have a very
|
when processing patterns that are not going to match, but which have a very
|
||||||
|
@ -678,6 +678,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 28 September 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -446,7 +446,7 @@ OPTIONS
|
||||||
very large amount of memory, leading in some cases to a pro-
|
very large amount of memory, leading in some cases to a pro-
|
||||||
gram crash if not enough is available. Other patterns may
|
gram crash if not enough is available. Other patterns may
|
||||||
take a very long time to search for all possible matching
|
take a very long time to search for all possible matching
|
||||||
strings. The pcre2_exec() function that is called by
|
strings. The pcre2_match() function that is called by
|
||||||
pcre2grep to do the matching has two parameters that can
|
pcre2grep to do the matching has two parameters that can
|
||||||
limit the resources that it uses.
|
limit the resources that it uses.
|
||||||
|
|
||||||
|
@ -737,5 +737,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 28 September 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2JIT 3 "12 November 2014" "PCRE2 10.00"
|
.TH PCRE2JIT 3 "23 November 2014" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
|
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
|
||||||
|
@ -6,11 +6,11 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
Just-in-time compiling is a heavyweight optimization that can greatly speed up
|
Just-in-time compiling is a heavyweight optimization that can greatly speed up
|
||||||
pattern matching. However, it comes at the cost of extra processing before the
|
pattern matching. However, it comes at the cost of extra processing before the
|
||||||
match is performed. Therefore, it is of most benefit when the same pattern is
|
match is performed, so it is of most benefit when the same pattern is going to
|
||||||
going to be matched many times. This does not necessarily mean many calls of a
|
be matched many times. This does not necessarily mean many calls of a matching
|
||||||
matching function; if the pattern is not anchored, matching attempts may take
|
function; if the pattern is not anchored, matching attempts may take place many
|
||||||
place many times at various positions in the subject, even for a single call.
|
times at various positions in the subject, even for a single call. Therefore,
|
||||||
Therefore, if the subject string is very long, it may still pay to use JIT for
|
if the subject string is very long, it may still pay to use JIT even for
|
||||||
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
|
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
|
||||||
32-bit PCRE2 libraries.
|
32-bit PCRE2 libraries.
|
||||||
.P
|
.P
|
||||||
|
@ -77,7 +77,7 @@ option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
|
||||||
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
|
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
|
||||||
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
|
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
|
||||||
\fBpcre2_jit_compile()\fP is called with no option bits set, it immediately
|
\fBpcre2_jit_compile()\fP is called with no option bits set, it immediately
|
||||||
returns zero. This is an alternative way of testing if JIT is available.
|
returns zero. This is an alternative way of testing whether JIT is available.
|
||||||
.P
|
.P
|
||||||
At present, it is not possible to free JIT compiled code except when the entire
|
At present, it is not possible to free JIT compiled code except when the entire
|
||||||
compiled pattern is freed by calling \fBpcre2_free_code()\fP.
|
compiled pattern is freed by calling \fBpcre2_free_code()\fP.
|
||||||
|
@ -276,7 +276,7 @@ compiled patterns, contexts, and stacks in any order, anytime. Just \fIdo
|
||||||
not\fP call \fBpcre2_match()\fP with a match context pointing to an already
|
not\fP call \fBpcre2_match()\fP with a match context pointing to an already
|
||||||
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
|
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
|
||||||
used by \fBpcre2_match()\fP in another thread). You can also replace the stack
|
used by \fBpcre2_match()\fP in another thread). You can also replace the stack
|
||||||
in a context at any time when it is not in use. You can also free the previous
|
in a context at any time when it is not in use. You should free the previous
|
||||||
stack before assigning a replacement.
|
stack before assigning a replacement.
|
||||||
.P
|
.P
|
||||||
(5) Should I allocate/free a stack every time before/after calling
|
(5) Should I allocate/free a stack every time before/after calling
|
||||||
|
@ -398,6 +398,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 12 November 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2SYNTAX 3 "14 November 2014" "PCRE2 10.00"
|
.TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
@ -394,7 +394,7 @@ appear.
|
||||||
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
||||||
.sp
|
.sp
|
||||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||||
limits set by the caller of pcre2_exec(), not increase them.
|
limits set by the caller of pcre2_match(), not increase them.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "NEWLINE CONVENTION"
|
.SH "NEWLINE CONVENTION"
|
||||||
|
@ -536,6 +536,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 14 November 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
119
doc/pcre2test.1
119
doc/pcre2test.1
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2TEST 1 "14 November 2014" "PCRE 10.00"
|
.TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -200,7 +200,7 @@ input lines. Each set starts with a regular expression pattern, followed by any
|
||||||
number of subject lines to be matched against that pattern. In between sets of
|
number of subject lines to be matched against that pattern. In between sets of
|
||||||
test data, command lines that begin with a hash (#) character may appear. This
|
test data, command lines that begin with a hash (#) character may appear. This
|
||||||
file format, with some restrictions, can also be processed by the
|
file format, with some restrictions, can also be processed by the
|
||||||
\fBperltest.pl\fP script that is distributed with PCRE2 as a means of checking
|
\fBperltest.sh\fP script that is distributed with PCRE2 as a means of checking
|
||||||
that the behaviour of PCRE2 and Perl is the same.
|
that the behaviour of PCRE2 and Perl is the same.
|
||||||
.P
|
.P
|
||||||
Each subject line is matched separately and independently. If you want to do
|
Each subject line is matched separately and independently. If you want to do
|
||||||
|
@ -243,11 +243,11 @@ patterns. Modifiers on a pattern can change these settings.
|
||||||
#perltest
|
#perltest
|
||||||
.sp
|
.sp
|
||||||
The appearance of this line causes all subsequent modifier settings to be
|
The appearance of this line causes all subsequent modifier settings to be
|
||||||
checked for compatibility with the \fBperltest.pl\fP script, which is used to
|
checked for compatibility with the \fBperltest.sh\fP script, which is used to
|
||||||
confirm that Perl gives the same results as PCRE2. Also, apart from comment
|
confirm that Perl gives the same results as PCRE2. Also, apart from comment
|
||||||
lines, none of the other command lines are permitted, because they and many
|
lines, none of the other command lines are permitted, because they and many
|
||||||
of the modifiers are specific to \fBpcre2test\fP, and should not be used in
|
of the modifiers are specific to \fBpcre2test\fP, and should not be used in
|
||||||
test files that are also processed by \fBperltest.pl\fP. The \fP#perltest\fB
|
test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB
|
||||||
command helps detect tests that are accidentally put in the wrong file.
|
command helps detect tests that are accidentally put in the wrong file.
|
||||||
.sp
|
.sp
|
||||||
#subject <modifier-list>
|
#subject <modifier-list>
|
||||||
|
@ -265,7 +265,7 @@ for both patterns and subject lines, whereas others are valid for one or the
|
||||||
other only. Each modifier has a long name, for example "anchored", and some of
|
other only. Each modifier has a long name, for example "anchored", and some of
|
||||||
them must be followed by an equals sign and a value, for example, "offset=12".
|
them must be followed by an equals sign and a value, for example, "offset=12".
|
||||||
Modifiers that do not take values may be preceded by a minus sign to turn off a
|
Modifiers that do not take values may be preceded by a minus sign to turn off a
|
||||||
previous default setting.
|
previous setting.
|
||||||
.P
|
.P
|
||||||
A few of the more common modifiers can also be specified as single letters, for
|
A few of the more common modifiers can also be specified as single letters, for
|
||||||
example "i" for "caseless". In documentation, following the Perl convention,
|
example "i" for "caseless". In documentation, following the Perl convention,
|
||||||
|
@ -336,7 +336,7 @@ encoding non-printing characters in a visible way:
|
||||||
\exhh hexadecimal byte (up to 2 hex digits)
|
\exhh hexadecimal byte (up to 2 hex digits)
|
||||||
\ex{hh...} hexadecimal character (any number of hex digits)
|
\ex{hh...} hexadecimal character (any number of hex digits)
|
||||||
.sp
|
.sp
|
||||||
The use of \ex{hh...} is not dependent on the use of the utf modifier on
|
The use of \ex{hh...} is not dependent on the use of the \fButf\fP modifier on
|
||||||
the pattern. It is recognized always. There may be any number of hexadecimal
|
the pattern. It is recognized always. There may be any number of hexadecimal
|
||||||
digits inside the braces; invalid values provoke error messages.
|
digits inside the braces; invalid values provoke error messages.
|
||||||
.P
|
.P
|
||||||
|
@ -366,7 +366,7 @@ part of the file. For example:
|
||||||
is converted to "abcabcabcabc". This feature does not support nesting. To
|
is converted to "abcabcabcabc". This feature does not support nesting. To
|
||||||
include a closing square bracket in the characters, code it as \ex5D.
|
include a closing square bracket in the characters, code it as \ex5D.
|
||||||
.P
|
.P
|
||||||
A backslash followed by an equals sign marke the end of the subject string and
|
A backslash followed by an equals sign marks the end of the subject string and
|
||||||
the start of a modifier list. For example:
|
the start of a modifier list. For example:
|
||||||
.sp
|
.sp
|
||||||
abc\e=notbol,notempty
|
abc\e=notbol,notempty
|
||||||
|
@ -461,8 +461,8 @@ set to "anycrlf", \eR matches CR, LF, or CRLF only. If it is set to "unicode",
|
||||||
is built, with the default default being Unicode.
|
is built, with the default default being Unicode.
|
||||||
.P
|
.P
|
||||||
The \fBnewline\fP modifier specifies which characters are to be interpreted as
|
The \fBnewline\fP modifier specifies which characters are to be interpreted as
|
||||||
newlines, both in the pattern and (by default) in subject lines. The type must
|
newlines, both in the pattern and in subject lines. The type must be one of CR,
|
||||||
be one of CR, LF, CRLF, ANYCRLF, or ANY.
|
LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Information about a pattern"
|
.SS "Information about a pattern"
|
||||||
|
@ -478,8 +478,8 @@ link sizes and different code unit widths. By using \fBbincode\fP, the same
|
||||||
regression tests can be used in different environments.
|
regression tests can be used in different environments.
|
||||||
.P
|
.P
|
||||||
The \fBfullbincode\fP modifier, by contrast, \fIdoes\fP include length and
|
The \fBfullbincode\fP modifier, by contrast, \fIdoes\fP include length and
|
||||||
offset values. This is used in a few special tests and is also useful for
|
offset values. This is used in a few special tests that run only for specific
|
||||||
one-off tests.
|
code unit widths and link sizes, and is also useful for one-off tests.
|
||||||
.P
|
.P
|
||||||
The \fBinfo\fP modifier requests information about the compiled pattern
|
The \fBinfo\fP modifier requests information about the compiled pattern
|
||||||
(whether it is anchored, has a fixed first character, and so on). The
|
(whether it is anchored, has a fixed first character, and so on). The
|
||||||
|
@ -501,13 +501,14 @@ some typical examples:
|
||||||
Last code unit = 'c' (caseless)
|
Last code unit = 'c' (caseless)
|
||||||
Subject length lower bound = 3
|
Subject length lower bound = 3
|
||||||
.sp
|
.sp
|
||||||
"Compile options" are those specified to the compile function; "overall
|
"Compile options" are those specified by modifiers; "overall options" have
|
||||||
options" have added options that are taken or deduced from the pattern. If both
|
added options that are taken or deduced from the pattern. If both sets of
|
||||||
sets of options are the same, just a single "options" line is output. "First
|
options are the same, just a single "options" line is output; if there are no
|
||||||
code unit" is where any match must start; if there is more than one they are
|
options, the line is omitted. "First code unit" is where any match must start;
|
||||||
listed as "starting code units". "Last code unit" is the last literal code unit
|
if there is more than one they are listed as "starting code units". "Last code
|
||||||
that must be present in any match. This is not necessarily the last character.
|
unit" is the last literal code unit that must be present in any match. This is
|
||||||
These lines are omitted if no starting or ending code units are recorded.
|
not necessarily the last character. These lines are omitted if no starting or
|
||||||
|
ending code units are recorded.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Specifying a pattern in hex"
|
.SS "Specifying a pattern in hex"
|
||||||
|
@ -520,16 +521,16 @@ pairs. For example:
|
||||||
/ab 32 59/hex
|
/ab 32 59/hex
|
||||||
.sp
|
.sp
|
||||||
This feature is provided as a way of creating patterns that contain binary zero
|
This feature is provided as a way of creating patterns that contain binary zero
|
||||||
characters. By default, \fBpcre2test\fP passes patterns as zero-terminated
|
and other non-printing characters. By default, \fBpcre2test\fP passes patterns
|
||||||
strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED.
|
as zero-terminated strings to \fBpcre2_compile()\fP, giving the length as
|
||||||
However, for patterns specified in hexadecimal, the actual length of the
|
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
|
||||||
pattern is passed.
|
actual length of the pattern is passed.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "JIT compilation"
|
.SS "JIT compilation"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
The \fB/jit\fP modifier may optionally be followed by and equals sign and a
|
The \fB/jit\fP modifier may optionally be followed by an equals sign and a
|
||||||
number in the range 0 to 7:
|
number in the range 0 to 7:
|
||||||
.sp
|
.sp
|
||||||
0 disable JIT
|
0 disable JIT
|
||||||
|
@ -561,7 +562,7 @@ pattern shows whether JIT compilation was or was not successful. If
|
||||||
\fBjitverify\fP is specified without \fBjit\fP, jit=7 is assumed. If JIT
|
\fBjitverify\fP is specified without \fBjit\fP, jit=7 is assumed. If JIT
|
||||||
compilation is successful when \fBjitverify\fP is set, the text "(JIT)" is
|
compilation is successful when \fBjitverify\fP is set, the text "(JIT)" is
|
||||||
added to the first output line after a match or non match when JIT-compiled
|
added to the first output line after a match or non match when JIT-compiled
|
||||||
code was actually used.
|
code was actually used in the match.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Setting a locale"
|
.SS "Setting a locale"
|
||||||
|
@ -645,8 +646,8 @@ be aborted.
|
||||||
.SS "Using alternative character tables"
|
.SS "Using alternative character tables"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
The \fB/tables\fP modifier must be followed by a single digit. It causes a
|
The value specified for the \fB/tables\fP modifier must be one of the digits 0,
|
||||||
specific set of built-in character tables to be passed to
|
1, or 2. It causes a specific set of built-in character tables to be passed to
|
||||||
\fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
|
\fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
|
||||||
different character tables. The digit specifies the tables as follows:
|
different character tables. The digit specifies the tables as follows:
|
||||||
.sp
|
.sp
|
||||||
|
@ -759,13 +760,13 @@ The effects of these modifiers are described in the following sections.
|
||||||
.SS "Showing more text"
|
.SS "Showing more text"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
The \fBaftertext\fP modifier requests that as well as outputting the substring
|
The \fBaftertext\fP modifier requests that as well as outputting the part of
|
||||||
that matched the entire pattern, \fBpcre2test\fP should in addition output the
|
the subject string that matched the entire pattern, \fBpcre2test\fP should in
|
||||||
remainder of the subject string. This is useful for tests where the subject
|
addition output the remainder of the subject string. This is useful for tests
|
||||||
contains multiple copies of the same substring. The \fBallaftertext\fP modifier
|
where the subject contains multiple copies of the same substring. The
|
||||||
requests the same action for captured substrings as well as the main matched
|
\fBallaftertext\fP modifier requests the same action for captured substrings as
|
||||||
substring. In each case the remainder is output on the following line with a
|
well as the main matched substring. In each case the remainder is output on the
|
||||||
plus character following the capture number.
|
following line with a plus character following the capture number.
|
||||||
.P
|
.P
|
||||||
The \fBallusedtext\fP modifier requests that all the text that was consulted
|
The \fBallusedtext\fP modifier requests that all the text that was consulted
|
||||||
during a successful pattern match by the interpreter should be shown. This
|
during a successful pattern match by the interpreter should be shown. This
|
||||||
|
@ -782,7 +783,8 @@ underneath them. Here is an example:
|
||||||
<<< >>>
|
<<< >>>
|
||||||
.sp
|
.sp
|
||||||
This shows that the matched string is "abc", with the preceding and following
|
This shows that the matched string is "abc", with the preceding and following
|
||||||
strings "pqr" and "xyz" also consulted during the match.
|
strings "pqr" and "xyz" having been consulted during the match (when processing
|
||||||
|
the assertions).
|
||||||
.P
|
.P
|
||||||
The \fBstartchar\fP modifier requests that the starting character for the match
|
The \fBstartchar\fP modifier requests that the starting character for the match
|
||||||
be indicated, if it is different to the start of the matched string. The only
|
be indicated, if it is different to the start of the matched string. The only
|
||||||
|
@ -836,7 +838,7 @@ function is called again to search the remainder of the subject. The difference
|
||||||
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
|
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
|
||||||
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
|
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
|
||||||
to start searching at a new point within the entire string (which is what Perl
|
to start searching at a new point within the entire string (which is what Perl
|
||||||
does), whereas the latter passes over a shortened substring. This makes a
|
does), whereas the latter passes over a shortened subject. This makes a
|
||||||
difference to the matching process if the pattern begins with a lookbehind
|
difference to the matching process if the pattern begins with a lookbehind
|
||||||
assertion (including \eb or \eB).
|
assertion (including \eb or \eB).
|
||||||
.P
|
.P
|
||||||
|
@ -847,7 +849,7 @@ fails, the start offset is advanced, and the normal match is retried. This
|
||||||
imitates the way Perl handles such cases when using the \fB/g\fP modifier or
|
imitates the way Perl handles such cases when using the \fB/g\fP modifier or
|
||||||
the \fBsplit()\fP function. Normally, the start offset is advanced by one
|
the \fBsplit()\fP function. Normally, the start offset is advanced by one
|
||||||
character, but if the newline convention recognizes CRLF as a newline, and the
|
character, but if the newline convention recognizes CRLF as a newline, and the
|
||||||
current character is CR followed by LF, an advance of two is used.
|
current character is CR followed by LF, an advance of two characters occurs.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Testing substring extraction functions"
|
.SS "Testing substring extraction functions"
|
||||||
|
@ -860,9 +862,9 @@ for example:
|
||||||
.sp
|
.sp
|
||||||
abcd\e=copy=1,copy=3,get=G1
|
abcd\e=copy=1,copy=3,get=G1
|
||||||
.sp
|
.sp
|
||||||
If the \fB#subject\fP command is used to set default copy and get lists, these
|
If the \fB#subject\fP command is used to set default copy and/or get lists,
|
||||||
can be unset by specifying a negative number for numbered groups and an empty
|
these can be unset by specifying a negative number to cancel all numbered
|
||||||
name for named groups.
|
groups and an empty name to cancel all named groups.
|
||||||
.P
|
.P
|
||||||
The \fBgetall\fP modifier tests \fBpcre2_substring_list_get()\fP, which
|
The \fBgetall\fP modifier tests \fBpcre2_substring_list_get()\fP, which
|
||||||
extracts all captured substrings.
|
extracts all captured substrings.
|
||||||
|
@ -871,7 +873,8 @@ If the subject line is successfully matched, the substrings extracted by the
|
||||||
convenience functions are output with C, G, or L after the string number
|
convenience functions are output with C, G, or L after the string number
|
||||||
instead of a colon. This is in addition to the normal full list. The string
|
instead of a colon. This is in addition to the normal full list. The string
|
||||||
length (that is, the return from the extraction function) is given in
|
length (that is, the return from the extraction function) is given in
|
||||||
parentheses after each substring.
|
parentheses after each substring, followed by the name when the extraction was
|
||||||
|
by name.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Testing the substitution function"
|
.SS "Testing the substitution function"
|
||||||
|
@ -1044,11 +1047,10 @@ entire substring that was inspected during the partial match; it may include
|
||||||
characters before the actual match start if a lookbehind assertion, \eK, \eb,
|
characters before the actual match start if a lookbehind assertion, \eK, \eb,
|
||||||
or \eB was involved.)
|
or \eB was involved.)
|
||||||
.P
|
.P
|
||||||
For any other return, \fBpcre2test\fP outputs the PCRE2
|
For any other return, \fBpcre2test\fP outputs the PCRE2 negative error number
|
||||||
negative error number and a short descriptive phrase. If the error is a failed
|
and a short descriptive phrase. If the error is a failed UTF string check, the
|
||||||
UTF string check, the offset of the start of the failing character and the
|
code unit offset of the start of the failing character is also output. Here is
|
||||||
reason code are also output. Here is an example of an interactive
|
an example of an interactive \fBpcre2test\fP run.
|
||||||
\fBpcre2test\fP run.
|
|
||||||
.sp
|
.sp
|
||||||
$ pcre2test
|
$ pcre2test
|
||||||
PCRE2 version 9.00 2014-05-10
|
PCRE2 version 9.00 2014-05-10
|
||||||
|
@ -1061,10 +1063,10 @@ reason code are also output. Here is an example of an interactive
|
||||||
No match
|
No match
|
||||||
.sp
|
.sp
|
||||||
Unset capturing substrings that are not followed by one that is set are not
|
Unset capturing substrings that are not followed by one that is set are not
|
||||||
returned by \fBpcre2_match()\fP, and are not shown by \fBpcre2test\fP. In the
|
shown by \fBpcre2test\fP unless the \fBallcaptures\fP modifier is specified. In
|
||||||
following example, there are two capturing substrings, but when the first data
|
the following example, there are two capturing substrings, but when the first
|
||||||
line is matched, the second, unset substring is not shown. An "internal" unset
|
data line is matched, the second, unset substring is not shown. An "internal"
|
||||||
substring is shown as "<unset>", as for the second data line.
|
unset substring is shown as "<unset>", as for the second data line.
|
||||||
.sp
|
.sp
|
||||||
re> /(a)|(b)/
|
re> /(a)|(b)/
|
||||||
data> a
|
data> a
|
||||||
|
@ -1100,8 +1102,8 @@ are output in sequence, like this:
|
||||||
1: pp
|
1: pp
|
||||||
.sp
|
.sp
|
||||||
"No match" is output only if the first match attempt fails. Here is an example
|
"No match" is output only if the first match attempt fails. Here is an example
|
||||||
of a failure message (the offset 4 that is specified by \e>4 is past the end of
|
of a failure message (the offset 4 that is specified by the \fBoffset\fP
|
||||||
the subject string):
|
modifier is past the end of the subject string):
|
||||||
.sp
|
.sp
|
||||||
re> /xyz/
|
re> /xyz/
|
||||||
data> xyz\e=offset=4
|
data> xyz\e=offset=4
|
||||||
|
@ -1127,12 +1129,13 @@ the subject where there is at least one match. For example:
|
||||||
1: tang
|
1: tang
|
||||||
2: tan
|
2: tan
|
||||||
.sp
|
.sp
|
||||||
(Using the normal matching function on this data finds only "tang".) The
|
Using the normal matching function on this data finds only "tang". The
|
||||||
longest matching string is always given first (and numbered zero). After a
|
longest matching string is always given first (and numbered zero). After a
|
||||||
PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the
|
PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the
|
||||||
partially matching substring. (Note that this is the entire substring that was
|
partially matching substring. Note that this is the entire substring that was
|
||||||
inspected during the partial match; it may include characters before the actual
|
inspected during the partial match; it may include characters before the actual
|
||||||
match start if a lookbehind assertion, \eK, \eb, or \eB was involved.)
|
match start if a lookbehind assertion, \eb, or \eB was involved. (\eK is not
|
||||||
|
supported for DFA matching.)
|
||||||
.P
|
.P
|
||||||
If global matching is requested, the search for further matches resumes
|
If global matching is requested, the search for further matches resumes
|
||||||
at the end of the longest match. For example:
|
at the end of the longest match. For example:
|
||||||
|
@ -1174,9 +1177,9 @@ documentation.
|
||||||
.SH CALLOUTS
|
.SH CALLOUTS
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
If the pattern contains any callout requests, \fBpcre2test\fP's callout function
|
If the pattern contains any callout requests, \fBpcre2test\fP's callout
|
||||||
is called during matching. This works with both matching functions. By default,
|
function is called during matching. This works with both matching functions. By
|
||||||
the called function displays the callout number, the start and current
|
default, the called function displays the callout number, the start and current
|
||||||
positions in the text at the callout time, and the next pattern item to be
|
positions in the text at the callout time, and the next pattern item to be
|
||||||
tested. For example:
|
tested. For example:
|
||||||
.sp
|
.sp
|
||||||
|
@ -1271,6 +1274,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 14 November 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2UNICODE 3 "03 November 2014" "PCRE2 10.00"
|
.TH PCRE2UNICODE 3 "23 November 2014" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE - Perl-compatible regular expressions (revised API)
|
PCRE - Perl-compatible regular expressions (revised API)
|
||||||
.SH "UNICODE AND UTF SUPPORT"
|
.SH "UNICODE AND UTF SUPPORT"
|
||||||
|
@ -64,7 +64,7 @@ characters (see the description of \eC in the
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
documentation). The use of \eC is not supported in the alternative matching
|
documentation). The use of \eC is not supported in the alternative matching
|
||||||
function \fBpcre2_dfa_exec()\fP, nor is it supported in UTF mode by the JIT
|
function \fBpcre2_dfa_match()\fP, nor is it supported in UTF mode by the JIT
|
||||||
optimization. If JIT optimization is requested for a UTF pattern that contains
|
optimization. If JIT optimization is requested for a UTF pattern that contains
|
||||||
\eC, it will not succeed, and so the matching will be carried out by the normal
|
\eC, it will not succeed, and so the matching will be carried out by the normal
|
||||||
interpretive function.
|
interpretive function.
|
||||||
|
@ -108,7 +108,10 @@ case-equivalent, and these are treated as such.
|
||||||
.sp
|
.sp
|
||||||
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
||||||
are (by default) checked for validity on entry to the relevant functions.
|
are (by default) checked for validity on entry to the relevant functions.
|
||||||
If an invalid UTF string is passed, an error return is given.
|
If an invalid UTF string is passed, an negative error code is returned. The
|
||||||
|
code unit offset to the offending character can be extracted from the match
|
||||||
|
data block by calling \fBpcre2_get_startchar()\fP, which is used for this
|
||||||
|
purpose after a UTF error.
|
||||||
.P
|
.P
|
||||||
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
||||||
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
||||||
|
@ -130,14 +133,14 @@ UTF-32.)
|
||||||
In some situations, you may already know that your strings are valid, and
|
In some situations, you may already know that your strings are valid, and
|
||||||
therefore want to skip these checks in order to improve performance, for
|
therefore want to skip these checks in order to improve performance, for
|
||||||
example in the case of a long subject string that is being scanned repeatedly.
|
example in the case of a long subject string that is being scanned repeatedly.
|
||||||
If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2
|
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
||||||
assumes that the pattern or subject it is given (respectively) contains only
|
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
||||||
valid UTF code unit sequences.
|
only valid UTF code unit sequences.
|
||||||
.P
|
.P
|
||||||
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
|
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
|
||||||
the pattern; it does not also apply to subject strings. If you want to disable
|
the pattern; it does not also apply to subject strings. If you want to disable
|
||||||
the check for a subject string you must pass this option to \fBpcre2_exec()\fP
|
the check for a subject string you must pass this option to \fBpcre2_match()\fP
|
||||||
or \fBpcre2_dfa_exec()\fP.
|
or \fBpcre2_dfa_match()\fP.
|
||||||
.P
|
.P
|
||||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||||
is undefined and your program may crash or loop indefinitely.
|
is undefined and your program may crash or loop indefinitely.
|
||||||
|
@ -249,6 +252,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 03 November 2014
|
Last updated: 23 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -3224,12 +3224,8 @@ multiunit character. */
|
||||||
#ifdef SUPPORT_UNICODE
|
#ifdef SUPPORT_UNICODE
|
||||||
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
|
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
|
||||||
{
|
{
|
||||||
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar));
|
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
|
||||||
if (match_data->rc != 0)
|
if (match_data->rc != 0) return match_data->rc;
|
||||||
{
|
|
||||||
match_data->leftchar = 0;
|
|
||||||
return match_data->rc;
|
|
||||||
}
|
|
||||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||||
if (start_offset > 0 && start_offset < length &&
|
if (start_offset > 0 && start_offset < length &&
|
||||||
NOT_FIRSTCHAR(subject[start_offset]))
|
NOT_FIRSTCHAR(subject[start_offset]))
|
||||||
|
|
|
@ -6459,12 +6459,8 @@ multiunit character. */
|
||||||
#ifdef SUPPORT_UNICODE
|
#ifdef SUPPORT_UNICODE
|
||||||
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
|
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
|
||||||
{
|
{
|
||||||
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar));
|
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
|
||||||
if (match_data->rc != 0)
|
if (match_data->rc != 0) return match_data->rc;
|
||||||
{
|
|
||||||
match_data->leftchar = 0;
|
|
||||||
return match_data->rc;
|
|
||||||
}
|
|
||||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||||
if (start_offset > 0 && start_offset < length &&
|
if (start_offset > 0 && start_offset < length &&
|
||||||
NOT_FIRSTCHAR(subject[start_offset]))
|
NOT_FIRSTCHAR(subject[start_offset]))
|
||||||
|
|
|
@ -5570,6 +5570,13 @@ else for (gmatched = 0;; gmatched++)
|
||||||
fprintf(outfile, "Failed: error %d: ", capcount);
|
fprintf(outfile, "Failed: error %d: ", capcount);
|
||||||
PCRE2_GET_ERROR_MESSAGE(mlen, capcount, pbuffer);
|
PCRE2_GET_ERROR_MESSAGE(mlen, capcount, pbuffer);
|
||||||
PCHARSV(CASTVAR(void *, pbuffer), 0, mlen, FALSE, outfile);
|
PCHARSV(CASTVAR(void *, pbuffer), 0, mlen, FALSE, outfile);
|
||||||
|
if (capcount <= PCRE2_ERROR_UTF8_ERR1 &&
|
||||||
|
capcount >= PCRE2_ERROR_UTF32_ERR2)
|
||||||
|
{
|
||||||
|
PCRE2_SIZE startchar;
|
||||||
|
PCRE2_GET_STARTCHAR(startchar, match_data);
|
||||||
|
fprintf(outfile, " at offset %ld", startchar);
|
||||||
|
}
|
||||||
fprintf(outfile, "\n");
|
fprintf(outfile, "\n");
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
|
@ -48,12 +48,12 @@
|
||||||
/テテテxxx/utf
|
/テテテxxx/utf
|
||||||
|
|
||||||
/badutf/utf
|
/badutf/utf
|
||||||
\xdf
|
X\xdf
|
||||||
\xef
|
XX\xef
|
||||||
\xef\x80
|
XXX\xef\x80
|
||||||
\xf7
|
X\xf7
|
||||||
\xf7\x80
|
XX\xf7\x80
|
||||||
\xf7\x80\x80
|
XXX\xf7\x80\x80
|
||||||
\xfb
|
\xfb
|
||||||
\xfb\x80
|
\xfb\x80
|
||||||
\xfb\x80\x80
|
\xfb\x80\x80
|
||||||
|
@ -89,14 +89,14 @@
|
||||||
\xff
|
\xff
|
||||||
|
|
||||||
/badutf/utf
|
/badutf/utf
|
||||||
\xfb\x80\x80\x80\x80
|
XX\xfb\x80\x80\x80\x80
|
||||||
\xfd\x80\x80\x80\x80\x80
|
XX\xfd\x80\x80\x80\x80\x80
|
||||||
\xf7\xbf\xbf\xbf
|
XX\xf7\xbf\xbf\xbf
|
||||||
|
|
||||||
/shortutf/utf
|
/shortutf/utf
|
||||||
\xdf\=ph
|
XX\xdf\=ph
|
||||||
\xef\=ph
|
XX\xef\=ph
|
||||||
\xef\x80\=ph
|
XX\xef\x80\=ph
|
||||||
\xf7\=ph
|
\xf7\=ph
|
||||||
\xf7\x80\=ph
|
\xf7\x80\=ph
|
||||||
\xf7\x80\x80\=ph
|
\xf7\x80\x80\=ph
|
||||||
|
@ -111,9 +111,9 @@
|
||||||
\xfd\x80\x80\x80\x80\=ph
|
\xfd\x80\x80\x80\x80\=ph
|
||||||
|
|
||||||
/anything/utf
|
/anything/utf
|
||||||
\xc0\x80
|
X\xc0\x80
|
||||||
\xc1\x8f
|
XX\xc1\x8f
|
||||||
\xe0\x9f\x80
|
XXX\xe0\x9f\x80
|
||||||
\xf0\x8f\x80\x80
|
\xf0\x8f\x80\x80
|
||||||
\xf8\x87\x80\x80\x80
|
\xf8\x87\x80\x80\x80
|
||||||
\xfc\x83\x80\x80\x80\x80
|
\xfc\x83\x80\x80\x80\x80
|
||||||
|
|
|
@ -157,18 +157,18 @@
|
||||||
/^[\QĀ\E-\QŐ\E/B,utf
|
/^[\QĀ\E-\QŐ\E/B,utf
|
||||||
|
|
||||||
/X/utf
|
/X/utf
|
||||||
\x{d800}
|
XX\x{d800}
|
||||||
\x{d800}\=no_utf_check
|
XX\x{d800}\=no_utf_check
|
||||||
\x{da00}
|
XX\x{da00}
|
||||||
\x{da00}\=no_utf_check
|
XX\x{da00}\=no_utf_check
|
||||||
\x{dc00}
|
XX\x{dc00}
|
||||||
\x{dc00}\=no_utf_check
|
XX\x{dc00}\=no_utf_check
|
||||||
\x{de00}
|
XX\x{de00}
|
||||||
\x{de00}\=no_utf_check
|
XX\x{de00}\=no_utf_check
|
||||||
\x{dfff}
|
XX\x{dfff}
|
||||||
\x{dfff}\=no_utf_check
|
XX\x{dfff}\=no_utf_check
|
||||||
\x{110000}
|
XX\x{110000}
|
||||||
\x{d800}\x{1234}
|
XX\x{d800}\x{1234}
|
||||||
|
|
||||||
/(*UTF16)\x{11234}/
|
/(*UTF16)\x{11234}/
|
||||||
abcd\x{11234}pqr
|
abcd\x{11234}pqr
|
||||||
|
|
|
@ -73,142 +73,142 @@ Failed: error -3 at offset 0: UTF-8 error: 1 byte missing at end
|
||||||
Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
|
Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
|
||||||
|
|
||||||
/badutf/utf
|
/badutf/utf
|
||||||
\xdf
|
X\xdf
|
||||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 1
|
||||||
\xef
|
XX\xef
|
||||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
|
||||||
\xef\x80
|
XXX\xef\x80
|
||||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
|
||||||
\xf7
|
X\xf7
|
||||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 1
|
||||||
\xf7\x80
|
XX\xf7\x80
|
||||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
|
||||||
\xf7\x80\x80
|
XXX\xf7\x80\x80
|
||||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
|
||||||
\xfb
|
\xfb
|
||||||
Failed: error -6: UTF-8 error: 4 bytes missing at end
|
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
|
||||||
\xfb\x80
|
\xfb\x80
|
||||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
|
||||||
\xfb\x80\x80
|
\xfb\x80\x80
|
||||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
|
||||||
\xfb\x80\x80\x80
|
\xfb\x80\x80\x80
|
||||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
|
||||||
\xfd
|
\xfd
|
||||||
Failed: error -7: UTF-8 error: 5 bytes missing at end
|
Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
|
||||||
\xfd\x80
|
\xfd\x80
|
||||||
Failed: error -6: UTF-8 error: 4 bytes missing at end
|
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
|
||||||
\xfd\x80\x80
|
\xfd\x80\x80
|
||||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
|
||||||
\xfd\x80\x80\x80
|
\xfd\x80\x80\x80
|
||||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
|
||||||
\xfd\x80\x80\x80\x80
|
\xfd\x80\x80\x80\x80
|
||||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
|
||||||
\xdf\x7f
|
\xdf\x7f
|
||||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
|
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
|
||||||
\xef\x7f\x80
|
\xef\x7f\x80
|
||||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
|
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
|
||||||
\xef\x80\x7f
|
\xef\x80\x7f
|
||||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
|
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
|
||||||
\xf7\x7f\x80\x80
|
\xf7\x7f\x80\x80
|
||||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
|
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
|
||||||
\xf7\x80\x7f\x80
|
\xf7\x80\x7f\x80
|
||||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
|
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
|
||||||
\xf7\x80\x80\x7f
|
\xf7\x80\x80\x7f
|
||||||
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
|
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
|
||||||
\xfb\x7f\x80\x80\x80
|
\xfb\x7f\x80\x80\x80
|
||||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
|
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
|
||||||
\xfb\x80\x7f\x80\x80
|
\xfb\x80\x7f\x80\x80
|
||||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
|
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
|
||||||
\xfb\x80\x80\x7f\x80
|
\xfb\x80\x80\x7f\x80
|
||||||
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
|
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
|
||||||
\xfb\x80\x80\x80\x7f
|
\xfb\x80\x80\x80\x7f
|
||||||
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80
|
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
|
||||||
\xfd\x7f\x80\x80\x80\x80
|
\xfd\x7f\x80\x80\x80\x80
|
||||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
|
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
|
||||||
\xfd\x80\x7f\x80\x80\x80
|
\xfd\x80\x7f\x80\x80\x80
|
||||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
|
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
|
||||||
\xfd\x80\x80\x7f\x80\x80
|
\xfd\x80\x80\x7f\x80\x80
|
||||||
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
|
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
|
||||||
\xfd\x80\x80\x80\x7f\x80
|
\xfd\x80\x80\x80\x7f\x80
|
||||||
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80
|
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
|
||||||
\xfd\x80\x80\x80\x80\x7f
|
\xfd\x80\x80\x80\x80\x7f
|
||||||
Failed: error -12: UTF-8 error: byte 6 top bits not 0x80
|
Failed: error -12: UTF-8 error: byte 6 top bits not 0x80 at offset 0
|
||||||
\xed\xa0\x80
|
\xed\xa0\x80
|
||||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
|
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
|
||||||
\xc0\x8f
|
\xc0\x8f
|
||||||
Failed: error -17: UTF-8 error: overlong 2-byte sequence
|
Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 0
|
||||||
\xe0\x80\x8f
|
\xe0\x80\x8f
|
||||||
Failed: error -18: UTF-8 error: overlong 3-byte sequence
|
Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 0
|
||||||
\xf0\x80\x80\x8f
|
\xf0\x80\x80\x8f
|
||||||
Failed: error -19: UTF-8 error: overlong 4-byte sequence
|
Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
|
||||||
\xf8\x80\x80\x80\x8f
|
\xf8\x80\x80\x80\x8f
|
||||||
Failed: error -20: UTF-8 error: overlong 5-byte sequence
|
Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
|
||||||
\xfc\x80\x80\x80\x80\x8f
|
\xfc\x80\x80\x80\x80\x8f
|
||||||
Failed: error -21: UTF-8 error: overlong 6-byte sequence
|
Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
|
||||||
\x80
|
\x80
|
||||||
Failed: error -22: UTF-8 error: isolated 0x80 byte
|
Failed: error -22: UTF-8 error: isolated 0x80 byte at offset 0
|
||||||
\xfe
|
\xfe
|
||||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
|
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
|
||||||
\xff
|
\xff
|
||||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
|
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
|
||||||
|
|
||||||
/badutf/utf
|
/badutf/utf
|
||||||
\xfb\x80\x80\x80\x80
|
XX\xfb\x80\x80\x80\x80
|
||||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
|
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 2
|
||||||
\xfd\x80\x80\x80\x80\x80
|
XX\xfd\x80\x80\x80\x80\x80
|
||||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
|
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 2
|
||||||
\xf7\xbf\xbf\xbf
|
XX\xf7\xbf\xbf\xbf
|
||||||
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined
|
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 2
|
||||||
|
|
||||||
/shortutf/utf
|
/shortutf/utf
|
||||||
\xdf\=ph
|
XX\xdf\=ph
|
||||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
|
||||||
\xef\=ph
|
XX\xef\=ph
|
||||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
|
||||||
\xef\x80\=ph
|
XX\xef\x80\=ph
|
||||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
|
||||||
\xf7\=ph
|
\xf7\=ph
|
||||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
|
||||||
\xf7\x80\=ph
|
\xf7\x80\=ph
|
||||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
|
||||||
\xf7\x80\x80\=ph
|
\xf7\x80\x80\=ph
|
||||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
|
||||||
\xfb\=ph
|
\xfb\=ph
|
||||||
Failed: error -6: UTF-8 error: 4 bytes missing at end
|
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
|
||||||
\xfb\x80\=ph
|
\xfb\x80\=ph
|
||||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
|
||||||
\xfb\x80\x80\=ph
|
\xfb\x80\x80\=ph
|
||||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
|
||||||
\xfb\x80\x80\x80\=ph
|
\xfb\x80\x80\x80\=ph
|
||||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
|
||||||
\xfd\=ph
|
\xfd\=ph
|
||||||
Failed: error -7: UTF-8 error: 5 bytes missing at end
|
Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
|
||||||
\xfd\x80\=ph
|
\xfd\x80\=ph
|
||||||
Failed: error -6: UTF-8 error: 4 bytes missing at end
|
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
|
||||||
\xfd\x80\x80\=ph
|
\xfd\x80\x80\=ph
|
||||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
|
||||||
\xfd\x80\x80\x80\=ph
|
\xfd\x80\x80\x80\=ph
|
||||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
|
||||||
\xfd\x80\x80\x80\x80\=ph
|
\xfd\x80\x80\x80\x80\=ph
|
||||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
|
||||||
|
|
||||||
/anything/utf
|
/anything/utf
|
||||||
\xc0\x80
|
X\xc0\x80
|
||||||
Failed: error -17: UTF-8 error: overlong 2-byte sequence
|
Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 1
|
||||||
\xc1\x8f
|
XX\xc1\x8f
|
||||||
Failed: error -17: UTF-8 error: overlong 2-byte sequence
|
Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 2
|
||||||
\xe0\x9f\x80
|
XXX\xe0\x9f\x80
|
||||||
Failed: error -18: UTF-8 error: overlong 3-byte sequence
|
Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 3
|
||||||
\xf0\x8f\x80\x80
|
\xf0\x8f\x80\x80
|
||||||
Failed: error -19: UTF-8 error: overlong 4-byte sequence
|
Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
|
||||||
\xf8\x87\x80\x80\x80
|
\xf8\x87\x80\x80\x80
|
||||||
Failed: error -20: UTF-8 error: overlong 5-byte sequence
|
Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
|
||||||
\xfc\x83\x80\x80\x80\x80
|
\xfc\x83\x80\x80\x80\x80
|
||||||
Failed: error -21: UTF-8 error: overlong 6-byte sequence
|
Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
|
||||||
\xfe\x80\x80\x80\x80\x80
|
\xfe\x80\x80\x80\x80\x80
|
||||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
|
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
|
||||||
\xff\x80\x80\x80\x80\x80
|
\xff\x80\x80\x80\x80\x80
|
||||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
|
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
|
||||||
\xc3\x8f
|
\xc3\x8f
|
||||||
No match
|
No match
|
||||||
\xe0\xaf\x80
|
\xe0\xaf\x80
|
||||||
|
@ -220,13 +220,13 @@ No match
|
||||||
\xf1\x8f\x80\x80
|
\xf1\x8f\x80\x80
|
||||||
No match
|
No match
|
||||||
\xf8\x88\x80\x80\x80
|
\xf8\x88\x80\x80\x80
|
||||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
|
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
|
||||||
\xf9\x87\x80\x80\x80
|
\xf9\x87\x80\x80\x80
|
||||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
|
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
|
||||||
\xfc\x84\x80\x80\x80\x80
|
\xfc\x84\x80\x80\x80\x80
|
||||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
|
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
|
||||||
\xfd\x83\x80\x80\x80\x80
|
\xfd\x83\x80\x80\x80\x80
|
||||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
|
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
|
||||||
\xf8\x88\x80\x80\x80\=no_utf_check
|
\xf8\x88\x80\x80\x80\=no_utf_check
|
||||||
No match
|
No match
|
||||||
\xf9\x87\x80\x80\x80\=no_utf_check
|
\xf9\x87\x80\x80\x80\=no_utf_check
|
||||||
|
@ -751,27 +751,27 @@ Failed: error 106 at offset 15: missing terminating ] for character class
|
||||||
|
|
||||||
/X/utf
|
/X/utf
|
||||||
\x{d800}
|
\x{d800}
|
||||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
|
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
|
||||||
\x{d800}\=no_utf_check
|
\x{d800}\=no_utf_check
|
||||||
No match
|
No match
|
||||||
\x{da00}
|
\x{da00}
|
||||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
|
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
|
||||||
\x{da00}\=no_utf_check
|
\x{da00}\=no_utf_check
|
||||||
No match
|
No match
|
||||||
\x{dfff}
|
\x{dfff}
|
||||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
|
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
|
||||||
\x{dfff}\=no_utf_check
|
\x{dfff}\=no_utf_check
|
||||||
No match
|
No match
|
||||||
\x{110000}
|
\x{110000}
|
||||||
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined
|
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 0
|
||||||
\x{110000}\=no_utf_check
|
\x{110000}\=no_utf_check
|
||||||
No match
|
No match
|
||||||
\x{2000000}
|
\x{2000000}
|
||||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
|
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
|
||||||
\x{2000000}\=no_utf_check
|
\x{2000000}\=no_utf_check
|
||||||
No match
|
No match
|
||||||
\x{7fffffff}
|
\x{7fffffff}
|
||||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
|
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
|
||||||
\x{7fffffff}\=no_utf_check
|
\x{7fffffff}\=no_utf_check
|
||||||
No match
|
No match
|
||||||
|
|
||||||
|
@ -1106,7 +1106,7 @@ Subject length lower bound = 1
|
||||||
\x{ff000041}
|
\x{ff000041}
|
||||||
** Character \x{ff000041} is greater than 0x7fffffff and so cannot be converted to UTF-8
|
** Character \x{ff000041} is greater than 0x7fffffff and so cannot be converted to UTF-8
|
||||||
\x{7f000041}
|
\x{7f000041}
|
||||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
|
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
|
||||||
|
|
||||||
/(*UTF8)abc/never_utf
|
/(*UTF8)abc/never_utf
|
||||||
Failed: error 174 at offset 7: using UTF is disabled by the application
|
Failed: error 174 at offset 7: using UTF is disabled by the application
|
||||||
|
|
|
@ -607,30 +607,30 @@ Subject length lower bound = 2
|
||||||
Failed: error 106 at offset 13: missing terminating ] for character class
|
Failed: error 106 at offset 13: missing terminating ] for character class
|
||||||
|
|
||||||
/X/utf
|
/X/utf
|
||||||
\x{d800}
|
XX\x{d800}
|
||||||
Failed: error -24: UTF-16 error: missing low surrogate at end
|
Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
|
||||||
\x{d800}\=no_utf_check
|
XX\x{d800}\=no_utf_check
|
||||||
No match
|
0: X
|
||||||
\x{da00}
|
XX\x{da00}
|
||||||
Failed: error -24: UTF-16 error: missing low surrogate at end
|
Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
|
||||||
\x{da00}\=no_utf_check
|
XX\x{da00}\=no_utf_check
|
||||||
No match
|
0: X
|
||||||
\x{dc00}
|
XX\x{dc00}
|
||||||
Failed: error -26: UTF-16 error: isolated low surrogate
|
Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
|
||||||
\x{dc00}\=no_utf_check
|
XX\x{dc00}\=no_utf_check
|
||||||
No match
|
0: X
|
||||||
\x{de00}
|
XX\x{de00}
|
||||||
Failed: error -26: UTF-16 error: isolated low surrogate
|
Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
|
||||||
\x{de00}\=no_utf_check
|
XX\x{de00}\=no_utf_check
|
||||||
No match
|
0: X
|
||||||
\x{dfff}
|
XX\x{dfff}
|
||||||
Failed: error -26: UTF-16 error: isolated low surrogate
|
Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
|
||||||
\x{dfff}\=no_utf_check
|
XX\x{dfff}\=no_utf_check
|
||||||
No match
|
0: X
|
||||||
\x{110000}
|
XX\x{110000}
|
||||||
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
|
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
|
||||||
\x{d800}\x{1234}
|
XX\x{d800}\x{1234}
|
||||||
Failed: error -25: UTF-16 error: invalid low surrogate
|
Failed: error -25: UTF-16 error: invalid low surrogate at offset 3
|
||||||
|
|
||||||
/(*UTF16)\x{11234}/
|
/(*UTF16)\x{11234}/
|
||||||
abcd\x{11234}pqr
|
abcd\x{11234}pqr
|
||||||
|
|
|
@ -600,30 +600,30 @@ Subject length lower bound = 2
|
||||||
Failed: error 106 at offset 13: missing terminating ] for character class
|
Failed: error 106 at offset 13: missing terminating ] for character class
|
||||||
|
|
||||||
/X/utf
|
/X/utf
|
||||||
\x{d800}
|
XX\x{d800}
|
||||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||||
\x{d800}\=no_utf_check
|
XX\x{d800}\=no_utf_check
|
||||||
No match
|
0: X
|
||||||
\x{da00}
|
XX\x{da00}
|
||||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||||
\x{da00}\=no_utf_check
|
XX\x{da00}\=no_utf_check
|
||||||
No match
|
0: X
|
||||||
\x{dc00}
|
XX\x{dc00}
|
||||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||||
\x{dc00}\=no_utf_check
|
XX\x{dc00}\=no_utf_check
|
||||||
No match
|
0: X
|
||||||
\x{de00}
|
XX\x{de00}
|
||||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||||
\x{de00}\=no_utf_check
|
XX\x{de00}\=no_utf_check
|
||||||
No match
|
0: X
|
||||||
\x{dfff}
|
XX\x{dfff}
|
||||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||||
\x{dfff}\=no_utf_check
|
XX\x{dfff}\=no_utf_check
|
||||||
No match
|
0: X
|
||||||
\x{110000}
|
XX\x{110000}
|
||||||
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined
|
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 2
|
||||||
\x{d800}\x{1234}
|
XX\x{d800}\x{1234}
|
||||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||||
|
|
||||||
/(*UTF16)\x{11234}/
|
/(*UTF16)\x{11234}/
|
||||||
Failed: error 160 at offset 5: (*VERB) not recognized or malformed
|
Failed: error 160 at offset 5: (*VERB) not recognized or malformed
|
||||||
|
@ -1113,7 +1113,7 @@ Failed: error 134 at offset 10: character code point value in \x{} or \o{} is to
|
||||||
|
|
||||||
/\C/utf
|
/\C/utf
|
||||||
\x{110000}
|
\x{110000}
|
||||||
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined
|
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
|
||||||
|
|
||||||
/\x{100}*A/IB,utf
|
/\x{100}*A/IB,utf
|
||||||
------------------------------------------------------------------
|
------------------------------------------------------------------
|
||||||
|
|
Loading…
Reference in New Issue