More documentation and test updates.

This commit is contained in:
Philip.Hazel 2014-11-23 18:38:38 +00:00
parent eb4fffbbf4
commit 91f2e97474
25 changed files with 625 additions and 603 deletions

View File

@ -148,7 +148,7 @@ listing), and the short pages for individual functions, are concatenated in
pcre2limits details of size and other limits
pcre2matching discussion of the two matching algorithms
pcre2partial details of the partial matching facility
pcre2pattern syntax and semantics of supported regular expression patterns
pcre2pattern syntax and semantics of supported regular expression patterns
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program

View File

@ -17,9 +17,9 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a>
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
<li><a name="TOC5" href="#SEC5">Unicode and UTF SUPPORT</a>
<li><a name="TOC5" href="#SEC5">UNICODE AND UTF SUPPORT</a>
<li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC7" href="#SEC7">CODE VALUE OF NEWLINE</a>
<li><a name="TOC7" href="#SEC7">NEWLINE RECOGNITION</a>
<li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a>
<li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
<li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
@ -91,12 +91,12 @@ respectively. These can be interpreted either as single-unit characters or
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
the following to the <b>configure</b> command:
<pre>
--enable-pcre16
--enable-pcre32
--enable-pcre2-16
--enable-pcre2-32
</pre>
If you do not want the 8-bit library, add
<pre>
--disable-pcre8
--disable-pcre2-8
</pre>
as well. At least one of the three libraries must be built. Note that the POSIX
wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit
@ -106,14 +106,15 @@ libraries.
<br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
<P>
The Autotools PCRE2 building process uses <b>libtool</b> to build both shared
and static libraries by default. You can suppress one of these by adding one of
and static libraries by default. You can suppress an unwanted library by adding
one of
<pre>
--disable-shared
--disable-static
</pre>
to the <b>configure</b> command, as required.
to the <b>configure</b> command.
</P>
<br><a name="SEC5" href="#TOC1">Unicode and UTF SUPPORT</a><br>
<br><a name="SEC5" href="#TOC1">UNICODE AND UTF SUPPORT</a><br>
<P>
By default, PCRE2 is built with support for Unicode and UTF character strings.
To build it without Unicode support, add
@ -126,20 +127,15 @@ in the same configuration.
</P>
<P>
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
or UTF-32. To do that you have have to set the PCRE2_UTF option when you call
<b>pcre2_compile()</b> to compile a pattern.
or UTF-32. To do that, applications that use the library have to set the
PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
</P>
<P>
It is not possible to support both EBCDIC and UTF-8 codes in the same version
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
exclusive.
</P>
<P>
UTF support allows the libraries to process character codepoints up to 0x10ffff
in the strings that they handle. It also provides support for accessing the
properties of such characters, using pattern escapes such as \P, \p, and \X.
Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
supported. Details are given in the
UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. It also provides support for
accessing the Unicode properties of such characters, using pattern escapes such
as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
<i>Nd</i> are supported. Details are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
</P>
@ -150,7 +146,7 @@ Just-in-time compiler support is included in the build by specifying
--enable-jit
</pre>
This support is available only for certain hardware architectures. If this
option is set for an unsupported architecture, a compile time error occurs.
option is set for an unsupported architecture, a building error occurs.
See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for a discussion of JIT usage. When JIT support is enabled,
@ -160,7 +156,7 @@ pcre2grep automatically makes use of it, unless you add
</pre>
to the "configure" command.
</P>
<br><a name="SEC7" href="#TOC1">CODE VALUE OF NEWLINE</a><br>
<br><a name="SEC7" href="#TOC1">NEWLINE RECOGNITION</a><br>
<P>
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
of a line. This is the normal newline character on Unix-like systems. You can
@ -168,12 +164,13 @@ compile PCRE2 to use carriage return (CR) instead, by adding
<pre>
--enable-newline-is-cr
</pre>
to the <b>configure</b> command. There is also a --enable-newline-is-lf option,
to the <b>configure</b> command. There is also an --enable-newline-is-lf option,
which explicitly specifies linefeed as the newline character.
<br>
<br>
Alternatively, you can specify that line endings are to be indicated by the two
character sequence CRLF. If you want this, add
</P>
<P>
Alternatively, you can specify that line endings are to be indicated by the
two-character sequence CRLF (CR immediately followed by LF). If you want this,
add
<pre>
--enable-newline-is-crlf
</pre>
@ -186,22 +183,26 @@ indicating a line ending. Finally, a fifth option, specified by
<pre>
--enable-newline-is-any
</pre>
causes PCRE2 to recognize any Unicode newline sequence.
causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
sequences are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029).
</P>
<P>
Whatever line ending convention is selected when PCRE2 is built can be
overridden when the library functions are called. At build time it is
Whatever default line ending convention is selected when PCRE2 is built can be
overridden by applications that use the library. At build time it is
conventional to use the standard for your operating system.
</P>
<br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
By default, the sequence \R in a pattern matches any Unicode newline sequence,
whatever has been selected as the line ending sequence. If you specify
independently of what has been selected as the line ending sequence. If you
specify
<pre>
--enable-bsr-anycrlf
</pre>
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
selected when PCRE2 is built can be overridden when the library functions are
selected when PCRE2 is built can be overridden by applications that use the
called.
</P>
<br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
@ -210,10 +211,10 @@ Within a compiled pattern, offset values are used to point from one part to
another (for example, from an opening parenthesis to an alternation
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
are used for these offsets, leading to a maximum size for a compiled pattern of
around 64K. This is sufficient to handle all but the most gigantic patterns.
Nevertheless, some people do want to process truly enormous patterns, so it is
possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
setting such as
around 64K code units. This is sufficient to handle all but the most gigantic
patterns. Nevertheless, some people do want to process truly enormous patterns,
so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
adding a setting such as
<pre>
--with-link-size=3
</pre>
@ -294,16 +295,20 @@ hand".)
<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
<P>
PCRE2 assumes by default that it will run in an environment where the character
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
most computer operating systems. PCRE2 can, however, be compiled to run in an
EBCDIC environment by adding
8-bit EBCDIC environment by adding
<pre>
--enable-ebcdic --disable-unicode
</pre>
to the <b>configure</b> command. This setting implies
--enable-rebuild-chartables. You should only use it if you know that you are in
an EBCDIC environment (for example, an IBM mainframe operating system). The
--enable-ebcdic option is incompatible with Unicode support.
an EBCDIC environment (for example, an IBM mainframe operating system).
</P>
<P>
It is not possible to support both EBCDIC and UTF-8 codes in the same version
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
exclusive.
</P>
<P>
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
@ -347,8 +352,8 @@ parameter value by adding, for example,
<pre>
--with-pcre2grep-bufsize=50K
</pre>
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can, however,
override this value by specifying a run-time option.
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override this
value by using --buffer-size on the command line..
</P>
<br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
<P>
@ -362,16 +367,16 @@ to the <b>configure</b> command, <b>pcre2test</b> is linked with the
from a terminal, it reads it using the <b>readline()</b> function. This provides
line-editing and history facilities. Note that <b>libreadline</b> is
GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this
way, there may be licensing issues. These can be avoided by linking with
<b>libedit</b> (which has a BSD licence) instead.
way, there may be licensing issues. These can be avoided by linking instead
with <b>libedit</b>, which has a BSD licence.
</P>
<P>
Setting this option causes the <b>-lreadline</b> option to be added to the
<b>pcre2test</b> build. In many operating environments with a sytem-installed
readline library this is sufficient. However, in some environments (e.g. if an
unmodified distribution version of readline is in use), some extra
configuration may be necessary. The INSTALL file for <b>libreadline</b> says
this:
Setting --enable-pcre2test-libreadline causes the <b>-lreadline</b> option to be
added to the <b>pcre2test</b> build. In many operating environments with a
sytem-installed readline library this is sufficient. However, in some
environments (e.g. if an unmodified distribution version of readline is in
use), some extra configuration may be necessary. The INSTALL file for
<b>libreadline</b> says this:
<pre>
"Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications
@ -386,13 +391,13 @@ immediately before the <b>configure</b> command.
</P>
<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<P>
By adding the
If you add
<pre>
--enable-valgrind
</pre>
option to to the <b>configure</b> command, PCRE2 will use valgrind annotations
to mark certain memory regions as unaddressable. This allows it to detect
invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
certain memory regions as unaddressable. This allows it to detect invalid
memory accesses, and is mostly useful for debugging PCRE2 itself.
</P>
<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<P>
@ -466,7 +471,7 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 03 November 2014
Last updated: 23 November 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>

View File

@ -85,29 +85,27 @@ expect.
<P>
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
if it were a++[bc]. The <b>pcre2test</b> output when this pattern is anchored
and then applied with automatic callouts to the string "aaaa" is:
if it were a++[bc]. The <b>pcre2test</b> output when this pattern is compiled
with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
"aaaa" is:
<pre>
---&#62;aaaa
+0 ^ ^
+1 ^ a+
+3 ^ ^ [bc]
+0 ^ a+
+2 ^ ^ [bc]
No match
</pre>
This indicates that when matching [bc] fails, there is no backtracking into a+
and therefore the callouts that would be taken for the backtracks do not occur.
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS
to <b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). If
this is done in <b>pcre2test</b> (using the /no_auto_possess qualifier), the
output changes to this:
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
<b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). In this
case, the output changes to this:
<pre>
---&#62;aaaa
+0 ^ ^
+1 ^ a+
+3 ^ ^ [bc]
+3 ^ ^ [bc]
+3 ^ ^ [bc]
+3 ^^ [bc]
+0 ^ a+
+2 ^ ^ [bc]
+2 ^ ^ [bc]
+2 ^ ^ [bc]
+2 ^^ [bc]
No match
</pre>
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
@ -137,10 +135,10 @@ callouts such as the example above are obeyed.
</P>
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
<P>
During matching, when PCRE2 reaches a callout point, the external function that
is set in the match context is called (if it is set). This applies to both
normal and DFA matching. The only argument to the callout function is a pointer
to a <b>pcre2_callout</b> block. This structure contains the following fields:
During matching, when PCRE2 reaches a callout point, if an external function is
set in the match context, it is called. This applies to both normal and DFA
matching. The only argument to the callout function is a pointer to a
<b>pcre2_callout</b> block. This structure contains the following fields:
<pre>
uint32_t <i>version</i>;
uint32_t <i>callout_number</i>;
@ -169,7 +167,7 @@ automatically generated callouts).
<P>
The <i>offset_vector</i> field is a pointer to the vector of capturing offsets
(the "ovector") that was passed to the matching function in the match data
block. When <b>pcre2_match()</b> is used, the contents can be inspected, in
block. When <b>pcre2_match()</b> is used, the contents can be inspected in
order to extract substrings that have been matched so far, in the same way as
for extracting substrings after a match has completed. For the DFA matching
function, this field is not useful.
@ -261,7 +259,7 @@ Cambridge, England.
</P>
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
<P>
Last updated: 19 October 2014
Last updated: 23 November 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>

View File

@ -467,8 +467,8 @@ used. There is no short form for this option.
Processing some regular expression patterns can require a very large amount of
memory, leading in some cases to a program crash if not enough is available.
Other patterns may take a very long time to search for all possible matching
strings. The <b>pcre2_exec()</b> function that is called by <b>pcre2grep</b> to do
the matching has two parameters that can limit the resources that it uses.
strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
do the matching has two parameters that can limit the resources that it uses.
<br>
<br>
The <b>--match-limit</b> option provides a means of limiting resource usage
@ -750,7 +750,7 @@ Cambridge, England.
</P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P>
Last updated: 28 September 2014
Last updated: 23 November 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>

View File

@ -31,11 +31,11 @@ please consult the man page, in case the conversion went wrong.
<P>
Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the
match is performed. Therefore, it is of most benefit when the same pattern is
going to be matched many times. This does not necessarily mean many calls of a
matching function; if the pattern is not anchored, matching attempts may take
place many times at various positions in the subject, even for a single call.
Therefore, if the subject string is very long, it may still pay to use JIT for
match is performed, so it is of most benefit when the same pattern is going to
be matched many times. This does not necessarily mean many calls of a matching
function; if the pattern is not anchored, matching attempts may take place many
times at various positions in the subject, even for a single call. Therefore,
if the subject string is very long, it may still pay to use JIT even for
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries.
</P>
@ -103,7 +103,7 @@ option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
returns zero. This is an alternative way of testing if JIT is available.
returns zero. This is an alternative way of testing whether JIT is available.
</P>
<P>
At present, it is not possible to free JIT compiled code except when the entire
@ -299,7 +299,7 @@ compiled patterns, contexts, and stacks in any order, anytime. Just \fIdo
not\fP call <b>pcre2_match()</b> with a match context pointing to an already
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
used by <b>pcre2_match()</b> in another thread). You can also replace the stack
in a context at any time when it is not in use. You can also free the previous
in a context at any time when it is not in use. You should free the previous
stack before assigning a replacement.
</P>
<P>
@ -418,7 +418,7 @@ Cambridge, England.
</P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P>
Last updated: 12 November 2014
Last updated: 23 November 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>

View File

@ -421,7 +421,7 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
</pre>
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_exec(), not increase them.
limits set by the caller of pcre2_match(), not increase them.
</P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
<P>
@ -553,7 +553,7 @@ Cambridge, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 14 November 2014
Last updated: 23 November 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>

View File

@ -72,7 +72,7 @@ but its use can lead to some strange effects because it breaks up multi-unit
characters (see the description of \C in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation). The use of \C is not supported in the alternative matching
function <b>pcre2_dfa_exec()</b>, nor is it supported in UTF mode by the JIT
function <b>pcre2_dfa_match()</b>, nor is it supported in UTF mode by the JIT
optimization. If JIT optimization is requested for a UTF pattern that contains
\C, it will not succeed, and so the matching will be carried out by the normal
interpretive function.
@ -141,15 +141,15 @@ UTF-32.)
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly.
If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2
assumes that the pattern or subject it is given (respectively) contains only
valid UTF code unit sequences.
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
PCRE2 assumes that the pattern or subject it is given (respectively) contains
only valid UTF code unit sequences.
</P>
<P>
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable
the check for a subject string you must pass this option to <b>pcre2_exec()</b>
or <b>pcre2_dfa_exec()</b>.
the check for a subject string you must pass this option to <b>pcre2_match()</b>
or <b>pcre2_dfa_match()</b>.
</P>
<P>
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
@ -261,7 +261,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 03 November 2014
Last updated: 23 November 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>

View File

@ -2667,12 +2667,12 @@ BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
or UTF-16/UTF-32 strings. To build these additional libraries, add one
or both of the following to the configure command:
--enable-pcre16
--enable-pcre32
--enable-pcre2-16
--enable-pcre2-32
If you do not want the 8-bit library, add
--disable-pcre8
--disable-pcre2-8
as well. At least one of the three libraries must be built. Note that
the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
@ -2683,16 +2683,16 @@ BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
BUILDING SHARED AND STATIC LIBRARIES
The Autotools PCRE2 building process uses libtool to build both shared
and static libraries by default. You can suppress one of these by
adding one of
and static libraries by default. You can suppress an unwanted library
by adding one of
--disable-shared
--disable-static
to the configure command, as required.
to the configure command.
Unicode and UTF SUPPORT
UNICODE AND UTF SUPPORT
By default, PCRE2 is built with support for Unicode and UTF character
strings. To build it without Unicode support, add
@ -2704,18 +2704,16 @@ Unicode and UTF SUPPORT
another without, in the same configuration.
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
UTF-16 or UTF-32. To do that you have have to set the PCRE2_UTF option
when you call pcre2_compile() to compile a pattern.
UTF-16 or UTF-32. To do that, applications that use the library have to
set the PCRE2_UTF option when they call pcre2_compile() to compile a
pattern.
It is not possible to support both EBCDIC and UTF-8 codes in the same
version of the library. Consequently, --enable-unicode and --enable-
ebcdic are mutually exclusive.
UTF support allows the libraries to process character codepoints up to
0x10ffff in the strings that they handle. It also provides support for
accessing the properties of such characters, using pattern escapes such
as \P, \p, and \X. Only the general category properties such as Lu and
Nd are supported. Details are given in the pcre2pattern documentation.
UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. It also provides support for
accessing the Unicode properties of such characters, using pattern
escapes such as \P, \p, and \X. Only the general category properties
such as Lu and Nd are supported. Details are given in the pcre2pattern
documentation.
JUST-IN-TIME COMPILER SUPPORT
@ -2725,17 +2723,17 @@ JUST-IN-TIME COMPILER SUPPORT
--enable-jit
This support is available only for certain hardware architectures. If
this option is set for an unsupported architecture, a compile time
error occurs. See the pcre2jit documentation for a discussion of JIT
usage. When JIT support is enabled, pcre2grep automatically makes use
of it, unless you add
this option is set for an unsupported architecture, a building error
occurs. See the pcre2jit documentation for a discussion of JIT usage.
When JIT support is enabled, pcre2grep automatically makes use of it,
unless you add
--disable-pcre2grep-jit
to the "configure" command.
CODE VALUE OF NEWLINE
NEWLINE RECOGNITION
By default, PCRE2 interprets the linefeed (LF) character as indicating
the end of a line. This is the normal newline character on Unix-like
@ -2744,11 +2742,12 @@ CODE VALUE OF NEWLINE
--enable-newline-is-cr
to the configure command. There is also a --enable-newline-is-lf
to the configure command. There is also an --enable-newline-is-lf
option, which explicitly specifies linefeed as the newline character.
Alternatively, you can specify that line endings are to be indicated by
the two character sequence CRLF. If you want this, add
the two-character sequence CRLF (CR immediately followed by LF). If you
want this, add
--enable-newline-is-crlf
@ -2756,41 +2755,46 @@ CODE VALUE OF NEWLINE
--enable-newline-is-anycrlf
which causes PCRE2 to recognize any of the three sequences CR, LF, or
which causes PCRE2 to recognize any of the three sequences CR, LF, or
CRLF as indicating a line ending. Finally, a fifth option, specified by
--enable-newline-is-any
causes PCRE2 to recognize any Unicode newline sequence.
causes PCRE2 to recognize any Unicode newline sequence. The Unicode
newline sequences are the three just mentioned, plus the single charac-
ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
U+2029).
Whatever line ending convention is selected when PCRE2 is built can be
overridden when the library functions are called. At build time it is
conventional to use the standard for your operating system.
Whatever default line ending convention is selected when PCRE2 is built
can be overridden by applications that use the library. At build time
it is conventional to use the standard for your operating system.
WHAT \R MATCHES
By default, the sequence \R in a pattern matches any Unicode newline
sequence, whatever has been selected as the line ending sequence. If
you specify
By default, the sequence \R in a pattern matches any Unicode newline
sequence, independently of what has been selected as the line ending
sequence. If you specify
--enable-bsr-anycrlf
the default is changed so that \R matches only CR, LF, or CRLF. What-
ever is selected when PCRE2 is built can be overridden when the library
functions are called.
the default is changed so that \R matches only CR, LF, or CRLF. What-
ever is selected when PCRE2 is built can be overridden by applications
that use the called.
HANDLING VERY LARGE PATTERNS
Within a compiled pattern, offset values are used to point from one
part to another (for example, from an opening parenthesis to an alter-
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
two-byte values are used for these offsets, leading to a maximum size
for a compiled pattern of around 64K. This is sufficient to handle all
but the most gigantic patterns. Nevertheless, some people do want to
process truly enormous patterns, so it is possible to compile PCRE2 to
use three-byte or four-byte offsets by adding a setting such as
Within a compiled pattern, offset values are used to point from one
part to another (for example, from an opening parenthesis to an alter-
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
two-byte values are used for these offsets, leading to a maximum size
for a compiled pattern of around 64K code units. This is sufficient to
handle all but the most gigantic patterns. Nevertheless, some people do
want to process truly enormous patterns, so it is possible to compile
PCRE2 to use three-byte or four-byte offsets by adding a setting such
as
--with-link-size=3
@ -2876,25 +2880,28 @@ CREATING CHARACTER TABLES AT BUILD TIME
USING EBCDIC CODE
PCRE2 assumes by default that it will run in an environment where the
character code is ASCII (or Unicode, which is a superset of ASCII).
This is the case for most computer operating systems. PCRE2 can, how-
ever, be compiled to run in an EBCDIC environment by adding
character code is ASCII or Unicode, which is a superset of ASCII. This
is the case for most computer operating systems. PCRE2 can, however, be
compiled to run in an 8-bit EBCDIC environment by adding
--enable-ebcdic --disable-unicode
to the configure command. This setting implies --enable-rebuild-charta-
bles. You should only use it if you know that you are in an EBCDIC
environment (for example, an IBM mainframe operating system). The
--enable-ebcdic option is incompatible with Unicode support.
environment (for example, an IBM mainframe operating system).
It is not possible to support both EBCDIC and UTF-8 codes in the same
version of the library. Consequently, --enable-unicode and --enable-
ebcdic are mutually exclusive.
The EBCDIC character that corresponds to an ASCII LF is assumed to have
the value 0x15 by default. However, in some EBCDIC environments, 0x25
the value 0x15 by default. However, in some EBCDIC environments, 0x25
is used. In such an environment you should use
--enable-ebcdic-nl25
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
acter (which, in Unicode, is 0x85).
@ -2905,32 +2912,32 @@ USING EBCDIC CODE
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
By default, pcre2grep reads all files as plain text. You can build it
so that it recognizes files whose names end in .gz or .bz2, and reads
By default, pcre2grep reads all files as plain text. You can build it
so that it recognizes files whose names end in .gz or .bz2, and reads
them with libz or libbz2, respectively, by adding one or both of
--enable-pcre2grep-libz
--enable-pcre2grep-libbz2
to the configure command. These options naturally require that the rel-
evant libraries are installed on your system. Configuration will fail
evant libraries are installed on your system. Configuration will fail
if they are not.
PCRE2GREP BUFFER SIZE
pcre2grep uses an internal buffer to hold a "window" on the file it is
pcre2grep uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when
it finds a match. The size of the buffer is controlled by a parameter
it finds a match. The size of the buffer is controlled by a parameter
whose default value is 20K. The buffer itself is three times this size,
but because of the way it is used for holding "before" lines, the long-
est line that is guaranteed to be processable is the parameter size.
est line that is guaranteed to be processable is the parameter size.
You can change the default parameter value by adding, for example,
--with-pcre2grep-bufsize=50K
to the configure command. The caller of pcre2grep can, however, over-
ride this value by specifying a run-time option.
to the configure command. The caller of pcre2grep can override this
value by using --buffer-size on the command line..
PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
@ -2940,26 +2947,26 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
--enable-pcre2test-libreadline
--enable-pcre2test-libedit
to the configure command, pcre2test is linked with the libreadline
to the configure command, pcre2test is linked with the libreadline
orlibedit library, respectively, and when its input is from a terminal,
it reads it using the readline() function. This provides line-editing
and history facilities. Note that libreadline is GPL-licensed, so if
you distribute a binary of pcre2test linked in this way, there may be
licensing issues. These can be avoided by linking with libedit (which
has a BSD licence) instead.
it reads it using the readline() function. This provides line-editing
and history facilities. Note that libreadline is GPL-licensed, so if
you distribute a binary of pcre2test linked in this way, there may be
licensing issues. These can be avoided by linking instead with libedit,
which has a BSD licence.
Setting this option causes the -lreadline option to be added to the
pcre2test build. In many operating environments with a sytem-installed
readline library this is sufficient. However, in some environments
(e.g. if an unmodified distribution version of readline is in use),
some extra configuration may be necessary. The INSTALL file for
libreadline says this:
Setting --enable-pcre2test-libreadline causes the -lreadline option to
be added to the pcre2test build. In many operating environments with a
sytem-installed readline library this is sufficient. However, in some
environments (e.g. if an unmodified distribution version of readline is
in use), some extra configuration may be necessary. The INSTALL file
for libreadline says this:
"Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications
which link with readline the to choose an appropriate library."
If your environment has not been set up so that an appropriate library
If your environment has not been set up so that an appropriate library
is automatically included, you may need to add something like
LIBS="-ncurses"
@ -2969,19 +2976,19 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
DEBUGGING WITH VALGRIND SUPPORT
By adding the
If you add
--enable-valgrind
option to to the configure command, PCRE2 will use valgrind annotations
to mark certain memory regions as unaddressable. This allows it to
detect invalid memory accesses, and is mostly useful for debugging
PCRE2 itself.
to the configure command, PCRE2 will use valgrind annotations to mark
certain memory regions as unaddressable. This allows it to detect
invalid memory accesses, and is mostly useful for debugging PCRE2
itself.
CODE COVERAGE REPORTING
If your C compiler is gcc, you can build a version of PCRE2 that can
If your C compiler is gcc, you can build a version of PCRE2 that can
generate a code coverage report for its test suite. To enable this, you
must install lcov version 1.6 or above. Then specify
@ -2990,20 +2997,20 @@ CODE COVERAGE REPORTING
to the configure command and build PCRE2 in the usual way.
Note that using ccache (a caching C compiler) is incompatible with code
coverage reporting. If you have configured ccache to run automatically
coverage reporting. If you have configured ccache to run automatically
on your system, you must set the environment variable
CCACHE_DISABLE=1
before running make to build PCRE2, so that ccache is not used.
When --enable-coverage is used, the following addition targets are
When --enable-coverage is used, the following addition targets are
added to the Makefile:
make coverage
This creates a fresh coverage report for the PCRE2 test suite. It is
equivalent to running "make coverage-reset", "make coverage-baseline",
This creates a fresh coverage report for the PCRE2 test suite. It is
equivalent to running "make coverage-reset", "make coverage-baseline",
"make check", and then "make coverage-report".
make coverage-reset
@ -3020,18 +3027,18 @@ CODE COVERAGE REPORTING
make coverage-clean-report
This removes the generated coverage report without cleaning the cover-
This removes the generated coverage report without cleaning the cover-
age data itself.
make coverage-clean-data
This removes the captured coverage data without removing the coverage
This removes the captured coverage data without removing the coverage
files created at compile time (*.gcno).
make coverage-clean
This cleans all coverage data including the generated coverage report.
For more information about code coverage, see the gcov and lcov docu-
This cleans all coverage data including the generated coverage report.
For more information about code coverage, see the gcov and lcov docu-
mentation.
@ -3049,7 +3056,7 @@ AUTHOR
REVISION
Last updated: 03 November 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
@ -3122,62 +3129,59 @@ MISSING CALLOUTS
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
that what follows cannot be part of the repeat. For example, a+[bc] is
compiled as if it were a++[bc]. The pcre2test output when this pattern
is anchored and then applied with automatic callouts to the string
"aaaa" is:
is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
to the string "aaaa" is:
--->aaaa
+0 ^ ^
+1 ^ a+
+3 ^ ^ [bc]
+0 ^ a+
+2 ^ ^ [bc]
No match
This indicates that when matching [bc] fails, there is no backtracking
into a+ and therefore the callouts that would be taken for the back-
tracks do not occur. You can disable the auto-possessify feature by
passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat-
tern with (*NO_AUTO_POSSESS). If this is done in pcre2test (using the
/no_auto_possess qualifier), the output changes to this:
tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
--->aaaa
+0 ^ ^
+1 ^ a+
+3 ^ ^ [bc]
+3 ^ ^ [bc]
+3 ^ ^ [bc]
+3 ^^ [bc]
+0 ^ a+
+2 ^ ^ [bc]
+2 ^ ^ [bc]
+2 ^ ^ [bc]
+2 ^^ [bc]
No match
This time, when matching [bc] fails, the matcher backtracks into a+ and
tries again, repeatedly, until a+ itself fails.
Other optimizations that provide fast "no match" results also affect
Other optimizations that provide fast "no match" results also affect
callouts. For example, if the pattern is
ab(?C4)cd
PCRE2 knows that any matching string must contain the letter "d". If
the subject string is "abyz", the lack of "d" means that matching
doesn't ever start, and the callout is never reached. However, with
PCRE2 knows that any matching string must contain the letter "d". If
the subject string is "abyz", the lack of "d" means that matching
doesn't ever start, and the callout is never reached. However, with
"abyd", though the result is still no match, the callout is obeyed.
PCRE2 also knows the minimum length of a matching string, and will
immediately give a "no match" return without actually running a match
if the subject is not long enough, or, for unanchored patterns, if it
PCRE2 also knows the minimum length of a matching string, and will
immediately give a "no match" return without actually running a match
if the subject is not long enough, or, for unanchored patterns, if it
has been scanned far enough.
You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
MIZE option to pcre2_compile(), or by starting the pattern with
(*NO_START_OPT). This slows down the matching process, but does ensure
MIZE option to pcre2_compile(), or by starting the pattern with
(*NO_START_OPT). This slows down the matching process, but does ensure
that callouts such as the example above are obeyed.
THE CALLOUT INTERFACE
During matching, when PCRE2 reaches a callout point, the external func-
tion that is set in the match context is called (if it is set). This
applies to both normal and DFA matching. The only argument to the call-
out function is a pointer to a pcre2_callout block. This structure con-
tains the following fields:
During matching, when PCRE2 reaches a callout point, if an external
function is set in the match context, it is called. This applies to
both normal and DFA matching. The only argument to the callout function
is a pointer to a pcre2_callout block. This structure contains the fol-
lowing fields:
uint32_t version;
uint32_t callout_number;
@ -3193,69 +3197,69 @@ THE CALLOUT INTERFACE
PCRE2_SIZE pattern_position;
PCRE2_SIZE next_item_length;
The version field contains the version number of the block format. The
The version field contains the version number of the block format. The
current version is 0. The version number will change in future if addi-
tional fields are added, but the intention is never to remove any of
tional fields are added, but the intention is never to remove any of
the existing fields.
The callout_number field contains the number of the callout, as com-
piled into the pattern (that is, the number after ?C for manual call-
The callout_number field contains the number of the callout, as com-
piled into the pattern (that is, the number after ?C for manual call-
outs, and 255 for automatically generated callouts).
The offset_vector field is a pointer to the vector of capturing offsets
(the "ovector") that was passed to the matching function in the match
data block. When pcre2_match() is used, the contents can be inspected,
in order to extract substrings that have been matched so far, in the
same way as for extracting substrings after a match has completed. For
(the "ovector") that was passed to the matching function in the match
data block. When pcre2_match() is used, the contents can be inspected
in order to extract substrings that have been matched so far, in the
same way as for extracting substrings after a match has completed. For
the DFA matching function, this field is not useful.
The subject and subject_length fields contain copies of the values that
were passed to the matching function.
The start_match field normally contains the offset within the subject
at which the current match attempt started. However, if the escape
sequence \K has been encountered, this value is changed to reflect the
modified starting point. If the pattern is not anchored, the callout
The start_match field normally contains the offset within the subject
at which the current match attempt started. However, if the escape
sequence \K has been encountered, this value is changed to reflect the
modified starting point. If the pattern is not anchored, the callout
function may be called several times from the same point in the pattern
for different starting points in the subject.
The current_position field contains the offset within the subject of
The current_position field contains the offset within the subject of
the current match pointer.
When the pcre2_match() is used, the capture_top field contains one more
than the number of the highest numbered captured substring so far. If
than the number of the highest numbered captured substring so far. If
no substrings have been captured, the value of capture_top is one. This
is always the case when the DFA functions are used, because they do not
support captured substrings.
The capture_last field contains the number of the most recently cap-
tured substring. However, when a recursion exits, the value reverts to
what it was outside the recursion, as do the values of all captured
substrings. If no substrings have been captured, the value of cap-
The capture_last field contains the number of the most recently cap-
tured substring. However, when a recursion exits, the value reverts to
what it was outside the recursion, as do the values of all captured
substrings. If no substrings have been captured, the value of cap-
ture_last is 0. This is always the case for the DFA matching functions.
The callout_data field contains a value that is passed to a matching
function specifically so that it can be passed back in callouts. It is
set in the match context when the callout is set up by calling
The callout_data field contains a value that is passed to a matching
function specifically so that it can be passed back in callouts. It is
set in the match context when the callout is set up by calling
pcre2_set_callout() (see the pcre2api documentation).
The pattern_position field contains the offset to the next item to be
The pattern_position field contains the offset to the next item to be
matched in the pattern string.
The next_item_length field contains the length of the next item to be
The next_item_length field contains the length of the next item to be
matched in the pattern string. When the callout immediately precedes an
alternation bar, a closing parenthesis, or the end of the pattern, the
length is zero. When the callout precedes an opening parenthesis, the
alternation bar, a closing parenthesis, or the end of the pattern, the
length is zero. When the callout precedes an opening parenthesis, the
length is that of the entire subpattern.
The pattern_position and next_item_length fields are intended to help
in distinguishing between different automatic callouts, which all have
The pattern_position and next_item_length fields are intended to help
in distinguishing between different automatic callouts, which all have
the same callout number. However, they are set for all callouts.
In callouts from pcre2_match() the mark field contains a pointer to the
zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
(*THEN) item in the match, or NULL if no such items have been passed.
Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
(*THEN) item in the match, or NULL if no such items have been passed.
Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
previous (*MARK). In callouts from the DFA matching function this field
always contains NULL.
@ -3263,16 +3267,16 @@ THE CALLOUT INTERFACE
RETURN VALUES
The external callout function returns an integer to PCRE2. If the value
is zero, matching proceeds as normal. If the value is greater than
zero, matching fails at the current point, but the testing of other
is zero, matching proceeds as normal. If the value is greater than
zero, matching fails at the current point, but the testing of other
matching possibilities goes ahead, just as if a lookahead assertion had
failed. If the value is less than zero, the match is abandoned, and the
matching function returns the negative value.
Negative values should normally be chosen from the set of
PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
reserved for use by callout functions; it will never be used by PCRE2
Negative values should normally be chosen from the set of
PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
reserved for use by callout functions; it will never be used by PCRE2
itself.
@ -3285,7 +3289,7 @@ AUTHOR
REVISION
Last updated: 19 October 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
@ -3487,12 +3491,12 @@ PCRE2 JUST-IN-TIME COMPILER SUPPORT
Just-in-time compiling is a heavyweight optimization that can greatly
speed up pattern matching. However, it comes at the cost of extra pro-
cessing before the match is performed. Therefore, it is of most benefit
when the same pattern is going to be matched many times. This does not
necessarily mean many calls of a matching function; if the pattern is
not anchored, matching attempts may take place many times at various
positions in the subject, even for a single call. Therefore, if the
subject string is very long, it may still pay to use JIT for one-off
cessing before the match is performed, so it is of most benefit when
the same pattern is going to be matched many times. This does not nec-
essarily mean many calls of a matching function; if the pattern is not
anchored, matching attempts may take place many times at various posi-
tions in the subject, even for a single call. Therefore, if the subject
string is very long, it may still pay to use JIT even for one-off
matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries.
@ -3558,8 +3562,8 @@ SIMPLE USE OF JIT
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
ing. If pcre2_jit_compile() is called with no option bits set, it imme-
diately returns zero. This is an alternative way of testing if JIT is
available.
diately returns zero. This is an alternative way of testing whether JIT
is available.
At present, it is not possible to free JIT compiled code except when
the entire compiled pattern is freed by calling pcre2_free_code().
@ -3745,7 +3749,7 @@ JIT STACK FAQ
an already freed stack, as that will cause SEGFAULT. (Also, do not free
a stack currently used by pcre2_match() in another thread). You can
also replace the stack in a context at any time when it is not in use.
You can also free the previous stack before assigning a replacement.
You should free the previous stack before assigning a replacement.
(5) Should I allocate/free a stack every time before/after calling
pcre2_match()?
@ -3855,7 +3859,7 @@ AUTHOR
REVISION
Last updated: 12 November 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
@ -4642,7 +4646,7 @@ WIDE CHARACTERS AND UTF MODES
UTF mode, but its use can lead to some strange effects because it
breaks up multi-unit characters (see the description of \C in the
pcre2pattern documentation). The use of \C is not supported in the
alternative matching function pcre2_dfa_exec(), nor is it supported in
alternative matching function pcre2_dfa_match(), nor is it supported in
UTF mode by the JIT optimization. If JIT optimization is requested for
a UTF pattern that contains \C, it will not succeed, and so the match-
ing will be carried out by the normal interpretive function.
@ -4701,14 +4705,14 @@ VALIDITY OF UTF STRINGS
In some situations, you may already know that your strings are valid,
and therefore want to skip these checks in order to improve perfor-
mance, for example in the case of a long subject string that is being
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK flag at compile
time or at run time, PCRE2 assumes that the pattern or subject it is
given (respectively) contains only valid UTF code unit sequences.
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
pile time or at match time, PCRE2 assumes that the pattern or subject
it is given (respectively) contains only valid UTF code unit sequences.
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
for the pattern; it does not also apply to subject strings. If you want
to disable the check for a subject string you must pass this option to
pcre2_exec() or pcre2_dfa_exec().
pcre2_match() or pcre2_dfa_match().
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
result is undefined and your program may crash or loop indefinitely.
@ -4807,7 +4811,7 @@ AUTHOR
REVISION
Last updated: 03 November 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "21 November 2014" "PCRE2 10.00"
.TH PCRE2API 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -2090,6 +2090,13 @@ returned by \fBpcre2_get_startchar()\fP. For a non-partial match, this can be
different to the value of \fIovector[0]\fP if the pattern contains the \eK
escape sequence. After a partial match, however, this value is always the same
as \fIovector[0]\fP because \eK does not affect the result of a partial match.
.P
The \fBstartchar\fP field is also used to return the offset of an invalid
UTF character when UTF checking fails. Details are given in the
.\" HREF
\fBpcre2unicode\fP
.\"
page.
.
.
.\" HTML <a name="errorlist"></a>
@ -2707,6 +2714,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 21 November 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2BUILD 3 "03 November 2014" "PCRE2 10.00"
.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.
@ -74,12 +74,12 @@ respectively. These can be interpreted either as single-unit characters or
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
the following to the \fBconfigure\fP command:
.sp
--enable-pcre16
--enable-pcre32
--enable-pcre2-16
--enable-pcre2-32
.sp
If you do not want the 8-bit library, add
.sp
--disable-pcre8
--disable-pcre2-8
.sp
as well. At least one of the three libraries must be built. Note that the POSIX
wrapper is for the 8-bit library only, and that \fBpcre2grep\fP is an 8-bit
@ -91,15 +91,16 @@ libraries.
.rs
.sp
The Autotools PCRE2 building process uses \fBlibtool\fP to build both shared
and static libraries by default. You can suppress one of these by adding one of
and static libraries by default. You can suppress an unwanted library by adding
one of
.sp
--disable-shared
--disable-static
.sp
to the \fBconfigure\fP command, as required.
to the \fBconfigure\fP command.
.
.
.SH "Unicode and UTF SUPPORT"
.SH "UNICODE AND UTF SUPPORT"
.rs
.sp
By default, PCRE2 is built with support for Unicode and UTF character strings.
@ -112,18 +113,14 @@ is not possible to build one library with Unicode support, and another without,
in the same configuration.
.P
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
or UTF-32. To do that you have have to set the PCRE2_UTF option when you call
\fBpcre2_compile()\fP to compile a pattern.
or UTF-32. To do that, applications that use the library have to set the
PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
.P
It is not possible to support both EBCDIC and UTF-8 codes in the same version
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
exclusive.
.P
UTF support allows the libraries to process character codepoints up to 0x10ffff
in the strings that they handle. It also provides support for accessing the
properties of such characters, using pattern escapes such as \eP, \ep, and \eX.
Only the general category properties such as \fILu\fP and \fINd\fP are
supported. Details are given in the
UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. It also provides support for
accessing the Unicode properties of such characters, using pattern escapes such
as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
\fINd\fP are supported. Details are given in the
.\" HREF
\fBpcre2pattern\fP
.\"
@ -138,7 +135,7 @@ Just-in-time compiler support is included in the build by specifying
--enable-jit
.sp
This support is available only for certain hardware architectures. If this
option is set for an unsupported architecture, a compile time error occurs.
option is set for an unsupported architecture, a building error occurs.
See the
.\" HREF
\fBpcre2jit\fP
@ -151,7 +148,7 @@ pcre2grep automatically makes use of it, unless you add
to the "configure" command.
.
.
.SH "CODE VALUE OF NEWLINE"
.SH "NEWLINE RECOGNITION"
.rs
.sp
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
@ -160,11 +157,12 @@ compile PCRE2 to use carriage return (CR) instead, by adding
.sp
--enable-newline-is-cr
.sp
to the \fBconfigure\fP command. There is also a --enable-newline-is-lf option,
to the \fBconfigure\fP command. There is also an --enable-newline-is-lf option,
which explicitly specifies linefeed as the newline character.
.sp
Alternatively, you can specify that line endings are to be indicated by the two
character sequence CRLF. If you want this, add
.P
Alternatively, you can specify that line endings are to be indicated by the
two-character sequence CRLF (CR immediately followed by LF). If you want this,
add
.sp
--enable-newline-is-crlf
.sp
@ -177,10 +175,13 @@ indicating a line ending. Finally, a fifth option, specified by
.sp
--enable-newline-is-any
.sp
causes PCRE2 to recognize any Unicode newline sequence.
causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
sequences are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029).
.P
Whatever line ending convention is selected when PCRE2 is built can be
overridden when the library functions are called. At build time it is
Whatever default line ending convention is selected when PCRE2 is built can be
overridden by applications that use the library. At build time it is
conventional to use the standard for your operating system.
.
.
@ -188,12 +189,13 @@ conventional to use the standard for your operating system.
.rs
.sp
By default, the sequence \eR in a pattern matches any Unicode newline sequence,
whatever has been selected as the line ending sequence. If you specify
independently of what has been selected as the line ending sequence. If you
specify
.sp
--enable-bsr-anycrlf
.sp
the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
selected when PCRE2 is built can be overridden when the library functions are
selected when PCRE2 is built can be overridden by applications that use the
called.
.
.
@ -204,10 +206,10 @@ Within a compiled pattern, offset values are used to point from one part to
another (for example, from an opening parenthesis to an alternation
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
are used for these offsets, leading to a maximum size for a compiled pattern of
around 64K. This is sufficient to handle all but the most gigantic patterns.
Nevertheless, some people do want to process truly enormous patterns, so it is
possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
setting such as
around 64K code units. This is sufficient to handle all but the most gigantic
patterns. Nevertheless, some people do want to process truly enormous patterns,
so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
adding a setting such as
.sp
--with-link-size=3
.sp
@ -299,16 +301,19 @@ hand".)
.rs
.sp
PCRE2 assumes by default that it will run in an environment where the character
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
most computer operating systems. PCRE2 can, however, be compiled to run in an
EBCDIC environment by adding
8-bit EBCDIC environment by adding
.sp
--enable-ebcdic --disable-unicode
.sp
to the \fBconfigure\fP command. This setting implies
--enable-rebuild-chartables. You should only use it if you know that you are in
an EBCDIC environment (for example, an IBM mainframe operating system). The
--enable-ebcdic option is incompatible with Unicode support.
an EBCDIC environment (for example, an IBM mainframe operating system).
.P
It is not possible to support both EBCDIC and UTF-8 codes in the same version
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
exclusive.
.P
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
@ -354,8 +359,8 @@ parameter value by adding, for example,
.sp
--with-pcre2grep-bufsize=50K
.sp
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can, however,
override this value by specifying a run-time option.
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override this
value by using --buffer-size on the command line..
.
.
.SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT"
@ -371,15 +376,15 @@ to the \fBconfigure\fP command, \fBpcre2test\fP is linked with the
from a terminal, it reads it using the \fBreadline()\fP function. This provides
line-editing and history facilities. Note that \fBlibreadline\fP is
GPL-licensed, so if you distribute a binary of \fBpcre2test\fP linked in this
way, there may be licensing issues. These can be avoided by linking with
\fBlibedit\fP (which has a BSD licence) instead.
way, there may be licensing issues. These can be avoided by linking instead
with \fBlibedit\fP, which has a BSD licence.
.P
Setting this option causes the \fB-lreadline\fP option to be added to the
\fBpcre2test\fP build. In many operating environments with a sytem-installed
readline library this is sufficient. However, in some environments (e.g. if an
unmodified distribution version of readline is in use), some extra
configuration may be necessary. The INSTALL file for \fBlibreadline\fP says
this:
Setting --enable-pcre2test-libreadline causes the \fB-lreadline\fP option to be
added to the \fBpcre2test\fP build. In many operating environments with a
sytem-installed readline library this is sufficient. However, in some
environments (e.g. if an unmodified distribution version of readline is in
use), some extra configuration may be necessary. The INSTALL file for
\fBlibreadline\fP says this:
.sp
"Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications
@ -396,13 +401,13 @@ immediately before the \fBconfigure\fP command.
.SH "DEBUGGING WITH VALGRIND SUPPORT"
.rs
.sp
By adding the
If you add
.sp
--enable-valgrind
.sp
option to to the \fBconfigure\fP command, PCRE2 will use valgrind annotations
to mark certain memory regions as unaddressable. This allows it to detect
invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
to the \fBconfigure\fP command, PCRE2 will use valgrind annotations to mark
certain memory regions as unaddressable. This allows it to detect invalid
memory accesses, and is mostly useful for debugging PCRE2 itself.
.
.
.SH "CODE COVERAGE REPORTING"
@ -482,6 +487,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 03 November 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2CALLOUT 3 "19 October 2014" "PCRE2 10.00"
.TH PCRE2CALLOUT 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -68,29 +68,27 @@ expect.
.P
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is anchored
and then applied with automatic callouts to the string "aaaa" is:
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
"aaaa" is:
.sp
--->aaaa
+0 ^ ^
+1 ^ a+
+3 ^ ^ [bc]
+0 ^ a+
+2 ^ ^ [bc]
No match
.sp
This indicates that when matching [bc] fails, there is no backtracking into a+
and therefore the callouts that would be taken for the backtracks do not occur.
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS
to \fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). If
this is done in \fBpcre2test\fP (using the /no_auto_possess qualifier), the
output changes to this:
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
\fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
case, the output changes to this:
.sp
--->aaaa
+0 ^ ^
+1 ^ a+
+3 ^ ^ [bc]
+3 ^ ^ [bc]
+3 ^ ^ [bc]
+3 ^^ [bc]
+0 ^ a+
+2 ^ ^ [bc]
+2 ^ ^ [bc]
+2 ^ ^ [bc]
+2 ^^ [bc]
No match
.sp
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
@ -119,10 +117,10 @@ callouts such as the example above are obeyed.
.SH "THE CALLOUT INTERFACE"
.rs
.sp
During matching, when PCRE2 reaches a callout point, the external function that
is set in the match context is called (if it is set). This applies to both
normal and DFA matching. The only argument to the callout function is a pointer
to a \fBpcre2_callout\fP block. This structure contains the following fields:
During matching, when PCRE2 reaches a callout point, if an external function is
set in the match context, it is called. This applies to both normal and DFA
matching. The only argument to the callout function is a pointer to a
\fBpcre2_callout\fP block. This structure contains the following fields:
.sp
uint32_t \fIversion\fP;
uint32_t \fIcallout_number\fP;
@ -149,7 +147,7 @@ automatically generated callouts).
.P
The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
(the "ovector") that was passed to the matching function in the match data
block. When \fBpcre2_match()\fP is used, the contents can be inspected, in
block. When \fBpcre2_match()\fP is used, the contents can be inspected in
order to extract substrings that have been matched so far, in the same way as
for extracting substrings after a match has completed. For the DFA matching
function, this field is not useful.
@ -238,6 +236,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 19 October 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2GREP 1 "28 September 2014" "PCRE2 10.00"
.TH PCRE2GREP 1 "23 November 2014" "PCRE2 10.00"
.SH NAME
pcre2grep - a grep with Perl-compatible regular expressions.
.SH SYNOPSIS
@ -403,8 +403,8 @@ used. There is no short form for this option.
Processing some regular expression patterns can require a very large amount of
memory, leading in some cases to a program crash if not enough is available.
Other patterns may take a very long time to search for all possible matching
strings. The \fBpcre2_exec()\fP function that is called by \fBpcre2grep\fP to do
the matching has two parameters that can limit the resources that it uses.
strings. The \fBpcre2_match()\fP function that is called by \fBpcre2grep\fP to
do the matching has two parameters that can limit the resources that it uses.
.sp
The \fB--match-limit\fP option provides a means of limiting resource usage
when processing patterns that are not going to match, but which have a very
@ -678,6 +678,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 28 September 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -446,7 +446,7 @@ OPTIONS
very large amount of memory, leading in some cases to a pro-
gram crash if not enough is available. Other patterns may
take a very long time to search for all possible matching
strings. The pcre2_exec() function that is called by
strings. The pcre2_match() function that is called by
pcre2grep to do the matching has two parameters that can
limit the resources that it uses.
@ -737,5 +737,5 @@ AUTHOR
REVISION
Last updated: 28 September 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.

View File

@ -1,4 +1,4 @@
.TH PCRE2JIT 3 "12 November 2014" "PCRE2 10.00"
.TH PCRE2JIT 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@ -6,11 +6,11 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp
Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the
match is performed. Therefore, it is of most benefit when the same pattern is
going to be matched many times. This does not necessarily mean many calls of a
matching function; if the pattern is not anchored, matching attempts may take
place many times at various positions in the subject, even for a single call.
Therefore, if the subject string is very long, it may still pay to use JIT for
match is performed, so it is of most benefit when the same pattern is going to
be matched many times. This does not necessarily mean many calls of a matching
function; if the pattern is not anchored, matching attempts may take place many
times at various positions in the subject, even for a single call. Therefore,
if the subject string is very long, it may still pay to use JIT even for
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries.
.P
@ -77,7 +77,7 @@ option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
\fBpcre2_jit_compile()\fP is called with no option bits set, it immediately
returns zero. This is an alternative way of testing if JIT is available.
returns zero. This is an alternative way of testing whether JIT is available.
.P
At present, it is not possible to free JIT compiled code except when the entire
compiled pattern is freed by calling \fBpcre2_free_code()\fP.
@ -276,7 +276,7 @@ compiled patterns, contexts, and stacks in any order, anytime. Just \fIdo
not\fP call \fBpcre2_match()\fP with a match context pointing to an already
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
used by \fBpcre2_match()\fP in another thread). You can also replace the stack
in a context at any time when it is not in use. You can also free the previous
in a context at any time when it is not in use. You should free the previous
stack before assigning a replacement.
.P
(5) Should I allocate/free a stack every time before/after calling
@ -398,6 +398,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 12 November 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "14 November 2014" "PCRE2 10.00"
.TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -394,7 +394,7 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
.sp
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_exec(), not increase them.
limits set by the caller of pcre2_match(), not increase them.
.
.
.SH "NEWLINE CONVENTION"
@ -536,6 +536,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 14 November 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "14 November 2014" "PCRE 10.00"
.TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@ -200,7 +200,7 @@ input lines. Each set starts with a regular expression pattern, followed by any
number of subject lines to be matched against that pattern. In between sets of
test data, command lines that begin with a hash (#) character may appear. This
file format, with some restrictions, can also be processed by the
\fBperltest.pl\fP script that is distributed with PCRE2 as a means of checking
\fBperltest.sh\fP script that is distributed with PCRE2 as a means of checking
that the behaviour of PCRE2 and Perl is the same.
.P
Each subject line is matched separately and independently. If you want to do
@ -243,11 +243,11 @@ patterns. Modifiers on a pattern can change these settings.
#perltest
.sp
The appearance of this line causes all subsequent modifier settings to be
checked for compatibility with the \fBperltest.pl\fP script, which is used to
checked for compatibility with the \fBperltest.sh\fP script, which is used to
confirm that Perl gives the same results as PCRE2. Also, apart from comment
lines, none of the other command lines are permitted, because they and many
of the modifiers are specific to \fBpcre2test\fP, and should not be used in
test files that are also processed by \fBperltest.pl\fP. The \fP#perltest\fB
test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB
command helps detect tests that are accidentally put in the wrong file.
.sp
#subject <modifier-list>
@ -265,7 +265,7 @@ for both patterns and subject lines, whereas others are valid for one or the
other only. Each modifier has a long name, for example "anchored", and some of
them must be followed by an equals sign and a value, for example, "offset=12".
Modifiers that do not take values may be preceded by a minus sign to turn off a
previous default setting.
previous setting.
.P
A few of the more common modifiers can also be specified as single letters, for
example "i" for "caseless". In documentation, following the Perl convention,
@ -336,7 +336,7 @@ encoding non-printing characters in a visible way:
\exhh hexadecimal byte (up to 2 hex digits)
\ex{hh...} hexadecimal character (any number of hex digits)
.sp
The use of \ex{hh...} is not dependent on the use of the utf modifier on
The use of \ex{hh...} is not dependent on the use of the \fButf\fP modifier on
the pattern. It is recognized always. There may be any number of hexadecimal
digits inside the braces; invalid values provoke error messages.
.P
@ -366,7 +366,7 @@ part of the file. For example:
is converted to "abcabcabcabc". This feature does not support nesting. To
include a closing square bracket in the characters, code it as \ex5D.
.P
A backslash followed by an equals sign marke the end of the subject string and
A backslash followed by an equals sign marks the end of the subject string and
the start of a modifier list. For example:
.sp
abc\e=notbol,notempty
@ -461,8 +461,8 @@ set to "anycrlf", \eR matches CR, LF, or CRLF only. If it is set to "unicode",
is built, with the default default being Unicode.
.P
The \fBnewline\fP modifier specifies which characters are to be interpreted as
newlines, both in the pattern and (by default) in subject lines. The type must
be one of CR, LF, CRLF, ANYCRLF, or ANY.
newlines, both in the pattern and in subject lines. The type must be one of CR,
LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
.
.
.SS "Information about a pattern"
@ -478,8 +478,8 @@ link sizes and different code unit widths. By using \fBbincode\fP, the same
regression tests can be used in different environments.
.P
The \fBfullbincode\fP modifier, by contrast, \fIdoes\fP include length and
offset values. This is used in a few special tests and is also useful for
one-off tests.
offset values. This is used in a few special tests that run only for specific
code unit widths and link sizes, and is also useful for one-off tests.
.P
The \fBinfo\fP modifier requests information about the compiled pattern
(whether it is anchored, has a fixed first character, and so on). The
@ -501,13 +501,14 @@ some typical examples:
Last code unit = 'c' (caseless)
Subject length lower bound = 3
.sp
"Compile options" are those specified to the compile function; "overall
options" have added options that are taken or deduced from the pattern. If both
sets of options are the same, just a single "options" line is output. "First
code unit" is where any match must start; if there is more than one they are
listed as "starting code units". "Last code unit" is the last literal code unit
that must be present in any match. This is not necessarily the last character.
These lines are omitted if no starting or ending code units are recorded.
"Compile options" are those specified by modifiers; "overall options" have
added options that are taken or deduced from the pattern. If both sets of
options are the same, just a single "options" line is output; if there are no
options, the line is omitted. "First code unit" is where any match must start;
if there is more than one they are listed as "starting code units". "Last code
unit" is the last literal code unit that must be present in any match. This is
not necessarily the last character. These lines are omitted if no starting or
ending code units are recorded.
.
.
.SS "Specifying a pattern in hex"
@ -520,16 +521,16 @@ pairs. For example:
/ab 32 59/hex
.sp
This feature is provided as a way of creating patterns that contain binary zero
characters. By default, \fBpcre2test\fP passes patterns as zero-terminated
strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED.
However, for patterns specified in hexadecimal, the actual length of the
pattern is passed.
and other non-printing characters. By default, \fBpcre2test\fP passes patterns
as zero-terminated strings to \fBpcre2_compile()\fP, giving the length as
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
actual length of the pattern is passed.
.
.
.SS "JIT compilation"
.rs
.sp
The \fB/jit\fP modifier may optionally be followed by and equals sign and a
The \fB/jit\fP modifier may optionally be followed by an equals sign and a
number in the range 0 to 7:
.sp
0 disable JIT
@ -561,7 +562,7 @@ pattern shows whether JIT compilation was or was not successful. If
\fBjitverify\fP is specified without \fBjit\fP, jit=7 is assumed. If JIT
compilation is successful when \fBjitverify\fP is set, the text "(JIT)" is
added to the first output line after a match or non match when JIT-compiled
code was actually used.
code was actually used in the match.
.
.
.SS "Setting a locale"
@ -645,8 +646,8 @@ be aborted.
.SS "Using alternative character tables"
.rs
.sp
The \fB/tables\fP modifier must be followed by a single digit. It causes a
specific set of built-in character tables to be passed to
The value specified for the \fB/tables\fP modifier must be one of the digits 0,
1, or 2. It causes a specific set of built-in character tables to be passed to
\fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
different character tables. The digit specifies the tables as follows:
.sp
@ -759,13 +760,13 @@ The effects of these modifiers are described in the following sections.
.SS "Showing more text"
.rs
.sp
The \fBaftertext\fP modifier requests that as well as outputting the substring
that matched the entire pattern, \fBpcre2test\fP should in addition output the
remainder of the subject string. This is useful for tests where the subject
contains multiple copies of the same substring. The \fBallaftertext\fP modifier
requests the same action for captured substrings as well as the main matched
substring. In each case the remainder is output on the following line with a
plus character following the capture number.
The \fBaftertext\fP modifier requests that as well as outputting the part of
the subject string that matched the entire pattern, \fBpcre2test\fP should in
addition output the remainder of the subject string. This is useful for tests
where the subject contains multiple copies of the same substring. The
\fBallaftertext\fP modifier requests the same action for captured substrings as
well as the main matched substring. In each case the remainder is output on the
following line with a plus character following the capture number.
.P
The \fBallusedtext\fP modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown. This
@ -782,7 +783,8 @@ underneath them. Here is an example:
<<< >>>
.sp
This shows that the matched string is "abc", with the preceding and following
strings "pqr" and "xyz" also consulted during the match.
strings "pqr" and "xyz" having been consulted during the match (when processing
the assertions).
.P
The \fBstartchar\fP modifier requests that the starting character for the match
be indicated, if it is different to the start of the matched string. The only
@ -836,7 +838,7 @@ function is called again to search the remainder of the subject. The difference
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
to start searching at a new point within the entire string (which is what Perl
does), whereas the latter passes over a shortened substring. This makes a
does), whereas the latter passes over a shortened subject. This makes a
difference to the matching process if the pattern begins with a lookbehind
assertion (including \eb or \eB).
.P
@ -847,7 +849,7 @@ fails, the start offset is advanced, and the normal match is retried. This
imitates the way Perl handles such cases when using the \fB/g\fP modifier or
the \fBsplit()\fP function. Normally, the start offset is advanced by one
character, but if the newline convention recognizes CRLF as a newline, and the
current character is CR followed by LF, an advance of two is used.
current character is CR followed by LF, an advance of two characters occurs.
.
.
.SS "Testing substring extraction functions"
@ -860,9 +862,9 @@ for example:
.sp
abcd\e=copy=1,copy=3,get=G1
.sp
If the \fB#subject\fP command is used to set default copy and get lists, these
can be unset by specifying a negative number for numbered groups and an empty
name for named groups.
If the \fB#subject\fP command is used to set default copy and/or get lists,
these can be unset by specifying a negative number to cancel all numbered
groups and an empty name to cancel all named groups.
.P
The \fBgetall\fP modifier tests \fBpcre2_substring_list_get()\fP, which
extracts all captured substrings.
@ -871,7 +873,8 @@ If the subject line is successfully matched, the substrings extracted by the
convenience functions are output with C, G, or L after the string number
instead of a colon. This is in addition to the normal full list. The string
length (that is, the return from the extraction function) is given in
parentheses after each substring.
parentheses after each substring, followed by the name when the extraction was
by name.
.
.
.SS "Testing the substitution function"
@ -1044,11 +1047,10 @@ entire substring that was inspected during the partial match; it may include
characters before the actual match start if a lookbehind assertion, \eK, \eb,
or \eB was involved.)
.P
For any other return, \fBpcre2test\fP outputs the PCRE2
negative error number and a short descriptive phrase. If the error is a failed
UTF string check, the offset of the start of the failing character and the
reason code are also output. Here is an example of an interactive
\fBpcre2test\fP run.
For any other return, \fBpcre2test\fP outputs the PCRE2 negative error number
and a short descriptive phrase. If the error is a failed UTF string check, the
code unit offset of the start of the failing character is also output. Here is
an example of an interactive \fBpcre2test\fP run.
.sp
$ pcre2test
PCRE2 version 9.00 2014-05-10
@ -1061,10 +1063,10 @@ reason code are also output. Here is an example of an interactive
No match
.sp
Unset capturing substrings that are not followed by one that is set are not
returned by \fBpcre2_match()\fP, and are not shown by \fBpcre2test\fP. In the
following example, there are two capturing substrings, but when the first data
line is matched, the second, unset substring is not shown. An "internal" unset
substring is shown as "<unset>", as for the second data line.
shown by \fBpcre2test\fP unless the \fBallcaptures\fP modifier is specified. In
the following example, there are two capturing substrings, but when the first
data line is matched, the second, unset substring is not shown. An "internal"
unset substring is shown as "<unset>", as for the second data line.
.sp
re> /(a)|(b)/
data> a
@ -1100,8 +1102,8 @@ are output in sequence, like this:
1: pp
.sp
"No match" is output only if the first match attempt fails. Here is an example
of a failure message (the offset 4 that is specified by \e>4 is past the end of
the subject string):
of a failure message (the offset 4 that is specified by the \fBoffset\fP
modifier is past the end of the subject string):
.sp
re> /xyz/
data> xyz\e=offset=4
@ -1127,12 +1129,13 @@ the subject where there is at least one match. For example:
1: tang
2: tan
.sp
(Using the normal matching function on this data finds only "tang".) The
Using the normal matching function on this data finds only "tang". The
longest matching string is always given first (and numbered zero). After a
PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the
partially matching substring. (Note that this is the entire substring that was
partially matching substring. Note that this is the entire substring that was
inspected during the partial match; it may include characters before the actual
match start if a lookbehind assertion, \eK, \eb, or \eB was involved.)
match start if a lookbehind assertion, \eb, or \eB was involved. (\eK is not
supported for DFA matching.)
.P
If global matching is requested, the search for further matches resumes
at the end of the longest match. For example:
@ -1174,9 +1177,9 @@ documentation.
.SH CALLOUTS
.rs
.sp
If the pattern contains any callout requests, \fBpcre2test\fP's callout function
is called during matching. This works with both matching functions. By default,
the called function displays the callout number, the start and current
If the pattern contains any callout requests, \fBpcre2test\fP's callout
function is called during matching. This works with both matching functions. By
default, the called function displays the callout number, the start and current
positions in the text at the callout time, and the next pattern item to be
tested. For example:
.sp
@ -1271,6 +1274,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 14 November 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2UNICODE 3 "03 November 2014" "PCRE2 10.00"
.TH PCRE2UNICODE 3 "23 November 2014" "PCRE2 10.00"
.SH NAME
PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT"
@ -64,7 +64,7 @@ characters (see the description of \eC in the
\fBpcre2pattern\fP
.\"
documentation). The use of \eC is not supported in the alternative matching
function \fBpcre2_dfa_exec()\fP, nor is it supported in UTF mode by the JIT
function \fBpcre2_dfa_match()\fP, nor is it supported in UTF mode by the JIT
optimization. If JIT optimization is requested for a UTF pattern that contains
\eC, it will not succeed, and so the matching will be carried out by the normal
interpretive function.
@ -108,7 +108,10 @@ case-equivalent, and these are treated as such.
.sp
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
are (by default) checked for validity on entry to the relevant functions.
If an invalid UTF string is passed, an error return is given.
If an invalid UTF string is passed, an negative error code is returned. The
code unit offset to the offending character can be extracted from the match
data block by calling \fBpcre2_get_startchar()\fP, which is used for this
purpose after a UTF error.
.P
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
@ -130,14 +133,14 @@ UTF-32.)
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly.
If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2
assumes that the pattern or subject it is given (respectively) contains only
valid UTF code unit sequences.
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
PCRE2 assumes that the pattern or subject it is given (respectively) contains
only valid UTF code unit sequences.
.P
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable
the check for a subject string you must pass this option to \fBpcre2_exec()\fP
or \fBpcre2_dfa_exec()\fP.
the check for a subject string you must pass this option to \fBpcre2_match()\fP
or \fBpcre2_dfa_match()\fP.
.P
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is undefined and your program may crash or loop indefinitely.
@ -249,6 +252,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 03 November 2014
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -3224,12 +3224,8 @@ multiunit character. */
#ifdef SUPPORT_UNICODE
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
{
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar));
if (match_data->rc != 0)
{
match_data->leftchar = 0;
return match_data->rc;
}
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
if (match_data->rc != 0) return match_data->rc;
#if PCRE2_CODE_UNIT_WIDTH != 32
if (start_offset > 0 && start_offset < length &&
NOT_FIRSTCHAR(subject[start_offset]))

View File

@ -6459,12 +6459,8 @@ multiunit character. */
#ifdef SUPPORT_UNICODE
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
{
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar));
if (match_data->rc != 0)
{
match_data->leftchar = 0;
return match_data->rc;
}
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
if (match_data->rc != 0) return match_data->rc;
#if PCRE2_CODE_UNIT_WIDTH != 32
if (start_offset > 0 && start_offset < length &&
NOT_FIRSTCHAR(subject[start_offset]))

View File

@ -5570,6 +5570,13 @@ else for (gmatched = 0;; gmatched++)
fprintf(outfile, "Failed: error %d: ", capcount);
PCRE2_GET_ERROR_MESSAGE(mlen, capcount, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, mlen, FALSE, outfile);
if (capcount <= PCRE2_ERROR_UTF8_ERR1 &&
capcount >= PCRE2_ERROR_UTF32_ERR2)
{
PCRE2_SIZE startchar;
PCRE2_GET_STARTCHAR(startchar, match_data);
fprintf(outfile, " at offset %ld", startchar);
}
fprintf(outfile, "\n");
break;
}

30
testdata/testinput10 vendored
View File

@ -48,12 +48,12 @@
/テテテxxx/utf
/badutf/utf
\xdf
\xef
\xef\x80
\xf7
\xf7\x80
\xf7\x80\x80
X\xdf
XX\xef
XXX\xef\x80
X\xf7
XX\xf7\x80
XXX\xf7\x80\x80
\xfb
\xfb\x80
\xfb\x80\x80
@ -89,14 +89,14 @@
\xff
/badutf/utf
\xfb\x80\x80\x80\x80
\xfd\x80\x80\x80\x80\x80
\xf7\xbf\xbf\xbf
XX\xfb\x80\x80\x80\x80
XX\xfd\x80\x80\x80\x80\x80
XX\xf7\xbf\xbf\xbf
/shortutf/utf
\xdf\=ph
\xef\=ph
\xef\x80\=ph
XX\xdf\=ph
XX\xef\=ph
XX\xef\x80\=ph
\xf7\=ph
\xf7\x80\=ph
\xf7\x80\x80\=ph
@ -111,9 +111,9 @@
\xfd\x80\x80\x80\x80\=ph
/anything/utf
\xc0\x80
\xc1\x8f
\xe0\x9f\x80
X\xc0\x80
XX\xc1\x8f
XXX\xe0\x9f\x80
\xf0\x8f\x80\x80
\xf8\x87\x80\x80\x80
\xfc\x83\x80\x80\x80\x80

24
testdata/testinput12 vendored
View File

@ -157,18 +157,18 @@
/^[\QĀ\E-\QŐ\E/B,utf
/X/utf
\x{d800}
\x{d800}\=no_utf_check
\x{da00}
\x{da00}\=no_utf_check
\x{dc00}
\x{dc00}\=no_utf_check
\x{de00}
\x{de00}\=no_utf_check
\x{dfff}
\x{dfff}\=no_utf_check
\x{110000}
\x{d800}\x{1234}
XX\x{d800}
XX\x{d800}\=no_utf_check
XX\x{da00}
XX\x{da00}\=no_utf_check
XX\x{dc00}
XX\x{dc00}\=no_utf_check
XX\x{de00}
XX\x{de00}\=no_utf_check
XX\x{dfff}
XX\x{dfff}\=no_utf_check
XX\x{110000}
XX\x{d800}\x{1234}
/(*UTF16)\x{11234}/
abcd\x{11234}pqr

182
testdata/testoutput10 vendored
View File

@ -73,142 +73,142 @@ Failed: error -3 at offset 0: UTF-8 error: 1 byte missing at end
Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
/badutf/utf
\xdf
Failed: error -3: UTF-8 error: 1 byte missing at end
\xef
Failed: error -4: UTF-8 error: 2 bytes missing at end
\xef\x80
Failed: error -3: UTF-8 error: 1 byte missing at end
\xf7
Failed: error -5: UTF-8 error: 3 bytes missing at end
\xf7\x80
Failed: error -4: UTF-8 error: 2 bytes missing at end
\xf7\x80\x80
Failed: error -3: UTF-8 error: 1 byte missing at end
X\xdf
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 1
XX\xef
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
XXX\xef\x80
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
X\xf7
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 1
XX\xf7\x80
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
XXX\xf7\x80\x80
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
\xfb
Failed: error -6: UTF-8 error: 4 bytes missing at end
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfb\x80
Failed: error -5: UTF-8 error: 3 bytes missing at end
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfb\x80\x80
Failed: error -4: UTF-8 error: 2 bytes missing at end
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfb\x80\x80\x80
Failed: error -3: UTF-8 error: 1 byte missing at end
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xfd
Failed: error -7: UTF-8 error: 5 bytes missing at end
Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
\xfd\x80
Failed: error -6: UTF-8 error: 4 bytes missing at end
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfd\x80\x80
Failed: error -5: UTF-8 error: 3 bytes missing at end
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfd\x80\x80\x80
Failed: error -4: UTF-8 error: 2 bytes missing at end
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfd\x80\x80\x80\x80
Failed: error -3: UTF-8 error: 1 byte missing at end
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xdf\x7f
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xef\x7f\x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xef\x80\x7f
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xf7\x7f\x80\x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xf7\x80\x7f\x80
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xf7\x80\x80\x7f
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
\xfb\x7f\x80\x80\x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xfb\x80\x7f\x80\x80
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xfb\x80\x80\x7f\x80
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
\xfb\x80\x80\x80\x7f
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
\xfd\x7f\x80\x80\x80\x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xfd\x80\x7f\x80\x80\x80
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xfd\x80\x80\x7f\x80\x80
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
\xfd\x80\x80\x80\x7f\x80
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
\xfd\x80\x80\x80\x80\x7f
Failed: error -12: UTF-8 error: byte 6 top bits not 0x80
Failed: error -12: UTF-8 error: byte 6 top bits not 0x80 at offset 0
\xed\xa0\x80
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\xc0\x8f
Failed: error -17: UTF-8 error: overlong 2-byte sequence
Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 0
\xe0\x80\x8f
Failed: error -18: UTF-8 error: overlong 3-byte sequence
Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 0
\xf0\x80\x80\x8f
Failed: error -19: UTF-8 error: overlong 4-byte sequence
Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
\xf8\x80\x80\x80\x8f
Failed: error -20: UTF-8 error: overlong 5-byte sequence
Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
\xfc\x80\x80\x80\x80\x8f
Failed: error -21: UTF-8 error: overlong 6-byte sequence
Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
\x80
Failed: error -22: UTF-8 error: isolated 0x80 byte
Failed: error -22: UTF-8 error: isolated 0x80 byte at offset 0
\xfe
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
\xff
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
/badutf/utf
\xfb\x80\x80\x80\x80
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
\xfd\x80\x80\x80\x80\x80
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
\xf7\xbf\xbf\xbf
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined
XX\xfb\x80\x80\x80\x80
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 2
XX\xfd\x80\x80\x80\x80\x80
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 2
XX\xf7\xbf\xbf\xbf
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 2
/shortutf/utf
\xdf\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end
\xef\=ph
Failed: error -4: UTF-8 error: 2 bytes missing at end
\xef\x80\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end
XX\xdf\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
XX\xef\=ph
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
XX\xef\x80\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
\xf7\=ph
Failed: error -5: UTF-8 error: 3 bytes missing at end
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xf7\x80\=ph
Failed: error -4: UTF-8 error: 2 bytes missing at end
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xf7\x80\x80\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xfb\=ph
Failed: error -6: UTF-8 error: 4 bytes missing at end
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfb\x80\=ph
Failed: error -5: UTF-8 error: 3 bytes missing at end
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfb\x80\x80\=ph
Failed: error -4: UTF-8 error: 2 bytes missing at end
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfb\x80\x80\x80\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xfd\=ph
Failed: error -7: UTF-8 error: 5 bytes missing at end
Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
\xfd\x80\=ph
Failed: error -6: UTF-8 error: 4 bytes missing at end
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfd\x80\x80\=ph
Failed: error -5: UTF-8 error: 3 bytes missing at end
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfd\x80\x80\x80\=ph
Failed: error -4: UTF-8 error: 2 bytes missing at end
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfd\x80\x80\x80\x80\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
/anything/utf
\xc0\x80
Failed: error -17: UTF-8 error: overlong 2-byte sequence
\xc1\x8f
Failed: error -17: UTF-8 error: overlong 2-byte sequence
\xe0\x9f\x80
Failed: error -18: UTF-8 error: overlong 3-byte sequence
X\xc0\x80
Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 1
XX\xc1\x8f
Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 2
XXX\xe0\x9f\x80
Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 3
\xf0\x8f\x80\x80
Failed: error -19: UTF-8 error: overlong 4-byte sequence
Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
\xf8\x87\x80\x80\x80
Failed: error -20: UTF-8 error: overlong 5-byte sequence
Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
\xfc\x83\x80\x80\x80\x80
Failed: error -21: UTF-8 error: overlong 6-byte sequence
Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
\xfe\x80\x80\x80\x80\x80
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
\xff\x80\x80\x80\x80\x80
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
\xc3\x8f
No match
\xe0\xaf\x80
@ -220,13 +220,13 @@ No match
\xf1\x8f\x80\x80
No match
\xf8\x88\x80\x80\x80
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
\xf9\x87\x80\x80\x80
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
\xfc\x84\x80\x80\x80\x80
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
\xfd\x83\x80\x80\x80\x80
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
\xf8\x88\x80\x80\x80\=no_utf_check
No match
\xf9\x87\x80\x80\x80\=no_utf_check
@ -751,27 +751,27 @@ Failed: error 106 at offset 15: missing terminating ] for character class
/X/utf
\x{d800}
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\x{d800}\=no_utf_check
No match
\x{da00}
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\x{da00}\=no_utf_check
No match
\x{dfff}
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\x{dfff}\=no_utf_check
No match
\x{110000}
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 0
\x{110000}\=no_utf_check
No match
\x{2000000}
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
\x{2000000}\=no_utf_check
No match
\x{7fffffff}
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
\x{7fffffff}\=no_utf_check
No match
@ -1106,7 +1106,7 @@ Subject length lower bound = 1
\x{ff000041}
** Character \x{ff000041} is greater than 0x7fffffff and so cannot be converted to UTF-8
\x{7f000041}
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
/(*UTF8)abc/never_utf
Failed: error 174 at offset 7: using UTF is disabled by the application

View File

@ -607,30 +607,30 @@ Subject length lower bound = 2
Failed: error 106 at offset 13: missing terminating ] for character class
/X/utf
\x{d800}
Failed: error -24: UTF-16 error: missing low surrogate at end
\x{d800}\=no_utf_check
No match
\x{da00}
Failed: error -24: UTF-16 error: missing low surrogate at end
\x{da00}\=no_utf_check
No match
\x{dc00}
Failed: error -26: UTF-16 error: isolated low surrogate
\x{dc00}\=no_utf_check
No match
\x{de00}
Failed: error -26: UTF-16 error: isolated low surrogate
\x{de00}\=no_utf_check
No match
\x{dfff}
Failed: error -26: UTF-16 error: isolated low surrogate
\x{dfff}\=no_utf_check
No match
\x{110000}
XX\x{d800}
Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
XX\x{d800}\=no_utf_check
0: X
XX\x{da00}
Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
XX\x{da00}\=no_utf_check
0: X
XX\x{dc00}
Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
XX\x{dc00}\=no_utf_check
0: X
XX\x{de00}
Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
XX\x{de00}\=no_utf_check
0: X
XX\x{dfff}
Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
XX\x{dfff}\=no_utf_check
0: X
XX\x{110000}
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
\x{d800}\x{1234}
Failed: error -25: UTF-16 error: invalid low surrogate
XX\x{d800}\x{1234}
Failed: error -25: UTF-16 error: invalid low surrogate at offset 3
/(*UTF16)\x{11234}/
abcd\x{11234}pqr

View File

@ -600,30 +600,30 @@ Subject length lower bound = 2
Failed: error 106 at offset 13: missing terminating ] for character class
/X/utf
\x{d800}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
\x{d800}\=no_utf_check
No match
\x{da00}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
\x{da00}\=no_utf_check
No match
\x{dc00}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
\x{dc00}\=no_utf_check
No match
\x{de00}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
\x{de00}\=no_utf_check
No match
\x{dfff}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
\x{dfff}\=no_utf_check
No match
\x{110000}
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined
\x{d800}\x{1234}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
XX\x{d800}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
XX\x{d800}\=no_utf_check
0: X
XX\x{da00}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
XX\x{da00}\=no_utf_check
0: X
XX\x{dc00}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
XX\x{dc00}\=no_utf_check
0: X
XX\x{de00}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
XX\x{de00}\=no_utf_check
0: X
XX\x{dfff}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
XX\x{dfff}\=no_utf_check
0: X
XX\x{110000}
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 2
XX\x{d800}\x{1234}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
/(*UTF16)\x{11234}/
Failed: error 160 at offset 5: (*VERB) not recognized or malformed
@ -1113,7 +1113,7 @@ Failed: error 134 at offset 10: character code point value in \x{} or \o{} is to
/\C/utf
\x{110000}
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
/\x{100}*A/IB,utf
------------------------------------------------------------------