More documentation and test updates.
This commit is contained in:
parent
eb4fffbbf4
commit
91f2e97474
|
@ -148,7 +148,7 @@ listing), and the short pages for individual functions, are concatenated in
|
|||
pcre2limits details of size and other limits
|
||||
pcre2matching discussion of the two matching algorithms
|
||||
pcre2partial details of the partial matching facility
|
||||
pcre2pattern syntax and semantics of supported regular expression patterns
|
||||
pcre2pattern syntax and semantics of supported regular expression patterns
|
||||
pcre2perform discussion of performance issues
|
||||
pcre2posix the POSIX-compatible C API for the 8-bit library
|
||||
pcre2sample discussion of the pcre2demo program
|
||||
|
|
|
@ -17,9 +17,9 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a>
|
||||
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
|
||||
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
|
||||
<li><a name="TOC5" href="#SEC5">Unicode and UTF SUPPORT</a>
|
||||
<li><a name="TOC5" href="#SEC5">UNICODE AND UTF SUPPORT</a>
|
||||
<li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a>
|
||||
<li><a name="TOC7" href="#SEC7">CODE VALUE OF NEWLINE</a>
|
||||
<li><a name="TOC7" href="#SEC7">NEWLINE RECOGNITION</a>
|
||||
<li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a>
|
||||
<li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
|
||||
<li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
|
||||
|
@ -91,12 +91,12 @@ respectively. These can be interpreted either as single-unit characters or
|
|||
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
|
||||
the following to the <b>configure</b> command:
|
||||
<pre>
|
||||
--enable-pcre16
|
||||
--enable-pcre32
|
||||
--enable-pcre2-16
|
||||
--enable-pcre2-32
|
||||
</pre>
|
||||
If you do not want the 8-bit library, add
|
||||
<pre>
|
||||
--disable-pcre8
|
||||
--disable-pcre2-8
|
||||
</pre>
|
||||
as well. At least one of the three libraries must be built. Note that the POSIX
|
||||
wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit
|
||||
|
@ -106,14 +106,15 @@ libraries.
|
|||
<br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
|
||||
<P>
|
||||
The Autotools PCRE2 building process uses <b>libtool</b> to build both shared
|
||||
and static libraries by default. You can suppress one of these by adding one of
|
||||
and static libraries by default. You can suppress an unwanted library by adding
|
||||
one of
|
||||
<pre>
|
||||
--disable-shared
|
||||
--disable-static
|
||||
</pre>
|
||||
to the <b>configure</b> command, as required.
|
||||
to the <b>configure</b> command.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">Unicode and UTF SUPPORT</a><br>
|
||||
<br><a name="SEC5" href="#TOC1">UNICODE AND UTF SUPPORT</a><br>
|
||||
<P>
|
||||
By default, PCRE2 is built with support for Unicode and UTF character strings.
|
||||
To build it without Unicode support, add
|
||||
|
@ -126,20 +127,15 @@ in the same configuration.
|
|||
</P>
|
||||
<P>
|
||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
||||
or UTF-32. To do that you have have to set the PCRE2_UTF option when you call
|
||||
<b>pcre2_compile()</b> to compile a pattern.
|
||||
or UTF-32. To do that, applications that use the library have to set the
|
||||
PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to support both EBCDIC and UTF-8 codes in the same version
|
||||
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
|
||||
exclusive.
|
||||
</P>
|
||||
<P>
|
||||
UTF support allows the libraries to process character codepoints up to 0x10ffff
|
||||
in the strings that they handle. It also provides support for accessing the
|
||||
properties of such characters, using pattern escapes such as \P, \p, and \X.
|
||||
Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
|
||||
supported. Details are given in the
|
||||
UTF support allows the libraries to process character code points up to
|
||||
0x10ffff in the strings that they handle. It also provides support for
|
||||
accessing the Unicode properties of such characters, using pattern escapes such
|
||||
as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
|
||||
<i>Nd</i> are supported. Details are given in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
|
@ -150,7 +146,7 @@ Just-in-time compiler support is included in the build by specifying
|
|||
--enable-jit
|
||||
</pre>
|
||||
This support is available only for certain hardware architectures. If this
|
||||
option is set for an unsupported architecture, a compile time error occurs.
|
||||
option is set for an unsupported architecture, a building error occurs.
|
||||
See the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for a discussion of JIT usage. When JIT support is enabled,
|
||||
|
@ -160,7 +156,7 @@ pcre2grep automatically makes use of it, unless you add
|
|||
</pre>
|
||||
to the "configure" command.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">CODE VALUE OF NEWLINE</a><br>
|
||||
<br><a name="SEC7" href="#TOC1">NEWLINE RECOGNITION</a><br>
|
||||
<P>
|
||||
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
|
||||
of a line. This is the normal newline character on Unix-like systems. You can
|
||||
|
@ -168,12 +164,13 @@ compile PCRE2 to use carriage return (CR) instead, by adding
|
|||
<pre>
|
||||
--enable-newline-is-cr
|
||||
</pre>
|
||||
to the <b>configure</b> command. There is also a --enable-newline-is-lf option,
|
||||
to the <b>configure</b> command. There is also an --enable-newline-is-lf option,
|
||||
which explicitly specifies linefeed as the newline character.
|
||||
<br>
|
||||
<br>
|
||||
Alternatively, you can specify that line endings are to be indicated by the two
|
||||
character sequence CRLF. If you want this, add
|
||||
</P>
|
||||
<P>
|
||||
Alternatively, you can specify that line endings are to be indicated by the
|
||||
two-character sequence CRLF (CR immediately followed by LF). If you want this,
|
||||
add
|
||||
<pre>
|
||||
--enable-newline-is-crlf
|
||||
</pre>
|
||||
|
@ -186,22 +183,26 @@ indicating a line ending. Finally, a fifth option, specified by
|
|||
<pre>
|
||||
--enable-newline-is-any
|
||||
</pre>
|
||||
causes PCRE2 to recognize any Unicode newline sequence.
|
||||
causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
|
||||
sequences are the three just mentioned, plus the single characters VT (vertical
|
||||
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
|
||||
separator, U+2028), and PS (paragraph separator, U+2029).
|
||||
</P>
|
||||
<P>
|
||||
Whatever line ending convention is selected when PCRE2 is built can be
|
||||
overridden when the library functions are called. At build time it is
|
||||
Whatever default line ending convention is selected when PCRE2 is built can be
|
||||
overridden by applications that use the library. At build time it is
|
||||
conventional to use the standard for your operating system.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||
<P>
|
||||
By default, the sequence \R in a pattern matches any Unicode newline sequence,
|
||||
whatever has been selected as the line ending sequence. If you specify
|
||||
independently of what has been selected as the line ending sequence. If you
|
||||
specify
|
||||
<pre>
|
||||
--enable-bsr-anycrlf
|
||||
</pre>
|
||||
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
|
||||
selected when PCRE2 is built can be overridden when the library functions are
|
||||
selected when PCRE2 is built can be overridden by applications that use the
|
||||
called.
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
|
||||
|
@ -210,10 +211,10 @@ Within a compiled pattern, offset values are used to point from one part to
|
|||
another (for example, from an opening parenthesis to an alternation
|
||||
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
|
||||
are used for these offsets, leading to a maximum size for a compiled pattern of
|
||||
around 64K. This is sufficient to handle all but the most gigantic patterns.
|
||||
Nevertheless, some people do want to process truly enormous patterns, so it is
|
||||
possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
|
||||
setting such as
|
||||
around 64K code units. This is sufficient to handle all but the most gigantic
|
||||
patterns. Nevertheless, some people do want to process truly enormous patterns,
|
||||
so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
|
||||
adding a setting such as
|
||||
<pre>
|
||||
--with-link-size=3
|
||||
</pre>
|
||||
|
@ -294,16 +295,20 @@ hand".)
|
|||
<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
|
||||
<P>
|
||||
PCRE2 assumes by default that it will run in an environment where the character
|
||||
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
|
||||
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
|
||||
most computer operating systems. PCRE2 can, however, be compiled to run in an
|
||||
EBCDIC environment by adding
|
||||
8-bit EBCDIC environment by adding
|
||||
<pre>
|
||||
--enable-ebcdic --disable-unicode
|
||||
</pre>
|
||||
to the <b>configure</b> command. This setting implies
|
||||
--enable-rebuild-chartables. You should only use it if you know that you are in
|
||||
an EBCDIC environment (for example, an IBM mainframe operating system). The
|
||||
--enable-ebcdic option is incompatible with Unicode support.
|
||||
an EBCDIC environment (for example, an IBM mainframe operating system).
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to support both EBCDIC and UTF-8 codes in the same version
|
||||
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
|
||||
exclusive.
|
||||
</P>
|
||||
<P>
|
||||
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
|
||||
|
@ -347,8 +352,8 @@ parameter value by adding, for example,
|
|||
<pre>
|
||||
--with-pcre2grep-bufsize=50K
|
||||
</pre>
|
||||
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can, however,
|
||||
override this value by specifying a run-time option.
|
||||
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override this
|
||||
value by using --buffer-size on the command line..
|
||||
</P>
|
||||
<br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
|
||||
<P>
|
||||
|
@ -362,16 +367,16 @@ to the <b>configure</b> command, <b>pcre2test</b> is linked with the
|
|||
from a terminal, it reads it using the <b>readline()</b> function. This provides
|
||||
line-editing and history facilities. Note that <b>libreadline</b> is
|
||||
GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this
|
||||
way, there may be licensing issues. These can be avoided by linking with
|
||||
<b>libedit</b> (which has a BSD licence) instead.
|
||||
way, there may be licensing issues. These can be avoided by linking instead
|
||||
with <b>libedit</b>, which has a BSD licence.
|
||||
</P>
|
||||
<P>
|
||||
Setting this option causes the <b>-lreadline</b> option to be added to the
|
||||
<b>pcre2test</b> build. In many operating environments with a sytem-installed
|
||||
readline library this is sufficient. However, in some environments (e.g. if an
|
||||
unmodified distribution version of readline is in use), some extra
|
||||
configuration may be necessary. The INSTALL file for <b>libreadline</b> says
|
||||
this:
|
||||
Setting --enable-pcre2test-libreadline causes the <b>-lreadline</b> option to be
|
||||
added to the <b>pcre2test</b> build. In many operating environments with a
|
||||
sytem-installed readline library this is sufficient. However, in some
|
||||
environments (e.g. if an unmodified distribution version of readline is in
|
||||
use), some extra configuration may be necessary. The INSTALL file for
|
||||
<b>libreadline</b> says this:
|
||||
<pre>
|
||||
"Readline uses the termcap functions, but does not link with
|
||||
the termcap or curses library itself, allowing applications
|
||||
|
@ -386,13 +391,13 @@ immediately before the <b>configure</b> command.
|
|||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
||||
<P>
|
||||
By adding the
|
||||
If you add
|
||||
<pre>
|
||||
--enable-valgrind
|
||||
</pre>
|
||||
option to to the <b>configure</b> command, PCRE2 will use valgrind annotations
|
||||
to mark certain memory regions as unaddressable. This allows it to detect
|
||||
invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||
to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
|
||||
certain memory regions as unaddressable. This allows it to detect invalid
|
||||
memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
||||
<P>
|
||||
|
@ -466,7 +471,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 03 November 2014
|
||||
Last updated: 23 November 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -85,29 +85,27 @@ expect.
|
|||
<P>
|
||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
||||
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
||||
if it were a++[bc]. The <b>pcre2test</b> output when this pattern is anchored
|
||||
and then applied with automatic callouts to the string "aaaa" is:
|
||||
if it were a++[bc]. The <b>pcre2test</b> output when this pattern is compiled
|
||||
with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
|
||||
"aaaa" is:
|
||||
<pre>
|
||||
--->aaaa
|
||||
+0 ^ ^
|
||||
+1 ^ a+
|
||||
+3 ^ ^ [bc]
|
||||
+0 ^ a+
|
||||
+2 ^ ^ [bc]
|
||||
No match
|
||||
</pre>
|
||||
This indicates that when matching [bc] fails, there is no backtracking into a+
|
||||
and therefore the callouts that would be taken for the backtracks do not occur.
|
||||
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS
|
||||
to <b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). If
|
||||
this is done in <b>pcre2test</b> (using the /no_auto_possess qualifier), the
|
||||
output changes to this:
|
||||
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
|
||||
<b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). In this
|
||||
case, the output changes to this:
|
||||
<pre>
|
||||
--->aaaa
|
||||
+0 ^ ^
|
||||
+1 ^ a+
|
||||
+3 ^ ^ [bc]
|
||||
+3 ^ ^ [bc]
|
||||
+3 ^ ^ [bc]
|
||||
+3 ^^ [bc]
|
||||
+0 ^ a+
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^^ [bc]
|
||||
No match
|
||||
</pre>
|
||||
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
||||
|
@ -137,10 +135,10 @@ callouts such as the example above are obeyed.
|
|||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
|
||||
<P>
|
||||
During matching, when PCRE2 reaches a callout point, the external function that
|
||||
is set in the match context is called (if it is set). This applies to both
|
||||
normal and DFA matching. The only argument to the callout function is a pointer
|
||||
to a <b>pcre2_callout</b> block. This structure contains the following fields:
|
||||
During matching, when PCRE2 reaches a callout point, if an external function is
|
||||
set in the match context, it is called. This applies to both normal and DFA
|
||||
matching. The only argument to the callout function is a pointer to a
|
||||
<b>pcre2_callout</b> block. This structure contains the following fields:
|
||||
<pre>
|
||||
uint32_t <i>version</i>;
|
||||
uint32_t <i>callout_number</i>;
|
||||
|
@ -169,7 +167,7 @@ automatically generated callouts).
|
|||
<P>
|
||||
The <i>offset_vector</i> field is a pointer to the vector of capturing offsets
|
||||
(the "ovector") that was passed to the matching function in the match data
|
||||
block. When <b>pcre2_match()</b> is used, the contents can be inspected, in
|
||||
block. When <b>pcre2_match()</b> is used, the contents can be inspected in
|
||||
order to extract substrings that have been matched so far, in the same way as
|
||||
for extracting substrings after a match has completed. For the DFA matching
|
||||
function, this field is not useful.
|
||||
|
@ -261,7 +259,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 19 October 2014
|
||||
Last updated: 23 November 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -467,8 +467,8 @@ used. There is no short form for this option.
|
|||
Processing some regular expression patterns can require a very large amount of
|
||||
memory, leading in some cases to a program crash if not enough is available.
|
||||
Other patterns may take a very long time to search for all possible matching
|
||||
strings. The <b>pcre2_exec()</b> function that is called by <b>pcre2grep</b> to do
|
||||
the matching has two parameters that can limit the resources that it uses.
|
||||
strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
|
||||
do the matching has two parameters that can limit the resources that it uses.
|
||||
<br>
|
||||
<br>
|
||||
The <b>--match-limit</b> option provides a means of limiting resource usage
|
||||
|
@ -750,7 +750,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 September 2014
|
||||
Last updated: 23 November 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -31,11 +31,11 @@ please consult the man page, in case the conversion went wrong.
|
|||
<P>
|
||||
Just-in-time compiling is a heavyweight optimization that can greatly speed up
|
||||
pattern matching. However, it comes at the cost of extra processing before the
|
||||
match is performed. Therefore, it is of most benefit when the same pattern is
|
||||
going to be matched many times. This does not necessarily mean many calls of a
|
||||
matching function; if the pattern is not anchored, matching attempts may take
|
||||
place many times at various positions in the subject, even for a single call.
|
||||
Therefore, if the subject string is very long, it may still pay to use JIT for
|
||||
match is performed, so it is of most benefit when the same pattern is going to
|
||||
be matched many times. This does not necessarily mean many calls of a matching
|
||||
function; if the pattern is not anchored, matching attempts may take place many
|
||||
times at various positions in the subject, even for a single call. Therefore,
|
||||
if the subject string is very long, it may still pay to use JIT even for
|
||||
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
|
||||
32-bit PCRE2 libraries.
|
||||
</P>
|
||||
|
@ -103,7 +103,7 @@ option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
|
|||
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
|
||||
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
|
||||
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
|
||||
returns zero. This is an alternative way of testing if JIT is available.
|
||||
returns zero. This is an alternative way of testing whether JIT is available.
|
||||
</P>
|
||||
<P>
|
||||
At present, it is not possible to free JIT compiled code except when the entire
|
||||
|
@ -299,7 +299,7 @@ compiled patterns, contexts, and stacks in any order, anytime. Just \fIdo
|
|||
not\fP call <b>pcre2_match()</b> with a match context pointing to an already
|
||||
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
|
||||
used by <b>pcre2_match()</b> in another thread). You can also replace the stack
|
||||
in a context at any time when it is not in use. You can also free the previous
|
||||
in a context at any time when it is not in use. You should free the previous
|
||||
stack before assigning a replacement.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -418,7 +418,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 12 November 2014
|
||||
Last updated: 23 November 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -421,7 +421,7 @@ appear.
|
|||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||
</pre>
|
||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||
limits set by the caller of pcre2_exec(), not increase them.
|
||||
limits set by the caller of pcre2_match(), not increase them.
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||
<P>
|
||||
|
@ -553,7 +553,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 14 November 2014
|
||||
Last updated: 23 November 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -72,7 +72,7 @@ but its use can lead to some strange effects because it breaks up multi-unit
|
|||
characters (see the description of \C in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation). The use of \C is not supported in the alternative matching
|
||||
function <b>pcre2_dfa_exec()</b>, nor is it supported in UTF mode by the JIT
|
||||
function <b>pcre2_dfa_match()</b>, nor is it supported in UTF mode by the JIT
|
||||
optimization. If JIT optimization is requested for a UTF pattern that contains
|
||||
\C, it will not succeed, and so the matching will be carried out by the normal
|
||||
interpretive function.
|
||||
|
@ -141,15 +141,15 @@ UTF-32.)
|
|||
In some situations, you may already know that your strings are valid, and
|
||||
therefore want to skip these checks in order to improve performance, for
|
||||
example in the case of a long subject string that is being scanned repeatedly.
|
||||
If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2
|
||||
assumes that the pattern or subject it is given (respectively) contains only
|
||||
valid UTF code unit sequences.
|
||||
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
||||
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
||||
only valid UTF code unit sequences.
|
||||
</P>
|
||||
<P>
|
||||
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
|
||||
the pattern; it does not also apply to subject strings. If you want to disable
|
||||
the check for a subject string you must pass this option to <b>pcre2_exec()</b>
|
||||
or <b>pcre2_dfa_exec()</b>.
|
||||
the check for a subject string you must pass this option to <b>pcre2_match()</b>
|
||||
or <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||
|
@ -261,7 +261,7 @@ Cambridge, England.
|
|||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 03 November 2014
|
||||
Last updated: 23 November 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
|
|
338
doc/pcre2.txt
338
doc/pcre2.txt
|
@ -2667,12 +2667,12 @@ BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
|||
or UTF-16/UTF-32 strings. To build these additional libraries, add one
|
||||
or both of the following to the configure command:
|
||||
|
||||
--enable-pcre16
|
||||
--enable-pcre32
|
||||
--enable-pcre2-16
|
||||
--enable-pcre2-32
|
||||
|
||||
If you do not want the 8-bit library, add
|
||||
|
||||
--disable-pcre8
|
||||
--disable-pcre2-8
|
||||
|
||||
as well. At least one of the three libraries must be built. Note that
|
||||
the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
|
||||
|
@ -2683,16 +2683,16 @@ BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
|||
BUILDING SHARED AND STATIC LIBRARIES
|
||||
|
||||
The Autotools PCRE2 building process uses libtool to build both shared
|
||||
and static libraries by default. You can suppress one of these by
|
||||
adding one of
|
||||
and static libraries by default. You can suppress an unwanted library
|
||||
by adding one of
|
||||
|
||||
--disable-shared
|
||||
--disable-static
|
||||
|
||||
to the configure command, as required.
|
||||
to the configure command.
|
||||
|
||||
|
||||
Unicode and UTF SUPPORT
|
||||
UNICODE AND UTF SUPPORT
|
||||
|
||||
By default, PCRE2 is built with support for Unicode and UTF character
|
||||
strings. To build it without Unicode support, add
|
||||
|
@ -2704,18 +2704,16 @@ Unicode and UTF SUPPORT
|
|||
another without, in the same configuration.
|
||||
|
||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
|
||||
UTF-16 or UTF-32. To do that you have have to set the PCRE2_UTF option
|
||||
when you call pcre2_compile() to compile a pattern.
|
||||
UTF-16 or UTF-32. To do that, applications that use the library have to
|
||||
set the PCRE2_UTF option when they call pcre2_compile() to compile a
|
||||
pattern.
|
||||
|
||||
It is not possible to support both EBCDIC and UTF-8 codes in the same
|
||||
version of the library. Consequently, --enable-unicode and --enable-
|
||||
ebcdic are mutually exclusive.
|
||||
|
||||
UTF support allows the libraries to process character codepoints up to
|
||||
0x10ffff in the strings that they handle. It also provides support for
|
||||
accessing the properties of such characters, using pattern escapes such
|
||||
as \P, \p, and \X. Only the general category properties such as Lu and
|
||||
Nd are supported. Details are given in the pcre2pattern documentation.
|
||||
UTF support allows the libraries to process character code points up to
|
||||
0x10ffff in the strings that they handle. It also provides support for
|
||||
accessing the Unicode properties of such characters, using pattern
|
||||
escapes such as \P, \p, and \X. Only the general category properties
|
||||
such as Lu and Nd are supported. Details are given in the pcre2pattern
|
||||
documentation.
|
||||
|
||||
|
||||
JUST-IN-TIME COMPILER SUPPORT
|
||||
|
@ -2725,17 +2723,17 @@ JUST-IN-TIME COMPILER SUPPORT
|
|||
--enable-jit
|
||||
|
||||
This support is available only for certain hardware architectures. If
|
||||
this option is set for an unsupported architecture, a compile time
|
||||
error occurs. See the pcre2jit documentation for a discussion of JIT
|
||||
usage. When JIT support is enabled, pcre2grep automatically makes use
|
||||
of it, unless you add
|
||||
this option is set for an unsupported architecture, a building error
|
||||
occurs. See the pcre2jit documentation for a discussion of JIT usage.
|
||||
When JIT support is enabled, pcre2grep automatically makes use of it,
|
||||
unless you add
|
||||
|
||||
--disable-pcre2grep-jit
|
||||
|
||||
to the "configure" command.
|
||||
|
||||
|
||||
CODE VALUE OF NEWLINE
|
||||
NEWLINE RECOGNITION
|
||||
|
||||
By default, PCRE2 interprets the linefeed (LF) character as indicating
|
||||
the end of a line. This is the normal newline character on Unix-like
|
||||
|
@ -2744,11 +2742,12 @@ CODE VALUE OF NEWLINE
|
|||
|
||||
--enable-newline-is-cr
|
||||
|
||||
to the configure command. There is also a --enable-newline-is-lf
|
||||
to the configure command. There is also an --enable-newline-is-lf
|
||||
option, which explicitly specifies linefeed as the newline character.
|
||||
|
||||
Alternatively, you can specify that line endings are to be indicated by
|
||||
the two character sequence CRLF. If you want this, add
|
||||
the two-character sequence CRLF (CR immediately followed by LF). If you
|
||||
want this, add
|
||||
|
||||
--enable-newline-is-crlf
|
||||
|
||||
|
@ -2756,41 +2755,46 @@ CODE VALUE OF NEWLINE
|
|||
|
||||
--enable-newline-is-anycrlf
|
||||
|
||||
which causes PCRE2 to recognize any of the three sequences CR, LF, or
|
||||
which causes PCRE2 to recognize any of the three sequences CR, LF, or
|
||||
CRLF as indicating a line ending. Finally, a fifth option, specified by
|
||||
|
||||
--enable-newline-is-any
|
||||
|
||||
causes PCRE2 to recognize any Unicode newline sequence.
|
||||
causes PCRE2 to recognize any Unicode newline sequence. The Unicode
|
||||
newline sequences are the three just mentioned, plus the single charac-
|
||||
ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
|
||||
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
|
||||
U+2029).
|
||||
|
||||
Whatever line ending convention is selected when PCRE2 is built can be
|
||||
overridden when the library functions are called. At build time it is
|
||||
conventional to use the standard for your operating system.
|
||||
Whatever default line ending convention is selected when PCRE2 is built
|
||||
can be overridden by applications that use the library. At build time
|
||||
it is conventional to use the standard for your operating system.
|
||||
|
||||
|
||||
WHAT \R MATCHES
|
||||
|
||||
By default, the sequence \R in a pattern matches any Unicode newline
|
||||
sequence, whatever has been selected as the line ending sequence. If
|
||||
you specify
|
||||
By default, the sequence \R in a pattern matches any Unicode newline
|
||||
sequence, independently of what has been selected as the line ending
|
||||
sequence. If you specify
|
||||
|
||||
--enable-bsr-anycrlf
|
||||
|
||||
the default is changed so that \R matches only CR, LF, or CRLF. What-
|
||||
ever is selected when PCRE2 is built can be overridden when the library
|
||||
functions are called.
|
||||
the default is changed so that \R matches only CR, LF, or CRLF. What-
|
||||
ever is selected when PCRE2 is built can be overridden by applications
|
||||
that use the called.
|
||||
|
||||
|
||||
HANDLING VERY LARGE PATTERNS
|
||||
|
||||
Within a compiled pattern, offset values are used to point from one
|
||||
part to another (for example, from an opening parenthesis to an alter-
|
||||
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
|
||||
two-byte values are used for these offsets, leading to a maximum size
|
||||
for a compiled pattern of around 64K. This is sufficient to handle all
|
||||
but the most gigantic patterns. Nevertheless, some people do want to
|
||||
process truly enormous patterns, so it is possible to compile PCRE2 to
|
||||
use three-byte or four-byte offsets by adding a setting such as
|
||||
Within a compiled pattern, offset values are used to point from one
|
||||
part to another (for example, from an opening parenthesis to an alter-
|
||||
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
|
||||
two-byte values are used for these offsets, leading to a maximum size
|
||||
for a compiled pattern of around 64K code units. This is sufficient to
|
||||
handle all but the most gigantic patterns. Nevertheless, some people do
|
||||
want to process truly enormous patterns, so it is possible to compile
|
||||
PCRE2 to use three-byte or four-byte offsets by adding a setting such
|
||||
as
|
||||
|
||||
--with-link-size=3
|
||||
|
||||
|
@ -2876,25 +2880,28 @@ CREATING CHARACTER TABLES AT BUILD TIME
|
|||
USING EBCDIC CODE
|
||||
|
||||
PCRE2 assumes by default that it will run in an environment where the
|
||||
character code is ASCII (or Unicode, which is a superset of ASCII).
|
||||
This is the case for most computer operating systems. PCRE2 can, how-
|
||||
ever, be compiled to run in an EBCDIC environment by adding
|
||||
character code is ASCII or Unicode, which is a superset of ASCII. This
|
||||
is the case for most computer operating systems. PCRE2 can, however, be
|
||||
compiled to run in an 8-bit EBCDIC environment by adding
|
||||
|
||||
--enable-ebcdic --disable-unicode
|
||||
|
||||
to the configure command. This setting implies --enable-rebuild-charta-
|
||||
bles. You should only use it if you know that you are in an EBCDIC
|
||||
environment (for example, an IBM mainframe operating system). The
|
||||
--enable-ebcdic option is incompatible with Unicode support.
|
||||
environment (for example, an IBM mainframe operating system).
|
||||
|
||||
It is not possible to support both EBCDIC and UTF-8 codes in the same
|
||||
version of the library. Consequently, --enable-unicode and --enable-
|
||||
ebcdic are mutually exclusive.
|
||||
|
||||
The EBCDIC character that corresponds to an ASCII LF is assumed to have
|
||||
the value 0x15 by default. However, in some EBCDIC environments, 0x25
|
||||
the value 0x15 by default. However, in some EBCDIC environments, 0x25
|
||||
is used. In such an environment you should use
|
||||
|
||||
--enable-ebcdic-nl25
|
||||
|
||||
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
|
||||
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
|
||||
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
|
||||
0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
|
||||
acter (which, in Unicode, is 0x85).
|
||||
|
||||
|
@ -2905,32 +2912,32 @@ USING EBCDIC CODE
|
|||
|
||||
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
|
||||
|
||||
By default, pcre2grep reads all files as plain text. You can build it
|
||||
so that it recognizes files whose names end in .gz or .bz2, and reads
|
||||
By default, pcre2grep reads all files as plain text. You can build it
|
||||
so that it recognizes files whose names end in .gz or .bz2, and reads
|
||||
them with libz or libbz2, respectively, by adding one or both of
|
||||
|
||||
--enable-pcre2grep-libz
|
||||
--enable-pcre2grep-libbz2
|
||||
|
||||
to the configure command. These options naturally require that the rel-
|
||||
evant libraries are installed on your system. Configuration will fail
|
||||
evant libraries are installed on your system. Configuration will fail
|
||||
if they are not.
|
||||
|
||||
|
||||
PCRE2GREP BUFFER SIZE
|
||||
|
||||
pcre2grep uses an internal buffer to hold a "window" on the file it is
|
||||
pcre2grep uses an internal buffer to hold a "window" on the file it is
|
||||
scanning, in order to be able to output "before" and "after" lines when
|
||||
it finds a match. The size of the buffer is controlled by a parameter
|
||||
it finds a match. The size of the buffer is controlled by a parameter
|
||||
whose default value is 20K. The buffer itself is three times this size,
|
||||
but because of the way it is used for holding "before" lines, the long-
|
||||
est line that is guaranteed to be processable is the parameter size.
|
||||
est line that is guaranteed to be processable is the parameter size.
|
||||
You can change the default parameter value by adding, for example,
|
||||
|
||||
--with-pcre2grep-bufsize=50K
|
||||
|
||||
to the configure command. The caller of pcre2grep can, however, over-
|
||||
ride this value by specifying a run-time option.
|
||||
to the configure command. The caller of pcre2grep can override this
|
||||
value by using --buffer-size on the command line..
|
||||
|
||||
|
||||
PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
||||
|
@ -2940,26 +2947,26 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
|||
--enable-pcre2test-libreadline
|
||||
--enable-pcre2test-libedit
|
||||
|
||||
to the configure command, pcre2test is linked with the libreadline
|
||||
to the configure command, pcre2test is linked with the libreadline
|
||||
orlibedit library, respectively, and when its input is from a terminal,
|
||||
it reads it using the readline() function. This provides line-editing
|
||||
and history facilities. Note that libreadline is GPL-licensed, so if
|
||||
you distribute a binary of pcre2test linked in this way, there may be
|
||||
licensing issues. These can be avoided by linking with libedit (which
|
||||
has a BSD licence) instead.
|
||||
it reads it using the readline() function. This provides line-editing
|
||||
and history facilities. Note that libreadline is GPL-licensed, so if
|
||||
you distribute a binary of pcre2test linked in this way, there may be
|
||||
licensing issues. These can be avoided by linking instead with libedit,
|
||||
which has a BSD licence.
|
||||
|
||||
Setting this option causes the -lreadline option to be added to the
|
||||
pcre2test build. In many operating environments with a sytem-installed
|
||||
readline library this is sufficient. However, in some environments
|
||||
(e.g. if an unmodified distribution version of readline is in use),
|
||||
some extra configuration may be necessary. The INSTALL file for
|
||||
libreadline says this:
|
||||
Setting --enable-pcre2test-libreadline causes the -lreadline option to
|
||||
be added to the pcre2test build. In many operating environments with a
|
||||
sytem-installed readline library this is sufficient. However, in some
|
||||
environments (e.g. if an unmodified distribution version of readline is
|
||||
in use), some extra configuration may be necessary. The INSTALL file
|
||||
for libreadline says this:
|
||||
|
||||
"Readline uses the termcap functions, but does not link with
|
||||
the termcap or curses library itself, allowing applications
|
||||
which link with readline the to choose an appropriate library."
|
||||
|
||||
If your environment has not been set up so that an appropriate library
|
||||
If your environment has not been set up so that an appropriate library
|
||||
is automatically included, you may need to add something like
|
||||
|
||||
LIBS="-ncurses"
|
||||
|
@ -2969,19 +2976,19 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
|||
|
||||
DEBUGGING WITH VALGRIND SUPPORT
|
||||
|
||||
By adding the
|
||||
If you add
|
||||
|
||||
--enable-valgrind
|
||||
|
||||
option to to the configure command, PCRE2 will use valgrind annotations
|
||||
to mark certain memory regions as unaddressable. This allows it to
|
||||
detect invalid memory accesses, and is mostly useful for debugging
|
||||
PCRE2 itself.
|
||||
to the configure command, PCRE2 will use valgrind annotations to mark
|
||||
certain memory regions as unaddressable. This allows it to detect
|
||||
invalid memory accesses, and is mostly useful for debugging PCRE2
|
||||
itself.
|
||||
|
||||
|
||||
CODE COVERAGE REPORTING
|
||||
|
||||
If your C compiler is gcc, you can build a version of PCRE2 that can
|
||||
If your C compiler is gcc, you can build a version of PCRE2 that can
|
||||
generate a code coverage report for its test suite. To enable this, you
|
||||
must install lcov version 1.6 or above. Then specify
|
||||
|
||||
|
@ -2990,20 +2997,20 @@ CODE COVERAGE REPORTING
|
|||
to the configure command and build PCRE2 in the usual way.
|
||||
|
||||
Note that using ccache (a caching C compiler) is incompatible with code
|
||||
coverage reporting. If you have configured ccache to run automatically
|
||||
coverage reporting. If you have configured ccache to run automatically
|
||||
on your system, you must set the environment variable
|
||||
|
||||
CCACHE_DISABLE=1
|
||||
|
||||
before running make to build PCRE2, so that ccache is not used.
|
||||
|
||||
When --enable-coverage is used, the following addition targets are
|
||||
When --enable-coverage is used, the following addition targets are
|
||||
added to the Makefile:
|
||||
|
||||
make coverage
|
||||
|
||||
This creates a fresh coverage report for the PCRE2 test suite. It is
|
||||
equivalent to running "make coverage-reset", "make coverage-baseline",
|
||||
This creates a fresh coverage report for the PCRE2 test suite. It is
|
||||
equivalent to running "make coverage-reset", "make coverage-baseline",
|
||||
"make check", and then "make coverage-report".
|
||||
|
||||
make coverage-reset
|
||||
|
@ -3020,18 +3027,18 @@ CODE COVERAGE REPORTING
|
|||
|
||||
make coverage-clean-report
|
||||
|
||||
This removes the generated coverage report without cleaning the cover-
|
||||
This removes the generated coverage report without cleaning the cover-
|
||||
age data itself.
|
||||
|
||||
make coverage-clean-data
|
||||
|
||||
This removes the captured coverage data without removing the coverage
|
||||
This removes the captured coverage data without removing the coverage
|
||||
files created at compile time (*.gcno).
|
||||
|
||||
make coverage-clean
|
||||
|
||||
This cleans all coverage data including the generated coverage report.
|
||||
For more information about code coverage, see the gcov and lcov docu-
|
||||
This cleans all coverage data including the generated coverage report.
|
||||
For more information about code coverage, see the gcov and lcov docu-
|
||||
mentation.
|
||||
|
||||
|
||||
|
@ -3049,7 +3056,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 03 November 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -3122,62 +3129,59 @@ MISSING CALLOUTS
|
|||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
|
||||
that what follows cannot be part of the repeat. For example, a+[bc] is
|
||||
compiled as if it were a++[bc]. The pcre2test output when this pattern
|
||||
is anchored and then applied with automatic callouts to the string
|
||||
"aaaa" is:
|
||||
is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
|
||||
to the string "aaaa" is:
|
||||
|
||||
--->aaaa
|
||||
+0 ^ ^
|
||||
+1 ^ a+
|
||||
+3 ^ ^ [bc]
|
||||
+0 ^ a+
|
||||
+2 ^ ^ [bc]
|
||||
No match
|
||||
|
||||
This indicates that when matching [bc] fails, there is no backtracking
|
||||
into a+ and therefore the callouts that would be taken for the back-
|
||||
tracks do not occur. You can disable the auto-possessify feature by
|
||||
passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat-
|
||||
tern with (*NO_AUTO_POSSESS). If this is done in pcre2test (using the
|
||||
/no_auto_possess qualifier), the output changes to this:
|
||||
tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
|
||||
|
||||
--->aaaa
|
||||
+0 ^ ^
|
||||
+1 ^ a+
|
||||
+3 ^ ^ [bc]
|
||||
+3 ^ ^ [bc]
|
||||
+3 ^ ^ [bc]
|
||||
+3 ^^ [bc]
|
||||
+0 ^ a+
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^^ [bc]
|
||||
No match
|
||||
|
||||
This time, when matching [bc] fails, the matcher backtracks into a+ and
|
||||
tries again, repeatedly, until a+ itself fails.
|
||||
|
||||
Other optimizations that provide fast "no match" results also affect
|
||||
Other optimizations that provide fast "no match" results also affect
|
||||
callouts. For example, if the pattern is
|
||||
|
||||
ab(?C4)cd
|
||||
|
||||
PCRE2 knows that any matching string must contain the letter "d". If
|
||||
the subject string is "abyz", the lack of "d" means that matching
|
||||
doesn't ever start, and the callout is never reached. However, with
|
||||
PCRE2 knows that any matching string must contain the letter "d". If
|
||||
the subject string is "abyz", the lack of "d" means that matching
|
||||
doesn't ever start, and the callout is never reached. However, with
|
||||
"abyd", though the result is still no match, the callout is obeyed.
|
||||
|
||||
PCRE2 also knows the minimum length of a matching string, and will
|
||||
immediately give a "no match" return without actually running a match
|
||||
if the subject is not long enough, or, for unanchored patterns, if it
|
||||
PCRE2 also knows the minimum length of a matching string, and will
|
||||
immediately give a "no match" return without actually running a match
|
||||
if the subject is not long enough, or, for unanchored patterns, if it
|
||||
has been scanned far enough.
|
||||
|
||||
You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
|
||||
MIZE option to pcre2_compile(), or by starting the pattern with
|
||||
(*NO_START_OPT). This slows down the matching process, but does ensure
|
||||
MIZE option to pcre2_compile(), or by starting the pattern with
|
||||
(*NO_START_OPT). This slows down the matching process, but does ensure
|
||||
that callouts such as the example above are obeyed.
|
||||
|
||||
|
||||
THE CALLOUT INTERFACE
|
||||
|
||||
During matching, when PCRE2 reaches a callout point, the external func-
|
||||
tion that is set in the match context is called (if it is set). This
|
||||
applies to both normal and DFA matching. The only argument to the call-
|
||||
out function is a pointer to a pcre2_callout block. This structure con-
|
||||
tains the following fields:
|
||||
During matching, when PCRE2 reaches a callout point, if an external
|
||||
function is set in the match context, it is called. This applies to
|
||||
both normal and DFA matching. The only argument to the callout function
|
||||
is a pointer to a pcre2_callout block. This structure contains the fol-
|
||||
lowing fields:
|
||||
|
||||
uint32_t version;
|
||||
uint32_t callout_number;
|
||||
|
@ -3193,69 +3197,69 @@ THE CALLOUT INTERFACE
|
|||
PCRE2_SIZE pattern_position;
|
||||
PCRE2_SIZE next_item_length;
|
||||
|
||||
The version field contains the version number of the block format. The
|
||||
The version field contains the version number of the block format. The
|
||||
current version is 0. The version number will change in future if addi-
|
||||
tional fields are added, but the intention is never to remove any of
|
||||
tional fields are added, but the intention is never to remove any of
|
||||
the existing fields.
|
||||
|
||||
The callout_number field contains the number of the callout, as com-
|
||||
piled into the pattern (that is, the number after ?C for manual call-
|
||||
The callout_number field contains the number of the callout, as com-
|
||||
piled into the pattern (that is, the number after ?C for manual call-
|
||||
outs, and 255 for automatically generated callouts).
|
||||
|
||||
The offset_vector field is a pointer to the vector of capturing offsets
|
||||
(the "ovector") that was passed to the matching function in the match
|
||||
data block. When pcre2_match() is used, the contents can be inspected,
|
||||
in order to extract substrings that have been matched so far, in the
|
||||
same way as for extracting substrings after a match has completed. For
|
||||
(the "ovector") that was passed to the matching function in the match
|
||||
data block. When pcre2_match() is used, the contents can be inspected
|
||||
in order to extract substrings that have been matched so far, in the
|
||||
same way as for extracting substrings after a match has completed. For
|
||||
the DFA matching function, this field is not useful.
|
||||
|
||||
The subject and subject_length fields contain copies of the values that
|
||||
were passed to the matching function.
|
||||
|
||||
The start_match field normally contains the offset within the subject
|
||||
at which the current match attempt started. However, if the escape
|
||||
sequence \K has been encountered, this value is changed to reflect the
|
||||
modified starting point. If the pattern is not anchored, the callout
|
||||
The start_match field normally contains the offset within the subject
|
||||
at which the current match attempt started. However, if the escape
|
||||
sequence \K has been encountered, this value is changed to reflect the
|
||||
modified starting point. If the pattern is not anchored, the callout
|
||||
function may be called several times from the same point in the pattern
|
||||
for different starting points in the subject.
|
||||
|
||||
The current_position field contains the offset within the subject of
|
||||
The current_position field contains the offset within the subject of
|
||||
the current match pointer.
|
||||
|
||||
When the pcre2_match() is used, the capture_top field contains one more
|
||||
than the number of the highest numbered captured substring so far. If
|
||||
than the number of the highest numbered captured substring so far. If
|
||||
no substrings have been captured, the value of capture_top is one. This
|
||||
is always the case when the DFA functions are used, because they do not
|
||||
support captured substrings.
|
||||
|
||||
The capture_last field contains the number of the most recently cap-
|
||||
tured substring. However, when a recursion exits, the value reverts to
|
||||
what it was outside the recursion, as do the values of all captured
|
||||
substrings. If no substrings have been captured, the value of cap-
|
||||
The capture_last field contains the number of the most recently cap-
|
||||
tured substring. However, when a recursion exits, the value reverts to
|
||||
what it was outside the recursion, as do the values of all captured
|
||||
substrings. If no substrings have been captured, the value of cap-
|
||||
ture_last is 0. This is always the case for the DFA matching functions.
|
||||
|
||||
The callout_data field contains a value that is passed to a matching
|
||||
function specifically so that it can be passed back in callouts. It is
|
||||
set in the match context when the callout is set up by calling
|
||||
The callout_data field contains a value that is passed to a matching
|
||||
function specifically so that it can be passed back in callouts. It is
|
||||
set in the match context when the callout is set up by calling
|
||||
pcre2_set_callout() (see the pcre2api documentation).
|
||||
|
||||
The pattern_position field contains the offset to the next item to be
|
||||
The pattern_position field contains the offset to the next item to be
|
||||
matched in the pattern string.
|
||||
|
||||
The next_item_length field contains the length of the next item to be
|
||||
The next_item_length field contains the length of the next item to be
|
||||
matched in the pattern string. When the callout immediately precedes an
|
||||
alternation bar, a closing parenthesis, or the end of the pattern, the
|
||||
length is zero. When the callout precedes an opening parenthesis, the
|
||||
alternation bar, a closing parenthesis, or the end of the pattern, the
|
||||
length is zero. When the callout precedes an opening parenthesis, the
|
||||
length is that of the entire subpattern.
|
||||
|
||||
The pattern_position and next_item_length fields are intended to help
|
||||
in distinguishing between different automatic callouts, which all have
|
||||
The pattern_position and next_item_length fields are intended to help
|
||||
in distinguishing between different automatic callouts, which all have
|
||||
the same callout number. However, they are set for all callouts.
|
||||
|
||||
In callouts from pcre2_match() the mark field contains a pointer to the
|
||||
zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
|
||||
(*THEN) item in the match, or NULL if no such items have been passed.
|
||||
Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
|
||||
zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
|
||||
(*THEN) item in the match, or NULL if no such items have been passed.
|
||||
Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
|
||||
previous (*MARK). In callouts from the DFA matching function this field
|
||||
always contains NULL.
|
||||
|
||||
|
@ -3263,16 +3267,16 @@ THE CALLOUT INTERFACE
|
|||
RETURN VALUES
|
||||
|
||||
The external callout function returns an integer to PCRE2. If the value
|
||||
is zero, matching proceeds as normal. If the value is greater than
|
||||
zero, matching fails at the current point, but the testing of other
|
||||
is zero, matching proceeds as normal. If the value is greater than
|
||||
zero, matching fails at the current point, but the testing of other
|
||||
matching possibilities goes ahead, just as if a lookahead assertion had
|
||||
failed. If the value is less than zero, the match is abandoned, and the
|
||||
matching function returns the negative value.
|
||||
|
||||
Negative values should normally be chosen from the set of
|
||||
PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
|
||||
standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
|
||||
reserved for use by callout functions; it will never be used by PCRE2
|
||||
Negative values should normally be chosen from the set of
|
||||
PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
|
||||
standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
|
||||
reserved for use by callout functions; it will never be used by PCRE2
|
||||
itself.
|
||||
|
||||
|
||||
|
@ -3285,7 +3289,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 19 October 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -3487,12 +3491,12 @@ PCRE2 JUST-IN-TIME COMPILER SUPPORT
|
|||
|
||||
Just-in-time compiling is a heavyweight optimization that can greatly
|
||||
speed up pattern matching. However, it comes at the cost of extra pro-
|
||||
cessing before the match is performed. Therefore, it is of most benefit
|
||||
when the same pattern is going to be matched many times. This does not
|
||||
necessarily mean many calls of a matching function; if the pattern is
|
||||
not anchored, matching attempts may take place many times at various
|
||||
positions in the subject, even for a single call. Therefore, if the
|
||||
subject string is very long, it may still pay to use JIT for one-off
|
||||
cessing before the match is performed, so it is of most benefit when
|
||||
the same pattern is going to be matched many times. This does not nec-
|
||||
essarily mean many calls of a matching function; if the pattern is not
|
||||
anchored, matching attempts may take place many times at various posi-
|
||||
tions in the subject, even for a single call. Therefore, if the subject
|
||||
string is very long, it may still pay to use JIT even for one-off
|
||||
matches. JIT support is available for all of the 8-bit, 16-bit and
|
||||
32-bit PCRE2 libraries.
|
||||
|
||||
|
@ -3558,8 +3562,8 @@ SIMPLE USE OF JIT
|
|||
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
|
||||
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
|
||||
ing. If pcre2_jit_compile() is called with no option bits set, it imme-
|
||||
diately returns zero. This is an alternative way of testing if JIT is
|
||||
available.
|
||||
diately returns zero. This is an alternative way of testing whether JIT
|
||||
is available.
|
||||
|
||||
At present, it is not possible to free JIT compiled code except when
|
||||
the entire compiled pattern is freed by calling pcre2_free_code().
|
||||
|
@ -3745,7 +3749,7 @@ JIT STACK FAQ
|
|||
an already freed stack, as that will cause SEGFAULT. (Also, do not free
|
||||
a stack currently used by pcre2_match() in another thread). You can
|
||||
also replace the stack in a context at any time when it is not in use.
|
||||
You can also free the previous stack before assigning a replacement.
|
||||
You should free the previous stack before assigning a replacement.
|
||||
|
||||
(5) Should I allocate/free a stack every time before/after calling
|
||||
pcre2_match()?
|
||||
|
@ -3855,7 +3859,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 12 November 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -4642,7 +4646,7 @@ WIDE CHARACTERS AND UTF MODES
|
|||
UTF mode, but its use can lead to some strange effects because it
|
||||
breaks up multi-unit characters (see the description of \C in the
|
||||
pcre2pattern documentation). The use of \C is not supported in the
|
||||
alternative matching function pcre2_dfa_exec(), nor is it supported in
|
||||
alternative matching function pcre2_dfa_match(), nor is it supported in
|
||||
UTF mode by the JIT optimization. If JIT optimization is requested for
|
||||
a UTF pattern that contains \C, it will not succeed, and so the match-
|
||||
ing will be carried out by the normal interpretive function.
|
||||
|
@ -4701,14 +4705,14 @@ VALIDITY OF UTF STRINGS
|
|||
In some situations, you may already know that your strings are valid,
|
||||
and therefore want to skip these checks in order to improve perfor-
|
||||
mance, for example in the case of a long subject string that is being
|
||||
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK flag at compile
|
||||
time or at run time, PCRE2 assumes that the pattern or subject it is
|
||||
given (respectively) contains only valid UTF code unit sequences.
|
||||
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
|
||||
pile time or at match time, PCRE2 assumes that the pattern or subject
|
||||
it is given (respectively) contains only valid UTF code unit sequences.
|
||||
|
||||
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
|
||||
for the pattern; it does not also apply to subject strings. If you want
|
||||
to disable the check for a subject string you must pass this option to
|
||||
pcre2_exec() or pcre2_dfa_exec().
|
||||
pcre2_match() or pcre2_dfa_match().
|
||||
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
|
||||
result is undefined and your program may crash or loop indefinitely.
|
||||
|
@ -4807,7 +4811,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 03 November 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "21 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2API 3 "23 November 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -2090,6 +2090,13 @@ returned by \fBpcre2_get_startchar()\fP. For a non-partial match, this can be
|
|||
different to the value of \fIovector[0]\fP if the pattern contains the \eK
|
||||
escape sequence. After a partial match, however, this value is always the same
|
||||
as \fIovector[0]\fP because \eK does not affect the result of a partial match.
|
||||
.P
|
||||
The \fBstartchar\fP field is also used to return the offset of an invalid
|
||||
UTF character when UTF checking fails. Details are given in the
|
||||
.\" HREF
|
||||
\fBpcre2unicode\fP
|
||||
.\"
|
||||
page.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="errorlist"></a>
|
||||
|
@ -2707,6 +2714,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 21 November 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
109
doc/pcre2build.3
109
doc/pcre2build.3
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2BUILD 3 "03 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.
|
||||
|
@ -74,12 +74,12 @@ respectively. These can be interpreted either as single-unit characters or
|
|||
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
|
||||
the following to the \fBconfigure\fP command:
|
||||
.sp
|
||||
--enable-pcre16
|
||||
--enable-pcre32
|
||||
--enable-pcre2-16
|
||||
--enable-pcre2-32
|
||||
.sp
|
||||
If you do not want the 8-bit library, add
|
||||
.sp
|
||||
--disable-pcre8
|
||||
--disable-pcre2-8
|
||||
.sp
|
||||
as well. At least one of the three libraries must be built. Note that the POSIX
|
||||
wrapper is for the 8-bit library only, and that \fBpcre2grep\fP is an 8-bit
|
||||
|
@ -91,15 +91,16 @@ libraries.
|
|||
.rs
|
||||
.sp
|
||||
The Autotools PCRE2 building process uses \fBlibtool\fP to build both shared
|
||||
and static libraries by default. You can suppress one of these by adding one of
|
||||
and static libraries by default. You can suppress an unwanted library by adding
|
||||
one of
|
||||
.sp
|
||||
--disable-shared
|
||||
--disable-static
|
||||
.sp
|
||||
to the \fBconfigure\fP command, as required.
|
||||
to the \fBconfigure\fP command.
|
||||
.
|
||||
.
|
||||
.SH "Unicode and UTF SUPPORT"
|
||||
.SH "UNICODE AND UTF SUPPORT"
|
||||
.rs
|
||||
.sp
|
||||
By default, PCRE2 is built with support for Unicode and UTF character strings.
|
||||
|
@ -112,18 +113,14 @@ is not possible to build one library with Unicode support, and another without,
|
|||
in the same configuration.
|
||||
.P
|
||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
||||
or UTF-32. To do that you have have to set the PCRE2_UTF option when you call
|
||||
\fBpcre2_compile()\fP to compile a pattern.
|
||||
or UTF-32. To do that, applications that use the library have to set the
|
||||
PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
|
||||
.P
|
||||
It is not possible to support both EBCDIC and UTF-8 codes in the same version
|
||||
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
|
||||
exclusive.
|
||||
.P
|
||||
UTF support allows the libraries to process character codepoints up to 0x10ffff
|
||||
in the strings that they handle. It also provides support for accessing the
|
||||
properties of such characters, using pattern escapes such as \eP, \ep, and \eX.
|
||||
Only the general category properties such as \fILu\fP and \fINd\fP are
|
||||
supported. Details are given in the
|
||||
UTF support allows the libraries to process character code points up to
|
||||
0x10ffff in the strings that they handle. It also provides support for
|
||||
accessing the Unicode properties of such characters, using pattern escapes such
|
||||
as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
|
||||
\fINd\fP are supported. Details are given in the
|
||||
.\" HREF
|
||||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
|
@ -138,7 +135,7 @@ Just-in-time compiler support is included in the build by specifying
|
|||
--enable-jit
|
||||
.sp
|
||||
This support is available only for certain hardware architectures. If this
|
||||
option is set for an unsupported architecture, a compile time error occurs.
|
||||
option is set for an unsupported architecture, a building error occurs.
|
||||
See the
|
||||
.\" HREF
|
||||
\fBpcre2jit\fP
|
||||
|
@ -151,7 +148,7 @@ pcre2grep automatically makes use of it, unless you add
|
|||
to the "configure" command.
|
||||
.
|
||||
.
|
||||
.SH "CODE VALUE OF NEWLINE"
|
||||
.SH "NEWLINE RECOGNITION"
|
||||
.rs
|
||||
.sp
|
||||
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
|
||||
|
@ -160,11 +157,12 @@ compile PCRE2 to use carriage return (CR) instead, by adding
|
|||
.sp
|
||||
--enable-newline-is-cr
|
||||
.sp
|
||||
to the \fBconfigure\fP command. There is also a --enable-newline-is-lf option,
|
||||
to the \fBconfigure\fP command. There is also an --enable-newline-is-lf option,
|
||||
which explicitly specifies linefeed as the newline character.
|
||||
.sp
|
||||
Alternatively, you can specify that line endings are to be indicated by the two
|
||||
character sequence CRLF. If you want this, add
|
||||
.P
|
||||
Alternatively, you can specify that line endings are to be indicated by the
|
||||
two-character sequence CRLF (CR immediately followed by LF). If you want this,
|
||||
add
|
||||
.sp
|
||||
--enable-newline-is-crlf
|
||||
.sp
|
||||
|
@ -177,10 +175,13 @@ indicating a line ending. Finally, a fifth option, specified by
|
|||
.sp
|
||||
--enable-newline-is-any
|
||||
.sp
|
||||
causes PCRE2 to recognize any Unicode newline sequence.
|
||||
causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
|
||||
sequences are the three just mentioned, plus the single characters VT (vertical
|
||||
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
|
||||
separator, U+2028), and PS (paragraph separator, U+2029).
|
||||
.P
|
||||
Whatever line ending convention is selected when PCRE2 is built can be
|
||||
overridden when the library functions are called. At build time it is
|
||||
Whatever default line ending convention is selected when PCRE2 is built can be
|
||||
overridden by applications that use the library. At build time it is
|
||||
conventional to use the standard for your operating system.
|
||||
.
|
||||
.
|
||||
|
@ -188,12 +189,13 @@ conventional to use the standard for your operating system.
|
|||
.rs
|
||||
.sp
|
||||
By default, the sequence \eR in a pattern matches any Unicode newline sequence,
|
||||
whatever has been selected as the line ending sequence. If you specify
|
||||
independently of what has been selected as the line ending sequence. If you
|
||||
specify
|
||||
.sp
|
||||
--enable-bsr-anycrlf
|
||||
.sp
|
||||
the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
|
||||
selected when PCRE2 is built can be overridden when the library functions are
|
||||
selected when PCRE2 is built can be overridden by applications that use the
|
||||
called.
|
||||
.
|
||||
.
|
||||
|
@ -204,10 +206,10 @@ Within a compiled pattern, offset values are used to point from one part to
|
|||
another (for example, from an opening parenthesis to an alternation
|
||||
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
|
||||
are used for these offsets, leading to a maximum size for a compiled pattern of
|
||||
around 64K. This is sufficient to handle all but the most gigantic patterns.
|
||||
Nevertheless, some people do want to process truly enormous patterns, so it is
|
||||
possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
|
||||
setting such as
|
||||
around 64K code units. This is sufficient to handle all but the most gigantic
|
||||
patterns. Nevertheless, some people do want to process truly enormous patterns,
|
||||
so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
|
||||
adding a setting such as
|
||||
.sp
|
||||
--with-link-size=3
|
||||
.sp
|
||||
|
@ -299,16 +301,19 @@ hand".)
|
|||
.rs
|
||||
.sp
|
||||
PCRE2 assumes by default that it will run in an environment where the character
|
||||
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
|
||||
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
|
||||
most computer operating systems. PCRE2 can, however, be compiled to run in an
|
||||
EBCDIC environment by adding
|
||||
8-bit EBCDIC environment by adding
|
||||
.sp
|
||||
--enable-ebcdic --disable-unicode
|
||||
.sp
|
||||
to the \fBconfigure\fP command. This setting implies
|
||||
--enable-rebuild-chartables. You should only use it if you know that you are in
|
||||
an EBCDIC environment (for example, an IBM mainframe operating system). The
|
||||
--enable-ebcdic option is incompatible with Unicode support.
|
||||
an EBCDIC environment (for example, an IBM mainframe operating system).
|
||||
.P
|
||||
It is not possible to support both EBCDIC and UTF-8 codes in the same version
|
||||
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
|
||||
exclusive.
|
||||
.P
|
||||
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
|
||||
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
|
||||
|
@ -354,8 +359,8 @@ parameter value by adding, for example,
|
|||
.sp
|
||||
--with-pcre2grep-bufsize=50K
|
||||
.sp
|
||||
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can, however,
|
||||
override this value by specifying a run-time option.
|
||||
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override this
|
||||
value by using --buffer-size on the command line..
|
||||
.
|
||||
.
|
||||
.SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT"
|
||||
|
@ -371,15 +376,15 @@ to the \fBconfigure\fP command, \fBpcre2test\fP is linked with the
|
|||
from a terminal, it reads it using the \fBreadline()\fP function. This provides
|
||||
line-editing and history facilities. Note that \fBlibreadline\fP is
|
||||
GPL-licensed, so if you distribute a binary of \fBpcre2test\fP linked in this
|
||||
way, there may be licensing issues. These can be avoided by linking with
|
||||
\fBlibedit\fP (which has a BSD licence) instead.
|
||||
way, there may be licensing issues. These can be avoided by linking instead
|
||||
with \fBlibedit\fP, which has a BSD licence.
|
||||
.P
|
||||
Setting this option causes the \fB-lreadline\fP option to be added to the
|
||||
\fBpcre2test\fP build. In many operating environments with a sytem-installed
|
||||
readline library this is sufficient. However, in some environments (e.g. if an
|
||||
unmodified distribution version of readline is in use), some extra
|
||||
configuration may be necessary. The INSTALL file for \fBlibreadline\fP says
|
||||
this:
|
||||
Setting --enable-pcre2test-libreadline causes the \fB-lreadline\fP option to be
|
||||
added to the \fBpcre2test\fP build. In many operating environments with a
|
||||
sytem-installed readline library this is sufficient. However, in some
|
||||
environments (e.g. if an unmodified distribution version of readline is in
|
||||
use), some extra configuration may be necessary. The INSTALL file for
|
||||
\fBlibreadline\fP says this:
|
||||
.sp
|
||||
"Readline uses the termcap functions, but does not link with
|
||||
the termcap or curses library itself, allowing applications
|
||||
|
@ -396,13 +401,13 @@ immediately before the \fBconfigure\fP command.
|
|||
.SH "DEBUGGING WITH VALGRIND SUPPORT"
|
||||
.rs
|
||||
.sp
|
||||
By adding the
|
||||
If you add
|
||||
.sp
|
||||
--enable-valgrind
|
||||
.sp
|
||||
option to to the \fBconfigure\fP command, PCRE2 will use valgrind annotations
|
||||
to mark certain memory regions as unaddressable. This allows it to detect
|
||||
invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||
to the \fBconfigure\fP command, PCRE2 will use valgrind annotations to mark
|
||||
certain memory regions as unaddressable. This allows it to detect invalid
|
||||
memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||
.
|
||||
.
|
||||
.SH "CODE COVERAGE REPORTING"
|
||||
|
@ -482,6 +487,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 03 November 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2CALLOUT 3 "19 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2CALLOUT 3 "23 November 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -68,29 +68,27 @@ expect.
|
|||
.P
|
||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
||||
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
||||
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is anchored
|
||||
and then applied with automatic callouts to the string "aaaa" is:
|
||||
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
|
||||
with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
|
||||
"aaaa" is:
|
||||
.sp
|
||||
--->aaaa
|
||||
+0 ^ ^
|
||||
+1 ^ a+
|
||||
+3 ^ ^ [bc]
|
||||
+0 ^ a+
|
||||
+2 ^ ^ [bc]
|
||||
No match
|
||||
.sp
|
||||
This indicates that when matching [bc] fails, there is no backtracking into a+
|
||||
and therefore the callouts that would be taken for the backtracks do not occur.
|
||||
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS
|
||||
to \fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). If
|
||||
this is done in \fBpcre2test\fP (using the /no_auto_possess qualifier), the
|
||||
output changes to this:
|
||||
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
|
||||
\fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
|
||||
case, the output changes to this:
|
||||
.sp
|
||||
--->aaaa
|
||||
+0 ^ ^
|
||||
+1 ^ a+
|
||||
+3 ^ ^ [bc]
|
||||
+3 ^ ^ [bc]
|
||||
+3 ^ ^ [bc]
|
||||
+3 ^^ [bc]
|
||||
+0 ^ a+
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^^ [bc]
|
||||
No match
|
||||
.sp
|
||||
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
||||
|
@ -119,10 +117,10 @@ callouts such as the example above are obeyed.
|
|||
.SH "THE CALLOUT INTERFACE"
|
||||
.rs
|
||||
.sp
|
||||
During matching, when PCRE2 reaches a callout point, the external function that
|
||||
is set in the match context is called (if it is set). This applies to both
|
||||
normal and DFA matching. The only argument to the callout function is a pointer
|
||||
to a \fBpcre2_callout\fP block. This structure contains the following fields:
|
||||
During matching, when PCRE2 reaches a callout point, if an external function is
|
||||
set in the match context, it is called. This applies to both normal and DFA
|
||||
matching. The only argument to the callout function is a pointer to a
|
||||
\fBpcre2_callout\fP block. This structure contains the following fields:
|
||||
.sp
|
||||
uint32_t \fIversion\fP;
|
||||
uint32_t \fIcallout_number\fP;
|
||||
|
@ -149,7 +147,7 @@ automatically generated callouts).
|
|||
.P
|
||||
The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
|
||||
(the "ovector") that was passed to the matching function in the match data
|
||||
block. When \fBpcre2_match()\fP is used, the contents can be inspected, in
|
||||
block. When \fBpcre2_match()\fP is used, the contents can be inspected in
|
||||
order to extract substrings that have been matched so far, in the same way as
|
||||
for extracting substrings after a match has completed. For the DFA matching
|
||||
function, this field is not useful.
|
||||
|
@ -238,6 +236,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 19 October 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2GREP 1 "28 September 2014" "PCRE2 10.00"
|
||||
.TH PCRE2GREP 1 "23 November 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
pcre2grep - a grep with Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
|
@ -403,8 +403,8 @@ used. There is no short form for this option.
|
|||
Processing some regular expression patterns can require a very large amount of
|
||||
memory, leading in some cases to a program crash if not enough is available.
|
||||
Other patterns may take a very long time to search for all possible matching
|
||||
strings. The \fBpcre2_exec()\fP function that is called by \fBpcre2grep\fP to do
|
||||
the matching has two parameters that can limit the resources that it uses.
|
||||
strings. The \fBpcre2_match()\fP function that is called by \fBpcre2grep\fP to
|
||||
do the matching has two parameters that can limit the resources that it uses.
|
||||
.sp
|
||||
The \fB--match-limit\fP option provides a means of limiting resource usage
|
||||
when processing patterns that are not going to match, but which have a very
|
||||
|
@ -678,6 +678,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 28 September 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -446,7 +446,7 @@ OPTIONS
|
|||
very large amount of memory, leading in some cases to a pro-
|
||||
gram crash if not enough is available. Other patterns may
|
||||
take a very long time to search for all possible matching
|
||||
strings. The pcre2_exec() function that is called by
|
||||
strings. The pcre2_match() function that is called by
|
||||
pcre2grep to do the matching has two parameters that can
|
||||
limit the resources that it uses.
|
||||
|
||||
|
@ -737,5 +737,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 28 September 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2JIT 3 "12 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2JIT 3 "23 November 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
|
||||
|
@ -6,11 +6,11 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
|||
.sp
|
||||
Just-in-time compiling is a heavyweight optimization that can greatly speed up
|
||||
pattern matching. However, it comes at the cost of extra processing before the
|
||||
match is performed. Therefore, it is of most benefit when the same pattern is
|
||||
going to be matched many times. This does not necessarily mean many calls of a
|
||||
matching function; if the pattern is not anchored, matching attempts may take
|
||||
place many times at various positions in the subject, even for a single call.
|
||||
Therefore, if the subject string is very long, it may still pay to use JIT for
|
||||
match is performed, so it is of most benefit when the same pattern is going to
|
||||
be matched many times. This does not necessarily mean many calls of a matching
|
||||
function; if the pattern is not anchored, matching attempts may take place many
|
||||
times at various positions in the subject, even for a single call. Therefore,
|
||||
if the subject string is very long, it may still pay to use JIT even for
|
||||
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
|
||||
32-bit PCRE2 libraries.
|
||||
.P
|
||||
|
@ -77,7 +77,7 @@ option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
|
|||
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
|
||||
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
|
||||
\fBpcre2_jit_compile()\fP is called with no option bits set, it immediately
|
||||
returns zero. This is an alternative way of testing if JIT is available.
|
||||
returns zero. This is an alternative way of testing whether JIT is available.
|
||||
.P
|
||||
At present, it is not possible to free JIT compiled code except when the entire
|
||||
compiled pattern is freed by calling \fBpcre2_free_code()\fP.
|
||||
|
@ -276,7 +276,7 @@ compiled patterns, contexts, and stacks in any order, anytime. Just \fIdo
|
|||
not\fP call \fBpcre2_match()\fP with a match context pointing to an already
|
||||
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
|
||||
used by \fBpcre2_match()\fP in another thread). You can also replace the stack
|
||||
in a context at any time when it is not in use. You can also free the previous
|
||||
in a context at any time when it is not in use. You should free the previous
|
||||
stack before assigning a replacement.
|
||||
.P
|
||||
(5) Should I allocate/free a stack every time before/after calling
|
||||
|
@ -398,6 +398,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 12 November 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "14 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -394,7 +394,7 @@ appear.
|
|||
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
||||
.sp
|
||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||
limits set by the caller of pcre2_exec(), not increase them.
|
||||
limits set by the caller of pcre2_match(), not increase them.
|
||||
.
|
||||
.
|
||||
.SH "NEWLINE CONVENTION"
|
||||
|
@ -536,6 +536,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 14 November 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
119
doc/pcre2test.1
119
doc/pcre2test.1
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2TEST 1 "14 November 2014" "PCRE 10.00"
|
||||
.TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00"
|
||||
.SH NAME
|
||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
|
@ -200,7 +200,7 @@ input lines. Each set starts with a regular expression pattern, followed by any
|
|||
number of subject lines to be matched against that pattern. In between sets of
|
||||
test data, command lines that begin with a hash (#) character may appear. This
|
||||
file format, with some restrictions, can also be processed by the
|
||||
\fBperltest.pl\fP script that is distributed with PCRE2 as a means of checking
|
||||
\fBperltest.sh\fP script that is distributed with PCRE2 as a means of checking
|
||||
that the behaviour of PCRE2 and Perl is the same.
|
||||
.P
|
||||
Each subject line is matched separately and independently. If you want to do
|
||||
|
@ -243,11 +243,11 @@ patterns. Modifiers on a pattern can change these settings.
|
|||
#perltest
|
||||
.sp
|
||||
The appearance of this line causes all subsequent modifier settings to be
|
||||
checked for compatibility with the \fBperltest.pl\fP script, which is used to
|
||||
checked for compatibility with the \fBperltest.sh\fP script, which is used to
|
||||
confirm that Perl gives the same results as PCRE2. Also, apart from comment
|
||||
lines, none of the other command lines are permitted, because they and many
|
||||
of the modifiers are specific to \fBpcre2test\fP, and should not be used in
|
||||
test files that are also processed by \fBperltest.pl\fP. The \fP#perltest\fB
|
||||
test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB
|
||||
command helps detect tests that are accidentally put in the wrong file.
|
||||
.sp
|
||||
#subject <modifier-list>
|
||||
|
@ -265,7 +265,7 @@ for both patterns and subject lines, whereas others are valid for one or the
|
|||
other only. Each modifier has a long name, for example "anchored", and some of
|
||||
them must be followed by an equals sign and a value, for example, "offset=12".
|
||||
Modifiers that do not take values may be preceded by a minus sign to turn off a
|
||||
previous default setting.
|
||||
previous setting.
|
||||
.P
|
||||
A few of the more common modifiers can also be specified as single letters, for
|
||||
example "i" for "caseless". In documentation, following the Perl convention,
|
||||
|
@ -336,7 +336,7 @@ encoding non-printing characters in a visible way:
|
|||
\exhh hexadecimal byte (up to 2 hex digits)
|
||||
\ex{hh...} hexadecimal character (any number of hex digits)
|
||||
.sp
|
||||
The use of \ex{hh...} is not dependent on the use of the utf modifier on
|
||||
The use of \ex{hh...} is not dependent on the use of the \fButf\fP modifier on
|
||||
the pattern. It is recognized always. There may be any number of hexadecimal
|
||||
digits inside the braces; invalid values provoke error messages.
|
||||
.P
|
||||
|
@ -366,7 +366,7 @@ part of the file. For example:
|
|||
is converted to "abcabcabcabc". This feature does not support nesting. To
|
||||
include a closing square bracket in the characters, code it as \ex5D.
|
||||
.P
|
||||
A backslash followed by an equals sign marke the end of the subject string and
|
||||
A backslash followed by an equals sign marks the end of the subject string and
|
||||
the start of a modifier list. For example:
|
||||
.sp
|
||||
abc\e=notbol,notempty
|
||||
|
@ -461,8 +461,8 @@ set to "anycrlf", \eR matches CR, LF, or CRLF only. If it is set to "unicode",
|
|||
is built, with the default default being Unicode.
|
||||
.P
|
||||
The \fBnewline\fP modifier specifies which characters are to be interpreted as
|
||||
newlines, both in the pattern and (by default) in subject lines. The type must
|
||||
be one of CR, LF, CRLF, ANYCRLF, or ANY.
|
||||
newlines, both in the pattern and in subject lines. The type must be one of CR,
|
||||
LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
|
||||
.
|
||||
.
|
||||
.SS "Information about a pattern"
|
||||
|
@ -478,8 +478,8 @@ link sizes and different code unit widths. By using \fBbincode\fP, the same
|
|||
regression tests can be used in different environments.
|
||||
.P
|
||||
The \fBfullbincode\fP modifier, by contrast, \fIdoes\fP include length and
|
||||
offset values. This is used in a few special tests and is also useful for
|
||||
one-off tests.
|
||||
offset values. This is used in a few special tests that run only for specific
|
||||
code unit widths and link sizes, and is also useful for one-off tests.
|
||||
.P
|
||||
The \fBinfo\fP modifier requests information about the compiled pattern
|
||||
(whether it is anchored, has a fixed first character, and so on). The
|
||||
|
@ -501,13 +501,14 @@ some typical examples:
|
|||
Last code unit = 'c' (caseless)
|
||||
Subject length lower bound = 3
|
||||
.sp
|
||||
"Compile options" are those specified to the compile function; "overall
|
||||
options" have added options that are taken or deduced from the pattern. If both
|
||||
sets of options are the same, just a single "options" line is output. "First
|
||||
code unit" is where any match must start; if there is more than one they are
|
||||
listed as "starting code units". "Last code unit" is the last literal code unit
|
||||
that must be present in any match. This is not necessarily the last character.
|
||||
These lines are omitted if no starting or ending code units are recorded.
|
||||
"Compile options" are those specified by modifiers; "overall options" have
|
||||
added options that are taken or deduced from the pattern. If both sets of
|
||||
options are the same, just a single "options" line is output; if there are no
|
||||
options, the line is omitted. "First code unit" is where any match must start;
|
||||
if there is more than one they are listed as "starting code units". "Last code
|
||||
unit" is the last literal code unit that must be present in any match. This is
|
||||
not necessarily the last character. These lines are omitted if no starting or
|
||||
ending code units are recorded.
|
||||
.
|
||||
.
|
||||
.SS "Specifying a pattern in hex"
|
||||
|
@ -520,16 +521,16 @@ pairs. For example:
|
|||
/ab 32 59/hex
|
||||
.sp
|
||||
This feature is provided as a way of creating patterns that contain binary zero
|
||||
characters. By default, \fBpcre2test\fP passes patterns as zero-terminated
|
||||
strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED.
|
||||
However, for patterns specified in hexadecimal, the actual length of the
|
||||
pattern is passed.
|
||||
and other non-printing characters. By default, \fBpcre2test\fP passes patterns
|
||||
as zero-terminated strings to \fBpcre2_compile()\fP, giving the length as
|
||||
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
|
||||
actual length of the pattern is passed.
|
||||
.
|
||||
.
|
||||
.SS "JIT compilation"
|
||||
.rs
|
||||
.sp
|
||||
The \fB/jit\fP modifier may optionally be followed by and equals sign and a
|
||||
The \fB/jit\fP modifier may optionally be followed by an equals sign and a
|
||||
number in the range 0 to 7:
|
||||
.sp
|
||||
0 disable JIT
|
||||
|
@ -561,7 +562,7 @@ pattern shows whether JIT compilation was or was not successful. If
|
|||
\fBjitverify\fP is specified without \fBjit\fP, jit=7 is assumed. If JIT
|
||||
compilation is successful when \fBjitverify\fP is set, the text "(JIT)" is
|
||||
added to the first output line after a match or non match when JIT-compiled
|
||||
code was actually used.
|
||||
code was actually used in the match.
|
||||
.
|
||||
.
|
||||
.SS "Setting a locale"
|
||||
|
@ -645,8 +646,8 @@ be aborted.
|
|||
.SS "Using alternative character tables"
|
||||
.rs
|
||||
.sp
|
||||
The \fB/tables\fP modifier must be followed by a single digit. It causes a
|
||||
specific set of built-in character tables to be passed to
|
||||
The value specified for the \fB/tables\fP modifier must be one of the digits 0,
|
||||
1, or 2. It causes a specific set of built-in character tables to be passed to
|
||||
\fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
|
||||
different character tables. The digit specifies the tables as follows:
|
||||
.sp
|
||||
|
@ -759,13 +760,13 @@ The effects of these modifiers are described in the following sections.
|
|||
.SS "Showing more text"
|
||||
.rs
|
||||
.sp
|
||||
The \fBaftertext\fP modifier requests that as well as outputting the substring
|
||||
that matched the entire pattern, \fBpcre2test\fP should in addition output the
|
||||
remainder of the subject string. This is useful for tests where the subject
|
||||
contains multiple copies of the same substring. The \fBallaftertext\fP modifier
|
||||
requests the same action for captured substrings as well as the main matched
|
||||
substring. In each case the remainder is output on the following line with a
|
||||
plus character following the capture number.
|
||||
The \fBaftertext\fP modifier requests that as well as outputting the part of
|
||||
the subject string that matched the entire pattern, \fBpcre2test\fP should in
|
||||
addition output the remainder of the subject string. This is useful for tests
|
||||
where the subject contains multiple copies of the same substring. The
|
||||
\fBallaftertext\fP modifier requests the same action for captured substrings as
|
||||
well as the main matched substring. In each case the remainder is output on the
|
||||
following line with a plus character following the capture number.
|
||||
.P
|
||||
The \fBallusedtext\fP modifier requests that all the text that was consulted
|
||||
during a successful pattern match by the interpreter should be shown. This
|
||||
|
@ -782,7 +783,8 @@ underneath them. Here is an example:
|
|||
<<< >>>
|
||||
.sp
|
||||
This shows that the matched string is "abc", with the preceding and following
|
||||
strings "pqr" and "xyz" also consulted during the match.
|
||||
strings "pqr" and "xyz" having been consulted during the match (when processing
|
||||
the assertions).
|
||||
.P
|
||||
The \fBstartchar\fP modifier requests that the starting character for the match
|
||||
be indicated, if it is different to the start of the matched string. The only
|
||||
|
@ -836,7 +838,7 @@ function is called again to search the remainder of the subject. The difference
|
|||
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
|
||||
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
|
||||
to start searching at a new point within the entire string (which is what Perl
|
||||
does), whereas the latter passes over a shortened substring. This makes a
|
||||
does), whereas the latter passes over a shortened subject. This makes a
|
||||
difference to the matching process if the pattern begins with a lookbehind
|
||||
assertion (including \eb or \eB).
|
||||
.P
|
||||
|
@ -847,7 +849,7 @@ fails, the start offset is advanced, and the normal match is retried. This
|
|||
imitates the way Perl handles such cases when using the \fB/g\fP modifier or
|
||||
the \fBsplit()\fP function. Normally, the start offset is advanced by one
|
||||
character, but if the newline convention recognizes CRLF as a newline, and the
|
||||
current character is CR followed by LF, an advance of two is used.
|
||||
current character is CR followed by LF, an advance of two characters occurs.
|
||||
.
|
||||
.
|
||||
.SS "Testing substring extraction functions"
|
||||
|
@ -860,9 +862,9 @@ for example:
|
|||
.sp
|
||||
abcd\e=copy=1,copy=3,get=G1
|
||||
.sp
|
||||
If the \fB#subject\fP command is used to set default copy and get lists, these
|
||||
can be unset by specifying a negative number for numbered groups and an empty
|
||||
name for named groups.
|
||||
If the \fB#subject\fP command is used to set default copy and/or get lists,
|
||||
these can be unset by specifying a negative number to cancel all numbered
|
||||
groups and an empty name to cancel all named groups.
|
||||
.P
|
||||
The \fBgetall\fP modifier tests \fBpcre2_substring_list_get()\fP, which
|
||||
extracts all captured substrings.
|
||||
|
@ -871,7 +873,8 @@ If the subject line is successfully matched, the substrings extracted by the
|
|||
convenience functions are output with C, G, or L after the string number
|
||||
instead of a colon. This is in addition to the normal full list. The string
|
||||
length (that is, the return from the extraction function) is given in
|
||||
parentheses after each substring.
|
||||
parentheses after each substring, followed by the name when the extraction was
|
||||
by name.
|
||||
.
|
||||
.
|
||||
.SS "Testing the substitution function"
|
||||
|
@ -1044,11 +1047,10 @@ entire substring that was inspected during the partial match; it may include
|
|||
characters before the actual match start if a lookbehind assertion, \eK, \eb,
|
||||
or \eB was involved.)
|
||||
.P
|
||||
For any other return, \fBpcre2test\fP outputs the PCRE2
|
||||
negative error number and a short descriptive phrase. If the error is a failed
|
||||
UTF string check, the offset of the start of the failing character and the
|
||||
reason code are also output. Here is an example of an interactive
|
||||
\fBpcre2test\fP run.
|
||||
For any other return, \fBpcre2test\fP outputs the PCRE2 negative error number
|
||||
and a short descriptive phrase. If the error is a failed UTF string check, the
|
||||
code unit offset of the start of the failing character is also output. Here is
|
||||
an example of an interactive \fBpcre2test\fP run.
|
||||
.sp
|
||||
$ pcre2test
|
||||
PCRE2 version 9.00 2014-05-10
|
||||
|
@ -1061,10 +1063,10 @@ reason code are also output. Here is an example of an interactive
|
|||
No match
|
||||
.sp
|
||||
Unset capturing substrings that are not followed by one that is set are not
|
||||
returned by \fBpcre2_match()\fP, and are not shown by \fBpcre2test\fP. In the
|
||||
following example, there are two capturing substrings, but when the first data
|
||||
line is matched, the second, unset substring is not shown. An "internal" unset
|
||||
substring is shown as "<unset>", as for the second data line.
|
||||
shown by \fBpcre2test\fP unless the \fBallcaptures\fP modifier is specified. In
|
||||
the following example, there are two capturing substrings, but when the first
|
||||
data line is matched, the second, unset substring is not shown. An "internal"
|
||||
unset substring is shown as "<unset>", as for the second data line.
|
||||
.sp
|
||||
re> /(a)|(b)/
|
||||
data> a
|
||||
|
@ -1100,8 +1102,8 @@ are output in sequence, like this:
|
|||
1: pp
|
||||
.sp
|
||||
"No match" is output only if the first match attempt fails. Here is an example
|
||||
of a failure message (the offset 4 that is specified by \e>4 is past the end of
|
||||
the subject string):
|
||||
of a failure message (the offset 4 that is specified by the \fBoffset\fP
|
||||
modifier is past the end of the subject string):
|
||||
.sp
|
||||
re> /xyz/
|
||||
data> xyz\e=offset=4
|
||||
|
@ -1127,12 +1129,13 @@ the subject where there is at least one match. For example:
|
|||
1: tang
|
||||
2: tan
|
||||
.sp
|
||||
(Using the normal matching function on this data finds only "tang".) The
|
||||
Using the normal matching function on this data finds only "tang". The
|
||||
longest matching string is always given first (and numbered zero). After a
|
||||
PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the
|
||||
partially matching substring. (Note that this is the entire substring that was
|
||||
partially matching substring. Note that this is the entire substring that was
|
||||
inspected during the partial match; it may include characters before the actual
|
||||
match start if a lookbehind assertion, \eK, \eb, or \eB was involved.)
|
||||
match start if a lookbehind assertion, \eb, or \eB was involved. (\eK is not
|
||||
supported for DFA matching.)
|
||||
.P
|
||||
If global matching is requested, the search for further matches resumes
|
||||
at the end of the longest match. For example:
|
||||
|
@ -1174,9 +1177,9 @@ documentation.
|
|||
.SH CALLOUTS
|
||||
.rs
|
||||
.sp
|
||||
If the pattern contains any callout requests, \fBpcre2test\fP's callout function
|
||||
is called during matching. This works with both matching functions. By default,
|
||||
the called function displays the callout number, the start and current
|
||||
If the pattern contains any callout requests, \fBpcre2test\fP's callout
|
||||
function is called during matching. This works with both matching functions. By
|
||||
default, the called function displays the callout number, the start and current
|
||||
positions in the text at the callout time, and the next pattern item to be
|
||||
tested. For example:
|
||||
.sp
|
||||
|
@ -1271,6 +1274,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 14 November 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2UNICODE 3 "03 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2UNICODE 3 "23 November 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE - Perl-compatible regular expressions (revised API)
|
||||
.SH "UNICODE AND UTF SUPPORT"
|
||||
|
@ -64,7 +64,7 @@ characters (see the description of \eC in the
|
|||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
documentation). The use of \eC is not supported in the alternative matching
|
||||
function \fBpcre2_dfa_exec()\fP, nor is it supported in UTF mode by the JIT
|
||||
function \fBpcre2_dfa_match()\fP, nor is it supported in UTF mode by the JIT
|
||||
optimization. If JIT optimization is requested for a UTF pattern that contains
|
||||
\eC, it will not succeed, and so the matching will be carried out by the normal
|
||||
interpretive function.
|
||||
|
@ -108,7 +108,10 @@ case-equivalent, and these are treated as such.
|
|||
.sp
|
||||
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
||||
are (by default) checked for validity on entry to the relevant functions.
|
||||
If an invalid UTF string is passed, an error return is given.
|
||||
If an invalid UTF string is passed, an negative error code is returned. The
|
||||
code unit offset to the offending character can be extracted from the match
|
||||
data block by calling \fBpcre2_get_startchar()\fP, which is used for this
|
||||
purpose after a UTF error.
|
||||
.P
|
||||
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
||||
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
||||
|
@ -130,14 +133,14 @@ UTF-32.)
|
|||
In some situations, you may already know that your strings are valid, and
|
||||
therefore want to skip these checks in order to improve performance, for
|
||||
example in the case of a long subject string that is being scanned repeatedly.
|
||||
If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2
|
||||
assumes that the pattern or subject it is given (respectively) contains only
|
||||
valid UTF code unit sequences.
|
||||
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
||||
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
||||
only valid UTF code unit sequences.
|
||||
.P
|
||||
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
|
||||
the pattern; it does not also apply to subject strings. If you want to disable
|
||||
the check for a subject string you must pass this option to \fBpcre2_exec()\fP
|
||||
or \fBpcre2_dfa_exec()\fP.
|
||||
the check for a subject string you must pass this option to \fBpcre2_match()\fP
|
||||
or \fBpcre2_dfa_match()\fP.
|
||||
.P
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||
is undefined and your program may crash or loop indefinitely.
|
||||
|
@ -249,6 +252,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 03 November 2014
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -3224,12 +3224,8 @@ multiunit character. */
|
|||
#ifdef SUPPORT_UNICODE
|
||||
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
|
||||
{
|
||||
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar));
|
||||
if (match_data->rc != 0)
|
||||
{
|
||||
match_data->leftchar = 0;
|
||||
return match_data->rc;
|
||||
}
|
||||
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
|
||||
if (match_data->rc != 0) return match_data->rc;
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
if (start_offset > 0 && start_offset < length &&
|
||||
NOT_FIRSTCHAR(subject[start_offset]))
|
||||
|
|
|
@ -6459,12 +6459,8 @@ multiunit character. */
|
|||
#ifdef SUPPORT_UNICODE
|
||||
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
|
||||
{
|
||||
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar));
|
||||
if (match_data->rc != 0)
|
||||
{
|
||||
match_data->leftchar = 0;
|
||||
return match_data->rc;
|
||||
}
|
||||
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
|
||||
if (match_data->rc != 0) return match_data->rc;
|
||||
#if PCRE2_CODE_UNIT_WIDTH != 32
|
||||
if (start_offset > 0 && start_offset < length &&
|
||||
NOT_FIRSTCHAR(subject[start_offset]))
|
||||
|
|
|
@ -5570,6 +5570,13 @@ else for (gmatched = 0;; gmatched++)
|
|||
fprintf(outfile, "Failed: error %d: ", capcount);
|
||||
PCRE2_GET_ERROR_MESSAGE(mlen, capcount, pbuffer);
|
||||
PCHARSV(CASTVAR(void *, pbuffer), 0, mlen, FALSE, outfile);
|
||||
if (capcount <= PCRE2_ERROR_UTF8_ERR1 &&
|
||||
capcount >= PCRE2_ERROR_UTF32_ERR2)
|
||||
{
|
||||
PCRE2_SIZE startchar;
|
||||
PCRE2_GET_STARTCHAR(startchar, match_data);
|
||||
fprintf(outfile, " at offset %ld", startchar);
|
||||
}
|
||||
fprintf(outfile, "\n");
|
||||
break;
|
||||
}
|
||||
|
|
|
@ -48,12 +48,12 @@
|
|||
/テテテxxx/utf
|
||||
|
||||
/badutf/utf
|
||||
\xdf
|
||||
\xef
|
||||
\xef\x80
|
||||
\xf7
|
||||
\xf7\x80
|
||||
\xf7\x80\x80
|
||||
X\xdf
|
||||
XX\xef
|
||||
XXX\xef\x80
|
||||
X\xf7
|
||||
XX\xf7\x80
|
||||
XXX\xf7\x80\x80
|
||||
\xfb
|
||||
\xfb\x80
|
||||
\xfb\x80\x80
|
||||
|
@ -89,14 +89,14 @@
|
|||
\xff
|
||||
|
||||
/badutf/utf
|
||||
\xfb\x80\x80\x80\x80
|
||||
\xfd\x80\x80\x80\x80\x80
|
||||
\xf7\xbf\xbf\xbf
|
||||
XX\xfb\x80\x80\x80\x80
|
||||
XX\xfd\x80\x80\x80\x80\x80
|
||||
XX\xf7\xbf\xbf\xbf
|
||||
|
||||
/shortutf/utf
|
||||
\xdf\=ph
|
||||
\xef\=ph
|
||||
\xef\x80\=ph
|
||||
XX\xdf\=ph
|
||||
XX\xef\=ph
|
||||
XX\xef\x80\=ph
|
||||
\xf7\=ph
|
||||
\xf7\x80\=ph
|
||||
\xf7\x80\x80\=ph
|
||||
|
@ -111,9 +111,9 @@
|
|||
\xfd\x80\x80\x80\x80\=ph
|
||||
|
||||
/anything/utf
|
||||
\xc0\x80
|
||||
\xc1\x8f
|
||||
\xe0\x9f\x80
|
||||
X\xc0\x80
|
||||
XX\xc1\x8f
|
||||
XXX\xe0\x9f\x80
|
||||
\xf0\x8f\x80\x80
|
||||
\xf8\x87\x80\x80\x80
|
||||
\xfc\x83\x80\x80\x80\x80
|
||||
|
|
|
@ -157,18 +157,18 @@
|
|||
/^[\QĀ\E-\QŐ\E/B,utf
|
||||
|
||||
/X/utf
|
||||
\x{d800}
|
||||
\x{d800}\=no_utf_check
|
||||
\x{da00}
|
||||
\x{da00}\=no_utf_check
|
||||
\x{dc00}
|
||||
\x{dc00}\=no_utf_check
|
||||
\x{de00}
|
||||
\x{de00}\=no_utf_check
|
||||
\x{dfff}
|
||||
\x{dfff}\=no_utf_check
|
||||
\x{110000}
|
||||
\x{d800}\x{1234}
|
||||
XX\x{d800}
|
||||
XX\x{d800}\=no_utf_check
|
||||
XX\x{da00}
|
||||
XX\x{da00}\=no_utf_check
|
||||
XX\x{dc00}
|
||||
XX\x{dc00}\=no_utf_check
|
||||
XX\x{de00}
|
||||
XX\x{de00}\=no_utf_check
|
||||
XX\x{dfff}
|
||||
XX\x{dfff}\=no_utf_check
|
||||
XX\x{110000}
|
||||
XX\x{d800}\x{1234}
|
||||
|
||||
/(*UTF16)\x{11234}/
|
||||
abcd\x{11234}pqr
|
||||
|
|
|
@ -73,142 +73,142 @@ Failed: error -3 at offset 0: UTF-8 error: 1 byte missing at end
|
|||
Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
|
||||
|
||||
/badutf/utf
|
||||
\xdf
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
||||
\xef
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
||||
\xef\x80
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
||||
\xf7
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
||||
\xf7\x80
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
||||
\xf7\x80\x80
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
||||
X\xdf
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 1
|
||||
XX\xef
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
|
||||
XXX\xef\x80
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
|
||||
X\xf7
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 1
|
||||
XX\xf7\x80
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
|
||||
XXX\xf7\x80\x80
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
|
||||
\xfb
|
||||
Failed: error -6: UTF-8 error: 4 bytes missing at end
|
||||
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
|
||||
\xfb\x80
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
|
||||
\xfb\x80\x80
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
|
||||
\xfb\x80\x80\x80
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
|
||||
\xfd
|
||||
Failed: error -7: UTF-8 error: 5 bytes missing at end
|
||||
Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
|
||||
\xfd\x80
|
||||
Failed: error -6: UTF-8 error: 4 bytes missing at end
|
||||
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
|
||||
\xfd\x80\x80
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
|
||||
\xfd\x80\x80\x80
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
|
||||
\xfd\x80\x80\x80\x80
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
|
||||
\xdf\x7f
|
||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
|
||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
|
||||
\xef\x7f\x80
|
||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
|
||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
|
||||
\xef\x80\x7f
|
||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
|
||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
|
||||
\xf7\x7f\x80\x80
|
||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
|
||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
|
||||
\xf7\x80\x7f\x80
|
||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
|
||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
|
||||
\xf7\x80\x80\x7f
|
||||
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
|
||||
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
|
||||
\xfb\x7f\x80\x80\x80
|
||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
|
||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
|
||||
\xfb\x80\x7f\x80\x80
|
||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
|
||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
|
||||
\xfb\x80\x80\x7f\x80
|
||||
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
|
||||
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
|
||||
\xfb\x80\x80\x80\x7f
|
||||
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80
|
||||
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
|
||||
\xfd\x7f\x80\x80\x80\x80
|
||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80
|
||||
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
|
||||
\xfd\x80\x7f\x80\x80\x80
|
||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80
|
||||
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
|
||||
\xfd\x80\x80\x7f\x80\x80
|
||||
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80
|
||||
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
|
||||
\xfd\x80\x80\x80\x7f\x80
|
||||
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80
|
||||
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
|
||||
\xfd\x80\x80\x80\x80\x7f
|
||||
Failed: error -12: UTF-8 error: byte 6 top bits not 0x80
|
||||
Failed: error -12: UTF-8 error: byte 6 top bits not 0x80 at offset 0
|
||||
\xed\xa0\x80
|
||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
|
||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
|
||||
\xc0\x8f
|
||||
Failed: error -17: UTF-8 error: overlong 2-byte sequence
|
||||
Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 0
|
||||
\xe0\x80\x8f
|
||||
Failed: error -18: UTF-8 error: overlong 3-byte sequence
|
||||
Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 0
|
||||
\xf0\x80\x80\x8f
|
||||
Failed: error -19: UTF-8 error: overlong 4-byte sequence
|
||||
Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
|
||||
\xf8\x80\x80\x80\x8f
|
||||
Failed: error -20: UTF-8 error: overlong 5-byte sequence
|
||||
Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
|
||||
\xfc\x80\x80\x80\x80\x8f
|
||||
Failed: error -21: UTF-8 error: overlong 6-byte sequence
|
||||
Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
|
||||
\x80
|
||||
Failed: error -22: UTF-8 error: isolated 0x80 byte
|
||||
Failed: error -22: UTF-8 error: isolated 0x80 byte at offset 0
|
||||
\xfe
|
||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
|
||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
|
||||
\xff
|
||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
|
||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
|
||||
|
||||
/badutf/utf
|
||||
\xfb\x80\x80\x80\x80
|
||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
|
||||
\xfd\x80\x80\x80\x80\x80
|
||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
|
||||
\xf7\xbf\xbf\xbf
|
||||
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined
|
||||
XX\xfb\x80\x80\x80\x80
|
||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 2
|
||||
XX\xfd\x80\x80\x80\x80\x80
|
||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 2
|
||||
XX\xf7\xbf\xbf\xbf
|
||||
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 2
|
||||
|
||||
/shortutf/utf
|
||||
\xdf\=ph
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
||||
\xef\=ph
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
||||
\xef\x80\=ph
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
||||
XX\xdf\=ph
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
|
||||
XX\xef\=ph
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
|
||||
XX\xef\x80\=ph
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
|
||||
\xf7\=ph
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
|
||||
\xf7\x80\=ph
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
|
||||
\xf7\x80\x80\=ph
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
|
||||
\xfb\=ph
|
||||
Failed: error -6: UTF-8 error: 4 bytes missing at end
|
||||
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
|
||||
\xfb\x80\=ph
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
|
||||
\xfb\x80\x80\=ph
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
|
||||
\xfb\x80\x80\x80\=ph
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
|
||||
\xfd\=ph
|
||||
Failed: error -7: UTF-8 error: 5 bytes missing at end
|
||||
Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
|
||||
\xfd\x80\=ph
|
||||
Failed: error -6: UTF-8 error: 4 bytes missing at end
|
||||
Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
|
||||
\xfd\x80\x80\=ph
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end
|
||||
Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
|
||||
\xfd\x80\x80\x80\=ph
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end
|
||||
Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
|
||||
\xfd\x80\x80\x80\x80\=ph
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end
|
||||
Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
|
||||
|
||||
/anything/utf
|
||||
\xc0\x80
|
||||
Failed: error -17: UTF-8 error: overlong 2-byte sequence
|
||||
\xc1\x8f
|
||||
Failed: error -17: UTF-8 error: overlong 2-byte sequence
|
||||
\xe0\x9f\x80
|
||||
Failed: error -18: UTF-8 error: overlong 3-byte sequence
|
||||
X\xc0\x80
|
||||
Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 1
|
||||
XX\xc1\x8f
|
||||
Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 2
|
||||
XXX\xe0\x9f\x80
|
||||
Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 3
|
||||
\xf0\x8f\x80\x80
|
||||
Failed: error -19: UTF-8 error: overlong 4-byte sequence
|
||||
Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
|
||||
\xf8\x87\x80\x80\x80
|
||||
Failed: error -20: UTF-8 error: overlong 5-byte sequence
|
||||
Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
|
||||
\xfc\x83\x80\x80\x80\x80
|
||||
Failed: error -21: UTF-8 error: overlong 6-byte sequence
|
||||
Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
|
||||
\xfe\x80\x80\x80\x80\x80
|
||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
|
||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
|
||||
\xff\x80\x80\x80\x80\x80
|
||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff)
|
||||
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
|
||||
\xc3\x8f
|
||||
No match
|
||||
\xe0\xaf\x80
|
||||
|
@ -220,13 +220,13 @@ No match
|
|||
\xf1\x8f\x80\x80
|
||||
No match
|
||||
\xf8\x88\x80\x80\x80
|
||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
|
||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
|
||||
\xf9\x87\x80\x80\x80
|
||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
|
||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
|
||||
\xfc\x84\x80\x80\x80\x80
|
||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
|
||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
|
||||
\xfd\x83\x80\x80\x80\x80
|
||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
|
||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
|
||||
\xf8\x88\x80\x80\x80\=no_utf_check
|
||||
No match
|
||||
\xf9\x87\x80\x80\x80\=no_utf_check
|
||||
|
@ -751,27 +751,27 @@ Failed: error 106 at offset 15: missing terminating ] for character class
|
|||
|
||||
/X/utf
|
||||
\x{d800}
|
||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
|
||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
|
||||
\x{d800}\=no_utf_check
|
||||
No match
|
||||
\x{da00}
|
||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
|
||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
|
||||
\x{da00}\=no_utf_check
|
||||
No match
|
||||
\x{dfff}
|
||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined
|
||||
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
|
||||
\x{dfff}\=no_utf_check
|
||||
No match
|
||||
\x{110000}
|
||||
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined
|
||||
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 0
|
||||
\x{110000}\=no_utf_check
|
||||
No match
|
||||
\x{2000000}
|
||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629)
|
||||
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
|
||||
\x{2000000}\=no_utf_check
|
||||
No match
|
||||
\x{7fffffff}
|
||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
|
||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
|
||||
\x{7fffffff}\=no_utf_check
|
||||
No match
|
||||
|
||||
|
@ -1106,7 +1106,7 @@ Subject length lower bound = 1
|
|||
\x{ff000041}
|
||||
** Character \x{ff000041} is greater than 0x7fffffff and so cannot be converted to UTF-8
|
||||
\x{7f000041}
|
||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629)
|
||||
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
|
||||
|
||||
/(*UTF8)abc/never_utf
|
||||
Failed: error 174 at offset 7: using UTF is disabled by the application
|
||||
|
|
|
@ -607,30 +607,30 @@ Subject length lower bound = 2
|
|||
Failed: error 106 at offset 13: missing terminating ] for character class
|
||||
|
||||
/X/utf
|
||||
\x{d800}
|
||||
Failed: error -24: UTF-16 error: missing low surrogate at end
|
||||
\x{d800}\=no_utf_check
|
||||
No match
|
||||
\x{da00}
|
||||
Failed: error -24: UTF-16 error: missing low surrogate at end
|
||||
\x{da00}\=no_utf_check
|
||||
No match
|
||||
\x{dc00}
|
||||
Failed: error -26: UTF-16 error: isolated low surrogate
|
||||
\x{dc00}\=no_utf_check
|
||||
No match
|
||||
\x{de00}
|
||||
Failed: error -26: UTF-16 error: isolated low surrogate
|
||||
\x{de00}\=no_utf_check
|
||||
No match
|
||||
\x{dfff}
|
||||
Failed: error -26: UTF-16 error: isolated low surrogate
|
||||
\x{dfff}\=no_utf_check
|
||||
No match
|
||||
\x{110000}
|
||||
XX\x{d800}
|
||||
Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
|
||||
XX\x{d800}\=no_utf_check
|
||||
0: X
|
||||
XX\x{da00}
|
||||
Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
|
||||
XX\x{da00}\=no_utf_check
|
||||
0: X
|
||||
XX\x{dc00}
|
||||
Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
|
||||
XX\x{dc00}\=no_utf_check
|
||||
0: X
|
||||
XX\x{de00}
|
||||
Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
|
||||
XX\x{de00}\=no_utf_check
|
||||
0: X
|
||||
XX\x{dfff}
|
||||
Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
|
||||
XX\x{dfff}\=no_utf_check
|
||||
0: X
|
||||
XX\x{110000}
|
||||
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
|
||||
\x{d800}\x{1234}
|
||||
Failed: error -25: UTF-16 error: invalid low surrogate
|
||||
XX\x{d800}\x{1234}
|
||||
Failed: error -25: UTF-16 error: invalid low surrogate at offset 3
|
||||
|
||||
/(*UTF16)\x{11234}/
|
||||
abcd\x{11234}pqr
|
||||
|
|
|
@ -600,30 +600,30 @@ Subject length lower bound = 2
|
|||
Failed: error 106 at offset 13: missing terminating ] for character class
|
||||
|
||||
/X/utf
|
||||
\x{d800}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
||||
\x{d800}\=no_utf_check
|
||||
No match
|
||||
\x{da00}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
||||
\x{da00}\=no_utf_check
|
||||
No match
|
||||
\x{dc00}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
||||
\x{dc00}\=no_utf_check
|
||||
No match
|
||||
\x{de00}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
||||
\x{de00}\=no_utf_check
|
||||
No match
|
||||
\x{dfff}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
||||
\x{dfff}\=no_utf_check
|
||||
No match
|
||||
\x{110000}
|
||||
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined
|
||||
\x{d800}\x{1234}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined
|
||||
XX\x{d800}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||
XX\x{d800}\=no_utf_check
|
||||
0: X
|
||||
XX\x{da00}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||
XX\x{da00}\=no_utf_check
|
||||
0: X
|
||||
XX\x{dc00}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||
XX\x{dc00}\=no_utf_check
|
||||
0: X
|
||||
XX\x{de00}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||
XX\x{de00}\=no_utf_check
|
||||
0: X
|
||||
XX\x{dfff}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||
XX\x{dfff}\=no_utf_check
|
||||
0: X
|
||||
XX\x{110000}
|
||||
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 2
|
||||
XX\x{d800}\x{1234}
|
||||
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
|
||||
|
||||
/(*UTF16)\x{11234}/
|
||||
Failed: error 160 at offset 5: (*VERB) not recognized or malformed
|
||||
|
@ -1113,7 +1113,7 @@ Failed: error 134 at offset 10: character code point value in \x{} or \o{} is to
|
|||
|
||||
/\C/utf
|
||||
\x{110000}
|
||||
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined
|
||||
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
|
||||
|
||||
/\x{100}*A/IB,utf
|
||||
------------------------------------------------------------------
|
||||
|
|
Loading…
Reference in New Issue