More documentation and test updates.

This commit is contained in:
Philip.Hazel 2014-11-23 18:38:38 +00:00
parent eb4fffbbf4
commit 91f2e97474
25 changed files with 625 additions and 603 deletions

View File

@ -17,9 +17,9 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a> <li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a>
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a> <li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a> <li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
<li><a name="TOC5" href="#SEC5">Unicode and UTF SUPPORT</a> <li><a name="TOC5" href="#SEC5">UNICODE AND UTF SUPPORT</a>
<li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a> <li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC7" href="#SEC7">CODE VALUE OF NEWLINE</a> <li><a name="TOC7" href="#SEC7">NEWLINE RECOGNITION</a>
<li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a> <li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a>
<li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a> <li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
<li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a> <li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
@ -91,12 +91,12 @@ respectively. These can be interpreted either as single-unit characters or
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
the following to the <b>configure</b> command: the following to the <b>configure</b> command:
<pre> <pre>
--enable-pcre16 --enable-pcre2-16
--enable-pcre32 --enable-pcre2-32
</pre> </pre>
If you do not want the 8-bit library, add If you do not want the 8-bit library, add
<pre> <pre>
--disable-pcre8 --disable-pcre2-8
</pre> </pre>
as well. At least one of the three libraries must be built. Note that the POSIX as well. At least one of the three libraries must be built. Note that the POSIX
wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit
@ -106,14 +106,15 @@ libraries.
<br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br> <br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
<P> <P>
The Autotools PCRE2 building process uses <b>libtool</b> to build both shared The Autotools PCRE2 building process uses <b>libtool</b> to build both shared
and static libraries by default. You can suppress one of these by adding one of and static libraries by default. You can suppress an unwanted library by adding
one of
<pre> <pre>
--disable-shared --disable-shared
--disable-static --disable-static
</pre> </pre>
to the <b>configure</b> command, as required. to the <b>configure</b> command.
</P> </P>
<br><a name="SEC5" href="#TOC1">Unicode and UTF SUPPORT</a><br> <br><a name="SEC5" href="#TOC1">UNICODE AND UTF SUPPORT</a><br>
<P> <P>
By default, PCRE2 is built with support for Unicode and UTF character strings. By default, PCRE2 is built with support for Unicode and UTF character strings.
To build it without Unicode support, add To build it without Unicode support, add
@ -126,20 +127,15 @@ in the same configuration.
</P> </P>
<P> <P>
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
or UTF-32. To do that you have have to set the PCRE2_UTF option when you call or UTF-32. To do that, applications that use the library have to set the
<b>pcre2_compile()</b> to compile a pattern. PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
</P> </P>
<P> <P>
It is not possible to support both EBCDIC and UTF-8 codes in the same version UTF support allows the libraries to process character code points up to
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually 0x10ffff in the strings that they handle. It also provides support for
exclusive. accessing the Unicode properties of such characters, using pattern escapes such
</P> as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
<P> <i>Nd</i> are supported. Details are given in the
UTF support allows the libraries to process character codepoints up to 0x10ffff
in the strings that they handle. It also provides support for accessing the
properties of such characters, using pattern escapes such as \P, \p, and \X.
Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
supported. Details are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation. documentation.
</P> </P>
@ -150,7 +146,7 @@ Just-in-time compiler support is included in the build by specifying
--enable-jit --enable-jit
</pre> </pre>
This support is available only for certain hardware architectures. If this This support is available only for certain hardware architectures. If this
option is set for an unsupported architecture, a compile time error occurs. option is set for an unsupported architecture, a building error occurs.
See the See the
<a href="pcre2jit.html"><b>pcre2jit</b></a> <a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for a discussion of JIT usage. When JIT support is enabled, documentation for a discussion of JIT usage. When JIT support is enabled,
@ -160,7 +156,7 @@ pcre2grep automatically makes use of it, unless you add
</pre> </pre>
to the "configure" command. to the "configure" command.
</P> </P>
<br><a name="SEC7" href="#TOC1">CODE VALUE OF NEWLINE</a><br> <br><a name="SEC7" href="#TOC1">NEWLINE RECOGNITION</a><br>
<P> <P>
By default, PCRE2 interprets the linefeed (LF) character as indicating the end By default, PCRE2 interprets the linefeed (LF) character as indicating the end
of a line. This is the normal newline character on Unix-like systems. You can of a line. This is the normal newline character on Unix-like systems. You can
@ -168,12 +164,13 @@ compile PCRE2 to use carriage return (CR) instead, by adding
<pre> <pre>
--enable-newline-is-cr --enable-newline-is-cr
</pre> </pre>
to the <b>configure</b> command. There is also a --enable-newline-is-lf option, to the <b>configure</b> command. There is also an --enable-newline-is-lf option,
which explicitly specifies linefeed as the newline character. which explicitly specifies linefeed as the newline character.
<br> </P>
<br> <P>
Alternatively, you can specify that line endings are to be indicated by the two Alternatively, you can specify that line endings are to be indicated by the
character sequence CRLF. If you want this, add two-character sequence CRLF (CR immediately followed by LF). If you want this,
add
<pre> <pre>
--enable-newline-is-crlf --enable-newline-is-crlf
</pre> </pre>
@ -186,22 +183,26 @@ indicating a line ending. Finally, a fifth option, specified by
<pre> <pre>
--enable-newline-is-any --enable-newline-is-any
</pre> </pre>
causes PCRE2 to recognize any Unicode newline sequence. causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
sequences are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029).
</P> </P>
<P> <P>
Whatever line ending convention is selected when PCRE2 is built can be Whatever default line ending convention is selected when PCRE2 is built can be
overridden when the library functions are called. At build time it is overridden by applications that use the library. At build time it is
conventional to use the standard for your operating system. conventional to use the standard for your operating system.
</P> </P>
<br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br> <br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br>
<P> <P>
By default, the sequence \R in a pattern matches any Unicode newline sequence, By default, the sequence \R in a pattern matches any Unicode newline sequence,
whatever has been selected as the line ending sequence. If you specify independently of what has been selected as the line ending sequence. If you
specify
<pre> <pre>
--enable-bsr-anycrlf --enable-bsr-anycrlf
</pre> </pre>
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
selected when PCRE2 is built can be overridden when the library functions are selected when PCRE2 is built can be overridden by applications that use the
called. called.
</P> </P>
<br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br> <br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
@ -210,10 +211,10 @@ Within a compiled pattern, offset values are used to point from one part to
another (for example, from an opening parenthesis to an alternation another (for example, from an opening parenthesis to an alternation
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
are used for these offsets, leading to a maximum size for a compiled pattern of are used for these offsets, leading to a maximum size for a compiled pattern of
around 64K. This is sufficient to handle all but the most gigantic patterns. around 64K code units. This is sufficient to handle all but the most gigantic
Nevertheless, some people do want to process truly enormous patterns, so it is patterns. Nevertheless, some people do want to process truly enormous patterns,
possible to compile PCRE2 to use three-byte or four-byte offsets by adding a so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
setting such as adding a setting such as
<pre> <pre>
--with-link-size=3 --with-link-size=3
</pre> </pre>
@ -294,16 +295,20 @@ hand".)
<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br> <br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
<P> <P>
PCRE2 assumes by default that it will run in an environment where the character PCRE2 assumes by default that it will run in an environment where the character
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for code is ASCII or Unicode, which is a superset of ASCII. This is the case for
most computer operating systems. PCRE2 can, however, be compiled to run in an most computer operating systems. PCRE2 can, however, be compiled to run in an
EBCDIC environment by adding 8-bit EBCDIC environment by adding
<pre> <pre>
--enable-ebcdic --disable-unicode --enable-ebcdic --disable-unicode
</pre> </pre>
to the <b>configure</b> command. This setting implies to the <b>configure</b> command. This setting implies
--enable-rebuild-chartables. You should only use it if you know that you are in --enable-rebuild-chartables. You should only use it if you know that you are in
an EBCDIC environment (for example, an IBM mainframe operating system). The an EBCDIC environment (for example, an IBM mainframe operating system).
--enable-ebcdic option is incompatible with Unicode support. </P>
<P>
It is not possible to support both EBCDIC and UTF-8 codes in the same version
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
exclusive.
</P> </P>
<P> <P>
The EBCDIC character that corresponds to an ASCII LF is assumed to have the The EBCDIC character that corresponds to an ASCII LF is assumed to have the
@ -347,8 +352,8 @@ parameter value by adding, for example,
<pre> <pre>
--with-pcre2grep-bufsize=50K --with-pcre2grep-bufsize=50K
</pre> </pre>
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can, however, to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override this
override this value by specifying a run-time option. value by using --buffer-size on the command line..
</P> </P>
<br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br> <br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
<P> <P>
@ -362,16 +367,16 @@ to the <b>configure</b> command, <b>pcre2test</b> is linked with the
from a terminal, it reads it using the <b>readline()</b> function. This provides from a terminal, it reads it using the <b>readline()</b> function. This provides
line-editing and history facilities. Note that <b>libreadline</b> is line-editing and history facilities. Note that <b>libreadline</b> is
GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this
way, there may be licensing issues. These can be avoided by linking with way, there may be licensing issues. These can be avoided by linking instead
<b>libedit</b> (which has a BSD licence) instead. with <b>libedit</b>, which has a BSD licence.
</P> </P>
<P> <P>
Setting this option causes the <b>-lreadline</b> option to be added to the Setting --enable-pcre2test-libreadline causes the <b>-lreadline</b> option to be
<b>pcre2test</b> build. In many operating environments with a sytem-installed added to the <b>pcre2test</b> build. In many operating environments with a
readline library this is sufficient. However, in some environments (e.g. if an sytem-installed readline library this is sufficient. However, in some
unmodified distribution version of readline is in use), some extra environments (e.g. if an unmodified distribution version of readline is in
configuration may be necessary. The INSTALL file for <b>libreadline</b> says use), some extra configuration may be necessary. The INSTALL file for
this: <b>libreadline</b> says this:
<pre> <pre>
"Readline uses the termcap functions, but does not link with "Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications the termcap or curses library itself, allowing applications
@ -386,13 +391,13 @@ immediately before the <b>configure</b> command.
</P> </P>
<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br> <br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<P> <P>
By adding the If you add
<pre> <pre>
--enable-valgrind --enable-valgrind
</pre> </pre>
option to to the <b>configure</b> command, PCRE2 will use valgrind annotations to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
to mark certain memory regions as unaddressable. This allows it to detect certain memory regions as unaddressable. This allows it to detect invalid
invalid memory accesses, and is mostly useful for debugging PCRE2 itself. memory accesses, and is mostly useful for debugging PCRE2 itself.
</P> </P>
<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br> <br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<P> <P>
@ -466,7 +471,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 03 November 2014 Last updated: 23 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -85,29 +85,27 @@ expect.
<P> <P>
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as what follows cannot be part of the repeat. For example, a+[bc] is compiled as
if it were a++[bc]. The <b>pcre2test</b> output when this pattern is anchored if it were a++[bc]. The <b>pcre2test</b> output when this pattern is compiled
and then applied with automatic callouts to the string "aaaa" is: with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
"aaaa" is:
<pre> <pre>
---&#62;aaaa ---&#62;aaaa
+0 ^ ^ +0 ^ a+
+1 ^ a+ +2 ^ ^ [bc]
+3 ^ ^ [bc]
No match No match
</pre> </pre>
This indicates that when matching [bc] fails, there is no backtracking into a+ This indicates that when matching [bc] fails, there is no backtracking into a+
and therefore the callouts that would be taken for the backtracks do not occur. and therefore the callouts that would be taken for the backtracks do not occur.
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
to <b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). If <b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). In this
this is done in <b>pcre2test</b> (using the /no_auto_possess qualifier), the case, the output changes to this:
output changes to this:
<pre> <pre>
---&#62;aaaa ---&#62;aaaa
+0 ^ ^ +0 ^ a+
+1 ^ a+ +2 ^ ^ [bc]
+3 ^ ^ [bc] +2 ^ ^ [bc]
+3 ^ ^ [bc] +2 ^ ^ [bc]
+3 ^ ^ [bc] +2 ^^ [bc]
+3 ^^ [bc]
No match No match
</pre> </pre>
This time, when matching [bc] fails, the matcher backtracks into a+ and tries This time, when matching [bc] fails, the matcher backtracks into a+ and tries
@ -137,10 +135,10 @@ callouts such as the example above are obeyed.
</P> </P>
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br> <br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
<P> <P>
During matching, when PCRE2 reaches a callout point, the external function that During matching, when PCRE2 reaches a callout point, if an external function is
is set in the match context is called (if it is set). This applies to both set in the match context, it is called. This applies to both normal and DFA
normal and DFA matching. The only argument to the callout function is a pointer matching. The only argument to the callout function is a pointer to a
to a <b>pcre2_callout</b> block. This structure contains the following fields: <b>pcre2_callout</b> block. This structure contains the following fields:
<pre> <pre>
uint32_t <i>version</i>; uint32_t <i>version</i>;
uint32_t <i>callout_number</i>; uint32_t <i>callout_number</i>;
@ -169,7 +167,7 @@ automatically generated callouts).
<P> <P>
The <i>offset_vector</i> field is a pointer to the vector of capturing offsets The <i>offset_vector</i> field is a pointer to the vector of capturing offsets
(the "ovector") that was passed to the matching function in the match data (the "ovector") that was passed to the matching function in the match data
block. When <b>pcre2_match()</b> is used, the contents can be inspected, in block. When <b>pcre2_match()</b> is used, the contents can be inspected in
order to extract substrings that have been matched so far, in the same way as order to extract substrings that have been matched so far, in the same way as
for extracting substrings after a match has completed. For the DFA matching for extracting substrings after a match has completed. For the DFA matching
function, this field is not useful. function, this field is not useful.
@ -261,7 +259,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC7" href="#TOC1">REVISION</a><br> <br><a name="SEC7" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 19 October 2014 Last updated: 23 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -467,8 +467,8 @@ used. There is no short form for this option.
Processing some regular expression patterns can require a very large amount of Processing some regular expression patterns can require a very large amount of
memory, leading in some cases to a program crash if not enough is available. memory, leading in some cases to a program crash if not enough is available.
Other patterns may take a very long time to search for all possible matching Other patterns may take a very long time to search for all possible matching
strings. The <b>pcre2_exec()</b> function that is called by <b>pcre2grep</b> to do strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
the matching has two parameters that can limit the resources that it uses. do the matching has two parameters that can limit the resources that it uses.
<br> <br>
<br> <br>
The <b>--match-limit</b> option provides a means of limiting resource usage The <b>--match-limit</b> option provides a means of limiting resource usage
@ -750,7 +750,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br> <br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 28 September 2014 Last updated: 23 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -31,11 +31,11 @@ please consult the man page, in case the conversion went wrong.
<P> <P>
Just-in-time compiling is a heavyweight optimization that can greatly speed up Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the pattern matching. However, it comes at the cost of extra processing before the
match is performed. Therefore, it is of most benefit when the same pattern is match is performed, so it is of most benefit when the same pattern is going to
going to be matched many times. This does not necessarily mean many calls of a be matched many times. This does not necessarily mean many calls of a matching
matching function; if the pattern is not anchored, matching attempts may take function; if the pattern is not anchored, matching attempts may take place many
place many times at various positions in the subject, even for a single call. times at various positions in the subject, even for a single call. Therefore,
Therefore, if the subject string is very long, it may still pay to use JIT for if the subject string is very long, it may still pay to use JIT even for
one-off matches. JIT support is available for all of the 8-bit, 16-bit and one-off matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries. 32-bit PCRE2 libraries.
</P> </P>
@ -103,7 +103,7 @@ option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
PCRE2_JIT_COMPLETE and just compile code for partial matching. If PCRE2_JIT_COMPLETE and just compile code for partial matching. If
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately <b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
returns zero. This is an alternative way of testing if JIT is available. returns zero. This is an alternative way of testing whether JIT is available.
</P> </P>
<P> <P>
At present, it is not possible to free JIT compiled code except when the entire At present, it is not possible to free JIT compiled code except when the entire
@ -299,7 +299,7 @@ compiled patterns, contexts, and stacks in any order, anytime. Just \fIdo
not\fP call <b>pcre2_match()</b> with a match context pointing to an already not\fP call <b>pcre2_match()</b> with a match context pointing to an already
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
used by <b>pcre2_match()</b> in another thread). You can also replace the stack used by <b>pcre2_match()</b> in another thread). You can also replace the stack
in a context at any time when it is not in use. You can also free the previous in a context at any time when it is not in use. You should free the previous
stack before assigning a replacement. stack before assigning a replacement.
</P> </P>
<P> <P>
@ -418,7 +418,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br> <br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 12 November 2014 Last updated: 23 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -421,7 +421,7 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc) (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
</pre> </pre>
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_exec(), not increase them. limits set by the caller of pcre2_match(), not increase them.
</P> </P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br> <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
<P> <P>
@ -553,7 +553,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br> <br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 14 November 2014 Last updated: 23 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -72,7 +72,7 @@ but its use can lead to some strange effects because it breaks up multi-unit
characters (see the description of \C in the characters (see the description of \C in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation). The use of \C is not supported in the alternative matching documentation). The use of \C is not supported in the alternative matching
function <b>pcre2_dfa_exec()</b>, nor is it supported in UTF mode by the JIT function <b>pcre2_dfa_match()</b>, nor is it supported in UTF mode by the JIT
optimization. If JIT optimization is requested for a UTF pattern that contains optimization. If JIT optimization is requested for a UTF pattern that contains
\C, it will not succeed, and so the matching will be carried out by the normal \C, it will not succeed, and so the matching will be carried out by the normal
interpretive function. interpretive function.
@ -141,15 +141,15 @@ UTF-32.)
In some situations, you may already know that your strings are valid, and In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly. example in the case of a long subject string that is being scanned repeatedly.
If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2 If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
assumes that the pattern or subject it is given (respectively) contains only PCRE2 assumes that the pattern or subject it is given (respectively) contains
valid UTF code unit sequences. only valid UTF code unit sequences.
</P> </P>
<P> <P>
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable the pattern; it does not also apply to subject strings. If you want to disable
the check for a subject string you must pass this option to <b>pcre2_exec()</b> the check for a subject string you must pass this option to <b>pcre2_match()</b>
or <b>pcre2_dfa_exec()</b>. or <b>pcre2_dfa_match()</b>.
</P> </P>
<P> <P>
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
@ -261,7 +261,7 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 03 November 2014 Last updated: 23 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -2667,12 +2667,12 @@ BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
or UTF-16/UTF-32 strings. To build these additional libraries, add one or UTF-16/UTF-32 strings. To build these additional libraries, add one
or both of the following to the configure command: or both of the following to the configure command:
--enable-pcre16 --enable-pcre2-16
--enable-pcre32 --enable-pcre2-32
If you do not want the 8-bit library, add If you do not want the 8-bit library, add
--disable-pcre8 --disable-pcre2-8
as well. At least one of the three libraries must be built. Note that as well. At least one of the three libraries must be built. Note that
the POSIX wrapper is for the 8-bit library only, and that pcre2grep is the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
@ -2683,16 +2683,16 @@ BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
BUILDING SHARED AND STATIC LIBRARIES BUILDING SHARED AND STATIC LIBRARIES
The Autotools PCRE2 building process uses libtool to build both shared The Autotools PCRE2 building process uses libtool to build both shared
and static libraries by default. You can suppress one of these by and static libraries by default. You can suppress an unwanted library
adding one of by adding one of
--disable-shared --disable-shared
--disable-static --disable-static
to the configure command, as required. to the configure command.
Unicode and UTF SUPPORT UNICODE AND UTF SUPPORT
By default, PCRE2 is built with support for Unicode and UTF character By default, PCRE2 is built with support for Unicode and UTF character
strings. To build it without Unicode support, add strings. To build it without Unicode support, add
@ -2704,18 +2704,16 @@ Unicode and UTF SUPPORT
another without, in the same configuration. another without, in the same configuration.
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
UTF-16 or UTF-32. To do that you have have to set the PCRE2_UTF option UTF-16 or UTF-32. To do that, applications that use the library have to
when you call pcre2_compile() to compile a pattern. set the PCRE2_UTF option when they call pcre2_compile() to compile a
pattern.
It is not possible to support both EBCDIC and UTF-8 codes in the same
version of the library. Consequently, --enable-unicode and --enable-
ebcdic are mutually exclusive.
UTF support allows the libraries to process character code points up to UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. It also provides support for 0x10ffff in the strings that they handle. It also provides support for
accessing the properties of such characters, using pattern escapes such accessing the Unicode properties of such characters, using pattern
as \P, \p, and \X. Only the general category properties such as Lu and escapes such as \P, \p, and \X. Only the general category properties
Nd are supported. Details are given in the pcre2pattern documentation. such as Lu and Nd are supported. Details are given in the pcre2pattern
documentation.
JUST-IN-TIME COMPILER SUPPORT JUST-IN-TIME COMPILER SUPPORT
@ -2725,17 +2723,17 @@ JUST-IN-TIME COMPILER SUPPORT
--enable-jit --enable-jit
This support is available only for certain hardware architectures. If This support is available only for certain hardware architectures. If
this option is set for an unsupported architecture, a compile time this option is set for an unsupported architecture, a building error
error occurs. See the pcre2jit documentation for a discussion of JIT occurs. See the pcre2jit documentation for a discussion of JIT usage.
usage. When JIT support is enabled, pcre2grep automatically makes use When JIT support is enabled, pcre2grep automatically makes use of it,
of it, unless you add unless you add
--disable-pcre2grep-jit --disable-pcre2grep-jit
to the "configure" command. to the "configure" command.
CODE VALUE OF NEWLINE NEWLINE RECOGNITION
By default, PCRE2 interprets the linefeed (LF) character as indicating By default, PCRE2 interprets the linefeed (LF) character as indicating
the end of a line. This is the normal newline character on Unix-like the end of a line. This is the normal newline character on Unix-like
@ -2744,11 +2742,12 @@ CODE VALUE OF NEWLINE
--enable-newline-is-cr --enable-newline-is-cr
to the configure command. There is also a --enable-newline-is-lf to the configure command. There is also an --enable-newline-is-lf
option, which explicitly specifies linefeed as the newline character. option, which explicitly specifies linefeed as the newline character.
Alternatively, you can specify that line endings are to be indicated by Alternatively, you can specify that line endings are to be indicated by
the two character sequence CRLF. If you want this, add the two-character sequence CRLF (CR immediately followed by LF). If you
want this, add
--enable-newline-is-crlf --enable-newline-is-crlf
@ -2761,24 +2760,28 @@ CODE VALUE OF NEWLINE
--enable-newline-is-any --enable-newline-is-any
causes PCRE2 to recognize any Unicode newline sequence. causes PCRE2 to recognize any Unicode newline sequence. The Unicode
newline sequences are the three just mentioned, plus the single charac-
ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
U+2029).
Whatever line ending convention is selected when PCRE2 is built can be Whatever default line ending convention is selected when PCRE2 is built
overridden when the library functions are called. At build time it is can be overridden by applications that use the library. At build time
conventional to use the standard for your operating system. it is conventional to use the standard for your operating system.
WHAT \R MATCHES WHAT \R MATCHES
By default, the sequence \R in a pattern matches any Unicode newline By default, the sequence \R in a pattern matches any Unicode newline
sequence, whatever has been selected as the line ending sequence. If sequence, independently of what has been selected as the line ending
you specify sequence. If you specify
--enable-bsr-anycrlf --enable-bsr-anycrlf
the default is changed so that \R matches only CR, LF, or CRLF. What- the default is changed so that \R matches only CR, LF, or CRLF. What-
ever is selected when PCRE2 is built can be overridden when the library ever is selected when PCRE2 is built can be overridden by applications
functions are called. that use the called.
HANDLING VERY LARGE PATTERNS HANDLING VERY LARGE PATTERNS
@ -2787,10 +2790,11 @@ HANDLING VERY LARGE PATTERNS
part to another (for example, from an opening parenthesis to an alter- part to another (for example, from an opening parenthesis to an alter-
nation metacharacter). By default, in the 8-bit and 16-bit libraries, nation metacharacter). By default, in the 8-bit and 16-bit libraries,
two-byte values are used for these offsets, leading to a maximum size two-byte values are used for these offsets, leading to a maximum size
for a compiled pattern of around 64K. This is sufficient to handle all for a compiled pattern of around 64K code units. This is sufficient to
but the most gigantic patterns. Nevertheless, some people do want to handle all but the most gigantic patterns. Nevertheless, some people do
process truly enormous patterns, so it is possible to compile PCRE2 to want to process truly enormous patterns, so it is possible to compile
use three-byte or four-byte offsets by adding a setting such as PCRE2 to use three-byte or four-byte offsets by adding a setting such
as
--with-link-size=3 --with-link-size=3
@ -2876,16 +2880,19 @@ CREATING CHARACTER TABLES AT BUILD TIME
USING EBCDIC CODE USING EBCDIC CODE
PCRE2 assumes by default that it will run in an environment where the PCRE2 assumes by default that it will run in an environment where the
character code is ASCII (or Unicode, which is a superset of ASCII). character code is ASCII or Unicode, which is a superset of ASCII. This
This is the case for most computer operating systems. PCRE2 can, how- is the case for most computer operating systems. PCRE2 can, however, be
ever, be compiled to run in an EBCDIC environment by adding compiled to run in an 8-bit EBCDIC environment by adding
--enable-ebcdic --disable-unicode --enable-ebcdic --disable-unicode
to the configure command. This setting implies --enable-rebuild-charta- to the configure command. This setting implies --enable-rebuild-charta-
bles. You should only use it if you know that you are in an EBCDIC bles. You should only use it if you know that you are in an EBCDIC
environment (for example, an IBM mainframe operating system). The environment (for example, an IBM mainframe operating system).
--enable-ebcdic option is incompatible with Unicode support.
It is not possible to support both EBCDIC and UTF-8 codes in the same
version of the library. Consequently, --enable-unicode and --enable-
ebcdic are mutually exclusive.
The EBCDIC character that corresponds to an ASCII LF is assumed to have The EBCDIC character that corresponds to an ASCII LF is assumed to have
the value 0x15 by default. However, in some EBCDIC environments, 0x25 the value 0x15 by default. However, in some EBCDIC environments, 0x25
@ -2929,8 +2936,8 @@ PCRE2GREP BUFFER SIZE
--with-pcre2grep-bufsize=50K --with-pcre2grep-bufsize=50K
to the configure command. The caller of pcre2grep can, however, over- to the configure command. The caller of pcre2grep can override this
ride this value by specifying a run-time option. value by using --buffer-size on the command line..
PCRE2TEST OPTION FOR LIBREADLINE SUPPORT PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
@ -2945,15 +2952,15 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
it reads it using the readline() function. This provides line-editing it reads it using the readline() function. This provides line-editing
and history facilities. Note that libreadline is GPL-licensed, so if and history facilities. Note that libreadline is GPL-licensed, so if
you distribute a binary of pcre2test linked in this way, there may be you distribute a binary of pcre2test linked in this way, there may be
licensing issues. These can be avoided by linking with libedit (which licensing issues. These can be avoided by linking instead with libedit,
has a BSD licence) instead. which has a BSD licence.
Setting this option causes the -lreadline option to be added to the Setting --enable-pcre2test-libreadline causes the -lreadline option to
pcre2test build. In many operating environments with a sytem-installed be added to the pcre2test build. In many operating environments with a
readline library this is sufficient. However, in some environments sytem-installed readline library this is sufficient. However, in some
(e.g. if an unmodified distribution version of readline is in use), environments (e.g. if an unmodified distribution version of readline is
some extra configuration may be necessary. The INSTALL file for in use), some extra configuration may be necessary. The INSTALL file
libreadline says this: for libreadline says this:
"Readline uses the termcap functions, but does not link with "Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications the termcap or curses library itself, allowing applications
@ -2969,14 +2976,14 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
DEBUGGING WITH VALGRIND SUPPORT DEBUGGING WITH VALGRIND SUPPORT
By adding the If you add
--enable-valgrind --enable-valgrind
option to to the configure command, PCRE2 will use valgrind annotations to the configure command, PCRE2 will use valgrind annotations to mark
to mark certain memory regions as unaddressable. This allows it to certain memory regions as unaddressable. This allows it to detect
detect invalid memory accesses, and is mostly useful for debugging invalid memory accesses, and is mostly useful for debugging PCRE2
PCRE2 itself. itself.
CODE COVERAGE REPORTING CODE COVERAGE REPORTING
@ -3049,7 +3056,7 @@ AUTHOR
REVISION REVISION
Last updated: 03 November 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -3122,29 +3129,26 @@ MISSING CALLOUTS
At compile time, PCRE2 "auto-possessifies" repeated items when it knows At compile time, PCRE2 "auto-possessifies" repeated items when it knows
that what follows cannot be part of the repeat. For example, a+[bc] is that what follows cannot be part of the repeat. For example, a+[bc] is
compiled as if it were a++[bc]. The pcre2test output when this pattern compiled as if it were a++[bc]. The pcre2test output when this pattern
is anchored and then applied with automatic callouts to the string is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
"aaaa" is: to the string "aaaa" is:
--->aaaa --->aaaa
+0 ^ ^ +0 ^ a+
+1 ^ a+ +2 ^ ^ [bc]
+3 ^ ^ [bc]
No match No match
This indicates that when matching [bc] fails, there is no backtracking This indicates that when matching [bc] fails, there is no backtracking
into a+ and therefore the callouts that would be taken for the back- into a+ and therefore the callouts that would be taken for the back-
tracks do not occur. You can disable the auto-possessify feature by tracks do not occur. You can disable the auto-possessify feature by
passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat- passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat-
tern with (*NO_AUTO_POSSESS). If this is done in pcre2test (using the tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
/no_auto_possess qualifier), the output changes to this:
--->aaaa --->aaaa
+0 ^ ^ +0 ^ a+
+1 ^ a+ +2 ^ ^ [bc]
+3 ^ ^ [bc] +2 ^ ^ [bc]
+3 ^ ^ [bc] +2 ^ ^ [bc]
+3 ^ ^ [bc] +2 ^^ [bc]
+3 ^^ [bc]
No match No match
This time, when matching [bc] fails, the matcher backtracks into a+ and This time, when matching [bc] fails, the matcher backtracks into a+ and
@ -3173,11 +3177,11 @@ MISSING CALLOUTS
THE CALLOUT INTERFACE THE CALLOUT INTERFACE
During matching, when PCRE2 reaches a callout point, the external func- During matching, when PCRE2 reaches a callout point, if an external
tion that is set in the match context is called (if it is set). This function is set in the match context, it is called. This applies to
applies to both normal and DFA matching. The only argument to the call- both normal and DFA matching. The only argument to the callout function
out function is a pointer to a pcre2_callout block. This structure con- is a pointer to a pcre2_callout block. This structure contains the fol-
tains the following fields: lowing fields:
uint32_t version; uint32_t version;
uint32_t callout_number; uint32_t callout_number;
@ -3204,7 +3208,7 @@ THE CALLOUT INTERFACE
The offset_vector field is a pointer to the vector of capturing offsets The offset_vector field is a pointer to the vector of capturing offsets
(the "ovector") that was passed to the matching function in the match (the "ovector") that was passed to the matching function in the match
data block. When pcre2_match() is used, the contents can be inspected, data block. When pcre2_match() is used, the contents can be inspected
in order to extract substrings that have been matched so far, in the in order to extract substrings that have been matched so far, in the
same way as for extracting substrings after a match has completed. For same way as for extracting substrings after a match has completed. For
the DFA matching function, this field is not useful. the DFA matching function, this field is not useful.
@ -3285,7 +3289,7 @@ AUTHOR
REVISION REVISION
Last updated: 19 October 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -3487,12 +3491,12 @@ PCRE2 JUST-IN-TIME COMPILER SUPPORT
Just-in-time compiling is a heavyweight optimization that can greatly Just-in-time compiling is a heavyweight optimization that can greatly
speed up pattern matching. However, it comes at the cost of extra pro- speed up pattern matching. However, it comes at the cost of extra pro-
cessing before the match is performed. Therefore, it is of most benefit cessing before the match is performed, so it is of most benefit when
when the same pattern is going to be matched many times. This does not the same pattern is going to be matched many times. This does not nec-
necessarily mean many calls of a matching function; if the pattern is essarily mean many calls of a matching function; if the pattern is not
not anchored, matching attempts may take place many times at various anchored, matching attempts may take place many times at various posi-
positions in the subject, even for a single call. Therefore, if the tions in the subject, even for a single call. Therefore, if the subject
subject string is very long, it may still pay to use JIT for one-off string is very long, it may still pay to use JIT even for one-off
matches. JIT support is available for all of the 8-bit, 16-bit and matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries. 32-bit PCRE2 libraries.
@ -3558,8 +3562,8 @@ SIMPLE USE OF JIT
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match- will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
ing. If pcre2_jit_compile() is called with no option bits set, it imme- ing. If pcre2_jit_compile() is called with no option bits set, it imme-
diately returns zero. This is an alternative way of testing if JIT is diately returns zero. This is an alternative way of testing whether JIT
available. is available.
At present, it is not possible to free JIT compiled code except when At present, it is not possible to free JIT compiled code except when
the entire compiled pattern is freed by calling pcre2_free_code(). the entire compiled pattern is freed by calling pcre2_free_code().
@ -3745,7 +3749,7 @@ JIT STACK FAQ
an already freed stack, as that will cause SEGFAULT. (Also, do not free an already freed stack, as that will cause SEGFAULT. (Also, do not free
a stack currently used by pcre2_match() in another thread). You can a stack currently used by pcre2_match() in another thread). You can
also replace the stack in a context at any time when it is not in use. also replace the stack in a context at any time when it is not in use.
You can also free the previous stack before assigning a replacement. You should free the previous stack before assigning a replacement.
(5) Should I allocate/free a stack every time before/after calling (5) Should I allocate/free a stack every time before/after calling
pcre2_match()? pcre2_match()?
@ -3855,7 +3859,7 @@ AUTHOR
REVISION REVISION
Last updated: 12 November 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -4642,7 +4646,7 @@ WIDE CHARACTERS AND UTF MODES
UTF mode, but its use can lead to some strange effects because it UTF mode, but its use can lead to some strange effects because it
breaks up multi-unit characters (see the description of \C in the breaks up multi-unit characters (see the description of \C in the
pcre2pattern documentation). The use of \C is not supported in the pcre2pattern documentation). The use of \C is not supported in the
alternative matching function pcre2_dfa_exec(), nor is it supported in alternative matching function pcre2_dfa_match(), nor is it supported in
UTF mode by the JIT optimization. If JIT optimization is requested for UTF mode by the JIT optimization. If JIT optimization is requested for
a UTF pattern that contains \C, it will not succeed, and so the match- a UTF pattern that contains \C, it will not succeed, and so the match-
ing will be carried out by the normal interpretive function. ing will be carried out by the normal interpretive function.
@ -4701,14 +4705,14 @@ VALIDITY OF UTF STRINGS
In some situations, you may already know that your strings are valid, In some situations, you may already know that your strings are valid,
and therefore want to skip these checks in order to improve perfor- and therefore want to skip these checks in order to improve perfor-
mance, for example in the case of a long subject string that is being mance, for example in the case of a long subject string that is being
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK flag at compile scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
time or at run time, PCRE2 assumes that the pattern or subject it is pile time or at match time, PCRE2 assumes that the pattern or subject
given (respectively) contains only valid UTF code unit sequences. it is given (respectively) contains only valid UTF code unit sequences.
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
for the pattern; it does not also apply to subject strings. If you want for the pattern; it does not also apply to subject strings. If you want
to disable the check for a subject string you must pass this option to to disable the check for a subject string you must pass this option to
pcre2_exec() or pcre2_dfa_exec(). pcre2_match() or pcre2_dfa_match().
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
result is undefined and your program may crash or loop indefinitely. result is undefined and your program may crash or loop indefinitely.
@ -4807,7 +4811,7 @@ AUTHOR
REVISION REVISION
Last updated: 03 November 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "21 November 2014" "PCRE2 10.00" .TH PCRE2API 3 "23 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -2090,6 +2090,13 @@ returned by \fBpcre2_get_startchar()\fP. For a non-partial match, this can be
different to the value of \fIovector[0]\fP if the pattern contains the \eK different to the value of \fIovector[0]\fP if the pattern contains the \eK
escape sequence. After a partial match, however, this value is always the same escape sequence. After a partial match, however, this value is always the same
as \fIovector[0]\fP because \eK does not affect the result of a partial match. as \fIovector[0]\fP because \eK does not affect the result of a partial match.
.P
The \fBstartchar\fP field is also used to return the offset of an invalid
UTF character when UTF checking fails. Details are given in the
.\" HREF
\fBpcre2unicode\fP
.\"
page.
. .
. .
.\" HTML <a name="errorlist"></a> .\" HTML <a name="errorlist"></a>
@ -2707,6 +2714,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 21 November 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2BUILD 3 "03 November 2014" "PCRE2 10.00" .TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
. .
@ -74,12 +74,12 @@ respectively. These can be interpreted either as single-unit characters or
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
the following to the \fBconfigure\fP command: the following to the \fBconfigure\fP command:
.sp .sp
--enable-pcre16 --enable-pcre2-16
--enable-pcre32 --enable-pcre2-32
.sp .sp
If you do not want the 8-bit library, add If you do not want the 8-bit library, add
.sp .sp
--disable-pcre8 --disable-pcre2-8
.sp .sp
as well. At least one of the three libraries must be built. Note that the POSIX as well. At least one of the three libraries must be built. Note that the POSIX
wrapper is for the 8-bit library only, and that \fBpcre2grep\fP is an 8-bit wrapper is for the 8-bit library only, and that \fBpcre2grep\fP is an 8-bit
@ -91,15 +91,16 @@ libraries.
.rs .rs
.sp .sp
The Autotools PCRE2 building process uses \fBlibtool\fP to build both shared The Autotools PCRE2 building process uses \fBlibtool\fP to build both shared
and static libraries by default. You can suppress one of these by adding one of and static libraries by default. You can suppress an unwanted library by adding
one of
.sp .sp
--disable-shared --disable-shared
--disable-static --disable-static
.sp .sp
to the \fBconfigure\fP command, as required. to the \fBconfigure\fP command.
. .
. .
.SH "Unicode and UTF SUPPORT" .SH "UNICODE AND UTF SUPPORT"
.rs .rs
.sp .sp
By default, PCRE2 is built with support for Unicode and UTF character strings. By default, PCRE2 is built with support for Unicode and UTF character strings.
@ -112,18 +113,14 @@ is not possible to build one library with Unicode support, and another without,
in the same configuration. in the same configuration.
.P .P
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
or UTF-32. To do that you have have to set the PCRE2_UTF option when you call or UTF-32. To do that, applications that use the library have to set the
\fBpcre2_compile()\fP to compile a pattern. PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
.P .P
It is not possible to support both EBCDIC and UTF-8 codes in the same version UTF support allows the libraries to process character code points up to
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually 0x10ffff in the strings that they handle. It also provides support for
exclusive. accessing the Unicode properties of such characters, using pattern escapes such
.P as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
UTF support allows the libraries to process character codepoints up to 0x10ffff \fINd\fP are supported. Details are given in the
in the strings that they handle. It also provides support for accessing the
properties of such characters, using pattern escapes such as \eP, \ep, and \eX.
Only the general category properties such as \fILu\fP and \fINd\fP are
supported. Details are given in the
.\" HREF .\" HREF
\fBpcre2pattern\fP \fBpcre2pattern\fP
.\" .\"
@ -138,7 +135,7 @@ Just-in-time compiler support is included in the build by specifying
--enable-jit --enable-jit
.sp .sp
This support is available only for certain hardware architectures. If this This support is available only for certain hardware architectures. If this
option is set for an unsupported architecture, a compile time error occurs. option is set for an unsupported architecture, a building error occurs.
See the See the
.\" HREF .\" HREF
\fBpcre2jit\fP \fBpcre2jit\fP
@ -151,7 +148,7 @@ pcre2grep automatically makes use of it, unless you add
to the "configure" command. to the "configure" command.
. .
. .
.SH "CODE VALUE OF NEWLINE" .SH "NEWLINE RECOGNITION"
.rs .rs
.sp .sp
By default, PCRE2 interprets the linefeed (LF) character as indicating the end By default, PCRE2 interprets the linefeed (LF) character as indicating the end
@ -160,11 +157,12 @@ compile PCRE2 to use carriage return (CR) instead, by adding
.sp .sp
--enable-newline-is-cr --enable-newline-is-cr
.sp .sp
to the \fBconfigure\fP command. There is also a --enable-newline-is-lf option, to the \fBconfigure\fP command. There is also an --enable-newline-is-lf option,
which explicitly specifies linefeed as the newline character. which explicitly specifies linefeed as the newline character.
.sp .P
Alternatively, you can specify that line endings are to be indicated by the two Alternatively, you can specify that line endings are to be indicated by the
character sequence CRLF. If you want this, add two-character sequence CRLF (CR immediately followed by LF). If you want this,
add
.sp .sp
--enable-newline-is-crlf --enable-newline-is-crlf
.sp .sp
@ -177,10 +175,13 @@ indicating a line ending. Finally, a fifth option, specified by
.sp .sp
--enable-newline-is-any --enable-newline-is-any
.sp .sp
causes PCRE2 to recognize any Unicode newline sequence. causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
sequences are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029).
.P .P
Whatever line ending convention is selected when PCRE2 is built can be Whatever default line ending convention is selected when PCRE2 is built can be
overridden when the library functions are called. At build time it is overridden by applications that use the library. At build time it is
conventional to use the standard for your operating system. conventional to use the standard for your operating system.
. .
. .
@ -188,12 +189,13 @@ conventional to use the standard for your operating system.
.rs .rs
.sp .sp
By default, the sequence \eR in a pattern matches any Unicode newline sequence, By default, the sequence \eR in a pattern matches any Unicode newline sequence,
whatever has been selected as the line ending sequence. If you specify independently of what has been selected as the line ending sequence. If you
specify
.sp .sp
--enable-bsr-anycrlf --enable-bsr-anycrlf
.sp .sp
the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
selected when PCRE2 is built can be overridden when the library functions are selected when PCRE2 is built can be overridden by applications that use the
called. called.
. .
. .
@ -204,10 +206,10 @@ Within a compiled pattern, offset values are used to point from one part to
another (for example, from an opening parenthesis to an alternation another (for example, from an opening parenthesis to an alternation
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
are used for these offsets, leading to a maximum size for a compiled pattern of are used for these offsets, leading to a maximum size for a compiled pattern of
around 64K. This is sufficient to handle all but the most gigantic patterns. around 64K code units. This is sufficient to handle all but the most gigantic
Nevertheless, some people do want to process truly enormous patterns, so it is patterns. Nevertheless, some people do want to process truly enormous patterns,
possible to compile PCRE2 to use three-byte or four-byte offsets by adding a so it is possible to compile PCRE2 to use three-byte or four-byte offsets by
setting such as adding a setting such as
.sp .sp
--with-link-size=3 --with-link-size=3
.sp .sp
@ -299,16 +301,19 @@ hand".)
.rs .rs
.sp .sp
PCRE2 assumes by default that it will run in an environment where the character PCRE2 assumes by default that it will run in an environment where the character
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for code is ASCII or Unicode, which is a superset of ASCII. This is the case for
most computer operating systems. PCRE2 can, however, be compiled to run in an most computer operating systems. PCRE2 can, however, be compiled to run in an
EBCDIC environment by adding 8-bit EBCDIC environment by adding
.sp .sp
--enable-ebcdic --disable-unicode --enable-ebcdic --disable-unicode
.sp .sp
to the \fBconfigure\fP command. This setting implies to the \fBconfigure\fP command. This setting implies
--enable-rebuild-chartables. You should only use it if you know that you are in --enable-rebuild-chartables. You should only use it if you know that you are in
an EBCDIC environment (for example, an IBM mainframe operating system). The an EBCDIC environment (for example, an IBM mainframe operating system).
--enable-ebcdic option is incompatible with Unicode support. .P
It is not possible to support both EBCDIC and UTF-8 codes in the same version
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
exclusive.
.P .P
The EBCDIC character that corresponds to an ASCII LF is assumed to have the The EBCDIC character that corresponds to an ASCII LF is assumed to have the
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
@ -354,8 +359,8 @@ parameter value by adding, for example,
.sp .sp
--with-pcre2grep-bufsize=50K --with-pcre2grep-bufsize=50K
.sp .sp
to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can, however, to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override this
override this value by specifying a run-time option. value by using --buffer-size on the command line..
. .
. .
.SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT" .SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT"
@ -371,15 +376,15 @@ to the \fBconfigure\fP command, \fBpcre2test\fP is linked with the
from a terminal, it reads it using the \fBreadline()\fP function. This provides from a terminal, it reads it using the \fBreadline()\fP function. This provides
line-editing and history facilities. Note that \fBlibreadline\fP is line-editing and history facilities. Note that \fBlibreadline\fP is
GPL-licensed, so if you distribute a binary of \fBpcre2test\fP linked in this GPL-licensed, so if you distribute a binary of \fBpcre2test\fP linked in this
way, there may be licensing issues. These can be avoided by linking with way, there may be licensing issues. These can be avoided by linking instead
\fBlibedit\fP (which has a BSD licence) instead. with \fBlibedit\fP, which has a BSD licence.
.P .P
Setting this option causes the \fB-lreadline\fP option to be added to the Setting --enable-pcre2test-libreadline causes the \fB-lreadline\fP option to be
\fBpcre2test\fP build. In many operating environments with a sytem-installed added to the \fBpcre2test\fP build. In many operating environments with a
readline library this is sufficient. However, in some environments (e.g. if an sytem-installed readline library this is sufficient. However, in some
unmodified distribution version of readline is in use), some extra environments (e.g. if an unmodified distribution version of readline is in
configuration may be necessary. The INSTALL file for \fBlibreadline\fP says use), some extra configuration may be necessary. The INSTALL file for
this: \fBlibreadline\fP says this:
.sp .sp
"Readline uses the termcap functions, but does not link with "Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications the termcap or curses library itself, allowing applications
@ -396,13 +401,13 @@ immediately before the \fBconfigure\fP command.
.SH "DEBUGGING WITH VALGRIND SUPPORT" .SH "DEBUGGING WITH VALGRIND SUPPORT"
.rs .rs
.sp .sp
By adding the If you add
.sp .sp
--enable-valgrind --enable-valgrind
.sp .sp
option to to the \fBconfigure\fP command, PCRE2 will use valgrind annotations to the \fBconfigure\fP command, PCRE2 will use valgrind annotations to mark
to mark certain memory regions as unaddressable. This allows it to detect certain memory regions as unaddressable. This allows it to detect invalid
invalid memory accesses, and is mostly useful for debugging PCRE2 itself. memory accesses, and is mostly useful for debugging PCRE2 itself.
. .
. .
.SH "CODE COVERAGE REPORTING" .SH "CODE COVERAGE REPORTING"
@ -482,6 +487,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 03 November 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2CALLOUT 3 "19 October 2014" "PCRE2 10.00" .TH PCRE2CALLOUT 3 "23 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -68,29 +68,27 @@ expect.
.P .P
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as what follows cannot be part of the repeat. For example, a+[bc] is compiled as
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is anchored if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
and then applied with automatic callouts to the string "aaaa" is: with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
"aaaa" is:
.sp .sp
--->aaaa --->aaaa
+0 ^ ^ +0 ^ a+
+1 ^ a+ +2 ^ ^ [bc]
+3 ^ ^ [bc]
No match No match
.sp .sp
This indicates that when matching [bc] fails, there is no backtracking into a+ This indicates that when matching [bc] fails, there is no backtracking into a+
and therefore the callouts that would be taken for the backtracks do not occur. and therefore the callouts that would be taken for the backtracks do not occur.
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
to \fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). If \fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
this is done in \fBpcre2test\fP (using the /no_auto_possess qualifier), the case, the output changes to this:
output changes to this:
.sp .sp
--->aaaa --->aaaa
+0 ^ ^ +0 ^ a+
+1 ^ a+ +2 ^ ^ [bc]
+3 ^ ^ [bc] +2 ^ ^ [bc]
+3 ^ ^ [bc] +2 ^ ^ [bc]
+3 ^ ^ [bc] +2 ^^ [bc]
+3 ^^ [bc]
No match No match
.sp .sp
This time, when matching [bc] fails, the matcher backtracks into a+ and tries This time, when matching [bc] fails, the matcher backtracks into a+ and tries
@ -119,10 +117,10 @@ callouts such as the example above are obeyed.
.SH "THE CALLOUT INTERFACE" .SH "THE CALLOUT INTERFACE"
.rs .rs
.sp .sp
During matching, when PCRE2 reaches a callout point, the external function that During matching, when PCRE2 reaches a callout point, if an external function is
is set in the match context is called (if it is set). This applies to both set in the match context, it is called. This applies to both normal and DFA
normal and DFA matching. The only argument to the callout function is a pointer matching. The only argument to the callout function is a pointer to a
to a \fBpcre2_callout\fP block. This structure contains the following fields: \fBpcre2_callout\fP block. This structure contains the following fields:
.sp .sp
uint32_t \fIversion\fP; uint32_t \fIversion\fP;
uint32_t \fIcallout_number\fP; uint32_t \fIcallout_number\fP;
@ -149,7 +147,7 @@ automatically generated callouts).
.P .P
The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
(the "ovector") that was passed to the matching function in the match data (the "ovector") that was passed to the matching function in the match data
block. When \fBpcre2_match()\fP is used, the contents can be inspected, in block. When \fBpcre2_match()\fP is used, the contents can be inspected in
order to extract substrings that have been matched so far, in the same way as order to extract substrings that have been matched so far, in the same way as
for extracting substrings after a match has completed. For the DFA matching for extracting substrings after a match has completed. For the DFA matching
function, this field is not useful. function, this field is not useful.
@ -238,6 +236,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 19 October 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2GREP 1 "28 September 2014" "PCRE2 10.00" .TH PCRE2GREP 1 "23 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
pcre2grep - a grep with Perl-compatible regular expressions. pcre2grep - a grep with Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -403,8 +403,8 @@ used. There is no short form for this option.
Processing some regular expression patterns can require a very large amount of Processing some regular expression patterns can require a very large amount of
memory, leading in some cases to a program crash if not enough is available. memory, leading in some cases to a program crash if not enough is available.
Other patterns may take a very long time to search for all possible matching Other patterns may take a very long time to search for all possible matching
strings. The \fBpcre2_exec()\fP function that is called by \fBpcre2grep\fP to do strings. The \fBpcre2_match()\fP function that is called by \fBpcre2grep\fP to
the matching has two parameters that can limit the resources that it uses. do the matching has two parameters that can limit the resources that it uses.
.sp .sp
The \fB--match-limit\fP option provides a means of limiting resource usage The \fB--match-limit\fP option provides a means of limiting resource usage
when processing patterns that are not going to match, but which have a very when processing patterns that are not going to match, but which have a very
@ -678,6 +678,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 28 September 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -446,7 +446,7 @@ OPTIONS
very large amount of memory, leading in some cases to a pro- very large amount of memory, leading in some cases to a pro-
gram crash if not enough is available. Other patterns may gram crash if not enough is available. Other patterns may
take a very long time to search for all possible matching take a very long time to search for all possible matching
strings. The pcre2_exec() function that is called by strings. The pcre2_match() function that is called by
pcre2grep to do the matching has two parameters that can pcre2grep to do the matching has two parameters that can
limit the resources that it uses. limit the resources that it uses.
@ -737,5 +737,5 @@ AUTHOR
REVISION REVISION
Last updated: 28 September 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.

View File

@ -1,4 +1,4 @@
.TH PCRE2JIT 3 "12 November 2014" "PCRE2 10.00" .TH PCRE2JIT 3 "23 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT" .SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@ -6,11 +6,11 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
Just-in-time compiling is a heavyweight optimization that can greatly speed up Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the pattern matching. However, it comes at the cost of extra processing before the
match is performed. Therefore, it is of most benefit when the same pattern is match is performed, so it is of most benefit when the same pattern is going to
going to be matched many times. This does not necessarily mean many calls of a be matched many times. This does not necessarily mean many calls of a matching
matching function; if the pattern is not anchored, matching attempts may take function; if the pattern is not anchored, matching attempts may take place many
place many times at various positions in the subject, even for a single call. times at various positions in the subject, even for a single call. Therefore,
Therefore, if the subject string is very long, it may still pay to use JIT for if the subject string is very long, it may still pay to use JIT even for
one-off matches. JIT support is available for all of the 8-bit, 16-bit and one-off matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries. 32-bit PCRE2 libraries.
.P .P
@ -77,7 +77,7 @@ option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
PCRE2_JIT_COMPLETE and just compile code for partial matching. If PCRE2_JIT_COMPLETE and just compile code for partial matching. If
\fBpcre2_jit_compile()\fP is called with no option bits set, it immediately \fBpcre2_jit_compile()\fP is called with no option bits set, it immediately
returns zero. This is an alternative way of testing if JIT is available. returns zero. This is an alternative way of testing whether JIT is available.
.P .P
At present, it is not possible to free JIT compiled code except when the entire At present, it is not possible to free JIT compiled code except when the entire
compiled pattern is freed by calling \fBpcre2_free_code()\fP. compiled pattern is freed by calling \fBpcre2_free_code()\fP.
@ -276,7 +276,7 @@ compiled patterns, contexts, and stacks in any order, anytime. Just \fIdo
not\fP call \fBpcre2_match()\fP with a match context pointing to an already not\fP call \fBpcre2_match()\fP with a match context pointing to an already
freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently freed stack, as that will cause SEGFAULT. (Also, do not free a stack currently
used by \fBpcre2_match()\fP in another thread). You can also replace the stack used by \fBpcre2_match()\fP in another thread). You can also replace the stack
in a context at any time when it is not in use. You can also free the previous in a context at any time when it is not in use. You should free the previous
stack before assigning a replacement. stack before assigning a replacement.
.P .P
(5) Should I allocate/free a stack every time before/after calling (5) Should I allocate/free a stack every time before/after calling
@ -398,6 +398,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 12 November 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "14 November 2014" "PCRE2 10.00" .TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -394,7 +394,7 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc) (*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
.sp .sp
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_exec(), not increase them. limits set by the caller of pcre2_match(), not increase them.
. .
. .
.SH "NEWLINE CONVENTION" .SH "NEWLINE CONVENTION"
@ -536,6 +536,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 14 November 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "14 November 2014" "PCRE 10.00" .TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00"
.SH NAME .SH NAME
pcre2test - a program for testing Perl-compatible regular expressions. pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -200,7 +200,7 @@ input lines. Each set starts with a regular expression pattern, followed by any
number of subject lines to be matched against that pattern. In between sets of number of subject lines to be matched against that pattern. In between sets of
test data, command lines that begin with a hash (#) character may appear. This test data, command lines that begin with a hash (#) character may appear. This
file format, with some restrictions, can also be processed by the file format, with some restrictions, can also be processed by the
\fBperltest.pl\fP script that is distributed with PCRE2 as a means of checking \fBperltest.sh\fP script that is distributed with PCRE2 as a means of checking
that the behaviour of PCRE2 and Perl is the same. that the behaviour of PCRE2 and Perl is the same.
.P .P
Each subject line is matched separately and independently. If you want to do Each subject line is matched separately and independently. If you want to do
@ -243,11 +243,11 @@ patterns. Modifiers on a pattern can change these settings.
#perltest #perltest
.sp .sp
The appearance of this line causes all subsequent modifier settings to be The appearance of this line causes all subsequent modifier settings to be
checked for compatibility with the \fBperltest.pl\fP script, which is used to checked for compatibility with the \fBperltest.sh\fP script, which is used to
confirm that Perl gives the same results as PCRE2. Also, apart from comment confirm that Perl gives the same results as PCRE2. Also, apart from comment
lines, none of the other command lines are permitted, because they and many lines, none of the other command lines are permitted, because they and many
of the modifiers are specific to \fBpcre2test\fP, and should not be used in of the modifiers are specific to \fBpcre2test\fP, and should not be used in
test files that are also processed by \fBperltest.pl\fP. The \fP#perltest\fB test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB
command helps detect tests that are accidentally put in the wrong file. command helps detect tests that are accidentally put in the wrong file.
.sp .sp
#subject <modifier-list> #subject <modifier-list>
@ -265,7 +265,7 @@ for both patterns and subject lines, whereas others are valid for one or the
other only. Each modifier has a long name, for example "anchored", and some of other only. Each modifier has a long name, for example "anchored", and some of
them must be followed by an equals sign and a value, for example, "offset=12". them must be followed by an equals sign and a value, for example, "offset=12".
Modifiers that do not take values may be preceded by a minus sign to turn off a Modifiers that do not take values may be preceded by a minus sign to turn off a
previous default setting. previous setting.
.P .P
A few of the more common modifiers can also be specified as single letters, for A few of the more common modifiers can also be specified as single letters, for
example "i" for "caseless". In documentation, following the Perl convention, example "i" for "caseless". In documentation, following the Perl convention,
@ -336,7 +336,7 @@ encoding non-printing characters in a visible way:
\exhh hexadecimal byte (up to 2 hex digits) \exhh hexadecimal byte (up to 2 hex digits)
\ex{hh...} hexadecimal character (any number of hex digits) \ex{hh...} hexadecimal character (any number of hex digits)
.sp .sp
The use of \ex{hh...} is not dependent on the use of the utf modifier on The use of \ex{hh...} is not dependent on the use of the \fButf\fP modifier on
the pattern. It is recognized always. There may be any number of hexadecimal the pattern. It is recognized always. There may be any number of hexadecimal
digits inside the braces; invalid values provoke error messages. digits inside the braces; invalid values provoke error messages.
.P .P
@ -366,7 +366,7 @@ part of the file. For example:
is converted to "abcabcabcabc". This feature does not support nesting. To is converted to "abcabcabcabc". This feature does not support nesting. To
include a closing square bracket in the characters, code it as \ex5D. include a closing square bracket in the characters, code it as \ex5D.
.P .P
A backslash followed by an equals sign marke the end of the subject string and A backslash followed by an equals sign marks the end of the subject string and
the start of a modifier list. For example: the start of a modifier list. For example:
.sp .sp
abc\e=notbol,notempty abc\e=notbol,notempty
@ -461,8 +461,8 @@ set to "anycrlf", \eR matches CR, LF, or CRLF only. If it is set to "unicode",
is built, with the default default being Unicode. is built, with the default default being Unicode.
.P .P
The \fBnewline\fP modifier specifies which characters are to be interpreted as The \fBnewline\fP modifier specifies which characters are to be interpreted as
newlines, both in the pattern and (by default) in subject lines. The type must newlines, both in the pattern and in subject lines. The type must be one of CR,
be one of CR, LF, CRLF, ANYCRLF, or ANY. LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
. .
. .
.SS "Information about a pattern" .SS "Information about a pattern"
@ -478,8 +478,8 @@ link sizes and different code unit widths. By using \fBbincode\fP, the same
regression tests can be used in different environments. regression tests can be used in different environments.
.P .P
The \fBfullbincode\fP modifier, by contrast, \fIdoes\fP include length and The \fBfullbincode\fP modifier, by contrast, \fIdoes\fP include length and
offset values. This is used in a few special tests and is also useful for offset values. This is used in a few special tests that run only for specific
one-off tests. code unit widths and link sizes, and is also useful for one-off tests.
.P .P
The \fBinfo\fP modifier requests information about the compiled pattern The \fBinfo\fP modifier requests information about the compiled pattern
(whether it is anchored, has a fixed first character, and so on). The (whether it is anchored, has a fixed first character, and so on). The
@ -501,13 +501,14 @@ some typical examples:
Last code unit = 'c' (caseless) Last code unit = 'c' (caseless)
Subject length lower bound = 3 Subject length lower bound = 3
.sp .sp
"Compile options" are those specified to the compile function; "overall "Compile options" are those specified by modifiers; "overall options" have
options" have added options that are taken or deduced from the pattern. If both added options that are taken or deduced from the pattern. If both sets of
sets of options are the same, just a single "options" line is output. "First options are the same, just a single "options" line is output; if there are no
code unit" is where any match must start; if there is more than one they are options, the line is omitted. "First code unit" is where any match must start;
listed as "starting code units". "Last code unit" is the last literal code unit if there is more than one they are listed as "starting code units". "Last code
that must be present in any match. This is not necessarily the last character. unit" is the last literal code unit that must be present in any match. This is
These lines are omitted if no starting or ending code units are recorded. not necessarily the last character. These lines are omitted if no starting or
ending code units are recorded.
. .
. .
.SS "Specifying a pattern in hex" .SS "Specifying a pattern in hex"
@ -520,16 +521,16 @@ pairs. For example:
/ab 32 59/hex /ab 32 59/hex
.sp .sp
This feature is provided as a way of creating patterns that contain binary zero This feature is provided as a way of creating patterns that contain binary zero
characters. By default, \fBpcre2test\fP passes patterns as zero-terminated and other non-printing characters. By default, \fBpcre2test\fP passes patterns
strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. as zero-terminated strings to \fBpcre2_compile()\fP, giving the length as
However, for patterns specified in hexadecimal, the actual length of the PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
pattern is passed. actual length of the pattern is passed.
. .
. .
.SS "JIT compilation" .SS "JIT compilation"
.rs .rs
.sp .sp
The \fB/jit\fP modifier may optionally be followed by and equals sign and a The \fB/jit\fP modifier may optionally be followed by an equals sign and a
number in the range 0 to 7: number in the range 0 to 7:
.sp .sp
0 disable JIT 0 disable JIT
@ -561,7 +562,7 @@ pattern shows whether JIT compilation was or was not successful. If
\fBjitverify\fP is specified without \fBjit\fP, jit=7 is assumed. If JIT \fBjitverify\fP is specified without \fBjit\fP, jit=7 is assumed. If JIT
compilation is successful when \fBjitverify\fP is set, the text "(JIT)" is compilation is successful when \fBjitverify\fP is set, the text "(JIT)" is
added to the first output line after a match or non match when JIT-compiled added to the first output line after a match or non match when JIT-compiled
code was actually used. code was actually used in the match.
. .
. .
.SS "Setting a locale" .SS "Setting a locale"
@ -645,8 +646,8 @@ be aborted.
.SS "Using alternative character tables" .SS "Using alternative character tables"
.rs .rs
.sp .sp
The \fB/tables\fP modifier must be followed by a single digit. It causes a The value specified for the \fB/tables\fP modifier must be one of the digits 0,
specific set of built-in character tables to be passed to 1, or 2. It causes a specific set of built-in character tables to be passed to
\fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with \fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
different character tables. The digit specifies the tables as follows: different character tables. The digit specifies the tables as follows:
.sp .sp
@ -759,13 +760,13 @@ The effects of these modifiers are described in the following sections.
.SS "Showing more text" .SS "Showing more text"
.rs .rs
.sp .sp
The \fBaftertext\fP modifier requests that as well as outputting the substring The \fBaftertext\fP modifier requests that as well as outputting the part of
that matched the entire pattern, \fBpcre2test\fP should in addition output the the subject string that matched the entire pattern, \fBpcre2test\fP should in
remainder of the subject string. This is useful for tests where the subject addition output the remainder of the subject string. This is useful for tests
contains multiple copies of the same substring. The \fBallaftertext\fP modifier where the subject contains multiple copies of the same substring. The
requests the same action for captured substrings as well as the main matched \fBallaftertext\fP modifier requests the same action for captured substrings as
substring. In each case the remainder is output on the following line with a well as the main matched substring. In each case the remainder is output on the
plus character following the capture number. following line with a plus character following the capture number.
.P .P
The \fBallusedtext\fP modifier requests that all the text that was consulted The \fBallusedtext\fP modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown. This during a successful pattern match by the interpreter should be shown. This
@ -782,7 +783,8 @@ underneath them. Here is an example:
<<< >>> <<< >>>
.sp .sp
This shows that the matched string is "abc", with the preceding and following This shows that the matched string is "abc", with the preceding and following
strings "pqr" and "xyz" also consulted during the match. strings "pqr" and "xyz" having been consulted during the match (when processing
the assertions).
.P .P
The \fBstartchar\fP modifier requests that the starting character for the match The \fBstartchar\fP modifier requests that the starting character for the match
be indicated, if it is different to the start of the matched string. The only be indicated, if it is different to the start of the matched string. The only
@ -836,7 +838,7 @@ function is called again to search the remainder of the subject. The difference
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP \fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
to start searching at a new point within the entire string (which is what Perl to start searching at a new point within the entire string (which is what Perl
does), whereas the latter passes over a shortened substring. This makes a does), whereas the latter passes over a shortened subject. This makes a
difference to the matching process if the pattern begins with a lookbehind difference to the matching process if the pattern begins with a lookbehind
assertion (including \eb or \eB). assertion (including \eb or \eB).
.P .P
@ -847,7 +849,7 @@ fails, the start offset is advanced, and the normal match is retried. This
imitates the way Perl handles such cases when using the \fB/g\fP modifier or imitates the way Perl handles such cases when using the \fB/g\fP modifier or
the \fBsplit()\fP function. Normally, the start offset is advanced by one the \fBsplit()\fP function. Normally, the start offset is advanced by one
character, but if the newline convention recognizes CRLF as a newline, and the character, but if the newline convention recognizes CRLF as a newline, and the
current character is CR followed by LF, an advance of two is used. current character is CR followed by LF, an advance of two characters occurs.
. .
. .
.SS "Testing substring extraction functions" .SS "Testing substring extraction functions"
@ -860,9 +862,9 @@ for example:
.sp .sp
abcd\e=copy=1,copy=3,get=G1 abcd\e=copy=1,copy=3,get=G1
.sp .sp
If the \fB#subject\fP command is used to set default copy and get lists, these If the \fB#subject\fP command is used to set default copy and/or get lists,
can be unset by specifying a negative number for numbered groups and an empty these can be unset by specifying a negative number to cancel all numbered
name for named groups. groups and an empty name to cancel all named groups.
.P .P
The \fBgetall\fP modifier tests \fBpcre2_substring_list_get()\fP, which The \fBgetall\fP modifier tests \fBpcre2_substring_list_get()\fP, which
extracts all captured substrings. extracts all captured substrings.
@ -871,7 +873,8 @@ If the subject line is successfully matched, the substrings extracted by the
convenience functions are output with C, G, or L after the string number convenience functions are output with C, G, or L after the string number
instead of a colon. This is in addition to the normal full list. The string instead of a colon. This is in addition to the normal full list. The string
length (that is, the return from the extraction function) is given in length (that is, the return from the extraction function) is given in
parentheses after each substring. parentheses after each substring, followed by the name when the extraction was
by name.
. .
. .
.SS "Testing the substitution function" .SS "Testing the substitution function"
@ -1044,11 +1047,10 @@ entire substring that was inspected during the partial match; it may include
characters before the actual match start if a lookbehind assertion, \eK, \eb, characters before the actual match start if a lookbehind assertion, \eK, \eb,
or \eB was involved.) or \eB was involved.)
.P .P
For any other return, \fBpcre2test\fP outputs the PCRE2 For any other return, \fBpcre2test\fP outputs the PCRE2 negative error number
negative error number and a short descriptive phrase. If the error is a failed and a short descriptive phrase. If the error is a failed UTF string check, the
UTF string check, the offset of the start of the failing character and the code unit offset of the start of the failing character is also output. Here is
reason code are also output. Here is an example of an interactive an example of an interactive \fBpcre2test\fP run.
\fBpcre2test\fP run.
.sp .sp
$ pcre2test $ pcre2test
PCRE2 version 9.00 2014-05-10 PCRE2 version 9.00 2014-05-10
@ -1061,10 +1063,10 @@ reason code are also output. Here is an example of an interactive
No match No match
.sp .sp
Unset capturing substrings that are not followed by one that is set are not Unset capturing substrings that are not followed by one that is set are not
returned by \fBpcre2_match()\fP, and are not shown by \fBpcre2test\fP. In the shown by \fBpcre2test\fP unless the \fBallcaptures\fP modifier is specified. In
following example, there are two capturing substrings, but when the first data the following example, there are two capturing substrings, but when the first
line is matched, the second, unset substring is not shown. An "internal" unset data line is matched, the second, unset substring is not shown. An "internal"
substring is shown as "<unset>", as for the second data line. unset substring is shown as "<unset>", as for the second data line.
.sp .sp
re> /(a)|(b)/ re> /(a)|(b)/
data> a data> a
@ -1100,8 +1102,8 @@ are output in sequence, like this:
1: pp 1: pp
.sp .sp
"No match" is output only if the first match attempt fails. Here is an example "No match" is output only if the first match attempt fails. Here is an example
of a failure message (the offset 4 that is specified by \e>4 is past the end of of a failure message (the offset 4 that is specified by the \fBoffset\fP
the subject string): modifier is past the end of the subject string):
.sp .sp
re> /xyz/ re> /xyz/
data> xyz\e=offset=4 data> xyz\e=offset=4
@ -1127,12 +1129,13 @@ the subject where there is at least one match. For example:
1: tang 1: tang
2: tan 2: tan
.sp .sp
(Using the normal matching function on this data finds only "tang".) The Using the normal matching function on this data finds only "tang". The
longest matching string is always given first (and numbered zero). After a longest matching string is always given first (and numbered zero). After a
PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the
partially matching substring. (Note that this is the entire substring that was partially matching substring. Note that this is the entire substring that was
inspected during the partial match; it may include characters before the actual inspected during the partial match; it may include characters before the actual
match start if a lookbehind assertion, \eK, \eb, or \eB was involved.) match start if a lookbehind assertion, \eb, or \eB was involved. (\eK is not
supported for DFA matching.)
.P .P
If global matching is requested, the search for further matches resumes If global matching is requested, the search for further matches resumes
at the end of the longest match. For example: at the end of the longest match. For example:
@ -1174,9 +1177,9 @@ documentation.
.SH CALLOUTS .SH CALLOUTS
.rs .rs
.sp .sp
If the pattern contains any callout requests, \fBpcre2test\fP's callout function If the pattern contains any callout requests, \fBpcre2test\fP's callout
is called during matching. This works with both matching functions. By default, function is called during matching. This works with both matching functions. By
the called function displays the callout number, the start and current default, the called function displays the callout number, the start and current
positions in the text at the callout time, and the next pattern item to be positions in the text at the callout time, and the next pattern item to be
tested. For example: tested. For example:
.sp .sp
@ -1271,6 +1274,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 14 November 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2UNICODE 3 "03 November 2014" "PCRE2 10.00" .TH PCRE2UNICODE 3 "23 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE - Perl-compatible regular expressions (revised API) PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT" .SH "UNICODE AND UTF SUPPORT"
@ -64,7 +64,7 @@ characters (see the description of \eC in the
\fBpcre2pattern\fP \fBpcre2pattern\fP
.\" .\"
documentation). The use of \eC is not supported in the alternative matching documentation). The use of \eC is not supported in the alternative matching
function \fBpcre2_dfa_exec()\fP, nor is it supported in UTF mode by the JIT function \fBpcre2_dfa_match()\fP, nor is it supported in UTF mode by the JIT
optimization. If JIT optimization is requested for a UTF pattern that contains optimization. If JIT optimization is requested for a UTF pattern that contains
\eC, it will not succeed, and so the matching will be carried out by the normal \eC, it will not succeed, and so the matching will be carried out by the normal
interpretive function. interpretive function.
@ -108,7 +108,10 @@ case-equivalent, and these are treated as such.
.sp .sp
When the PCRE2_UTF option is set, the strings passed as patterns and subjects When the PCRE2_UTF option is set, the strings passed as patterns and subjects
are (by default) checked for validity on entry to the relevant functions. are (by default) checked for validity on entry to the relevant functions.
If an invalid UTF string is passed, an error return is given. If an invalid UTF string is passed, an negative error code is returned. The
code unit offset to the offending character can be extracted from the match
data block by calling \fBpcre2_get_startchar()\fP, which is used for this
purpose after a UTF error.
.P .P
UTF-16 and UTF-32 strings can indicate their endianness by special code knows UTF-16 and UTF-32 strings can indicate their endianness by special code knows
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
@ -130,14 +133,14 @@ UTF-32.)
In some situations, you may already know that your strings are valid, and In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly. example in the case of a long subject string that is being scanned repeatedly.
If you set the PCRE2_NO_UTF_CHECK flag at compile time or at run time, PCRE2 If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
assumes that the pattern or subject it is given (respectively) contains only PCRE2 assumes that the pattern or subject it is given (respectively) contains
valid UTF code unit sequences. only valid UTF code unit sequences.
.P .P
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable the pattern; it does not also apply to subject strings. If you want to disable
the check for a subject string you must pass this option to \fBpcre2_exec()\fP the check for a subject string you must pass this option to \fBpcre2_match()\fP
or \fBpcre2_dfa_exec()\fP. or \fBpcre2_dfa_match()\fP.
.P .P
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is undefined and your program may crash or loop indefinitely. is undefined and your program may crash or loop indefinitely.
@ -249,6 +252,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 03 November 2014 Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -3224,12 +3224,8 @@ multiunit character. */
#ifdef SUPPORT_UNICODE #ifdef SUPPORT_UNICODE
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0) if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
{ {
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar)); match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
if (match_data->rc != 0) if (match_data->rc != 0) return match_data->rc;
{
match_data->leftchar = 0;
return match_data->rc;
}
#if PCRE2_CODE_UNIT_WIDTH != 32 #if PCRE2_CODE_UNIT_WIDTH != 32
if (start_offset > 0 && start_offset < length && if (start_offset > 0 && start_offset < length &&
NOT_FIRSTCHAR(subject[start_offset])) NOT_FIRSTCHAR(subject[start_offset]))

View File

@ -6459,12 +6459,8 @@ multiunit character. */
#ifdef SUPPORT_UNICODE #ifdef SUPPORT_UNICODE
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0) if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
{ {
match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->rightchar)); match_data->rc = PRIV(valid_utf)(subject, length, &(match_data->startchar));
if (match_data->rc != 0) if (match_data->rc != 0) return match_data->rc;
{
match_data->leftchar = 0;
return match_data->rc;
}
#if PCRE2_CODE_UNIT_WIDTH != 32 #if PCRE2_CODE_UNIT_WIDTH != 32
if (start_offset > 0 && start_offset < length && if (start_offset > 0 && start_offset < length &&
NOT_FIRSTCHAR(subject[start_offset])) NOT_FIRSTCHAR(subject[start_offset]))

View File

@ -5570,6 +5570,13 @@ else for (gmatched = 0;; gmatched++)
fprintf(outfile, "Failed: error %d: ", capcount); fprintf(outfile, "Failed: error %d: ", capcount);
PCRE2_GET_ERROR_MESSAGE(mlen, capcount, pbuffer); PCRE2_GET_ERROR_MESSAGE(mlen, capcount, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, mlen, FALSE, outfile); PCHARSV(CASTVAR(void *, pbuffer), 0, mlen, FALSE, outfile);
if (capcount <= PCRE2_ERROR_UTF8_ERR1 &&
capcount >= PCRE2_ERROR_UTF32_ERR2)
{
PCRE2_SIZE startchar;
PCRE2_GET_STARTCHAR(startchar, match_data);
fprintf(outfile, " at offset %ld", startchar);
}
fprintf(outfile, "\n"); fprintf(outfile, "\n");
break; break;
} }

30
testdata/testinput10 vendored
View File

@ -48,12 +48,12 @@
/テテテxxx/utf /テテテxxx/utf
/badutf/utf /badutf/utf
\xdf X\xdf
\xef XX\xef
\xef\x80 XXX\xef\x80
\xf7 X\xf7
\xf7\x80 XX\xf7\x80
\xf7\x80\x80 XXX\xf7\x80\x80
\xfb \xfb
\xfb\x80 \xfb\x80
\xfb\x80\x80 \xfb\x80\x80
@ -89,14 +89,14 @@
\xff \xff
/badutf/utf /badutf/utf
\xfb\x80\x80\x80\x80 XX\xfb\x80\x80\x80\x80
\xfd\x80\x80\x80\x80\x80 XX\xfd\x80\x80\x80\x80\x80
\xf7\xbf\xbf\xbf XX\xf7\xbf\xbf\xbf
/shortutf/utf /shortutf/utf
\xdf\=ph XX\xdf\=ph
\xef\=ph XX\xef\=ph
\xef\x80\=ph XX\xef\x80\=ph
\xf7\=ph \xf7\=ph
\xf7\x80\=ph \xf7\x80\=ph
\xf7\x80\x80\=ph \xf7\x80\x80\=ph
@ -111,9 +111,9 @@
\xfd\x80\x80\x80\x80\=ph \xfd\x80\x80\x80\x80\=ph
/anything/utf /anything/utf
\xc0\x80 X\xc0\x80
\xc1\x8f XX\xc1\x8f
\xe0\x9f\x80 XXX\xe0\x9f\x80
\xf0\x8f\x80\x80 \xf0\x8f\x80\x80
\xf8\x87\x80\x80\x80 \xf8\x87\x80\x80\x80
\xfc\x83\x80\x80\x80\x80 \xfc\x83\x80\x80\x80\x80

24
testdata/testinput12 vendored
View File

@ -157,18 +157,18 @@
/^[\QĀ\E-\QŐ\E/B,utf /^[\QĀ\E-\QŐ\E/B,utf
/X/utf /X/utf
\x{d800} XX\x{d800}
\x{d800}\=no_utf_check XX\x{d800}\=no_utf_check
\x{da00} XX\x{da00}
\x{da00}\=no_utf_check XX\x{da00}\=no_utf_check
\x{dc00} XX\x{dc00}
\x{dc00}\=no_utf_check XX\x{dc00}\=no_utf_check
\x{de00} XX\x{de00}
\x{de00}\=no_utf_check XX\x{de00}\=no_utf_check
\x{dfff} XX\x{dfff}
\x{dfff}\=no_utf_check XX\x{dfff}\=no_utf_check
\x{110000} XX\x{110000}
\x{d800}\x{1234} XX\x{d800}\x{1234}
/(*UTF16)\x{11234}/ /(*UTF16)\x{11234}/
abcd\x{11234}pqr abcd\x{11234}pqr

182
testdata/testoutput10 vendored
View File

@ -73,142 +73,142 @@ Failed: error -3 at offset 0: UTF-8 error: 1 byte missing at end
Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80 Failed: error -8 at offset 0: UTF-8 error: byte 2 top bits not 0x80
/badutf/utf /badutf/utf
\xdf X\xdf
Failed: error -3: UTF-8 error: 1 byte missing at end Failed: error -3: UTF-8 error: 1 byte missing at end at offset 1
\xef XX\xef
Failed: error -4: UTF-8 error: 2 bytes missing at end Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
\xef\x80 XXX\xef\x80
Failed: error -3: UTF-8 error: 1 byte missing at end Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
\xf7 X\xf7
Failed: error -5: UTF-8 error: 3 bytes missing at end Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 1
\xf7\x80 XX\xf7\x80
Failed: error -4: UTF-8 error: 2 bytes missing at end Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
\xf7\x80\x80 XXX\xf7\x80\x80
Failed: error -3: UTF-8 error: 1 byte missing at end Failed: error -3: UTF-8 error: 1 byte missing at end at offset 3
\xfb \xfb
Failed: error -6: UTF-8 error: 4 bytes missing at end Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfb\x80 \xfb\x80
Failed: error -5: UTF-8 error: 3 bytes missing at end Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfb\x80\x80 \xfb\x80\x80
Failed: error -4: UTF-8 error: 2 bytes missing at end Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfb\x80\x80\x80 \xfb\x80\x80\x80
Failed: error -3: UTF-8 error: 1 byte missing at end Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xfd \xfd
Failed: error -7: UTF-8 error: 5 bytes missing at end Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
\xfd\x80 \xfd\x80
Failed: error -6: UTF-8 error: 4 bytes missing at end Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfd\x80\x80 \xfd\x80\x80
Failed: error -5: UTF-8 error: 3 bytes missing at end Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfd\x80\x80\x80 \xfd\x80\x80\x80
Failed: error -4: UTF-8 error: 2 bytes missing at end Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfd\x80\x80\x80\x80 \xfd\x80\x80\x80\x80
Failed: error -3: UTF-8 error: 1 byte missing at end Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xdf\x7f \xdf\x7f
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xef\x7f\x80 \xef\x7f\x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xef\x80\x7f \xef\x80\x7f
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xf7\x7f\x80\x80 \xf7\x7f\x80\x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xf7\x80\x7f\x80 \xf7\x80\x7f\x80
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xf7\x80\x80\x7f \xf7\x80\x80\x7f
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
\xfb\x7f\x80\x80\x80 \xfb\x7f\x80\x80\x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xfb\x80\x7f\x80\x80 \xfb\x80\x7f\x80\x80
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xfb\x80\x80\x7f\x80 \xfb\x80\x80\x7f\x80
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
\xfb\x80\x80\x80\x7f \xfb\x80\x80\x80\x7f
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
\xfd\x7f\x80\x80\x80\x80 \xfd\x7f\x80\x80\x80\x80
Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 Failed: error -8: UTF-8 error: byte 2 top bits not 0x80 at offset 0
\xfd\x80\x7f\x80\x80\x80 \xfd\x80\x7f\x80\x80\x80
Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 Failed: error -9: UTF-8 error: byte 3 top bits not 0x80 at offset 0
\xfd\x80\x80\x7f\x80\x80 \xfd\x80\x80\x7f\x80\x80
Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 Failed: error -10: UTF-8 error: byte 4 top bits not 0x80 at offset 0
\xfd\x80\x80\x80\x7f\x80 \xfd\x80\x80\x80\x7f\x80
Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 Failed: error -11: UTF-8 error: byte 5 top bits not 0x80 at offset 0
\xfd\x80\x80\x80\x80\x7f \xfd\x80\x80\x80\x80\x7f
Failed: error -12: UTF-8 error: byte 6 top bits not 0x80 Failed: error -12: UTF-8 error: byte 6 top bits not 0x80 at offset 0
\xed\xa0\x80 \xed\xa0\x80
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\xc0\x8f \xc0\x8f
Failed: error -17: UTF-8 error: overlong 2-byte sequence Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 0
\xe0\x80\x8f \xe0\x80\x8f
Failed: error -18: UTF-8 error: overlong 3-byte sequence Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 0
\xf0\x80\x80\x8f \xf0\x80\x80\x8f
Failed: error -19: UTF-8 error: overlong 4-byte sequence Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
\xf8\x80\x80\x80\x8f \xf8\x80\x80\x80\x8f
Failed: error -20: UTF-8 error: overlong 5-byte sequence Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
\xfc\x80\x80\x80\x80\x8f \xfc\x80\x80\x80\x80\x8f
Failed: error -21: UTF-8 error: overlong 6-byte sequence Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
\x80 \x80
Failed: error -22: UTF-8 error: isolated 0x80 byte Failed: error -22: UTF-8 error: isolated 0x80 byte at offset 0
\xfe \xfe
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
\xff \xff
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
/badutf/utf /badutf/utf
\xfb\x80\x80\x80\x80 XX\xfb\x80\x80\x80\x80
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 2
\xfd\x80\x80\x80\x80\x80 XX\xfd\x80\x80\x80\x80\x80
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 2
\xf7\xbf\xbf\xbf XX\xf7\xbf\xbf\xbf
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 2
/shortutf/utf /shortutf/utf
\xdf\=ph XX\xdf\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
\xef\=ph XX\xef\=ph
Failed: error -4: UTF-8 error: 2 bytes missing at end Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 2
\xef\x80\=ph XX\xef\x80\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end Failed: error -3: UTF-8 error: 1 byte missing at end at offset 2
\xf7\=ph \xf7\=ph
Failed: error -5: UTF-8 error: 3 bytes missing at end Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xf7\x80\=ph \xf7\x80\=ph
Failed: error -4: UTF-8 error: 2 bytes missing at end Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xf7\x80\x80\=ph \xf7\x80\x80\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xfb\=ph \xfb\=ph
Failed: error -6: UTF-8 error: 4 bytes missing at end Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfb\x80\=ph \xfb\x80\=ph
Failed: error -5: UTF-8 error: 3 bytes missing at end Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfb\x80\x80\=ph \xfb\x80\x80\=ph
Failed: error -4: UTF-8 error: 2 bytes missing at end Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfb\x80\x80\x80\=ph \xfb\x80\x80\x80\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
\xfd\=ph \xfd\=ph
Failed: error -7: UTF-8 error: 5 bytes missing at end Failed: error -7: UTF-8 error: 5 bytes missing at end at offset 0
\xfd\x80\=ph \xfd\x80\=ph
Failed: error -6: UTF-8 error: 4 bytes missing at end Failed: error -6: UTF-8 error: 4 bytes missing at end at offset 0
\xfd\x80\x80\=ph \xfd\x80\x80\=ph
Failed: error -5: UTF-8 error: 3 bytes missing at end Failed: error -5: UTF-8 error: 3 bytes missing at end at offset 0
\xfd\x80\x80\x80\=ph \xfd\x80\x80\x80\=ph
Failed: error -4: UTF-8 error: 2 bytes missing at end Failed: error -4: UTF-8 error: 2 bytes missing at end at offset 0
\xfd\x80\x80\x80\x80\=ph \xfd\x80\x80\x80\x80\=ph
Failed: error -3: UTF-8 error: 1 byte missing at end Failed: error -3: UTF-8 error: 1 byte missing at end at offset 0
/anything/utf /anything/utf
\xc0\x80 X\xc0\x80
Failed: error -17: UTF-8 error: overlong 2-byte sequence Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 1
\xc1\x8f XX\xc1\x8f
Failed: error -17: UTF-8 error: overlong 2-byte sequence Failed: error -17: UTF-8 error: overlong 2-byte sequence at offset 2
\xe0\x9f\x80 XXX\xe0\x9f\x80
Failed: error -18: UTF-8 error: overlong 3-byte sequence Failed: error -18: UTF-8 error: overlong 3-byte sequence at offset 3
\xf0\x8f\x80\x80 \xf0\x8f\x80\x80
Failed: error -19: UTF-8 error: overlong 4-byte sequence Failed: error -19: UTF-8 error: overlong 4-byte sequence at offset 0
\xf8\x87\x80\x80\x80 \xf8\x87\x80\x80\x80
Failed: error -20: UTF-8 error: overlong 5-byte sequence Failed: error -20: UTF-8 error: overlong 5-byte sequence at offset 0
\xfc\x83\x80\x80\x80\x80 \xfc\x83\x80\x80\x80\x80
Failed: error -21: UTF-8 error: overlong 6-byte sequence Failed: error -21: UTF-8 error: overlong 6-byte sequence at offset 0
\xfe\x80\x80\x80\x80\x80 \xfe\x80\x80\x80\x80\x80
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
\xff\x80\x80\x80\x80\x80 \xff\x80\x80\x80\x80\x80
Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) Failed: error -23: UTF-8 error: illegal byte (0xfe or 0xff) at offset 0
\xc3\x8f \xc3\x8f
No match No match
\xe0\xaf\x80 \xe0\xaf\x80
@ -220,13 +220,13 @@ No match
\xf1\x8f\x80\x80 \xf1\x8f\x80\x80
No match No match
\xf8\x88\x80\x80\x80 \xf8\x88\x80\x80\x80
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
\xf9\x87\x80\x80\x80 \xf9\x87\x80\x80\x80
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
\xfc\x84\x80\x80\x80\x80 \xfc\x84\x80\x80\x80\x80
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
\xfd\x83\x80\x80\x80\x80 \xfd\x83\x80\x80\x80\x80
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
\xf8\x88\x80\x80\x80\=no_utf_check \xf8\x88\x80\x80\x80\=no_utf_check
No match No match
\xf9\x87\x80\x80\x80\=no_utf_check \xf9\x87\x80\x80\x80\=no_utf_check
@ -751,27 +751,27 @@ Failed: error 106 at offset 15: missing terminating ] for character class
/X/utf /X/utf
\x{d800} \x{d800}
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\x{d800}\=no_utf_check \x{d800}\=no_utf_check
No match No match
\x{da00} \x{da00}
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\x{da00}\=no_utf_check \x{da00}\=no_utf_check
No match No match
\x{dfff} \x{dfff}
Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined Failed: error -16: UTF-8 error: code points 0xd800-0xdfff are not defined at offset 0
\x{dfff}\=no_utf_check \x{dfff}\=no_utf_check
No match No match
\x{110000} \x{110000}
Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined Failed: error -15: UTF-8 error: code points greater than 0x10ffff are not defined at offset 0
\x{110000}\=no_utf_check \x{110000}\=no_utf_check
No match No match
\x{2000000} \x{2000000}
Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) Failed: error -13: UTF-8 error: 5-byte character is not allowed (RFC 3629) at offset 0
\x{2000000}\=no_utf_check \x{2000000}\=no_utf_check
No match No match
\x{7fffffff} \x{7fffffff}
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
\x{7fffffff}\=no_utf_check \x{7fffffff}\=no_utf_check
No match No match
@ -1106,7 +1106,7 @@ Subject length lower bound = 1
\x{ff000041} \x{ff000041}
** Character \x{ff000041} is greater than 0x7fffffff and so cannot be converted to UTF-8 ** Character \x{ff000041} is greater than 0x7fffffff and so cannot be converted to UTF-8
\x{7f000041} \x{7f000041}
Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) Failed: error -14: UTF-8 error: 6-byte character is not allowed (RFC 3629) at offset 0
/(*UTF8)abc/never_utf /(*UTF8)abc/never_utf
Failed: error 174 at offset 7: using UTF is disabled by the application Failed: error 174 at offset 7: using UTF is disabled by the application

View File

@ -607,30 +607,30 @@ Subject length lower bound = 2
Failed: error 106 at offset 13: missing terminating ] for character class Failed: error 106 at offset 13: missing terminating ] for character class
/X/utf /X/utf
\x{d800} XX\x{d800}
Failed: error -24: UTF-16 error: missing low surrogate at end Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
\x{d800}\=no_utf_check XX\x{d800}\=no_utf_check
No match 0: X
\x{da00} XX\x{da00}
Failed: error -24: UTF-16 error: missing low surrogate at end Failed: error -24: UTF-16 error: missing low surrogate at end at offset 2
\x{da00}\=no_utf_check XX\x{da00}\=no_utf_check
No match 0: X
\x{dc00} XX\x{dc00}
Failed: error -26: UTF-16 error: isolated low surrogate Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
\x{dc00}\=no_utf_check XX\x{dc00}\=no_utf_check
No match 0: X
\x{de00} XX\x{de00}
Failed: error -26: UTF-16 error: isolated low surrogate Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
\x{de00}\=no_utf_check XX\x{de00}\=no_utf_check
No match 0: X
\x{dfff} XX\x{dfff}
Failed: error -26: UTF-16 error: isolated low surrogate Failed: error -26: UTF-16 error: isolated low surrogate at offset 2
\x{dfff}\=no_utf_check XX\x{dfff}\=no_utf_check
No match 0: X
\x{110000} XX\x{110000}
** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16 ** Failed: character \x{110000} is greater than 0x10ffff and so cannot be converted to UTF-16
\x{d800}\x{1234} XX\x{d800}\x{1234}
Failed: error -25: UTF-16 error: invalid low surrogate Failed: error -25: UTF-16 error: invalid low surrogate at offset 3
/(*UTF16)\x{11234}/ /(*UTF16)\x{11234}/
abcd\x{11234}pqr abcd\x{11234}pqr

View File

@ -600,30 +600,30 @@ Subject length lower bound = 2
Failed: error 106 at offset 13: missing terminating ] for character class Failed: error 106 at offset 13: missing terminating ] for character class
/X/utf /X/utf
\x{d800} XX\x{d800}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
\x{d800}\=no_utf_check XX\x{d800}\=no_utf_check
No match 0: X
\x{da00} XX\x{da00}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
\x{da00}\=no_utf_check XX\x{da00}\=no_utf_check
No match 0: X
\x{dc00} XX\x{dc00}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
\x{dc00}\=no_utf_check XX\x{dc00}\=no_utf_check
No match 0: X
\x{de00} XX\x{de00}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
\x{de00}\=no_utf_check XX\x{de00}\=no_utf_check
No match 0: X
\x{dfff} XX\x{dfff}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
\x{dfff}\=no_utf_check XX\x{dfff}\=no_utf_check
No match 0: X
\x{110000} XX\x{110000}
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 2
\x{d800}\x{1234} XX\x{d800}\x{1234}
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
/(*UTF16)\x{11234}/ /(*UTF16)\x{11234}/
Failed: error 160 at offset 5: (*VERB) not recognized or malformed Failed: error 160 at offset 5: (*VERB) not recognized or malformed
@ -1113,7 +1113,7 @@ Failed: error 134 at offset 10: character code point value in \x{} or \o{} is to
/\C/utf /\C/utf
\x{110000} \x{110000}
Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined Failed: error -28: UTF-32 error: code points greater than 0x10ffff are not defined at offset 0
/\x{100}*A/IB,utf /\x{100}*A/IB,utf
------------------------------------------------------------------ ------------------------------------------------------------------