Further substitution tests (code and data), and more documentation.

This commit is contained in:
Philip.Hazel 2014-11-14 18:41:20 +00:00
parent adc7be2d3a
commit 07f8372202
21 changed files with 1014 additions and 693 deletions

View File

@ -51,4 +51,6 @@ the currrent group as "unset". Thus, the ovector for those groups contained
whatever was previously there. An example is the pattern /(x)|((*ACCEPT))/ when whatever was previously there. An example is the pattern /(x)|((*ACCEPT))/ when
matched against "abcd". matched against "abcd".
8. The pcre2_substitute() function has been implemented.
**** ****

View File

@ -135,7 +135,7 @@ remaining sections, except for the <b>pcre2demo</b> section (which is a program
listing), and the short pages for individual functions, are concatenated in listing), and the short pages for individual functions, are concatenated in
<b>pcre2.txt</b>, for ease of searching. The sections are as follows: <b>pcre2.txt</b>, for ease of searching. The sections are as follows:
<pre> <pre>
pcre2 this document FIXME CHECK THIS LIST pcre2 this document
pcre2-config show PCRE2 installation configuration information pcre2-config show PCRE2 installation configuration information
pcre2api details of PCRE2's native C API pcre2api details of PCRE2's native C API
pcre2build building PCRE2 pcre2build building PCRE2

View File

@ -1089,7 +1089,7 @@ equivalent to Perl's /x option, and it can be changed within a pattern by a
Which characters are interpreted as newlines can be specified by a setting in Which characters are interpreted as newlines can be specified by a setting in
the compile context that is passed to <b>pcre2_compile()</b> or by a special the compile context that is passed to <b>pcre2_compile()</b> or by a special
sequence at the start of the pattern, as described in the section entitled sequence at the start of the pattern, as described in the section entitled
<a href="pcrepattern.html#newlines">"Newline conventions"</a> <a href="pcre2pattern.html#newlines">"Newline conventions"</a>
in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is
built. built.
<pre> <pre>
@ -1243,7 +1243,7 @@ This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
\w, and some of the POSIX character classes. By default, only ASCII characters \w, and some of the POSIX character classes. By default, only ASCII characters
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
classify characters. More details are given in the section on classify characters. More details are given in the section on
<a href="pcre2.html#genericchartypes">generic character types</a> <a href="pcre2pattern.html#genericchartypes">generic character types</a>
in the in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page. If you set PCRE2_UCP, matching one of the items it affects takes much page. If you set PCRE2_UCP, matching one of the items it affects takes much
@ -1924,11 +1924,8 @@ documentation.
<P> <P>
When PCRE2 is built, a default newline convention is set; this is usually the When PCRE2 is built, a default newline convention is set; this is usually the
standard convention for the operating system. The default can be overridden in standard convention for the operating system. The default can be overridden in
either a a
<a href="#compilecontext">compile context</a> <a href="#compilecontext">compile context.</a>
or a
<a href="#matchcontext">match context.</a>
However, changing the newline convention at match time disables JIT matching.
During matching, the newline choice affects the behaviour of the dot, During matching, the newline choice affects the behaviour of the dot,
circumflex, and dollar metacharacters. It may also alter the way the match circumflex, and dollar metacharacters. It may also alter the way the match
position is advanced after a match failure for an unanchored pattern. position is advanced after a match failure for an unanchored pattern.
@ -2290,7 +2287,7 @@ subpattern <i>n</i> has not been used at all, it returns an empty string. This
can be distinguished from a genuine zero-length substring by inspecting the can be distinguished from a genuine zero-length substring by inspecting the
appropriate offset in the ovector, which contains PCRE2_UNSET for unset appropriate offset in the ovector, which contains PCRE2_UNSET for unset
substrings. substrings.
<a name="extractbynname"></a></P> <a name="extractbyname"></a></P>
<br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br> <br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
<P> <P>
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b> <b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
@ -2358,7 +2355,8 @@ string in <i>outputbuffer</i>, replacing the part that was matched with the
be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
</P> </P>
<P> <P>
In the replacement string, which is interpreted as a UTF string in UTF mode, a In the replacement string, which is interpreted as a UTF string in UTF mode,
and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
dollar character is an escape character that can specify the insertion of dollar character is an escape character that can specify the insertion of
characters from capturing groups in the pattern. The following forms are characters from capturing groups in the pattern. The following forms are
recognized: recognized:

View File

@ -51,11 +51,12 @@ JIT support is an optional feature of PCRE2. The "configure" option
you want to use JIT. The support is limited to the following hardware you want to use JIT. The support is limited to the following hardware
platforms: platforms:
<pre> <pre>
ARM v5, v7, and Thumb2 ARM 32-bit (v5, v7, and Thumb2)
ARM 64-bit
Intel x86 32-bit and 64-bit Intel x86 32-bit and 64-bit
MIPS 32-bit MIPS 32-bit and 64-bit
Power PC 32-bit and 64-bit Power PC 32-bit and 64-bit
SPARC 32-bit (experimental) SPARC 32-bit
</pre> </pre>
If --enable-jit is set on an unsupported platform, compilation fails. If --enable-jit is set on an unsupported platform, compilation fails.
</P> </P>
@ -73,11 +74,11 @@ To make use of the JIT support in the simplest way, all you have to do is to
call <b>pcre2_jit_compile()</b> after successfully compiling a pattern with call <b>pcre2_jit_compile()</b> after successfully compiling a pattern with
<b>pcre2_compile()</b>. This function has two arguments: the first is the <b>pcre2_compile()</b>. This function has two arguments: the first is the
compiled pattern pointer that was returned by <b>pcre2_compile()</b>, and the compiled pattern pointer that was returned by <b>pcre2_compile()</b>, and the
second is a set of option bits, which must include at least one of second is zero or more of the following option bits: PCRE2_JIT_COMPLETE,
PCRE2_JIT_COMPLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
</P> </P>
<P> <P>
If JIT support is not available, a call to <b>pcre2_jit_comple()</b> does If JIT support is not available, a call to <b>pcre2_jit_compile()</b> does
nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled pattern nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled pattern
is passed to the JIT compiler, which turns it into machine code that executes is passed to the JIT compiler, which turns it into machine code that executes
much faster than the normal interpretive code, but yields exactly the same much faster than the normal interpretive code, but yields exactly the same
@ -95,6 +96,20 @@ appropriate code is run if it is available. Otherwise, the pattern is matched
using interpretive code. using interpretive code.
</P> </P>
<P> <P>
You can call <b>pcre2_jit_compile()</b> multiple times for the same compiled
pattern. It does nothing if it has previously compiled code for any of the
option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
(perhaps later, when you find you need partial matching) again with
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
returns zero. This is an alternative way of testing if JIT is available.
</P>
<P>
At present, it is not possible to free JIT compiled code except when the entire
compiled pattern is freed by calling <b>pcre2_free_code()</b>.
</P>
<P>
In some circumstances you may need to call additional functions. These are In some circumstances you may need to call additional functions. These are
described in the section entitled described in the section entitled
<a href="#stackcontrol">"Controlling the JIT stack"</a> <a href="#stackcontrol">"Controlling the JIT stack"</a>
@ -167,7 +182,7 @@ memory allocation), a starting size and a maximum size, and it returns a
pointer to an opaque structure of type <b>pcre2_jit_stack</b>, or NULL if there pointer to an opaque structure of type <b>pcre2_jit_stack</b>, or NULL if there
is an error. The <b>pcre2_jit_stack_free()</b> function is used to free a stack is an error. The <b>pcre2_jit_stack_free()</b> function is used to free a stack
that is no longer needed. (For the technically minded: the address space is that is no longer needed. (For the technically minded: the address space is
allocated by mmap or VirtualAlloc.) FIXME Is this right? allocated by mmap or VirtualAlloc.)
</P> </P>
<P> <P>
JIT uses far less memory for recursion than the interpretive code, JIT uses far less memory for recursion than the interpretive code,
@ -187,7 +202,8 @@ passed to a matching function, its information determines which JIT stack is
used. There are three cases for the values of the other two options: used. There are three cases for the values of the other two options:
<pre> <pre>
(1) If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32K block (1) If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32K block
on the machine stack is used. on the machine stack is used. This is the default when a match
context is created.
(2) If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must be (2) If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must be
a pointer to a valid JIT stack, the result of calling a pointer to a valid JIT stack, the result of calling
@ -402,7 +418,7 @@ Cambridge CB2 3QH, England.
</P> </P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br> <br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 08 November 2014 Last updated: 12 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -100,8 +100,8 @@ page.
<P> <P>
Some applications that allow their users to supply patterns may wish to Some applications that allow their users to supply patterns may wish to
restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
option is set at compile time, (*UTF) is not allowed, and its appearance causes option is passed to <b>pcre2_compile()</b>, (*UTF) is not allowed, and its
an error. appearance in a pattern causes an error.
</P> </P>
<br><b> <br><b>
Unicode property support Unicode property support
@ -113,6 +113,22 @@ such as \d and \w to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 128 via a lookup instead of recognizing only characters with codes less than 128 via a lookup
table. table.
</P> </P>
<P>
Some applications that allow their users to supply patterns may wish to
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
<b>pcre2_compile()</b>, (*UCP) is not allowed, and its appearance in a pattern
causes an error.
</P>
<br><b>
Locking out empty string matching
</b><br>
<P>
Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
matching function is subsequently called to match the pattern. These options
lock out the matching of empty strings, either entirely, or only at the start
of the subject.
</P>
<br><b> <br><b>
Disabling auto-possessification Disabling auto-possessification
</b><br> </b><br>
@ -133,6 +149,28 @@ PCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly
reaching "no match" results. For more details, see the reaching "no match" results. For more details, see the
<a href="pcre2api.html"><b>pcre2api</b></a> <a href="pcre2api.html"><b>pcre2api</b></a>
documentation. documentation.
</P>
<br><b>
Setting match and recursion limits
</b><br>
<P>
The caller of <b>pcre2_match()</b> can set a limit on the number of times the
internal <b>match()</b> function is called and on the maximum depth of
recursive calls. These facilities are provided to catch runaway matches that
are provoked by patterns with huge matching trees (a typical example is a
pattern with nested unlimited repeats) and to avoid running out of system stack
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
gives an error return. The limits can also be set by items at the start of the
pattern of the form
<pre>
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
</pre>
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
<a name="newlines"></a></P> <a name="newlines"></a></P>
<br><b> <br><b>
Newline conventions Newline conventions
@ -179,26 +217,14 @@ below. A change of \R setting can be combined with a change of newline
convention. convention.
</P> </P>
<br><b> <br><b>
Setting match and recursion limits Specifying what \R matches
</b><br> </b><br>
<P> <P>
The caller of <b>pcre2_match()</b> can set a limit on the number of times the It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
internal <b>match()</b> function is called and on the maximum depth of complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
recursive calls. These facilities are provided to catch runaway matches that at compile time. This effect can also be achieved by starting a pattern with
are provoked by patterns with huge matching trees (a typical example is a (*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
pattern with nested unlimited repeats) and to avoid running out of system stack corresponding to PCRE2_BSR_UNICODE.
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
gives an error return. The limits can also be set by items at the start of the
pattern of the form
<pre>
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
</pre>
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
</P> </P>
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br> <br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
<P> <P>
@ -2280,8 +2306,8 @@ complex:
</PRE> </PRE>
</P> </P>
<P> <P>
There are four kinds of condition: references to subpatterns, references to There are five kinds of condition: references to subpatterns, references to
recursion, a pseudo-condition called DEFINE, and assertions. recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
</P> </P>
<br><b> <br><b>
Checking for a used subpattern by number Checking for a used subpattern by number
@ -2389,6 +2415,23 @@ pattern uses references to the named group to match the four dot-separated
components of an IPv4 address, insisting on a word boundary at each end. components of an IPv4 address, insisting on a word boundary at each end.
</P> </P>
<br><b> <br><b>
Checking the PCRE2 version
</b><br>
<P>
Programs that link with a PCRE2 library can check the version by calling
<b>pcre2_config()</b> with appropriate arguments. Users of applications that do
not have access to the underlying code cannot do this. A special "condition"
called VERSION exists to allow such users to discover which version of PCRE2
they are dealing with by using this condition to match a string such as
"yesno". VERSION must be followed either by "=" or "&#62;=" and a version number.
For example:
<pre>
(?(VERSION&#62;=10.4)yes|no)
</pre>
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
"no" otherwise.
</P>
<br><b>
Assertion conditions Assertion conditions
</b><br> </b><br>
<P> <P>
@ -3180,7 +3223,7 @@ subpattern, (*THEN) causes the subroutine match to fail.
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br> <br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
<P> <P>
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
<b>pcre2syntax</b>(3), <b>pcre2</b>(3), <b>pcre216(3)</b>, <b>pcre232(3)</b>. <b>pcre2syntax</b>(3), <b>pcre2</b>(3).
</P> </P>
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br> <br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
<P> <P>
@ -3193,7 +3236,7 @@ Cambridge CB2 3QH, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 03 November 2014 Last updated: 14 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -493,17 +493,18 @@ Each top-level branch of a look behind must be of a fixed length.
(?(condition)yes-pattern) (?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern) (?(condition)yes-pattern|no-pattern)
(?(n)... absolute reference condition (?(n) absolute reference condition
(?(+n)... relative reference condition (?(+n) relative reference condition
(?(-n)... relative reference condition (?(-n) relative reference condition
(?(&#60;name&#62;)... named reference condition (Perl) (?(&#60;name&#62;) named reference condition (Perl)
(?('name')... named reference condition (Perl) (?('name') named reference condition (Perl)
(?(name)... named reference condition (PCRE2) (?(name) named reference condition (PCRE2)
(?(R)... overall recursion condition (?(R) overall recursion condition
(?(Rn)... specific group recursion condition (?(Rn) specific group recursion condition
(?(R&name)... specific recursion condition (?(R&name) specific recursion condition
(?(DEFINE)... define subpattern for reference (?(DEFINE) define subpattern for reference
(?(assert)... assertion condition (?(VERSION[&#62;]=n.m) test PCRE2 version
(?(assert) assertion condition
</PRE> </PRE>
</P> </P>
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br> <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
@ -552,7 +553,7 @@ Cambridge CB2 3QH, England.
</P> </P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br> <br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 20 October 2014 Last updated: 14 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -201,10 +201,11 @@ Behave as if each subject line contains the given modifiers.
<P> <P>
<b>-t</b> <b>-t</b>
Run each compile and match many times with a timer, and output the resulting Run each compile and match many times with a timer, and output the resulting
times per compile or match. You can control the number of iterations that are times per compile or match. When JIT is used, separate times are given for the
used for timing by following <b>-t</b> with a number (as a separate item on the initial compile and the JIT compile. You can control the number of iterations
command line). For example, "-t 1000" iterates 1000 times. The default is to that are used for timing by following <b>-t</b> with a number (as a separate
iterate 500,000 times. item on the command line). For example, "-t 1000" iterates 1000 times. The
default is to iterate 500,000 times.
</P> </P>
<P> <P>
<b>-tm</b> <b>-tm</b>
@ -490,7 +491,6 @@ about the pattern:
tables=[0|1|2] select internal tables tables=[0|1|2] select internal tables
</pre> </pre>
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
</P> </P>
<br><b> <br><b>
Newline and \R handling Newline and \R handling
@ -528,7 +528,31 @@ one-off tests.
<P> <P>
The <b>info</b> modifier requests information about the compiled pattern The <b>info</b> modifier requests information about the compiled pattern
(whether it is anchored, has a fixed first character, and so on). The (whether it is anchored, has a fixed first character, and so on). The
information is obtained from the <b>pcre2_pattern_info()</b> function. information is obtained from the <b>pcre2_pattern_info()</b> function. Here are
some typical examples:
<pre>
re&#62; /(?i)(^a|^b)/m,info
Capturing subpattern count = 1
Compile options: multiline
Overall options: caseless multiline
First code unit at start or follows newline
Subject length lower bound = 1
re&#62; /(?i)abc/info
Capturing subpattern count = 0
Compile options: &#60;none&#62;
Overall options: caseless
First code unit = 'a' (caseless)
Last code unit = 'c' (caseless)
Subject length lower bound = 3
</pre>
"Compile options" are those specified to the compile function; "overall
options" have added options that are taken or deduced from the pattern. If both
sets of options are the same, just a single "options" line is output. "First
code unit" is where any match must start; if there is more than one they are
listed as "starting code units". "Last code unit" is the last literal code unit
that must be present in any match. This is not necessarily the last character.
These lines are omitted if no starting or ending code units are recorded.
</P> </P>
<br><b> <br><b>
Specifying a pattern in hex Specifying a pattern in hex
@ -543,8 +567,8 @@ pairs. For example:
This feature is provided as a way of creating patterns that contain binary zero This feature is provided as a way of creating patterns that contain binary zero
characters. By default, <b>pcre2test</b> passes patterns as zero-terminated characters. By default, <b>pcre2test</b> passes patterns as zero-terminated
strings to <b>pcre2_compile()</b>, giving the length as PCRE2_ZERO_TERMINATED. strings to <b>pcre2_compile()</b>, giving the length as PCRE2_ZERO_TERMINATED.
However, for patterns specified in hexadecimal, the length of the pattern is However, for patterns specified in hexadecimal, the actual length of the
passed. pattern is passed.
</P> </P>
<br><b> <br><b>
JIT compilation JIT compilation
@ -571,7 +595,7 @@ setting the size of the JIT stack.
</P> </P>
<P> <P>
If the <b>jitfast</b> modifier is specified, matching is done using the JIT If the <b>jitfast</b> modifier is specified, matching is done using the JIT
"fast path" interface (\fBpcre2_jit_match()), which skips some of the sanity "fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
checks that are done by <b>pcre2_match()</b>, and of course does not work when checks that are done by <b>pcre2_match()</b>, and of course does not work when
JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
assumed. assumed.
@ -604,11 +628,17 @@ character tables are mutually exclusive.
Showing pattern memory Showing pattern memory
</b><br> </b><br>
<P> <P>
The <b>/memory</b> modifier causes the size in bytes of the memory block used to The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
hold the compiled pattern to be output. This does not include the size of the the compiled pattern to be output. This does not include the size of the
<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is <b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
subsequently passed to the JIT compiler, the size of the JIT compiled code is subsequently passed to the JIT compiler, the size of the JIT compiled code is
also output. also output. Here is an example:
<pre>
re&#62; /a(b)c/jit,memory
Memory allocation (code space): 21
Memory allocation (JIT code): 1910
</PRE>
</P> </P>
<br><b> <br><b>
Limiting nested parentheses Limiting nested parentheses
@ -650,8 +680,8 @@ enable stack availability to be checked during compilation (see the
<a href="pcre2api.html"><b>pcre2api</b></a> <a href="pcre2api.html"><b>pcre2api</b></a>
documentation for details). If the number specified by the modifier is greater documentation for details). If the number specified by the modifier is greater
than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up
callback from <b>pcre2_compile()</b> to a local function. The argument it is callback from <b>pcre2_compile()</b> to a local function. The argument it
passed is the current nesting parenthesis depth; if this is greater than the receives is the current nesting parenthesis depth; if this is greater than the
value given by the modifier, non-zero is returned, causing the compilation to value given by the modifier, non-zero is returned, causing the compilation to
be aborted. be aborted.
</P> </P>
@ -688,6 +718,7 @@ not affect the compilation process.
allusedtext show all consulted text allusedtext show all consulted text
/g global global matching /g global global matching
mark show mark values mark show mark values
replace=&#60;string&#62; specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
</pre> </pre>
These modifiers may not appear in a <b>#pattern</b> command. If you want them as These modifiers may not appear in a <b>#pattern</b> command. If you want them as
@ -759,11 +790,11 @@ pattern.
offset=&#60;n&#62; set starting offset offset=&#60;n&#62; set starting offset
ovector=&#60;n&#62; set size of output vector ovector=&#60;n&#62; set size of output vector
recursion_limit=&#60;n&#62; set a recursion limit recursion_limit=&#60;n&#62; set a recursion limit
replace=&#60;string&#62; specify a replacement string
startchar show startchar when relevant startchar show startchar when relevant
zero_terminate pass the subject as zero-terminated zero_terminate pass the subject as zero-terminated
</pre> </pre>
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
</P> </P>
<br><b> <br><b>
Showing more text Showing more text
@ -841,6 +872,30 @@ Any value other than zero is used as a return from <b>pcre2test</b>'s callout
function. function.
</P> </P>
<br><b> <br><b>
Finding all matches in a string
</b><br>
<P>
Searching for all possible matches within a subject can be requested by the
<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The difference
between <b>global</b> and <b>altglobal</b> is that the former uses the
<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
to start searching at a new point within the entire string (which is what Perl
does), whereas the latter passes over a shortened substring. This makes a
difference to the matching process if the pattern begins with a lookbehind
assertion (including \b or \B).
</P>
<P>
If an empty string is matched, the next match is done with the
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
another, non-empty, match at the same point in the subject. If this match
fails, the start offset is advanced, and the normal match is retried. This
imitates the way Perl handles such cases when using the <b>/g</b> modifier or
the <b>split()</b> function. Normally, the start offset is advanced by one
character, but if the newline convention recognizes CRLF as a newline, and the
current character is CR followed by LF, an advance of two is used.
</P>
<br><b>
Testing substring extraction functions Testing substring extraction functions
</b><br> </b><br>
<P> <P>
@ -867,28 +922,46 @@ length (that is, the return from the extraction function) is given in
parentheses after each substring. parentheses after each substring.
</P> </P>
<br><b> <br><b>
Finding all matches in a string Testing the substitution function
</b><br> </b><br>
<P> <P>
Searching for all possible matches within a subject can be requested by the If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching called instead of one of the matching functions. Unlike subject strings,
function is called again to search the remainder of the subject. The difference <b>pcre2test</b> does not process replacement strings for escape sequences. In
between <b>global</b> and <b>altglobal</b> is that the former uses the UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> If so, it is correctly converted to a UTF string of the appropriate code unit
to start searching at a new point within the entire string (which is what Perl width. If it is not a valid UTF-8 string, the individual code units are copied
does), whereas the latter passes over a shortened substring. This makes a directly. This provides a means of passing an invalid UTF-8 string for testing
difference to the matching process if the pattern begins with a lookbehind purposes.
assertion (including \b or \B).
</P> </P>
<P> <P>
If an empty string is matched, the next match is done with the If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for <b>pcre2_substitute()</b>. After a successful substitution, the modified string
another, non-empty, match at the same point in the subject. If this match is output, preceded by the number of replacements. This may be zero if there
fails, the start offset is advanced, and the normal match is retried. This were no matches. Here is a simple example of a substitution test:
imitates the way Perl handles such cases when using the <b>/g</b> modifier or <pre>
the <b>split()</b> function. Normally, the start offset is advanced by one /abc/replace=xxx
character, but if the newline convention recognizes CRLF as a newline, and the =abc=abc=
current character is CR followed by LF, an advance of two is used. 1: =xxx=abc=
=abc=abc=\=global
2: =xxx=xxx=
</pre>
Subject and replacement strings should be kept relatively short for
substitution tests, as fixed-size buffers are used. To make it easy to test for
buffer overflow, if the replacement string starts with a number in square
brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
output buffer, with the replacement string starting at the next character. Here
is an example that tests the edge case:
<pre>
/abc/
123abc123\=replace=[10]XYZ
1: 123XYZ123
123abc123\=replace=[9]XYZ
Failed: error -47: no more memory
</pre>
A replacement string is ignored with POSIX and DFA matching. Specifying partial
matching provokes an error return ("bad option value") from
<b>pcre2_substitute()</b>.
</P> </P>
<br><b> <br><b>
Setting the JIT stack size Setting the JIT stack size
@ -969,10 +1042,10 @@ available for storing matching information. The default is 15.
A value of zero is useful when testing the POSIX API because it causes A value of zero is useful when testing the POSIX API because it causes
<b>regexec()</b> to be called with a NULL capture vector. When not testing the <b>regexec()</b> to be called with a NULL capture vector. When not testing the
POSIX API, a value of zero is used to cause POSIX API, a value of zero is used to cause
<b>pcre2_match_data_create_from_pattern</b> to be called, in order to create a <b>pcre2_match_data_create_from_pattern()</b> to be called, in order to create a
match block of exactly the right size for the pattern. (It is not possible to match block of exactly the right size for the pattern. (It is not possible to
create a match block with a zero-length ovector; there is always one pair of create a match block with a zero-length ovector; there is always at least one
offsets.) pair of offsets.)
</P> </P>
<br><b> <br><b>
Passing the subject as zero-terminated Passing the subject as zero-terminated
@ -985,7 +1058,7 @@ be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
this modifier has no effect, as there is no facility for passing a length.) this modifier has no effect, as there is no facility for passing a length.)
</P> </P>
<P> <P>
When testing <b>pcre2_substitute</b>, this modifier also has the effect of When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
passing the replacement string as zero-terminated. passing the replacement string as zero-terminated.
</P> </P>
<br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br> <br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br>
@ -1233,7 +1306,7 @@ Cambridge CB2 3QH, England.
</P> </P>
<br><a name="SEC20" href="#TOC1">REVISION</a><br> <br><a name="SEC20" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 09 November 2014 Last updated: 14 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -132,7 +132,7 @@ remaining sections, except for the \fBpcre2demo\fP section (which is a program
listing), and the short pages for individual functions, are concatenated in listing), and the short pages for individual functions, are concatenated in
\fBpcre2.txt\fP, for ease of searching. The sections are as follows: \fBpcre2.txt\fP, for ease of searching. The sections are as follows:
.sp .sp
pcre2 this document FIXME CHECK THIS LIST pcre2 this document
pcre2-config show PCRE2 installation configuration information pcre2-config show PCRE2 installation configuration information
pcre2api details of PCRE2's native C API pcre2api details of PCRE2's native C API
pcre2build building PCRE2 pcre2build building PCRE2

View File

@ -116,7 +116,7 @@ USER DOCUMENTATION
tions, are concatenated in pcre2.txt, for ease of searching. The sec- tions, are concatenated in pcre2.txt, for ease of searching. The sec-
tions are as follows: tions are as follows:
pcre2 this document FIXME CHECK THIS LIST pcre2 this document
pcre2-config show PCRE2 installation configuration information pcre2-config show PCRE2 installation configuration information
pcre2api details of PCRE2's native C API pcre2api details of PCRE2's native C API
pcre2build building PCRE2 pcre2build building PCRE2
@ -1928,12 +1928,10 @@ NEWLINE HANDLING WHEN MATCHING
When PCRE2 is built, a default newline convention is set; this is usu- When PCRE2 is built, a default newline convention is set; this is usu-
ally the standard convention for the operating system. The default can ally the standard convention for the operating system. The default can
be overridden in either a compile context or a match context. However, be overridden in a compile context. During matching, the newline
changing the newline convention at match time disables JIT matching. choice affects the behaviour of the dot, circumflex, and dollar
During matching, the newline choice affects the behaviour of the dot, metacharacters. It may also alter the way the match position is
circumflex, and dollar metacharacters. It may also alter the way the advanced after a match failure for an unanchored pattern.
match position is advanced after a match failure for an unanchored pat-
tern.
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
set, and a match attempt for an unanchored pattern fails when the cur- set, and a match attempt for an unanchored pattern fails when the cur-
@ -2320,46 +2318,47 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
given as PCRE2_ZERO_TERMINATED for a zero-terminated string. given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
In the replacement string, which is interpreted as a UTF string in UTF In the replacement string, which is interpreted as a UTF string in UTF
mode, a dollar character is an escape character that can specify the mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
insertion of characters from capturing groups in the pattern. The fol- option is set, a dollar character is an escape character that can spec-
lowing forms are recognized: ify the insertion of characters from capturing groups in the pattern.
The following forms are recognized:
$$ insert a dollar character $$ insert a dollar character
$<n> insert the contents of group <n> $<n> insert the contents of group <n>
${<n>} insert the contents of group <n> ${<n>} insert the contents of group <n>
Either a group number or a group name can be given for <n>. Curly Either a group number or a group name can be given for <n>. Curly
brackets are required only if the following character would be inter- brackets are required only if the following character would be inter-
preted as part of the number or name. The number may be zero to include preted as part of the number or name. The number may be zero to include
the entire matched string. For example, if the pattern a(b)c is the entire matched string. For example, if the pattern a(b)c is
matched with "[abc]" and the replacement string "+$1$0$1+", the result matched with "[abc]" and the replacement string "+$1$0$1+", the result
is "[+babcb+]". Group insertion is done by calling pcre2_copy_byname() is "[+babcb+]". Group insertion is done by calling pcre2_copy_byname()
or pcre2_copy_bynumber() as appropriate. or pcre2_copy_bynumber() as appropriate.
The first seven arguments of pcre2_substitute() are the same as for The first seven arguments of pcre2_substitute() are the same as for
pcre2_match(), except that the partial matching options are not permit- pcre2_match(), except that the partial matching options are not permit-
ted, and match_data may be passed as NULL, in which case a match data ted, and match_data may be passed as NULL, in which case a match data
block is obtained and freed within this function, using memory manage- block is obtained and freed within this function, using memory manage-
ment functions from the match context, if provided, or else those that ment functions from the match context, if provided, or else those that
were used to allocate memory for the compiled code. were used to allocate memory for the compiled code.
There is one additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes There is one additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes
the function to iterate over the subject string, replacing every match- the function to iterate over the subject string, replacing every match-
ing substring. If this is not set, only the first matching substring is ing substring. If this is not set, only the first matching substring is
replaced. replaced.
The outlengthptr argument must point to a variable that contains the The outlengthptr argument must point to a variable that contains the
length, in code units, of the output buffer. It is updated to contain length, in code units, of the output buffer. It is updated to contain
the length of the new string, excluding the trailing zero that is auto- the length of the new string, excluding the trailing zero that is auto-
matically added. matically added.
The function returns the number of replacements that were made. This The function returns the number of replacements that were made. This
may be zero if no matches were found, and is never greater than 1 may be zero if no matches were found, and is never greater than 1
unless PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a neg- unless PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a neg-
ative error code is returned. Except for PCRE2_ERROR_NOMATCH (which is ative error code is returned. Except for PCRE2_ERROR_NOMATCH (which is
never returned), any errors from pcre2_match() or the substring copying never returned), any errors from pcre2_match() or the substring copying
functions are passed straight back. PCRE2_ERROR_BADREPLACEMENT is functions are passed straight back. PCRE2_ERROR_BADREPLACEMENT is
returned for an invalid replacement string (unrecognized sequence fol- returned for an invalid replacement string (unrecognized sequence fol-
lowing a dollar sign), and PCRE2_ERROR_NOMEMORY is returned if the out- lowing a dollar sign), and PCRE2_ERROR_NOMEMORY is returned if the out-
put buffer is not big enough. put buffer is not big enough.
@ -2369,54 +2368,54 @@ DUPLICATE SUBPATTERN NAMES
int pcre2_substring_nametable_scan(const pcre2_code *code, int pcre2_substring_nametable_scan(const pcre2_code *code,
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
When a pattern is compiled with the PCRE2_DUPNAMES option, names for When a pattern is compiled with the PCRE2_DUPNAMES option, names for
subpatterns are not required to be unique. Duplicate names are always subpatterns are not required to be unique. Duplicate names are always
allowed for subpatterns with the same number, created by using the (?| allowed for subpatterns with the same number, created by using the (?|
feature. Indeed, if such subpatterns are named, they are required to feature. Indeed, if such subpatterns are named, they are required to
use the same names. use the same names.
Normally, patterns with duplicate names are such that in any one match, Normally, patterns with duplicate names are such that in any one match,
only one of the named subpatterns participates. An example is shown in only one of the named subpatterns participates. An example is shown in
the pcre2pattern documentation. the pcre2pattern documentation.
When duplicates are present, pcre2_substring_copy_byname() and When duplicates are present, pcre2_substring_copy_byname() and
pcre2_substring_get_byname() return the first substring corresponding pcre2_substring_get_byname() return the first substring corresponding
to the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING to the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING
is returned. The pcre2_substring_number_from_name() function returns is returned. The pcre2_substring_number_from_name() function returns
one of the numbers that are associated with the name, but it is not one of the numbers that are associated with the name, but it is not
defined which it is. defined which it is.
If you want to get full details of all captured substrings for a given If you want to get full details of all captured substrings for a given
name, you must use the pcre2_substring_nametable_scan() function. The name, you must use the pcre2_substring_nametable_scan() function. The
first argument is the compiled pattern, and the second is the name. If first argument is the compiled pattern, and the second is the name. If
the third and fourth arguments are NULL, the function returns a group the third and fourth arguments are NULL, the function returns a group
number (it is not defined which). Otherwise, the third and fourth argu- number (it is not defined which). Otherwise, the third and fourth argu-
ments must be pointers to variables that are updated by the function. ments must be pointers to variables that are updated by the function.
After it has run, they point to the first and last entries in the name- After it has run, they point to the first and last entries in the name-
to-number table for the given name, and the function returns the length to-number table for the given name, and the function returns the length
of each entry. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if of each entry. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if
there are no entries for the given name. there are no entries for the given name.
The format of the name table is described above in the section entitled The format of the name table is described above in the section entitled
Information about a pattern above. Given all the relevant entries for Information about a pattern above. Given all the relevant entries for
the name, you can extract each of their numbers, and hence the captured the name, you can extract each of their numbers, and hence the captured
data. data.
FINDING ALL POSSIBLE MATCHES FINDING ALL POSSIBLE MATCHES
The traditional matching function uses a similar algorithm to Perl, The traditional matching function uses a similar algorithm to Perl,
which stops when it finds the first match, starting at a given point in which stops when it finds the first match, starting at a given point in
the subject. If you want to find all possible matches, or the longest the subject. If you want to find all possible matches, or the longest
possible match at a given position, consider using the alternative possible match at a given position, consider using the alternative
matching function (see below) instead. If you cannot use the alterna- matching function (see below) instead. If you cannot use the alterna-
tive function, you can kludge it up by making use of the callout facil- tive function, you can kludge it up by making use of the callout facil-
ity, which is described in the pcre2callout documentation. ity, which is described in the pcre2callout documentation.
What you have to do is to insert a callout right at the end of the pat- What you have to do is to insert a callout right at the end of the pat-
tern. When your callout function is called, extract and save the cur- tern. When your callout function is called, extract and save the cur-
rent matched substring. Then return 1, which forces pcre2_match() to rent matched substring. Then return 1, which forces pcre2_match() to
backtrack and try other alternatives. Ultimately, when it runs out of backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
@ -2428,26 +2427,26 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
pcre2_match_context *mcontext, pcre2_match_context *mcontext,
int *workspace, PCRE2_SIZE wscount); int *workspace, PCRE2_SIZE wscount);
The function pcre2_dfa_match() is called to match a subject string The function pcre2_dfa_match() is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the against a compiled pattern, using a matching algorithm that scans the
subject string just once, and does not backtrack. This has different subject string just once, and does not backtrack. This has different
characteristics to the normal algorithm, and is not compatible with characteristics to the normal algorithm, and is not compatible with
Perl. Some of the features of PCRE2 patterns are not supported. Never- Perl. Some of the features of PCRE2 patterns are not supported. Never-
theless, there are times when this kind of matching can be useful. For theless, there are times when this kind of matching can be useful. For
a discussion of the two matching algorithms, and a list of features a discussion of the two matching algorithms, and a list of features
that pcre2_dfa_match() does not support, see the pcre2matching documen- that pcre2_dfa_match() does not support, see the pcre2matching documen-
tation. tation.
The arguments for the pcre2_dfa_match() function are the same as for The arguments for the pcre2_dfa_match() function are the same as for
pcre2_match(), plus two extras. The ovector within the match data block pcre2_match(), plus two extras. The ovector within the match data block
is used in a different way, and this is described below. The other com- is used in a different way, and this is described below. The other com-
mon arguments are used in the same way as for pcre2_match(), so their mon arguments are used in the same way as for pcre2_match(), so their
description is not repeated here. description is not repeated here.
The two additional arguments provide workspace for the function. The The two additional arguments provide workspace for the function. The
workspace vector should contain at least 20 elements. It is used for workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More keeping track of multiple paths through the pattern tree. More
workspace is needed for patterns and subjects where there are a lot of workspace is needed for patterns and subjects where there are a lot of
potential matches. potential matches.
Here is an example of a simple call to pcre2_dfa_match(): Here is an example of a simple call to pcre2_dfa_match():
@ -2467,45 +2466,45 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
Option bits for pcre_dfa_match() Option bits for pcre_dfa_match()
The unused bits of the options argument for pcre2_dfa_match() must be The unused bits of the options argument for pcre2_dfa_match() must be
zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT,
PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of
these are exactly the same as for pcre2_match(), so their description these are exactly the same as for pcre2_match(), so their description
is not repeated here. is not repeated here.
PCRE2_PARTIAL_HARD PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT PCRE2_PARTIAL_SOFT
These have the same general effect as they do for pcre2_match(), but These have the same general effect as they do for pcre2_match(), but
the details are slightly different. When PCRE2_PARTIAL_HARD is set for the details are slightly different. When PCRE2_PARTIAL_HARD is set for
pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
subject is reached and there is still at least one matching possibility subject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete that requires additional characters. This happens even if some complete
matches have already been found. When PCRE2_PARTIAL_SOFT is set, the matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
if the end of the subject is reached, there have been no complete if the end of the subject is reached, there have been no complete
matches, but there is still at least one matching possibility. The por- matches, but there is still at least one matching possibility. The por-
tion of the string that was inspected when the longest partial match tion of the string that was inspected when the longest partial match
was found is set as the first matching string in both cases. There is a was found is set as the first matching string in both cases. There is a
more detailed discussion of partial and multi-segment matching, with more detailed discussion of partial and multi-segment matching, with
examples, in the pcre2partial documentation. examples, in the pcre2partial documentation.
PCRE2_DFA_SHORTEST PCRE2_DFA_SHORTEST
Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna- stop as soon as it has found one match. Because of the way the alterna-
tive algorithm works, this is necessarily the shortest possible match tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string. at the first possible matching point in the subject string.
PCRE2_DFA_RESTART PCRE2_DFA_RESTART
When pcre2_dfa_match() returns a partial match, it is possible to call When pcre2_dfa_match() returns a partial match, it is possible to call
it again, with additional subject characters, and have it continue with it again, with additional subject characters, and have it continue with
the same match. The PCRE2_DFA_RESTART option requests this action; when the same match. The PCRE2_DFA_RESTART option requests this action; when
it is set, the workspace and wscount options must reference the same it is set, the workspace and wscount options must reference the same
vector as before because data about the match so far is left in them vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the after a partial match. There is more discussion of this facility in the
pcre2partial documentation. pcre2partial documentation.
@ -2513,8 +2512,8 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
When pcre2_dfa_match() succeeds, it may have matched more than one sub- When pcre2_dfa_match() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run string in the subject. Note, however, that all the matches from one run
of the function start at the same point in the subject. The shorter of the function start at the same point in the subject. The shorter
matches are all initial substrings of the longer matches. For example, matches are all initial substrings of the longer matches. For example,
if the pattern if the pattern
<.*> <.*>
@ -2529,66 +2528,66 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
<something> <something else> <something> <something else>
<something> <something else> <something further> <something> <something else> <something further>
On success, the yield of the function is a number greater than zero, On success, the yield of the function is a number greater than zero,
which is the number of matched substrings. The offsets of the sub- which is the number of matched substrings. The offsets of the sub-
strings are returned in the ovector, and can be extracted in the same strings are returned in the ovector, and can be extracted in the same
way as for pcre2_match(). They are returned in reverse order of way as for pcre2_match(). They are returned in reverse order of
length; that is, the longest matching string is given first. If there length; that is, the longest matching string is given first. If there
were too many matches to fit into the ovector, the yield of the func- were too many matches to fit into the ovector, the yield of the func-
tion is zero, and the vector is filled with the longest matches. tion is zero, and the vector is filled with the longest matches.
NOTE: PCRE2's "auto-possessification" optimization usually applies to NOTE: PCRE2's "auto-possessification" optimization usually applies to
character repeats at the end of a pattern (as well as internally). For character repeats at the end of a pattern (as well as internally). For
example, the pattern "a\d+" is compiled as if it were "a\d++" because example, the pattern "a\d+" is compiled as if it were "a\d++" because
there is no point in backtracking into the repeated digits. For DFA there is no point in backtracking into the repeated digits. For DFA
matching, this means that only one possible match is found. If you matching, this means that only one possible match is found. If you
really do want multiple matches in such cases, either use an ungreedy really do want multiple matches in such cases, either use an ungreedy
repeat ("a\d+?") or set the PCRE2_NO_AUTO_POSSESS option when compil- repeat ("a\d+?") or set the PCRE2_NO_AUTO_POSSESS option when compil-
ing. ing.
Error returns from pcre2_dfa_match() Error returns from pcre2_dfa_match()
The pcre2_dfa_match() function returns a negative number when it fails. The pcre2_dfa_match() function returns a negative number when it fails.
Many of the errors are the same as for pcre2_match(), as described Many of the errors are the same as for pcre2_match(), as described
above. There are in addition the following errors that are specific to above. There are in addition the following errors that are specific to
pcre2_dfa_match(): pcre2_dfa_match():
PCRE2_ERROR_DFA_UITEM PCRE2_ERROR_DFA_UITEM
This return is given if pcre2_dfa_match() encounters an item in the This return is given if pcre2_dfa_match() encounters an item in the
pattern that it does not support, for instance, the use of \C or a back pattern that it does not support, for instance, the use of \C or a back
reference. reference.
PCRE2_ERROR_DFA_UCOND PCRE2_ERROR_DFA_UCOND
This return is given if pcre2_dfa_match() encounters a condition item This return is given if pcre2_dfa_match() encounters a condition item
that uses a back reference for the condition, or a test for recursion that uses a back reference for the condition, or a test for recursion
in a specific group. These are not supported. in a specific group. These are not supported.
PCRE2_ERROR_DFA_WSSIZE PCRE2_ERROR_DFA_WSSIZE
This return is given if pcre2_dfa_match() runs out of space in the This return is given if pcre2_dfa_match() runs out of space in the
workspace vector. workspace vector.
PCRE2_ERROR_DFA_RECURSE PCRE2_ERROR_DFA_RECURSE
When a recursive subpattern is processed, the matching function calls When a recursive subpattern is processed, the matching function calls
itself recursively, using private memory for the ovector and workspace. itself recursively, using private memory for the ovector and workspace.
This error is given if the internal ovector is not large enough. This This error is given if the internal ovector is not large enough. This
should be extremely rare, as a vector of size 1000 is used. should be extremely rare, as a vector of size 1000 is used.
PCRE2_ERROR_DFA_BADRESTART PCRE2_ERROR_DFA_BADRESTART
When pcre2_dfa_match() is called with the pcre2_dfa_RESTART option, When pcre2_dfa_match() is called with the pcre2_dfa_RESTART option,
some plausibility checks are made on the contents of the workspace, some plausibility checks are made on the contents of the workspace,
which should contain data about the previous partial match. If any of which should contain data about the previous partial match. If any of
these checks fail, this error is given. these checks fail, this error is given.
SEE ALSO SEE ALSO
pcre2build(3), pcre2libs(3), pcre2callout(3), pcre2matching(3), pcre2build(3), pcre2libs(3), pcre2callout(3), pcre2matching(3),
pcre2partial(3), pcre2posix(3), pcre2demo(3), pcre2sample(3), pcre2partial(3), pcre2posix(3), pcre2demo(3), pcre2sample(3),
pcre2stack(3). pcre2stack(3).
@ -3508,11 +3507,12 @@ AVAILABILITY OF JIT SUPPORT
built if you want to use JIT. The support is limited to the following built if you want to use JIT. The support is limited to the following
hardware platforms: hardware platforms:
ARM v5, v7, and Thumb2 ARM 32-bit (v5, v7, and Thumb2)
ARM 64-bit
Intel x86 32-bit and 64-bit Intel x86 32-bit and 64-bit
MIPS 32-bit MIPS 32-bit and 64-bit
Power PC 32-bit and 64-bit Power PC 32-bit and 64-bit
SPARC 32-bit (experimental) SPARC 32-bit
If --enable-jit is set on an unsupported platform, compilation fails. If --enable-jit is set on an unsupported platform, compilation fails.
@ -3531,10 +3531,10 @@ SIMPLE USE OF JIT
is to call pcre2_jit_compile() after successfully compiling a pattern is to call pcre2_jit_compile() after successfully compiling a pattern
with pcre2_compile(). This function has two arguments: the first is the with pcre2_compile(). This function has two arguments: the first is the
compiled pattern pointer that was returned by pcre2_compile(), and the compiled pattern pointer that was returned by pcre2_compile(), and the
second is a set of option bits, which must include at least one of second is zero or more of the following option bits: PCRE2_JIT_COM-
PCRE2_JIT_COMPLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
If JIT support is not available, a call to pcre2_jit_comple() does If JIT support is not available, a call to pcre2_jit_compile() does
nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
pattern is passed to the JIT compiler, which turns it into machine code pattern is passed to the JIT compiler, which turns it into machine code
that executes much faster than the normal interpretive code, but yields that executes much faster than the normal interpretive code, but yields
@ -3550,81 +3550,94 @@ SIMPLE USE OF JIT
pcre2_match() is called, the appropriate code is run if it is avail- pcre2_match() is called, the appropriate code is run if it is avail-
able. Otherwise, the pattern is matched using interpretive code. able. Otherwise, the pattern is matched using interpretive code.
In some circumstances you may need to call additional functions. These You can call pcre2_jit_compile() multiple times for the same compiled
are described in the section entitled "Controlling the JIT stack" pattern. It does nothing if it has previously compiled code for any of
the option bits. For example, you can call it once with PCRE2_JIT_COM-
PLETE and (perhaps later, when you find you need partial matching)
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
ing. If pcre2_jit_compile() is called with no option bits set, it imme-
diately returns zero. This is an alternative way of testing if JIT is
available.
At present, it is not possible to free JIT compiled code except when
the entire compiled pattern is freed by calling pcre2_free_code().
In some circumstances you may need to call additional functions. These
are described in the section entitled "Controlling the JIT stack"
below. below.
There are some pcre2_match() options that are not supported by JIT, and There are some pcre2_match() options that are not supported by JIT, and
there are also some pattern items that JIT cannot handle. Details are there are also some pattern items that JIT cannot handle. Details are
given below. In both cases, matching automatically falls back to the given below. In both cases, matching automatically falls back to the
interpretive code. If you want to know whether JIT was actually used interpretive code. If you want to know whether JIT was actually used
for a particular match, you should arrange for a JIT callback function for a particular match, you should arrange for a JIT callback function
to be set up as described in the section entitled "Controlling the JIT to be set up as described in the section entitled "Controlling the JIT
stack" below, even if you do not need to supply a non-default JIT stack" below, even if you do not need to supply a non-default JIT
stack. Such a callback function is called whenever JIT code is about to stack. Such a callback function is called whenever JIT code is about to
be obeyed. If the match-time options are not right for JIT execution, be obeyed. If the match-time options are not right for JIT execution,
the callback function is not obeyed. the callback function is not obeyed.
If the JIT compiler finds an unsupported item, no JIT data is gener- If the JIT compiler finds an unsupported item, no JIT data is gener-
ated. You can find out if JIT matching is available after compiling a ated. You can find out if JIT matching is available after compiling a
pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JIT option. pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JIT option.
A result of 1 means that JIT compilation was successful. A result of 0 A result of 1 means that JIT compilation was successful. A result of 0
means that JIT support is not available, or the pattern was not pro- means that JIT support is not available, or the pattern was not pro-
cessed by pcre2_jit_compile(), or the JIT compiler was not able to han- cessed by pcre2_jit_compile(), or the JIT compiler was not able to han-
dle the pattern. dle the pattern.
UNSUPPORTED OPTIONS AND PATTERN ITEMS UNSUPPORTED OPTIONS AND PATTERN ITEMS
The pcre2_match() options that are supported for JIT matching are The pcre2_match() options that are supported for JIT matching are
PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
PCRE2_ANCHORED option is not supported at match time. PCRE2_ANCHORED option is not supported at match time.
The only unsupported pattern items are \C (match a single data unit) The only unsupported pattern items are \C (match a single data unit)
when running in a UTF mode, and a callout immediately before an asser- when running in a UTF mode, and a callout immediately before an asser-
tion condition in a conditional group. tion condition in a conditional group.
RETURN VALUES FROM JIT MATCHING RETURN VALUES FROM JIT MATCHING
When a pattern is matched using JIT matching, the return values are the When a pattern is matched using JIT matching, the return values are the
same as those given by the interpretive pcre2_match() code, with the same as those given by the interpretive pcre2_match() code, with the
addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
that the memory used for the JIT stack was insufficient. See "Control- that the memory used for the JIT stack was insufficient. See "Control-
ling the JIT stack" below for a discussion of JIT stack usage. ling the JIT stack" below for a discussion of JIT stack usage.
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
searching a very large pattern tree goes on for too long, as it is in searching a very large pattern tree goes on for too long, as it is in
the same circumstance when JIT is not used, but the details of exactly the same circumstance when JIT is not used, but the details of exactly
what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error
code is never returned when JIT matching is used. code is never returned when JIT matching is used.
CONTROLLING THE JIT STACK CONTROLLING THE JIT STACK
When the compiled JIT code runs, it needs a block of memory to use as a When the compiled JIT code runs, it needs a block of memory to use as a
stack. By default, it uses 32K on the machine stack. However, some stack. By default, it uses 32K on the machine stack. However, some
large or complicated patterns need more than this. The error large or complicated patterns need more than this. The error
PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack. PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack.
Three functions are provided for managing blocks of memory for use as Three functions are provided for managing blocks of memory for use as
JIT stacks. There is further discussion about the use of JIT stacks in JIT stacks. There is further discussion about the use of JIT stacks in
the section entitled "JIT stack FAQ" below. the section entitled "JIT stack FAQ" below.
The pcre2_jit_stack_create() function creates a JIT stack. Its argu- The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
ments are a general context (for memory allocation functions, or NULL ments are a general context (for memory allocation functions, or NULL
for standard memory allocation), a starting size and a maximum size, for standard memory allocation), a starting size and a maximum size,
and it returns a pointer to an opaque structure of type and it returns a pointer to an opaque structure of type
pcre2_jit_stack, or NULL if there is an error. The pcre2_jit_stack, or NULL if there is an error. The
pcre2_jit_stack_free() function is used to free a stack that is no pcre2_jit_stack_free() function is used to free a stack that is no
longer needed. (For the technically minded: the address space is allo- longer needed. (For the technically minded: the address space is allo-
cated by mmap or VirtualAlloc.) FIXME Is this right? cated by mmap or VirtualAlloc.)
JIT uses far less memory for recursion than the interpretive code, and JIT uses far less memory for recursion than the interpretive code, and
a maximum stack size of 512K to 1M should be more than enough for any a maximum stack size of 512K to 1M should be more than enough for any
pattern. pattern.
The pcre2_jit_stack_assign() function specifies which stack JIT code The pcre2_jit_stack_assign() function specifies which stack JIT code
should use. Its arguments are as follows: should use. Its arguments are as follows:
pcre2_match_context *mcontext pcre2_match_context *mcontext
@ -3633,11 +3646,12 @@ CONTROLLING THE JIT STACK
The first argument is a pointer to a match context. When this is subse- The first argument is a pointer to a match context. When this is subse-
quently passed to a matching function, its information determines which quently passed to a matching function, its information determines which
JIT stack is used. There are three cases for the values of the other JIT stack is used. There are three cases for the values of the other
two options: two options:
(1) If callback is NULL and data is NULL, an internal 32K block (1) If callback is NULL and data is NULL, an internal 32K block
on the machine stack is used. on the machine stack is used. This is the default when a match
context is created.
(2) If callback is NULL and data is not NULL, data must be (2) If callback is NULL and data is not NULL, data must be
a pointer to a valid JIT stack, the result of calling a pointer to a valid JIT stack, the result of calling
@ -3650,30 +3664,30 @@ CONTROLLING THE JIT STACK
return value must be a valid JIT stack, the result of calling return value must be a valid JIT stack, the result of calling
pcre2_jit_stack_create(). pcre2_jit_stack_create().
A callback function is obeyed whenever JIT code is about to be run; it A callback function is obeyed whenever JIT code is about to be run; it
is not obeyed when pcre2_match() is called with options that are incom- is not obeyed when pcre2_match() is called with options that are incom-
patible for JIT matching. A callback function can therefore be used to patible for JIT matching. A callback function can therefore be used to
determine whether a match operation was executed by JIT or by the determine whether a match operation was executed by JIT or by the
interpreter. interpreter.
You may safely use the same JIT stack for more than one pattern (either You may safely use the same JIT stack for more than one pattern (either
by assigning directly or by callback), as long as the patterns are all by assigning directly or by callback), as long as the patterns are all
matched sequentially in the same thread. In a multithread application, matched sequentially in the same thread. In a multithread application,
if you do not specify a JIT stack, or if you assign or pass back NULL if you do not specify a JIT stack, or if you assign or pass back NULL
from a callback, that is thread-safe, because each thread has its own from a callback, that is thread-safe, because each thread has its own
machine stack. However, if you assign or pass back a non-NULL JIT machine stack. However, if you assign or pass back a non-NULL JIT
stack, this must be a different stack for each thread so that the stack, this must be a different stack for each thread so that the
application is thread-safe. application is thread-safe.
Strictly speaking, even more is allowed. You can assign the same non- Strictly speaking, even more is allowed. You can assign the same non-
NULL stack to a match context that is used by any number of patterns, NULL stack to a match context that is used by any number of patterns,
as long as they are not used for matching by multiple threads at the as long as they are not used for matching by multiple threads at the
same time. For example, you could use the same stack in all compiled same time. For example, you could use the same stack in all compiled
patterns, with a global mutex in the callback to wait until the stack patterns, with a global mutex in the callback to wait until the stack
is available for use. However, this is an inefficient solution, and not is available for use. However, this is an inefficient solution, and not
recommended. recommended.
This is a suggestion for how a multithreaded program that needs to set This is a suggestion for how a multithreaded program that needs to set
up non-default JIT stacks might operate: up non-default JIT stacks might operate:
During thread initalization During thread initalization
@ -3685,7 +3699,7 @@ CONTROLLING THE JIT STACK
Use a one-line callback function Use a one-line callback function
return thread_local_var return thread_local_var
All the functions described in this section do nothing if JIT is not All the functions described in this section do nothing if JIT is not
available. available.
@ -3694,20 +3708,20 @@ JIT STACK FAQ
(1) Why do we need JIT stacks? (1) Why do we need JIT stacks?
PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
where the local data of the current node is pushed before checking its where the local data of the current node is pushed before checking its
child nodes. Allocating real machine stack on some platforms is diffi- child nodes. Allocating real machine stack on some platforms is diffi-
cult. For example, the stack chain needs to be updated every time if we cult. For example, the stack chain needs to be updated every time if we
extend the stack on PowerPC. Although it is possible, its updating extend the stack on PowerPC. Although it is possible, its updating
time overhead decreases performance. So we do the recursion in memory. time overhead decreases performance. So we do the recursion in memory.
(2) Why don't we simply allocate blocks of memory with malloc()? (2) Why don't we simply allocate blocks of memory with malloc()?
Modern operating systems have a nice feature: they can reserve an Modern operating systems have a nice feature: they can reserve an
address space instead of allocating memory. We can safely allocate mem- address space instead of allocating memory. We can safely allocate mem-
ory pages inside this address space, so the stack could grow without ory pages inside this address space, so the stack could grow without
moving memory data (this is important because of pointers). Thus we can moving memory data (this is important because of pointers). Thus we can
allocate 1M address space, and use only a single memory page (usually allocate 1M address space, and use only a single memory page (usually
4K) if that is enough. However, we can still grow up to 1M anytime if 4K) if that is enough. However, we can still grow up to 1M anytime if
needed. needed.
(3) Who "owns" a JIT stack? (3) Who "owns" a JIT stack?
@ -3715,8 +3729,8 @@ JIT STACK FAQ
The owner of the stack is the user program, not the JIT studied pattern The owner of the stack is the user program, not the JIT studied pattern
or anything else. The user program must ensure that if a stack is being or anything else. The user program must ensure that if a stack is being
used by pcre2_match(), (that is, it is assigned to a match context that used by pcre2_match(), (that is, it is assigned to a match context that
is passed to the pattern currently running), that stack must not be is passed to the pattern currently running), that stack must not be
used by any other threads (to avoid overwriting the same memory area). used by any other threads (to avoid overwriting the same memory area).
The best practice for multithreaded programs is to allocate a stack for The best practice for multithreaded programs is to allocate a stack for
each thread, and return this stack through the JIT callback function. each thread, and return this stack through the JIT callback function.
@ -3724,36 +3738,36 @@ JIT STACK FAQ
You can free a JIT stack at any time, as long as it will not be used by You can free a JIT stack at any time, as long as it will not be used by
pcre2_match() again. When you assign the stack to a match context, only pcre2_match() again. When you assign the stack to a match context, only
a pointer is set. There is no reference counting or any other magic. a pointer is set. There is no reference counting or any other magic.
You can free compiled patterns, contexts, and stacks in any order, any- You can free compiled patterns, contexts, and stacks in any order, any-
time. Just do not call pcre2_match() with a match context pointing to time. Just do not call pcre2_match() with a match context pointing to
an already freed stack, as that will cause SEGFAULT. (Also, do not free an already freed stack, as that will cause SEGFAULT. (Also, do not free
a stack currently used by pcre2_match() in another thread). You can a stack currently used by pcre2_match() in another thread). You can
also replace the stack in a context at any time when it is not in use. also replace the stack in a context at any time when it is not in use.
You can also free the previous stack before assigning a replacement. You can also free the previous stack before assigning a replacement.
(5) Should I allocate/free a stack every time before/after calling (5) Should I allocate/free a stack every time before/after calling
pcre2_match()? pcre2_match()?
No, because this is too costly in terms of resources. However, you No, because this is too costly in terms of resources. However, you
could implement some clever idea which release the stack if it is not could implement some clever idea which release the stack if it is not
used in let's say two minutes. The JIT callback can help to achieve used in let's say two minutes. The JIT callback can help to achieve
this without keeping a list of patterns. this without keeping a list of patterns.
(6) OK, the stack is for long term memory allocation. But what happens (6) OK, the stack is for long term memory allocation. But what happens
if a pattern causes stack overflow with a stack of 1M? Is that 1M kept if a pattern causes stack overflow with a stack of 1M? Is that 1M kept
until the stack is freed? until the stack is freed?
Especially on embedded sytems, it might be a good idea to release mem- Especially on embedded sytems, it might be a good idea to release mem-
ory sometimes without freeing the stack. There is no API for this at ory sometimes without freeing the stack. There is no API for this at
the moment. Probably a function call which returns with the currently the moment. Probably a function call which returns with the currently
allocated memory for any stack and another which allows releasing mem- allocated memory for any stack and another which allows releasing mem-
ory (shrinking the stack) would be a good idea if someone needs this. ory (shrinking the stack) would be a good idea if someone needs this.
(7) This is too much of a headache. Isn't there any better solution for (7) This is too much of a headache. Isn't there any better solution for
JIT stack handling? JIT stack handling?
No, thanks to Windows. If POSIX threads were used everywhere, we could No, thanks to Windows. If POSIX threads were used everywhere, we could
throw out this complicated API. throw out this complicated API.
@ -3762,18 +3776,18 @@ FREEING JIT SPECULATIVE MEMORY
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
The JIT executable allocator does not free all memory when it is possi- The JIT executable allocator does not free all memory when it is possi-
ble. It expects new allocations, and keeps some free memory around to ble. It expects new allocations, and keeps some free memory around to
improve allocation speed. However, in low memory conditions, it might improve allocation speed. However, in low memory conditions, it might
be better to free all possible memory. You can cause this to happen by be better to free all possible memory. You can cause this to happen by
calling pcre2_jit_free_unused_memory(). Its argument is a general con- calling pcre2_jit_free_unused_memory(). Its argument is a general con-
text, for custom memory management, or NULL for standard memory manage- text, for custom memory management, or NULL for standard memory manage-
ment. ment.
EXAMPLE CODE EXAMPLE CODE
This is a single-threaded example that specifies a JIT stack without This is a single-threaded example that specifies a JIT stack without
using a callback. A real program should include error checking after using a callback. A real program should include error checking after
all the function calls. all the function calls.
int rc; int rc;
@ -3801,28 +3815,28 @@ EXAMPLE CODE
JIT FAST PATH API JIT FAST PATH API
Because the API described above falls back to interpreted matching when Because the API described above falls back to interpreted matching when
JIT is not available, it is convenient for programs that are written JIT is not available, it is convenient for programs that are written
for general use in many environments. However, calling JIT via for general use in many environments. However, calling JIT via
pcre2_match() does have a performance impact. Programs that are written pcre2_match() does have a performance impact. Programs that are written
for use where JIT is known to be available, and which need the best for use where JIT is known to be available, and which need the best
possible performance, can instead use a "fast path" API to call JIT possible performance, can instead use a "fast path" API to call JIT
matching directly instead of calling pcre2_match() (obviously only for matching directly instead of calling pcre2_match() (obviously only for
patterns that have been successfully processed by pcre2_jit_compile()). patterns that have been successfully processed by pcre2_jit_compile()).
The fast path function is called pcre2_jit_match(), and it takes The fast path function is called pcre2_jit_match(), and it takes
exactly the same arguments as pcre2_match(). The return values are also exactly the same arguments as pcre2_match(). The return values are also
the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
complete) is requested that was not compiled. Unsupported option bits complete) is requested that was not compiled. Unsupported option bits
(for example, PCRE2_ANCHORED) are ignored. (for example, PCRE2_ANCHORED) are ignored.
When you call pcre2_match(), as well as testing for invalid options, a When you call pcre2_match(), as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For exam- number of other sanity checks are performed on the arguments. For exam-
ple, if the subject pointer is NULL, an immediate error is given. Also, ple, if the subject pointer is NULL, an immediate error is given. Also,
unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for
validity. In the interests of speed, these checks do not happen on the validity. In the interests of speed, these checks do not happen on the
JIT fast path, and if invalid data is passed, the result is undefined. JIT fast path, and if invalid data is passed, the result is undefined.
Bypassing the sanity checks and the pcre2_match() wrapping can give Bypassing the sanity checks and the pcre2_match() wrapping can give
speedups of more than 10%. speedups of more than 10%.
@ -3840,7 +3854,7 @@ AUTHOR
REVISION REVISION
Last updated: 08 November 2014 Last updated: 12 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1063,7 +1063,7 @@ equivalent to Perl's /x option, and it can be changed within a pattern by a
Which characters are interpreted as newlines can be specified by a setting in Which characters are interpreted as newlines can be specified by a setting in
the compile context that is passed to \fBpcre2_compile()\fP or by a special the compile context that is passed to \fBpcre2_compile()\fP or by a special
sequence at the start of the pattern, as described in the section entitled sequence at the start of the pattern, as described in the section entitled
.\" HTML <a href="pcrepattern.html#newlines"> .\" HTML <a href="pcre2pattern.html#newlines">
.\" </a> .\" </a>
"Newline conventions" "Newline conventions"
.\" .\"
@ -1226,7 +1226,7 @@ This option changes the way PCRE2 processes \eB, \eb, \eD, \ed, \eS, \es, \eW,
\ew, and some of the POSIX character classes. By default, only ASCII characters \ew, and some of the POSIX character classes. By default, only ASCII characters
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
classify characters. More details are given in the section on classify characters. More details are given in the section on
.\" HTML <a href="pcre2.html#genericchartypes"> .\" HTML <a href="pcre2pattern.html#genericchartypes">
.\" </a> .\" </a>
generic character types generic character types
.\" .\"
@ -1939,17 +1939,11 @@ documentation.
.sp .sp
When PCRE2 is built, a default newline convention is set; this is usually the When PCRE2 is built, a default newline convention is set; this is usually the
standard convention for the operating system. The default can be overridden in standard convention for the operating system. The default can be overridden in
either a a
.\" HTML <a href="#compilecontext"> .\" HTML <a href="#compilecontext">
.\" </a> .\" </a>
compile context compile context.
.\" .\"
or a
.\" HTML <a href="#matchcontext">
.\" </a>
match context.
.\"
However, changing the newline convention at match time disables JIT matching.
During matching, the newline choice affects the behaviour of the dot, During matching, the newline choice affects the behaviour of the dot,
circumflex, and dollar metacharacters. It may also alter the way the match circumflex, and dollar metacharacters. It may also alter the way the match
position is advanced after a match failure for an unanchored pattern. position is advanced after a match failure for an unanchored pattern.
@ -2322,7 +2316,7 @@ appropriate offset in the ovector, which contains PCRE2_UNSET for unset
substrings. substrings.
. .
. .
.\" HTML <a name="extractbynname"></a> .\" HTML <a name="extractbyname"></a>
.SH "EXTRACTING CAPTURED SUBSTRINGS BY NAME" .SH "EXTRACTING CAPTURED SUBSTRINGS BY NAME"
.rs .rs
.sp .sp

View File

@ -28,7 +28,7 @@ you want to use JIT. The support is limited to the following hardware
platforms: platforms:
.sp .sp
ARM 32-bit (v5, v7, and Thumb2) ARM 32-bit (v5, v7, and Thumb2)
ARM 64-bit ARM 64-bit
Intel x86 32-bit and 64-bit Intel x86 32-bit and 64-bit
MIPS 32-bit and 64-bit MIPS 32-bit and 64-bit
Power PC 32-bit and 64-bit Power PC 32-bit and 64-bit
@ -79,7 +79,7 @@ PCRE2_JIT_COMPLETE and just compile code for partial matching. If
\fBpcre2_jit_compile()\fP is called with no option bits set, it immediately \fBpcre2_jit_compile()\fP is called with no option bits set, it immediately
returns zero. This is an alternative way of testing if JIT is available. returns zero. This is an alternative way of testing if JIT is available.
.P .P
At present, it is not possible to free JIT compiled code except when the entire At present, it is not possible to free JIT compiled code except when the entire
compiled pattern is freed by calling \fBpcre2_free_code()\fP. compiled pattern is freed by calling \fBpcre2_free_code()\fP.
.P .P
In some circumstances you may need to call additional functions. These are In some circumstances you may need to call additional functions. These are
@ -186,8 +186,8 @@ passed to a matching function, its information determines which JIT stack is
used. There are three cases for the values of the other two options: used. There are three cases for the values of the other two options:
.sp .sp
(1) If \fIcallback\fP is NULL and \fIdata\fP is NULL, an internal 32K block (1) If \fIcallback\fP is NULL and \fIdata\fP is NULL, an internal 32K block
on the machine stack is used. This is the default when a match on the machine stack is used. This is the default when a match
context is created. context is created.
.sp .sp
(2) If \fIcallback\fP is NULL and \fIdata\fP is not NULL, \fIdata\fP must be (2) If \fIcallback\fP is NULL and \fIdata\fP is not NULL, \fIdata\fP must be
a pointer to a valid JIT stack, the result of calling a pointer to a valid JIT stack, the result of calling

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "03 November 2014" "PCRE2 10.00" .TH PCRE2PATTERN 3 "14 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -63,8 +63,8 @@ page.
.P .P
Some applications that allow their users to supply patterns may wish to Some applications that allow their users to supply patterns may wish to
restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
option is set at compile time, (*UTF) is not allowed, and its appearance causes option is passed to \fBpcre2_compile()\fP, (*UTF) is not allowed, and its
an error. appearance in a pattern causes an error.
. .
. .
.SS "Unicode property support" .SS "Unicode property support"
@ -75,6 +75,21 @@ This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \ed and \ew to use Unicode properties to determine character types, such as \ed and \ew to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 128 via a lookup instead of recognizing only characters with codes less than 128 via a lookup
table. table.
.P
Some applications that allow their users to supply patterns may wish to
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
\fBpcre2_compile()\fP, (*UCP) is not allowed, and its appearance in a pattern
causes an error.
.
.
.SS "Locking out empty string matching"
.rs
.sp
Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
matching function is subsequently called to match the pattern. These options
lock out the matching of empty strings, either entirely, or only at the start
of the subject.
. .
. .
.SS "Disabling auto-possessification" .SS "Disabling auto-possessification"
@ -102,6 +117,28 @@ reaching "no match" results. For more details, see the
documentation. documentation.
. .
. .
.SS "Setting match and recursion limits"
.rs
.sp
The caller of \fBpcre2_match()\fP can set a limit on the number of times the
internal \fBmatch()\fP function is called and on the maximum depth of
recursive calls. These facilities are provided to catch runaway matches that
are provoked by patterns with huge matching trees (a typical example is a
pattern with nested unlimited repeats) and to avoid running out of system stack
by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
gives an error return. The limits can also be set by items at the start of the
pattern of the form
.sp
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
.sp
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
.
.
.\" HTML <a name="newlines"></a> .\" HTML <a name="newlines"></a>
.SS "Newline conventions" .SS "Newline conventions"
.rs .rs
@ -153,26 +190,14 @@ below. A change of \eR setting can be combined with a change of newline
convention. convention.
. .
. .
.SS "Setting match and recursion limits" .SS "Specifying what \eR matches"
.rs .rs
.sp .sp
The caller of \fBpcre2_match()\fP can set a limit on the number of times the It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
internal \fBmatch()\fP function is called and on the maximum depth of complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
recursive calls. These facilities are provided to catch runaway matches that at compile time. This effect can also be achieved by starting a pattern with
are provoked by patterns with huge matching trees (a typical example is a (*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
pattern with nested unlimited repeats) and to avoid running out of system stack corresponding to PCRE2_BSR_UNICODE.
by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
gives an error return. The limits can also be set by items at the start of the
pattern of the form
.sp
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
.sp
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
. .
. .
.SH "EBCDIC CHARACTER CODES" .SH "EBCDIC CHARACTER CODES"
@ -2302,8 +2327,8 @@ complex:
(?(1) (A|B|C) | (D | (?(2)E|F) | E) ) (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
.sp .sp
.P .P
There are four kinds of condition: references to subpatterns, references to There are five kinds of condition: references to subpatterns, references to
recursion, a pseudo-condition called DEFINE, and assertions. recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
. .
. .
.SS "Checking for a used subpattern by number" .SS "Checking for a used subpattern by number"
@ -2418,6 +2443,23 @@ pattern uses references to the named group to match the four dot-separated
components of an IPv4 address, insisting on a word boundary at each end. components of an IPv4 address, insisting on a word boundary at each end.
. .
. .
.SS "Checking the PCRE2 version"
.rs
.sp
Programs that link with a PCRE2 library can check the version by calling
\fBpcre2_config()\fP with appropriate arguments. Users of applications that do
not have access to the underlying code cannot do this. A special "condition"
called VERSION exists to allow such users to discover which version of PCRE2
they are dealing with by using this condition to match a string such as
"yesno". VERSION must be followed either by "=" or ">=" and a version number.
For example:
.sp
(?(VERSION>=10.4)yes|no)
.sp
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
"no" otherwise.
.
.
.SS "Assertion conditions" .SS "Assertion conditions"
.rs .rs
.sp .sp
@ -3219,7 +3261,7 @@ subpattern, (*THEN) causes the subroutine match to fail.
.rs .rs
.sp .sp
\fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3),
\fBpcre2syntax\fP(3), \fBpcre2\fP(3), \fBpcre216(3)\fP, \fBpcre232(3)\fP. \fBpcre2syntax\fP(3), \fBpcre2\fP(3).
. .
. .
.SH AUTHOR .SH AUTHOR
@ -3236,6 +3278,6 @@ Cambridge CB2 3QH, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 03 November 2014 Last updated: 14 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "20 October 2014" "PCRE2 10.00" .TH PCRE2SYNTAX 3 "14 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -470,17 +470,18 @@ Each top-level branch of a look behind must be of a fixed length.
(?(condition)yes-pattern) (?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern) (?(condition)yes-pattern|no-pattern)
.sp .sp
(?(n)... absolute reference condition (?(n) absolute reference condition
(?(+n)... relative reference condition (?(+n) relative reference condition
(?(-n)... relative reference condition (?(-n) relative reference condition
(?(<name>)... named reference condition (Perl) (?(<name>) named reference condition (Perl)
(?('name')... named reference condition (Perl) (?('name') named reference condition (Perl)
(?(name)... named reference condition (PCRE2) (?(name) named reference condition (PCRE2)
(?(R)... overall recursion condition (?(R) overall recursion condition
(?(Rn)... specific group recursion condition (?(Rn) specific group recursion condition
(?(R&name)... specific recursion condition (?(R&name) specific recursion condition
(?(DEFINE)... define subpattern for reference (?(DEFINE) define subpattern for reference
(?(assert)... assertion condition (?(VERSION[>]=n.m) test PCRE2 version
(?(assert) assertion condition
. .
. .
.SH "BACKTRACKING CONTROL" .SH "BACKTRACKING CONTROL"
@ -535,6 +536,6 @@ Cambridge CB2 3QH, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 20 October 2014 Last updated: 14 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "12 November 2014" "PCRE 10.00" .TH PCRE2TEST 1 "14 November 2014" "PCRE 10.00"
.SH NAME .SH NAME
pcre2test - a program for testing Perl-compatible regular expressions. pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -450,7 +450,6 @@ about the pattern:
tables=[0|1|2] select internal tables tables=[0|1|2] select internal tables
.sp .sp
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
. .
. .
.SS "Newline and \eR handling" .SS "Newline and \eR handling"
@ -484,7 +483,31 @@ one-off tests.
.P .P
The \fBinfo\fP modifier requests information about the compiled pattern The \fBinfo\fP modifier requests information about the compiled pattern
(whether it is anchored, has a fixed first character, and so on). The (whether it is anchored, has a fixed first character, and so on). The
information is obtained from the \fBpcre2_pattern_info()\fP function. information is obtained from the \fBpcre2_pattern_info()\fP function. Here are
some typical examples:
.sp
re> /(?i)(^a|^b)/m,info
Capturing subpattern count = 1
Compile options: multiline
Overall options: caseless multiline
First code unit at start or follows newline
Subject length lower bound = 1
.sp
re> /(?i)abc/info
Capturing subpattern count = 0
Compile options: <none>
Overall options: caseless
First code unit = 'a' (caseless)
Last code unit = 'c' (caseless)
Subject length lower bound = 3
.sp
"Compile options" are those specified to the compile function; "overall
options" have added options that are taken or deduced from the pattern. If both
sets of options are the same, just a single "options" line is output. "First
code unit" is where any match must start; if there is more than one they are
listed as "starting code units". "Last code unit" is the last literal code unit
that must be present in any match. This is not necessarily the last character.
These lines are omitted if no starting or ending code units are recorded.
. .
. .
.SS "Specifying a pattern in hex" .SS "Specifying a pattern in hex"
@ -499,8 +522,8 @@ pairs. For example:
This feature is provided as a way of creating patterns that contain binary zero This feature is provided as a way of creating patterns that contain binary zero
characters. By default, \fBpcre2test\fP passes patterns as zero-terminated characters. By default, \fBpcre2test\fP passes patterns as zero-terminated
strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED.
However, for patterns specified in hexadecimal, the length of the pattern is However, for patterns specified in hexadecimal, the actual length of the
passed. pattern is passed.
. .
. .
.SS "JIT compilation" .SS "JIT compilation"
@ -528,7 +551,7 @@ documentation. See also the \fBjitstack\fP modifier below for a way of
setting the size of the JIT stack. setting the size of the JIT stack.
.P .P
If the \fBjitfast\fP modifier is specified, matching is done using the JIT If the \fBjitfast\fP modifier is specified, matching is done using the JIT
"fast path" interface (\fBpcre2_jit_match()), which skips some of the sanity "fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
checks that are done by \fBpcre2_match()\fP, and of course does not work when checks that are done by \fBpcre2_match()\fP, and of course does not work when
JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is
assumed. assumed.
@ -560,11 +583,16 @@ character tables are mutually exclusive.
.SS "Showing pattern memory" .SS "Showing pattern memory"
.rs .rs
.sp .sp
The \fB/memory\fP modifier causes the size in bytes of the memory block used to The \fB/memory\fP modifier causes the size in bytes of the memory used to hold
hold the compiled pattern to be output. This does not include the size of the the compiled pattern to be output. This does not include the size of the
\fBpcre2_code\fP block; it is just the actual compiled data. If the pattern is \fBpcre2_code\fP block; it is just the actual compiled data. If the pattern is
subsequently passed to the JIT compiler, the size of the JIT compiled code is subsequently passed to the JIT compiler, the size of the JIT compiled code is
also output. also output. Here is an example:
.sp
re> /a(b)c/jit,memory
Memory allocation (code space): 21
Memory allocation (JIT code): 1910
.sp
. .
. .
.SS "Limiting nested parentheses" .SS "Limiting nested parentheses"
@ -608,8 +636,8 @@ enable stack availability to be checked during compilation (see the
.\" .\"
documentation for details). If the number specified by the modifier is greater documentation for details). If the number specified by the modifier is greater
than zero, \fBpcre2_set_compile_recursion_guard()\fP is called to set up than zero, \fBpcre2_set_compile_recursion_guard()\fP is called to set up
callback from \fBpcre2_compile()\fP to a local function. The argument it is callback from \fBpcre2_compile()\fP to a local function. The argument it
passed is the current nesting parenthesis depth; if this is greater than the receives is the current nesting parenthesis depth; if this is greater than the
value given by the modifier, non-zero is returned, causing the compilation to value given by the modifier, non-zero is returned, causing the compilation to
be aborted. be aborted.
. .
@ -646,7 +674,7 @@ not affect the compilation process.
allusedtext show all consulted text allusedtext show all consulted text
/g global global matching /g global global matching
mark show mark values mark show mark values
replace=<string> specify a replacement string replace=<string> specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
.sp .sp
These modifiers may not appear in a \fB#pattern\fP command. If you want them as These modifiers may not appear in a \fB#pattern\fP command. If you want them as
@ -721,12 +749,11 @@ pattern.
offset=<n> set starting offset offset=<n> set starting offset
ovector=<n> set size of output vector ovector=<n> set size of output vector
recursion_limit=<n> set a recursion limit recursion_limit=<n> set a recursion limit
replace=<string> specify a replacement string replace=<string> specify a replacement string
startchar show startchar when relevant startchar show startchar when relevant
zero_terminate pass the subject as zero-terminated zero_terminate pass the subject as zero-terminated
.sp .sp
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
. .
. .
.SS "Showing more text" .SS "Showing more text"
@ -850,14 +877,14 @@ parentheses after each substring.
.SS "Testing the substitution function" .SS "Testing the substitution function"
.rs .rs
.sp .sp
If the \fBreplace\fP modifier is set, the \fBpcre2_substitute()\fP function is If the \fBreplace\fP modifier is set, the \fBpcre2_substitute()\fP function is
called instead of one of the matching functions. Unlike subject strings, called instead of one of the matching functions. Unlike subject strings,
\fBpcre2test\fP does not process replacement strings for escape sequences. In \fBpcre2test\fP does not process replacement strings for escape sequences. In
UTF mode, a replacement string is checked to see if it is a valid UTF-8 string. UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
If so, it is correctly converted to a UTF string of the appropriate code unit If so, it is correctly converted to a UTF string of the appropriate code unit
width. If it is not a valid UTF-8 string, the individual code units are copied width. If it is not a valid UTF-8 string, the individual code units are copied
directly. This provides a means of passing an invalid UTF-8 string for testing directly. This provides a means of passing an invalid UTF-8 string for testing
purposes. purposes.
.P .P
If the \fBglobal\fP modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to If the \fBglobal\fP modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
\fBpcre2_substitute()\fP. After a successful substitution, the modified string \fBpcre2_substitute()\fP. After a successful substitution, the modified string
@ -867,16 +894,23 @@ were no matches. Here is a simple example of a substitution test:
/abc/replace=xxx /abc/replace=xxx
=abc=abc= =abc=abc=
1: =xxx=abc= 1: =xxx=abc=
=abc=abc=\=global =abc=abc=\e=global
2: =xxx=xxx= 2: =xxx=xxx=
.sp .sp
Subject and replacement strings should be kept relatively short for Subject and replacement strings should be kept relatively short for
substitution tests, as fixed-size buffers are used. To make it easy to test for substitution tests, as fixed-size buffers are used. To make it easy to test for
buffer overflow, if the replacement string starts with a number in square buffer overflow, if the replacement string starts with a number in square
brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the
output buffer, with the replacement string starting at the next character. output buffer, with the replacement string starting at the next character. Here
.P is an example that tests the edge case:
A replacement string is ignored with POSIX and DFA matching. Specifying partial .sp
/abc/
123abc123\e=replace=[10]XYZ
1: 123XYZ123
123abc123\e=replace=[9]XYZ
Failed: error -47: no more memory
.sp
A replacement string is ignored with POSIX and DFA matching. Specifying partial
matching provokes an error return ("bad option value") from matching provokes an error return ("bad option value") from
\fBpcre2_substitute()\fP. \fBpcre2_substitute()\fP.
. .
@ -957,10 +991,10 @@ available for storing matching information. The default is 15.
A value of zero is useful when testing the POSIX API because it causes A value of zero is useful when testing the POSIX API because it causes
\fBregexec()\fP to be called with a NULL capture vector. When not testing the \fBregexec()\fP to be called with a NULL capture vector. When not testing the
POSIX API, a value of zero is used to cause POSIX API, a value of zero is used to cause
\fBpcre2_match_data_create_from_pattern\fP to be called, in order to create a \fBpcre2_match_data_create_from_pattern()\fP to be called, in order to create a
match block of exactly the right size for the pattern. (It is not possible to match block of exactly the right size for the pattern. (It is not possible to
create a match block with a zero-length ovector; there is always one pair of create a match block with a zero-length ovector; there is always at least one
offsets.) pair of offsets.)
. .
. .
.SS "Passing the subject as zero-terminated" .SS "Passing the subject as zero-terminated"
@ -972,7 +1006,7 @@ string, the \fBzero_terminate\fP modifier is provided. It causes the length to
be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface, be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
this modifier has no effect, as there is no facility for passing a length.) this modifier has no effect, as there is no facility for passing a length.)
.P .P
When testing \fBpcre2_substitute\fP, this modifier also has the effect of When testing \fBpcre2_substitute()\fP, this modifier also has the effect of
passing the replacement string as zero-terminated. passing the replacement string as zero-terminated.
. .
. .
@ -1237,6 +1271,6 @@ Cambridge CB2 3QH, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 12 November 2014 Last updated: 14 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -150,17 +150,18 @@ COMMAND LINE OPTIONS
Behave as if each subject line contains the given modifiers. Behave as if each subject line contains the given modifiers.
-t Run each compile and match many times with a timer, and out- -t Run each compile and match many times with a timer, and out-
put the resulting times per compile or match. You can control put the resulting times per compile or match. When JIT is
the number of iterations that are used for timing by follow- used, separate times are given for the initial compile and
ing -t with a number (as a separate item on the command the JIT compile. You can control the number of iterations
line). For example, "-t 1000" iterates 1000 times. The that are used for timing by following -t with a number (as a
default is to iterate 500,000 times. separate item on the command line). For example, "-t 1000"
iterates 1000 times. The default is to iterate 500,000 times.
-tm This is like -t except that it times only the matching phase, -tm This is like -t except that it times only the matching phase,
not the compile phase. not the compile phase.
-T -TM These behave like -t and -tm, but in addition, at the end of -T -TM These behave like -t and -tm, but in addition, at the end of
a run, the total times for all compiles and matches are out- a run, the total times for all compiles and matches are out-
put. put.
-version Output the PCRE2 version number and then exit. -version Output the PCRE2 version number and then exit.
@ -168,139 +169,139 @@ COMMAND LINE OPTIONS
DESCRIPTION DESCRIPTION
If pcre2test is given two filename arguments, it reads from the first If pcre2test is given two filename arguments, it reads from the first
and writes to the second. If the first name is "-", input is taken from and writes to the second. If the first name is "-", input is taken from
the standard input. If pcre2test is given only one argument, it reads the standard input. If pcre2test is given only one argument, it reads
from that file and writes to stdout. Otherwise, it reads from stdin and from that file and writes to stdout. Otherwise, it reads from stdin and
writes to stdout. When the input is a terminal, it prompts for each writes to stdout. When the input is a terminal, it prompts for each
line of input, using "re>" to prompt for regular expression patterns, line of input, using "re>" to prompt for regular expression patterns,
and "data>" to prompt for subject lines. and "data>" to prompt for subject lines.
When pcre2test is built, a configuration option can specify that it When pcre2test is built, a configuration option can specify that it
should be linked with the libreadline or libedit library. When this is should be linked with the libreadline or libedit library. When this is
done, if the input is from a terminal, it is read using the readline() done, if the input is from a terminal, it is read using the readline()
function. This provides line-editing and history facilities. The output function. This provides line-editing and history facilities. The output
from the -help option states whether or not readline() will be used. from the -help option states whether or not readline() will be used.
The program handles any number of tests, each of which consists of a The program handles any number of tests, each of which consists of a
set of input lines. Each set starts with a regular expression pattern, set of input lines. Each set starts with a regular expression pattern,
followed by any number of subject lines to be matched against that pat- followed by any number of subject lines to be matched against that pat-
tern. In between sets of test data, command lines that begin with a tern. In between sets of test data, command lines that begin with a
hash (#) character may appear. This file format, with some restric- hash (#) character may appear. This file format, with some restric-
tions, can also be processed by the perltest.pl script that is distrib- tions, can also be processed by the perltest.pl script that is distrib-
uted with PCRE2 as a means of checking that the behaviour of PCRE2 and uted with PCRE2 as a means of checking that the behaviour of PCRE2 and
Perl is the same. Perl is the same.
Each subject line is matched separately and independently. If you want Each subject line is matched separately and independently. If you want
to do multi-line matches, you have to use the \n escape sequence (or \r to do multi-line matches, you have to use the \n escape sequence (or \r
or \r\n, etc., depending on the newline setting) in a single line of or \r\n, etc., depending on the newline setting) in a single line of
input to encode the newline sequences. There is no limit on the length input to encode the newline sequences. There is no limit on the length
of subject lines; the input buffer is automatically extended if it is of subject lines; the input buffer is automatically extended if it is
too small. There is a replication feature that makes it possible to too small. There is a replication feature that makes it possible to
generate long subject lines without having to supply them explicitly. generate long subject lines without having to supply them explicitly.
An empty line or the end of the file signals the end of the subject An empty line or the end of the file signals the end of the subject
lines for a test, at which point a new pattern or command line is lines for a test, at which point a new pattern or command line is
expected if there is still input to be read. expected if there is still input to be read.
COMMAND LINES COMMAND LINES
In between sets of test data, a line that begins with a hash (#) char- In between sets of test data, a line that begins with a hash (#) char-
acter is interpreted as a command line. If the first character is fol- acter is interpreted as a command line. If the first character is fol-
lowed by white space or an exclamation mark, the line is treated as a lowed by white space or an exclamation mark, the line is treated as a
comment, and ignored. Otherwise, the following commands are recog- comment, and ignored. Otherwise, the following commands are recog-
nized: nized:
#forbid_utf #forbid_utf
Subsequent patterns automatically have the PCRE2_NEVER_UTF and Subsequent patterns automatically have the PCRE2_NEVER_UTF and
PCRE2_NEVER_UCP options set, which locks out the use of UTF and Unicode PCRE2_NEVER_UCP options set, which locks out the use of UTF and Unicode
property features. This is a trigger guard that is used in test files property features. This is a trigger guard that is used in test files
to ensure that UTF/Unicode tests are not accidentally added to files to ensure that UTF/Unicode tests are not accidentally added to files
that are used when UTF support is not included in the library. This that are used when UTF support is not included in the library. This
effect can also be obtained by the use of #pattern; the difference is effect can also be obtained by the use of #pattern; the difference is
that #forbid_utf cannot be unset, and the automatic options are not that #forbid_utf cannot be unset, and the automatic options are not
displayed in pattern information, to avoid cluttering up test output. displayed in pattern information, to avoid cluttering up test output.
#pattern <modifier-list> #pattern <modifier-list>
This command sets a default modifier list that applies to all subse- This command sets a default modifier list that applies to all subse-
quent patterns. Modifiers on a pattern can change these settings. quent patterns. Modifiers on a pattern can change these settings.
#perltest #perltest
The appearance of this line causes all subsequent modifier settings to The appearance of this line causes all subsequent modifier settings to
be checked for compatibility with the perltest.pl script, which is used be checked for compatibility with the perltest.pl script, which is used
to confirm that Perl gives the same results as PCRE2. Also, apart from to confirm that Perl gives the same results as PCRE2. Also, apart from
comment lines, none of the other command lines are permitted, because comment lines, none of the other command lines are permitted, because
they and many of the modifiers are specific to pcre2test, and should they and many of the modifiers are specific to pcre2test, and should
not be used in test files that are also processed by perltest.pl. The not be used in test files that are also processed by perltest.pl. The
#perltest command helps detect tests that are accidentally put in the #perltest command helps detect tests that are accidentally put in the
wrong file. wrong file.
#subject <modifier-list> #subject <modifier-list>
This command sets a default modifier list that applies to all subse- This command sets a default modifier list that applies to all subse-
quent subject lines. Modifiers on a subject line can change these set- quent subject lines. Modifiers on a subject line can change these set-
tings. tings.
MODIFIER SYNTAX MODIFIER SYNTAX
Modifier lists are used with both pattern and subject lines. Items in a Modifier lists are used with both pattern and subject lines. Items in a
list are separated by commas and optional white space. Some modifiers list are separated by commas and optional white space. Some modifiers
may be given for both patterns and subject lines, whereas others are may be given for both patterns and subject lines, whereas others are
valid for one or the other only. Each modifier has a long name, for valid for one or the other only. Each modifier has a long name, for
example "anchored", and some of them must be followed by an equals sign example "anchored", and some of them must be followed by an equals sign
and a value, for example, "offset=12". Modifiers that do not take val- and a value, for example, "offset=12". Modifiers that do not take val-
ues may be preceded by a minus sign to turn off a previous default set- ues may be preceded by a minus sign to turn off a previous default set-
ting. ting.
A few of the more common modifiers can also be specified as single let- A few of the more common modifiers can also be specified as single let-
ters, for example "i" for "caseless". In documentation, following the ters, for example "i" for "caseless". In documentation, following the
Perl convention, these are written with a slash ("the /i modifier") for Perl convention, these are written with a slash ("the /i modifier") for
clarity. Abbreviated modifiers must all be concatenated in the first clarity. Abbreviated modifiers must all be concatenated in the first
item of a modifier list. If the first item is not recognized as a long item of a modifier list. If the first item is not recognized as a long
modifier name, it is interpreted as a sequence of these abbreviations. modifier name, it is interpreted as a sequence of these abbreviations.
For example: For example:
/abc/ig,newline=cr,jit=3 /abc/ig,newline=cr,jit=3
This is a pattern line whose modifier list starts with two one-letter This is a pattern line whose modifier list starts with two one-letter
modifiers (/i and /g). The lower-case abbreviated modifiers are the modifiers (/i and /g). The lower-case abbreviated modifiers are the
same as used in Perl. same as used in Perl.
PATTERN SYNTAX PATTERN SYNTAX
A pattern line must start with one of the following characters (common A pattern line must start with one of the following characters (common
symbols, excluding pattern meta-characters): symbols, excluding pattern meta-characters):
/ ! " ' ` - = _ : ; , % & @ ~ / ! " ' ` - = _ : ; , % & @ ~
This is interpreted as the pattern's delimiter. A regular expression This is interpreted as the pattern's delimiter. A regular expression
may be continued over several input lines, in which case the newline may be continued over several input lines, in which case the newline
characters are included within it. It is possible to include the delim- characters are included within it. It is possible to include the delim-
iter within the pattern by escaping it with a backslash, for example iter within the pattern by escaping it with a backslash, for example
/abc\/def/ /abc\/def/
If you do this, the escape and the delimiter form part of the pattern, If you do this, the escape and the delimiter form part of the pattern,
but since the delimiters are all non-alphanumeric, this does not affect but since the delimiters are all non-alphanumeric, this does not affect
its interpretation. If the terminating delimiter is immediately fol- its interpretation. If the terminating delimiter is immediately fol-
lowed by a backslash, for example, lowed by a backslash, for example,
/abc/\ /abc/\
then a backslash is added to the end of the pattern. This is done to then a backslash is added to the end of the pattern. This is done to
provide a way of testing the error condition that arises if a pattern provide a way of testing the error condition that arises if a pattern
finishes with a backslash, because finishes with a backslash, because
/abc\/ /abc\/
is interpreted as the first line of a pattern that starts with "abc/", is interpreted as the first line of a pattern that starts with "abc/",
causing pcre2test to read the next line as a continuation of the regu- causing pcre2test to read the next line as a continuation of the regu-
lar expression. lar expression.
A pattern can be followed by a modifier list (details below). A pattern can be followed by a modifier list (details below).
@ -308,7 +309,7 @@ PATTERN SYNTAX
SUBJECT LINE SYNTAX SUBJECT LINE SYNTAX
Before each subject line is passed to pcre2_match() or Before each subject line is passed to pcre2_match() or
pcre2_dfa_match(), leading and trailing white space is removed, and the pcre2_dfa_match(), leading and trailing white space is removed, and the
line is scanned for backslash escapes. The following provide a means of line is scanned for backslash escapes. The following provide a means of
encoding non-printing characters in a visible way: encoding non-printing characters in a visible way:
@ -328,23 +329,23 @@ SUBJECT LINE SYNTAX
\x{hh...} hexadecimal character (any number of hex digits) \x{hh...} hexadecimal character (any number of hex digits)
The use of \x{hh...} is not dependent on the use of the utf modifier on The use of \x{hh...} is not dependent on the use of the utf modifier on
the pattern. It is recognized always. There may be any number of hexa- the pattern. It is recognized always. There may be any number of hexa-
decimal digits inside the braces; invalid values provoke error mes- decimal digits inside the braces; invalid values provoke error mes-
sages. sages.
Note that \xhh specifies one byte rather than one character in UTF-8 Note that \xhh specifies one byte rather than one character in UTF-8
mode; this makes it possible to construct invalid UTF-8 sequences for mode; this makes it possible to construct invalid UTF-8 sequences for
testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8 testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
character in UTF-8 mode, generating more than one byte if the value is character in UTF-8 mode, generating more than one byte if the value is
greater than 127. When testing the 8-bit library not in UTF-8 mode, greater than 127. When testing the 8-bit library not in UTF-8 mode,
\x{hh} generates one byte for values less than 256, and causes an error \x{hh} generates one byte for values less than 256, and causes an error
for greater values. for greater values.
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
possible to construct invalid UTF-16 sequences for testing purposes. possible to construct invalid UTF-16 sequences for testing purposes.
In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
makes it possible to construct invalid UTF-32 sequences for testing makes it possible to construct invalid UTF-32 sequences for testing
purposes. purposes.
There is a special backslash sequence that specifies replication of one There is a special backslash sequence that specifies replication of one
@ -352,38 +353,38 @@ SUBJECT LINE SYNTAX
\[<characters>]{<count>} \[<characters>]{<count>}
This makes it possible to test long strings without having to provide This makes it possible to test long strings without having to provide
them as part of the file. For example: them as part of the file. For example:
\[abc]{4} \[abc]{4}
is converted to "abcabcabcabc". This feature does not support nesting. is converted to "abcabcabcabc". This feature does not support nesting.
To include a closing square bracket in the characters, code it as \x5D. To include a closing square bracket in the characters, code it as \x5D.
A backslash followed by an equals sign marke the end of the subject A backslash followed by an equals sign marke the end of the subject
string and the start of a modifier list. For example: string and the start of a modifier list. For example:
abc\=notbol,notempty abc\=notbol,notempty
A backslash followed by any other non-alphanumeric character just A backslash followed by any other non-alphanumeric character just
escapes that character. A backslash followed by anything else causes an escapes that character. A backslash followed by anything else causes an
error. However, if the very last character in the line is a backslash error. However, if the very last character in the line is a backslash
(and there is no modifier list), it is ignored. This gives a way of (and there is no modifier list), it is ignored. This gives a way of
passing an empty line as data, since a real empty line terminates the passing an empty line as data, since a real empty line terminates the
data input. data input.
PATTERN MODIFIERS PATTERN MODIFIERS
There are three types of modifier that can appear in pattern lines, two There are three types of modifier that can appear in pattern lines, two
of which may also be used in a #pattern command. A pattern's modifier of which may also be used in a #pattern command. A pattern's modifier
list can add to or override default modifiers that were set by a previ- list can add to or override default modifiers that were set by a previ-
ous #pattern command. ous #pattern command.
Setting compilation options Setting compilation options
The following modifiers set options for pcre2_compile(). The most com- The following modifiers set options for pcre2_compile(). The most com-
mon ones have single-letter abbreviations. See pcreapi for a descrip- mon ones have single-letter abbreviations. See pcreapi for a descrip-
tion of their effects. tion of their effects.
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
@ -409,13 +410,13 @@ PATTERN MODIFIERS
utf set PCRE2_UTF utf set PCRE2_UTF
As well as turning on the PCRE2_UTF option, the utf modifier causes all As well as turning on the PCRE2_UTF option, the utf modifier causes all
non-printing characters in output strings to be printed using the non-printing characters in output strings to be printed using the
\x{hh...} notation. Otherwise, those less than 0x100 are output in hex \x{hh...} notation. Otherwise, those less than 0x100 are output in hex
without the curly brackets. without the curly brackets.
Setting compilation controls Setting compilation controls
The following modifiers affect the compilation process or request The following modifiers affect the compilation process or request
information about the pattern: information about the pattern:
bsr=[anycrlf|unicode] specify \R handling bsr=[anycrlf|unicode] specify \R handling
@ -437,7 +438,6 @@ PATTERN MODIFIERS
tables=[0|1|2] select internal tables tables=[0|1|2] select internal tables
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
Newline and \R handling Newline and \R handling
@ -468,7 +468,32 @@ PATTERN MODIFIERS
The info modifier requests information about the compiled pattern The info modifier requests information about the compiled pattern
(whether it is anchored, has a fixed first character, and so on). The (whether it is anchored, has a fixed first character, and so on). The
information is obtained from the pcre2_pattern_info() function. information is obtained from the pcre2_pattern_info() function. Here
are some typical examples:
re> /(?i)(^a|^b)/m,info
Capturing subpattern count = 1
Compile options: multiline
Overall options: caseless multiline
First code unit at start or follows newline
Subject length lower bound = 1
re> /(?i)abc/info
Capturing subpattern count = 0
Compile options: <none>
Overall options: caseless
First code unit = 'a' (caseless)
Last code unit = 'c' (caseless)
Subject length lower bound = 3
"Compile options" are those specified to the compile function; "overall
options" have added options that are taken or deduced from the pattern.
If both sets of options are the same, just a single "options" line is
output. "First code unit" is where any match must start; if there is
more than one they are listed as "starting code units". "Last code
unit" is the last literal code unit that must be present in any match.
This is not necessarily the last character. These lines are omitted if
no starting or ending code units are recorded.
Specifying a pattern in hex Specifying a pattern in hex
@ -482,7 +507,7 @@ PATTERN MODIFIERS
binary zero characters. By default, pcre2test passes patterns as zero- binary zero characters. By default, pcre2test passes patterns as zero-
terminated strings to pcre2_compile(), giving the length as terminated strings to pcre2_compile(), giving the length as
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal,
the length of the pattern is passed. the actual length of the pattern is passed.
JIT compilation JIT compilation
@ -505,7 +530,7 @@ PATTERN MODIFIERS
size of the JIT stack. size of the JIT stack.
If the jitfast modifier is specified, matching is done using the JIT If the jitfast modifier is specified, matching is done using the JIT
"fast path" interface (pcre2_jit_match()), which skips some of the san- "fast path" interface, pcre2_jit_match(), which skips some of the san-
ity checks that are done by pcre2_match(), and of course does not work ity checks that are done by pcre2_match(), and of course does not work
when JIT is not supported. If jitfast is specified without jit, jit=7 when JIT is not supported. If jitfast is specified without jit, jit=7
is assumed. is assumed.
@ -533,11 +558,16 @@ PATTERN MODIFIERS
Showing pattern memory Showing pattern memory
The /memory modifier causes the size in bytes of the memory block used The /memory modifier causes the size in bytes of the memory used to
to hold the compiled pattern to be output. This does not include the hold the compiled pattern to be output. This does not include the size
size of the pcre2_code block; it is just the actual compiled data. If of the pcre2_code block; it is just the actual compiled data. If the
the pattern is subsequently passed to the JIT compiler, the size of the pattern is subsequently passed to the JIT compiler, the size of the JIT
JIT compiled code is also output. compiled code is also output. Here is an example:
re> /a(b)c/jit,memory
Memory allocation (code space): 21
Memory allocation (JIT code): 1910
Limiting nested parentheses Limiting nested parentheses
@ -573,7 +603,7 @@ PATTERN MODIFIERS
mentation for details). If the number specified by the modifier is mentation for details). If the number specified by the modifier is
greater than zero, pcre2_set_compile_recursion_guard() is called to set greater than zero, pcre2_set_compile_recursion_guard() is called to set
up callback from pcre2_compile() to a local function. The argument it up callback from pcre2_compile() to a local function. The argument it
is passed is the current nesting parenthesis depth; if this is greater receives is the current nesting parenthesis depth; if this is greater
than the value given by the modifier, non-zero is returned, causing the than the value given by the modifier, non-zero is returned, causing the
compilation to be aborted. compilation to be aborted.
@ -606,6 +636,7 @@ PATTERN MODIFIERS
allusedtext show all consulted text allusedtext show all consulted text
/g global global matching /g global global matching
mark show mark values mark show mark values
replace=<string> specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
These modifiers may not appear in a #pattern command. If you want them These modifiers may not appear in a #pattern command. If you want them
@ -671,31 +702,31 @@ SUBJECT MODIFIERS
offset=<n> set starting offset offset=<n> set starting offset
ovector=<n> set size of output vector ovector=<n> set size of output vector
recursion_limit=<n> set a recursion limit recursion_limit=<n> set a recursion limit
replace=<string> specify a replacement string
startchar show startchar when relevant startchar show startchar when relevant
zero_terminate pass the subject as zero-terminated zero_terminate pass the subject as zero-terminated
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
Showing more text Showing more text
The aftertext modifier requests that as well as outputting the sub- The aftertext modifier requests that as well as outputting the sub-
string that matched the entire pattern, pcre2test should in addition string that matched the entire pattern, pcre2test should in addition
output the remainder of the subject string. This is useful for tests output the remainder of the subject string. This is useful for tests
where the subject contains multiple copies of the same substring. The where the subject contains multiple copies of the same substring. The
allaftertext modifier requests the same action for captured substrings allaftertext modifier requests the same action for captured substrings
as well as the main matched substring. In each case the remainder is as well as the main matched substring. In each case the remainder is
output on the following line with a plus character following the cap- output on the following line with a plus character following the cap-
ture number. ture number.
The allusedtext modifier requests that all the text that was consulted The allusedtext modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown. during a successful pattern match by the interpreter should be shown.
This feature is not supported for JIT matching, and if requested with This feature is not supported for JIT matching, and if requested with
JIT it is ignored (with a warning message). Setting this modifier JIT it is ignored (with a warning message). Setting this modifier
affects the output if there is a lookbehind at the start of a match, or affects the output if there is a lookbehind at the start of a match, or
a lookahead at the end, or if \K is used in the pattern. Characters a lookahead at the end, or if \K is used in the pattern. Characters
that precede or follow the start and end of the actual match are indi- that precede or follow the start and end of the actual match are indi-
cated in the output by '<' or '>' characters underneath them. Here is cated in the output by '<' or '>' characters underneath them. Here is
an example: an example:
re> /(?<=pqr)abc(?=xyz)/ re> /(?<=pqr)abc(?=xyz)/
@ -703,15 +734,15 @@ SUBJECT MODIFIERS
0: pqrabcxyz 0: pqrabcxyz
<<< >>> <<< >>>
This shows that the matched string is "abc", with the preceding and This shows that the matched string is "abc", with the preceding and
following strings "pqr" and "xyz" also consulted during the match. following strings "pqr" and "xyz" also consulted during the match.
The startchar modifier requests that the starting character for the The startchar modifier requests that the starting character for the
match be indicated, if it is different to the start of the matched match be indicated, if it is different to the start of the matched
string. The only time when this occurs is when \K has been processed as string. The only time when this occurs is when \K has been processed as
part of the match. In this situation, the output for the matched string part of the match. In this situation, the output for the matched string
is displayed from the starting character instead of from the match is displayed from the starting character instead of from the match
point, with circumflex characters under the earlier characters. For point, with circumflex characters under the earlier characters. For
example: example:
re> /abc\Kxyz/ re> /abc\Kxyz/
@ -719,7 +750,7 @@ SUBJECT MODIFIERS
0: abcxyz 0: abcxyz
^^^ ^^^
Unlike allusedtext, the startchar modifier can be used with JIT. How- Unlike allusedtext, the startchar modifier can be used with JIT. How-
ever, these two modifiers are mutually exclusive. ever, these two modifiers are mutually exclusive.
Showing the value of all capture groups Showing the value of all capture groups
@ -727,183 +758,223 @@ SUBJECT MODIFIERS
The allcaptures modifier requests that the values of all potential cap- The allcaptures modifier requests that the values of all potential cap-
tured parentheses be output after a match. By default, only those up to tured parentheses be output after a match. By default, only those up to
the highest one actually used in the match are output (corresponding to the highest one actually used in the match are output (corresponding to
the return code from pcre2_match()). Groups that did not take part in the return code from pcre2_match()). Groups that did not take part in
the match are output as "<unset>". the match are output as "<unset>".
Testing callouts Testing callouts
A callout function is supplied when pcre2test calls the library match- A callout function is supplied when pcre2test calls the library match-
ing functions, unless callout_none is specified. If callout_capture is ing functions, unless callout_none is specified. If callout_capture is
set, the current captured groups are output when a callout occurs. set, the current captured groups are output when a callout occurs.
The callout_fail modifier can be given one or two numbers. If there is The callout_fail modifier can be given one or two numbers. If there is
only one number, 1 is returned instead of 0 when a callout of that num- only one number, 1 is returned instead of 0 when a callout of that num-
ber is reached. If two numbers are given, 1 is returned when callout ber is reached. If two numbers are given, 1 is returned when callout
<n> is reached for the <m>th time. <n> is reached for the <m>th time.
The callout_data modifier can be given an unsigned or a negative num- The callout_data modifier can be given an unsigned or a negative num-
ber. Any value other than zero is used as a return from pcre2test's ber. Any value other than zero is used as a return from pcre2test's
callout function. callout function.
Testing substring extraction functions
The copy and get modifiers can be used to test the pcre2_sub-
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
given more than once, and each can specify a group name or number, for
example:
abcd\=copy=1,copy=3,get=G1
If the #subject command is used to set default copy and get lists,
these can be unset by specifying a negative number for numbered groups
and an empty name for named groups.
The getall modifier tests pcre2_substring_list_get(), which extracts
all captured substrings.
If the subject line is successfully matched, the substrings extracted
by the convenience functions are output with C, G, or L after the
string number instead of a colon. This is in addition to the normal
full list. The string length (that is, the return from the extraction
function) is given in parentheses after each substring.
Finding all matches in a string Finding all matches in a string
Searching for all possible matches within a subject can be requested by Searching for all possible matches within a subject can be requested by
the global or /altglobal modifier. After finding a match, the matching the global or /altglobal modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The function is called again to search the remainder of the subject. The
difference between global and altglobal is that the former uses the difference between global and altglobal is that the former uses the
start_offset argument to pcre2_match() or pcre2_dfa_match() to start start_offset argument to pcre2_match() or pcre2_dfa_match() to start
searching at a new point within the entire string (which is what Perl searching at a new point within the entire string (which is what Perl
does), whereas the latter passes over a shortened substring. This makes does), whereas the latter passes over a shortened substring. This makes
a difference to the matching process if the pattern begins with a look- a difference to the matching process if the pattern begins with a look-
behind assertion (including \b or \B). behind assertion (including \b or \B).
If an empty string is matched, the next match is done with the If an empty string is matched, the next match is done with the
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
for another, non-empty, match at the same point in the subject. If this for another, non-empty, match at the same point in the subject. If this
match fails, the start offset is advanced, and the normal match is match fails, the start offset is advanced, and the normal match is
retried. This imitates the way Perl handles such cases when using the retried. This imitates the way Perl handles such cases when using the
/g modifier or the split() function. Normally, the start offset is /g modifier or the split() function. Normally, the start offset is
advanced by one character, but if the newline convention recognizes advanced by one character, but if the newline convention recognizes
CRLF as a newline, and the current character is CR followed by LF, an CRLF as a newline, and the current character is CR followed by LF, an
advance of two is used. advance of two is used.
Testing substring extraction functions
The copy and get modifiers can be used to test the pcre2_sub-
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
given more than once, and each can specify a group name or number, for
example:
abcd\=copy=1,copy=3,get=G1
If the #subject command is used to set default copy and get lists,
these can be unset by specifying a negative number for numbered groups
and an empty name for named groups.
The getall modifier tests pcre2_substring_list_get(), which extracts
all captured substrings.
If the subject line is successfully matched, the substrings extracted
by the convenience functions are output with C, G, or L after the
string number instead of a colon. This is in addition to the normal
full list. The string length (that is, the return from the extraction
function) is given in parentheses after each substring.
Testing the substitution function
If the replace modifier is set, the pcre2_substitute() function is
called instead of one of the matching functions. Unlike subject
strings, pcre2test does not process replacement strings for escape
sequences. In UTF mode, a replacement string is checked to see if it is
a valid UTF-8 string. If so, it is correctly converted to a UTF string
of the appropriate code unit width. If it is not a valid UTF-8 string,
the individual code units are copied directly. This provides a means of
passing an invalid UTF-8 string for testing purposes.
If the global modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
pcre2_substitute(). After a successful substitution, the modified
string is output, preceded by the number of replacements. This may be
zero if there were no matches. Here is a simple example of a substitu-
tion test:
/abc/replace=xxx
=abc=abc=
1: =xxx=abc=
=abc=abc=\=global
2: =xxx=xxx=
Subject and replacement strings should be kept relatively short for
substitution tests, as fixed-size buffers are used. To make it easy to
test for buffer overflow, if the replacement string starts with a num-
ber in square brackets, that number is passed to pcre2_substitute() as
the size of the output buffer, with the replacement string starting at
the next character. Here is an example that tests the edge case:
/abc/
123abc123\=replace=[10]XYZ
1: 123XYZ123
123abc123\=replace=[9]XYZ
Failed: error -47: no more memory
A replacement string is ignored with POSIX and DFA matching. Specifying
partial matching provokes an error return ("bad option value") from
pcre2_substitute().
Setting the JIT stack size Setting the JIT stack size
The jitstack modifier provides a way of setting the maximum stack size The jitstack modifier provides a way of setting the maximum stack size
that is used by the just-in-time optimization code. It is ignored if that is used by the just-in-time optimization code. It is ignored if
JIT optimization is not being used. The value is a number of kilobytes. JIT optimization is not being used. The value is a number of kilobytes.
Providing a stack that is larger than the default 32K is necessary only Providing a stack that is larger than the default 32K is necessary only
for very complicated patterns. for very complicated patterns.
Setting match and recursion limits Setting match and recursion limits
The match_limit and recursion_limit modifiers set the appropriate lim- The match_limit and recursion_limit modifiers set the appropriate lim-
its in the match context. These values are ignored when the find_limits its in the match context. These values are ignored when the find_limits
modifier is specified. modifier is specified.
Finding minimum limits Finding minimum limits
If the find_limits modifier is present, pcre2test calls pcre2_match() If the find_limits modifier is present, pcre2test calls pcre2_match()
several times, setting different values in the match context via several times, setting different values in the match context via
pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds
the minimum values for each parameter that allow pcre2_match() to com- the minimum values for each parameter that allow pcre2_match() to com-
plete without error. plete without error.
If JIT is being used, only the match limit is relevant. If DFA matching If JIT is being used, only the match limit is relevant. If DFA matching
is being used, neither limit is relevant, and this modifier is ignored is being used, neither limit is relevant, and this modifier is ignored
(with a warning message). (with a warning message).
The match_limit number is a measure of the amount of backtracking that The match_limit number is a measure of the amount of backtracking that
takes place, and learning the minimum value can be instructive. For takes place, and learning the minimum value can be instructive. For
most simple matches, the number is quite small, but for patterns with most simple matches, the number is quite small, but for patterns with
very large numbers of matching possibilities, it can become large very very large numbers of matching possibilities, it can become large very
quickly with increasing length of subject string. The quickly with increasing length of subject string. The
match_limit_recursion number is a measure of how much stack (or, if match_limit_recursion number is a measure of how much stack (or, if
PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to
complete the match attempt. complete the match attempt.
Showing MARK names Showing MARK names
The mark modifier causes the names from backtracking control verbs that The mark modifier causes the names from backtracking control verbs that
are returned from calls to pcre2_match() to be displayed. If a mark is are returned from calls to pcre2_match() to be displayed. If a mark is
returned for a match, non-match, or partial match, pcre2test shows it. returned for a match, non-match, or partial match, pcre2test shows it.
For a match, it is on a line by itself, tagged with "MK:". Otherwise, For a match, it is on a line by itself, tagged with "MK:". Otherwise,
it is added to the non-match message. it is added to the non-match message.
Showing memory usage Showing memory usage
The memory modifier causes pcre2test to log all memory allocation and The memory modifier causes pcre2test to log all memory allocation and
freeing calls that occur during a match operation. freeing calls that occur during a match operation.
Setting a starting offset Setting a starting offset
The offset modifier sets an offset in the subject string at which The offset modifier sets an offset in the subject string at which
matching starts. Its value is a number of code units, not characters. matching starts. Its value is a number of code units, not characters.
Setting the size of the output vector Setting the size of the output vector
The ovector modifier applies only to the subject line in which it The ovector modifier applies only to the subject line in which it
appears, though of course it can also be used to set a default in a appears, though of course it can also be used to set a default in a
#subject command. It specifies the number of pairs of offsets that are #subject command. It specifies the number of pairs of offsets that are
available for storing matching information. The default is 15. available for storing matching information. The default is 15.
A value of zero is useful when testing the POSIX API because it causes A value of zero is useful when testing the POSIX API because it causes
regexec() to be called with a NULL capture vector. When not testing the regexec() to be called with a NULL capture vector. When not testing the
POSIX API, a value of zero is used to cause pcre2_match_data_cre- POSIX API, a value of zero is used to cause pcre2_match_data_cre-
ate_from_pattern to be called, in order to create a match block of ate_from_pattern() to be called, in order to create a match block of
exactly the right size for the pattern. (It is not possible to create a exactly the right size for the pattern. (It is not possible to create a
match block with a zero-length ovector; there is always one pair of match block with a zero-length ovector; there is always at least one
offsets.) pair of offsets.)
Passing the subject as zero-terminated Passing the subject as zero-terminated
By default, the subject string is passed to a native API matching func- By default, the subject string is passed to a native API matching func-
tion with its correct length. In order to test the facility for passing tion with its correct length. In order to test the facility for passing
a zero-terminated string, the zero_terminate modifier is provided. It a zero-terminated string, the zero_terminate modifier is provided. It
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
via the POSIX interface, this modifier has no effect, as there is no via the POSIX interface, this modifier has no effect, as there is no
facility for passing a length.) facility for passing a length.)
When testing pcre2_substitute, this modifier also has the effect of When testing pcre2_substitute(), this modifier also has the effect of
passing the replacement string as zero-terminated. passing the replacement string as zero-terminated.
THE ALTERNATIVE MATCHING FUNCTION THE ALTERNATIVE MATCHING FUNCTION
By default, pcre2test uses the standard PCRE2 matching function, By default, pcre2test uses the standard PCRE2 matching function,
pcre2_match() to match each subject line. PCRE2 also supports an alter- pcre2_match() to match each subject line. PCRE2 also supports an alter-
native matching function, pcre2_dfa_match(), which operates in a dif- native matching function, pcre2_dfa_match(), which operates in a dif-
ferent way, and has some restrictions. The differences between the two ferent way, and has some restrictions. The differences between the two
functions are described in the pcre2matching documentation. functions are described in the pcre2matching documentation.
If the dfa modifier is set, the alternative matching function is used. If the dfa modifier is set, the alternative matching function is used.
This function finds all possible matches at a given point in the sub- This function finds all possible matches at a given point in the sub-
ject. If, however, the dfa_shortest modifier is set, processing stops ject. If, however, the dfa_shortest modifier is set, processing stops
after the first match is found. This is always the shortest possible after the first match is found. This is always the shortest possible
match. match.
DEFAULT OUTPUT FROM pcre2test DEFAULT OUTPUT FROM pcre2test
This section describes the output when the normal matching function, This section describes the output when the normal matching function,
pcre2_match(), is being used. pcre2_match(), is being used.
When a match succeeds, pcre2test outputs the list of captured sub- When a match succeeds, pcre2test outputs the list of captured sub-
strings, starting with number 0 for the string that matched the whole strings, starting with number 0 for the string that matched the whole
pattern. Otherwise, it outputs "No match" when the return is pattern. Otherwise, it outputs "No match" when the return is
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
this is the entire substring that was inspected during the partial this is the entire substring that was inspected during the partial
match; it may include characters before the actual match start if a match; it may include characters before the actual match start if a
lookbehind assertion, \K, \b, or \B was involved.) lookbehind assertion, \K, \b, or \B was involved.)
For any other return, pcre2test outputs the PCRE2 negative error number For any other return, pcre2test outputs the PCRE2 negative error number
and a short descriptive phrase. If the error is a failed UTF string and a short descriptive phrase. If the error is a failed UTF string
check, the offset of the start of the failing character and the reason check, the offset of the start of the failing character and the reason
code are also output. Here is an example of an interactive pcre2test code are also output. Here is an example of an interactive pcre2test
run. run.
$ pcre2test $ pcre2test
@ -917,10 +988,10 @@ DEFAULT OUTPUT FROM pcre2test
No match No match
Unset capturing substrings that are not followed by one that is set are Unset capturing substrings that are not followed by one that is set are
not returned by pcre2_match(), and are not shown by pcre2test. In the not returned by pcre2_match(), and are not shown by pcre2test. In the
following example, there are two capturing substrings, but when the following example, there are two capturing substrings, but when the
first data line is matched, the second, unset substring is not shown. first data line is matched, the second, unset substring is not shown.
An "internal" unset substring is shown as "<unset>", as for the second An "internal" unset substring is shown as "<unset>", as for the second
data line. data line.
re> /(a)|(b)/ re> /(a)|(b)/
@ -932,11 +1003,11 @@ DEFAULT OUTPUT FROM pcre2test
1: <unset> 1: <unset>
2: b 2: b
If the strings contain any non-printing characters, they are output as If the strings contain any non-printing characters, they are output as
\xhh escapes if the value is less than 256 and UTF mode is not set. \xhh escapes if the value is less than 256 and UTF mode is not set.
Otherwise they are output as \x{hh...} escapes. See below for the defi- Otherwise they are output as \x{hh...} escapes. See below for the defi-
nition of non-printing characters. If the /aftertext modifier is set, nition of non-printing characters. If the /aftertext modifier is set,
the output for substring 0 is followed by the the rest of the subject the output for substring 0 is followed by the the rest of the subject
string, identified by "0+" like this: string, identified by "0+" like this:
re> /cat/aftertext re> /cat/aftertext
@ -944,7 +1015,7 @@ DEFAULT OUTPUT FROM pcre2test
0: cat 0: cat
0+ aract 0+ aract
If global matching is requested, the results of successive matching If global matching is requested, the results of successive matching
attempts are output in sequence, like this: attempts are output in sequence, like this:
re> /\Bi(\w\w)/g re> /\Bi(\w\w)/g
@ -956,8 +1027,8 @@ DEFAULT OUTPUT FROM pcre2test
0: ipp 0: ipp
1: pp 1: pp
"No match" is output only if the first match attempt fails. Here is an "No match" is output only if the first match attempt fails. Here is an
example of a failure message (the offset 4 that is specified by \>4 is example of a failure message (the offset 4 that is specified by \>4 is
past the end of the subject string): past the end of the subject string):
re> /xyz/ re> /xyz/
@ -965,7 +1036,7 @@ DEFAULT OUTPUT FROM pcre2test
Error -24 (bad offset value) Error -24 (bad offset value)
Note that whereas patterns can be continued over several lines (a plain Note that whereas patterns can be continued over several lines (a plain
">" prompt is used for continuations), subject lines may not. However ">" prompt is used for continuations), subject lines may not. However
newlines can be included in a subject by means of the \n escape (or \r, newlines can be included in a subject by means of the \n escape (or \r,
\r\n, etc., depending on the newline sequence setting). \r\n, etc., depending on the newline sequence setting).
@ -973,7 +1044,7 @@ DEFAULT OUTPUT FROM pcre2test
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
When the alternative matching function, pcre2_dfa_match(), is used, the When the alternative matching function, pcre2_dfa_match(), is used, the
output consists of a list of all the matches that start at the first output consists of a list of all the matches that start at the first
point in the subject where there is at least one match. For example: point in the subject where there is at least one match. For example:
re> /(tang|tangerine|tan)/ re> /(tang|tangerine|tan)/
@ -982,11 +1053,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tang 1: tang
2: tan 2: tan
(Using the normal matching function on this data finds only "tang".) (Using the normal matching function on this data finds only "tang".)
The longest matching string is always given first (and numbered zero). The longest matching string is always given first (and numbered zero).
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
followed by the partially matching substring. (Note that this is the followed by the partially matching substring. (Note that this is the
entire substring that was inspected during the partial match; it may entire substring that was inspected during the partial match; it may
include characters before the actual match start if a lookbehind asser- include characters before the actual match start if a lookbehind asser-
tion, \K, \b, or \B was involved.) tion, \K, \b, or \B was involved.)
@ -1002,16 +1073,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tan 1: tan
0: tan 0: tan
The alternative matching function does not support substring capture, The alternative matching function does not support substring capture,
so the modifiers that are concerned with captured substrings are not so the modifiers that are concerned with captured substrings are not
relevant. relevant.
RESTARTING AFTER A PARTIAL MATCH RESTARTING AFTER A PARTIAL MATCH
When the alternative matching function has given the PCRE2_ERROR_PAR- When the alternative matching function has given the PCRE2_ERROR_PAR-
TIAL return, indicating that the subject partially matched the pattern, TIAL return, indicating that the subject partially matched the pattern,
you can restart the match with additional subject data by means of the you can restart the match with additional subject data by means of the
dfa_restart modifier. For example: dfa_restart modifier. For example:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@ -1020,29 +1091,29 @@ RESTARTING AFTER A PARTIAL MATCH
data> n05\=dfa,dfa_restart data> n05\=dfa,dfa_restart
0: n05 0: n05
For further information about partial matching, see the pcre2partial For further information about partial matching, see the pcre2partial
documentation. documentation.
CALLOUTS CALLOUTS
If the pattern contains any callout requests, pcre2test's callout func- If the pattern contains any callout requests, pcre2test's callout func-
tion is called during matching. This works with both matching func- tion is called during matching. This works with both matching func-
tions. By default, the called function displays the callout number, the tions. By default, the called function displays the callout number, the
start and current positions in the text at the callout time, and the start and current positions in the text at the callout time, and the
next pattern item to be tested. For example: next pattern item to be tested. For example:
--->pqrabcdef --->pqrabcdef
0 ^ ^ \d 0 ^ ^ \d
This output indicates that callout number 0 occurred for a match This output indicates that callout number 0 occurred for a match
attempt starting at the fourth character of the subject string, when attempt starting at the fourth character of the subject string, when
the pointer was at the seventh character, and when the next pattern the pointer was at the seventh character, and when the next pattern
item was \d. Just one circumflex is output if the start and current item was \d. Just one circumflex is output if the start and current
positions are the same. positions are the same.
Callouts numbered 255 are assumed to be automatic callouts, inserted as Callouts numbered 255 are assumed to be automatic callouts, inserted as
a result of the /auto_callout pattern modifier. In this case, instead a result of the /auto_callout pattern modifier. In this case, instead
of showing the callout number, the offset in the pattern, preceded by a of showing the callout number, the offset in the pattern, preceded by a
plus, is output. For example: plus, is output. For example:
@ -1056,7 +1127,7 @@ CALLOUTS
0: E* 0: E*
If a pattern contains (*MARK) items, an additional line is output when- If a pattern contains (*MARK) items, an additional line is output when-
ever a change of latest mark is passed to the callout function. For ever a change of latest mark is passed to the callout function. For
example: example:
re> /a(*MARK:X)bc/auto_callout re> /a(*MARK:X)bc/auto_callout
@ -1070,30 +1141,30 @@ CALLOUTS
+12 ^ ^ +12 ^ ^
0: abc 0: abc
The mark changes between matching "a" and "b", but stays the same for The mark changes between matching "a" and "b", but stays the same for
the rest of the match, so nothing more is output. If, as a result of the rest of the match, so nothing more is output. If, as a result of
backtracking, the mark reverts to being unset, the text "<unset>" is backtracking, the mark reverts to being unset, the text "<unset>" is
output. output.
The callout function in pcre2test returns zero (carry on matching) by The callout function in pcre2test returns zero (carry on matching) by
default, but you can use a callout_fail modifier in a subject line (as default, but you can use a callout_fail modifier in a subject line (as
described above) to change this and other parameters of the callout. described above) to change this and other parameters of the callout.
Inserting callouts can be helpful when using pcre2test to check compli- Inserting callouts can be helpful when using pcre2test to check compli-
cated regular expressions. For further information about callouts, see cated regular expressions. For further information about callouts, see
the pcre2callout documentation. the pcre2callout documentation.
NON-PRINTING CHARACTERS NON-PRINTING CHARACTERS
When pcre2test is outputting text in the compiled version of a pattern, When pcre2test is outputting text in the compiled version of a pattern,
bytes other than 32-126 are always treated as non-printing characters bytes other than 32-126 are always treated as non-printing characters
and are therefore shown as hex escapes. and are therefore shown as hex escapes.
When pcre2test is outputting text that is a matched part of a subject When pcre2test is outputting text that is a matched part of a subject
string, it behaves in the same way, unless a different locale has been string, it behaves in the same way, unless a different locale has been
set for the pattern (using the /locale modifier). In this case, the set for the pattern (using the /locale modifier). In this case, the
isprint() function is used to distinguish printing and non-printing isprint() function is used to distinguish printing and non-printing
characters. characters.
@ -1112,5 +1183,5 @@ AUTHOR
REVISION REVISION
Last updated: 09 November 2014 Last updated: 14 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.

View File

@ -69,7 +69,7 @@ Arguments:
Returns: >= 0 number of substitutions made Returns: >= 0 number of substitutions made
< 0 an error code < 0 an error code
PCRE2_ERROR_BADREPLACEMENT means invalid use of $ PCRE2_ERROR_BADREPLACEMENT means invalid use of $
*/ */
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
@ -84,14 +84,14 @@ uint32_t ovector_count;
uint32_t goptions = 0; uint32_t goptions = 0;
BOOL match_data_created = FALSE; BOOL match_data_created = FALSE;
BOOL global = FALSE; BOOL global = FALSE;
PCRE2_SIZE buff_offset, lengthleft, endlength; PCRE2_SIZE buff_offset, lengthleft, fraglength;
PCRE2_SIZE *ovector; PCRE2_SIZE *ovector;
/* Partial matching is not valid. */ /* Partial matching is not valid. */
if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0) if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0)
return PCRE2_ERROR_BADOPTION; return PCRE2_ERROR_BADOPTION;
/* If no match data block is provided, create one. */ /* If no match data block is provided, create one. */
if (match_data == NULL) if (match_data == NULL)
@ -120,7 +120,7 @@ if ((code->overall_options & PCRE2_UTF) != 0 &&
} }
} }
#endif /* SUPPORT_UNICODE */ #endif /* SUPPORT_UNICODE */
/* Notice the global option and remove it from the options that are passed to /* Notice the global option and remove it from the options that are passed to
pcre2_match(). */ pcre2_match(). */
@ -151,17 +151,20 @@ do
rc = pcre2_match(code, subject, length, start_offset, options|goptions, rc = pcre2_match(code, subject, length, start_offset, options|goptions,
match_data, mcontext); match_data, mcontext);
/* Any error other than no match returns the error code. No match when not /* Any error other than no match returns the error code. No match when not
doing the special after-empty-match global rematch, or when at the end of the doing the special after-empty-match global rematch, or when at the end of the
subject, breaks the global loop. Otherwise, advance the starting point and subject, breaks the global loop. Otherwise, advance the starting point by one
try again. */ character, copying it to the output, and try again. */
if (rc < 0) if (rc < 0)
{ {
PCRE2_SIZE save_start;
if (rc != PCRE2_ERROR_NOMATCH) goto EXIT; if (rc != PCRE2_ERROR_NOMATCH) goto EXIT;
if (goptions == 0 || start_offset >= length) break; if (goptions == 0 || start_offset >= length) break;
start_offset++;
save_start = start_offset++;
if ((code->overall_options & PCRE2_UTF) != 0) if ((code->overall_options & PCRE2_UTF) != 0)
{ {
#if PCRE2_CODE_UNIT_WIDTH == 8 #if PCRE2_CODE_UNIT_WIDTH == 8
@ -173,20 +176,28 @@ do
start_offset++; start_offset++;
#endif #endif
} }
fraglength = start_offset - save_start;
if (lengthleft < fraglength) goto NOROOM;
memcpy(buffer + buff_offset, subject + save_start,
fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
buff_offset += fraglength;
lengthleft -= fraglength;
goptions = 0; goptions = 0;
continue; continue;
} }
/* Handle a successful match. */ /* Handle a successful match. */
subs++; subs++;
if (rc == 0) rc = ovector_count; if (rc == 0) rc = ovector_count;
endlength = ovector[0] - start_offset; fraglength = ovector[0] - start_offset;
if (endlength >= lengthleft) goto NOROOM; if (fraglength >= lengthleft) goto NOROOM;
memcpy(buffer + buff_offset, subject + start_offset, memcpy(buffer + buff_offset, subject + start_offset,
endlength*(PCRE2_CODE_UNIT_WIDTH/8)); fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
buff_offset += endlength; buff_offset += fraglength;
lengthleft -= endlength; lengthleft -= fraglength;
for (i = 0; i < rlength; i++) for (i = 0; i < rlength; i++)
{ {
@ -196,11 +207,11 @@ do
BOOL inparens; BOOL inparens;
PCRE2_SIZE sublength; PCRE2_SIZE sublength;
PCRE2_UCHAR next; PCRE2_UCHAR next;
PCRE2_UCHAR name[33]; PCRE2_UCHAR name[33];
if (++i == rlength) goto BAD; if (++i == rlength) goto BAD;
if ((next = replacement[i]) == CHAR_DOLLAR_SIGN) goto LITERAL; if ((next = replacement[i]) == CHAR_DOLLAR_SIGN) goto LITERAL;
group = -1; group = -1;
n = 0; n = 0;
inparens = FALSE; inparens = FALSE;
@ -232,7 +243,7 @@ do
if (i == rlength) break; if (i == rlength) break;
next = replacement[++i]; next = replacement[++i];
} }
if (n == 0) goto BAD; if (n == 0) goto BAD;
name[n] = 0; name[n] = 0;
} }
@ -241,7 +252,7 @@ do
if (i == rlength || next != CHAR_RIGHT_CURLY_BRACKET) goto BAD; if (i == rlength || next != CHAR_RIGHT_CURLY_BRACKET) goto BAD;
} }
else i--; /* Last code unit of name/number */ else i--; /* Last code unit of name/number */
/* Have found a syntactically correct group number or name. */ /* Have found a syntactically correct group number or name. */
sublength = lengthleft; sublength = lengthleft;
@ -251,8 +262,8 @@ do
else else
rc = pcre2_substring_copy_bynumber(match_data, group, rc = pcre2_substring_copy_bynumber(match_data, group,
buffer + buff_offset, &sublength); buffer + buff_offset, &sublength);
if (rc < 0) goto EXIT; if (rc < 0) goto EXIT;
buff_offset += sublength; buff_offset += sublength;
lengthleft -= sublength; lengthleft -= sublength;
} }
@ -279,17 +290,17 @@ do
/* Copy the rest of the subject and return the number of substitutions. */ /* Copy the rest of the subject and return the number of substitutions. */
rc = subs; rc = subs;
endlength = length - start_offset; fraglength = length - start_offset;
if (endlength + 1 > lengthleft) goto NOROOM; if (fraglength + 1 > lengthleft) goto NOROOM;
memcpy(buffer + buff_offset, subject + start_offset, memcpy(buffer + buff_offset, subject + start_offset,
endlength*(PCRE2_CODE_UNIT_WIDTH/8)); fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
buff_offset += endlength; buff_offset += fraglength;
buffer[buff_offset] = 0; buffer[buff_offset] = 0;
*blength = buff_offset; *blength = buff_offset;
EXIT: EXIT:
if (match_data_created) pcre2_match_data_free(match_data); if (match_data_created) pcre2_match_data_free(match_data);
else match_data->rc = rc; else match_data->rc = rc;
return rc; return rc;
NOROOM: NOROOM:

View File

@ -164,11 +164,12 @@ void vms_setsymbol( char *, char *, int );
#define DFA_WS_DIMENSION 1000 /* Size of DFA workspace */ #define DFA_WS_DIMENSION 1000 /* Size of DFA workspace */
#define DEFAULT_OVECCOUNT 15 /* Default ovector count */ #define DEFAULT_OVECCOUNT 15 /* Default ovector count */
#define JUNK_OFFSET 0xdeadbeef /* For initializing ovector */ #define JUNK_OFFSET 0xdeadbeef /* For initializing ovector */
#define LOCALESIZE 32 /* Size of locale name */
#define LOOPREPEAT 500000 /* Default loop count for timing */ #define LOOPREPEAT 500000 /* Default loop count for timing */
#define REPLACE_MODSIZE 96 /* Field for reading 8-bit replacement */ #define REPLACE_MODSIZE 96 /* Field for reading 8-bit replacement */
#define VERSION_SIZE 64 /* Size of buffer for the version strings */ #define VERSION_SIZE 64 /* Size of buffer for the version strings */
/* Make sure the buffer into which replacement strings are copied is big enough /* Make sure the buffer into which replacement strings are copied is big enough
to hold them as 32-bit code units. */ to hold them as 32-bit code units. */
#define REPLACE_BUFFSIZE (4*REPLACE_MODSIZE) #define REPLACE_BUFFSIZE (4*REPLACE_MODSIZE)
@ -263,9 +264,9 @@ these inclusions should not be changed. */
#define PCRE2_SUFFIX(a) a #define PCRE2_SUFFIX(a) a
/* We need to be able to check input text for UTF-8 validity, whatever code /* We need to be able to check input text for UTF-8 validity, whatever code
widths are actually available, because the input to pcre2test is always in widths are actually available, because the input to pcre2test is always in
8-bit code units. So we include the UTF validity checking function for 8-bit 8-bit code units. So we include the UTF validity checking function for 8-bit
code units. */ code units. */
extern int valid_utf(PCRE2_SPTR8, PCRE2_SIZE, PCRE2_SIZE *); extern int valid_utf(PCRE2_SPTR8, PCRE2_SIZE, PCRE2_SIZE *);
@ -388,10 +389,10 @@ data line. */
CTL_MARK|\ CTL_MARK|\
CTL_MEMORY|\ CTL_MEMORY|\
CTL_STARTCHAR) CTL_STARTCHAR)
/* Structures for holding modifier information for patterns and subject strings /* Structures for holding modifier information for patterns and subject strings
(data). Fields containing modifiers that can be set either for a pattern or a (data). Fields containing modifiers that can be set either for a pattern or a
subject must be at the start and in the same order in both cases so that the subject must be at the start and in the same order in both cases so that the
same offset in the big table below works for both. */ same offset in the big table below works for both. */
typedef struct patctl { /* Structure for pattern modifiers. */ typedef struct patctl { /* Structure for pattern modifiers. */
@ -401,7 +402,7 @@ typedef struct patctl { /* Structure for pattern modifiers. */
uint32_t jit; uint32_t jit;
uint32_t stackguard_test; uint32_t stackguard_test;
uint32_t tables_id; uint32_t tables_id;
uint8_t locale[32]; uint8_t locale[LOCALESIZE];
} patctl; } patctl;
#define MAXCPYGET 10 #define MAXCPYGET 10
@ -486,7 +487,7 @@ static modstruct modlist[] = {
{ "jitfast", MOD_PAT, MOD_CTL, CTL_JITFAST, PO(control) }, { "jitfast", MOD_PAT, MOD_CTL, CTL_JITFAST, PO(control) },
{ "jitstack", MOD_DAT, MOD_INT, 0, DO(jitstack) }, { "jitstack", MOD_DAT, MOD_INT, 0, DO(jitstack) },
{ "jitverify", MOD_PAT, MOD_CTL, CTL_JITVERIFY, PO(control) }, { "jitverify", MOD_PAT, MOD_CTL, CTL_JITVERIFY, PO(control) },
{ "locale", MOD_PAT, MOD_STR, 0, PO(locale) }, { "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) },
{ "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) }, { "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) },
{ "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) }, { "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) },
{ "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) }, { "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) },
@ -512,7 +513,7 @@ static modstruct modlist[] = {
{ "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) }, { "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) },
{ "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) }, { "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) },
{ "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) }, { "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) },
{ "replace", MOD_PND, MOD_STR, 0, PO(replacement) }, { "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) },
{ "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) }, { "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) },
{ "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) }, { "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) },
{ "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) }, { "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) },
@ -3141,6 +3142,12 @@ for (;;)
break; break;
case MOD_STR: case MOD_STR:
if (len + 1 > m->value)
{
fprintf(outfile, "** Overlong value for '%s' (max %d code units)\n",
m->name, m->value - 1);
return FALSE;
}
memcpy(field, pp, len); memcpy(field, pp, len);
((uint8_t *)field)[len] = 0; ((uint8_t *)field)[len] = 0;
pp = ep; pp = ep;
@ -3974,8 +3981,8 @@ if (TEST(compiled_code, ==, NULL))
if (pattern_info(PCRE2_INFO_MAXLOOKBEHIND, &maxlookbehind, FALSE) != 0) if (pattern_info(PCRE2_INFO_MAXLOOKBEHIND, &maxlookbehind, FALSE) != 0)
return PR_ABEND; return PR_ABEND;
/* Call the JIT compiler if requested. When timing, we must free and recompile /* Call the JIT compiler if requested. When timing, we must free and recompile
the pattern each time because that is the only way to free the JIT compiled the pattern each time because that is the only way to free the JIT compiled
code. We know that compilation will always succeed. */ code. We know that compilation will always succeed. */
if (pat_patctl.jit != 0) if (pat_patctl.jit != 0)
@ -3992,7 +3999,7 @@ if (pat_patctl.jit != 0)
pat_patctl.options|forbid_utf, &errorcode, &erroroffset, pat_context); pat_patctl.options|forbid_utf, &errorcode, &erroroffset, pat_context);
start_time = clock(); start_time = clock();
PCRE2_JIT_COMPILE(compiled_code, pat_patctl.jit); PCRE2_JIT_COMPILE(compiled_code, pat_patctl.jit);
time_taken += clock() - start_time; time_taken += clock() - start_time;
} }
total_jit_compile_time += time_taken; total_jit_compile_time += time_taken;
fprintf(outfile, "JIT compile %.4f milliseconds\n", fprintf(outfile, "JIT compile %.4f milliseconds\n",
@ -4000,9 +4007,9 @@ if (pat_patctl.jit != 0)
(double)CLOCKS_PER_SEC); (double)CLOCKS_PER_SEC);
} }
else else
{ {
PCRE2_JIT_COMPILE(compiled_code, pat_patctl.jit); PCRE2_JIT_COMPILE(compiled_code, pat_patctl.jit);
} }
} }
/* Output code size and other information if requested. */ /* Output code size and other information if requested. */
@ -4765,8 +4772,8 @@ else
PCRE2_MATCH_DATA_FREE(match_data); PCRE2_MATCH_DATA_FREE(match_data);
PCRE2_MATCH_DATA_CREATE(match_data, max_oveccount, NULL); PCRE2_MATCH_DATA_CREATE(match_data, max_oveccount, NULL);
} }
/* Replacement processing is ignored for DFA matching. */ /* Replacement processing is ignored for DFA matching. */
if (dat_datctl.replacement[0] != 0 && (dat_datctl.control & CTL_DFA) != 0) if (dat_datctl.replacement[0] != 0 && (dat_datctl.control & CTL_DFA) != 0)
{ {
@ -4799,7 +4806,7 @@ if (dat_datctl.replacement[0] != 0)
#endif #endif
if (timeitm) if (timeitm)
fprintf(outfile, "** Timing is not supported with replace: ignored\n"); fprintf(outfile, "** Timing is not supported with replace: ignored\n");
goption = ((dat_datctl.control & CTL_GLOBAL) == 0)? 0 : goption = ((dat_datctl.control & CTL_GLOBAL) == 0)? 0 :
PCRE2_SUBSTITUTE_GLOBAL; PCRE2_SUBSTITUTE_GLOBAL;
@ -4828,21 +4835,21 @@ if (dat_datctl.replacement[0] != 0)
nsize = n; nsize = n;
} }
/* Now copy the replacement string to a buffer of the appropriate width. No /* Now copy the replacement string to a buffer of the appropriate width. No
escape processing is done for replacements. In UTF mode, check for an invalid escape processing is done for replacements. In UTF mode, check for an invalid
UTF-8 input string, and if it is invalid, just copy its code units without UTF-8 input string, and if it is invalid, just copy its code units without
UTF interpretation. This provides a means of checking that an invalid string UTF interpretation. This provides a means of checking that an invalid string
is detected. Otherwise, UTF-8 can be used to include wide characters in a is detected. Otherwise, UTF-8 can be used to include wide characters in a
replacement. */ replacement. */
if (utf) badutf = valid_utf(pr, strlen((const char *)pr), &erroroffset); if (utf) badutf = valid_utf(pr, strlen((const char *)pr), &erroroffset);
/* Not UTF or invalid UTF-8: just copy the code units. */ /* Not UTF or invalid UTF-8: just copy the code units. */
if (!utf || badutf) if (!utf || badutf)
{ {
while ((c = *pr++) != 0) while ((c = *pr++) != 0)
{ {
#ifdef SUPPORT_PCRE2_8 #ifdef SUPPORT_PCRE2_8
if (test_mode == PCRE8_MODE) *r8++ = c; if (test_mode == PCRE8_MODE) *r8++ = c;
#endif #endif
@ -4854,9 +4861,9 @@ if (dat_datctl.replacement[0] != 0)
#endif #endif
} }
} }
/* Valid UTF-8 replacement string */ /* Valid UTF-8 replacement string */
else while ((c = *pr++) != 0) else while ((c = *pr++) != 0)
{ {
if (HASUTF8EXTRALEN(c)) { GETUTF8INC(c, pr); } if (HASUTF8EXTRALEN(c)) { GETUTF8INC(c, pr); }
@ -6314,7 +6321,7 @@ if (INTERACTIVE(infile)) fprintf(outfile, "\n");
if (showtotaltimes) if (showtotaltimes)
{ {
const char *pad = ""; const char *pad = "";
fprintf(outfile, "--------------------------------------\n"); fprintf(outfile, "--------------------------------------\n");
if (timeit > 0) if (timeit > 0)
{ {
@ -6325,7 +6332,7 @@ if (showtotaltimes)
fprintf(outfile, "Total JIT compile %.4f milliseconds\n", fprintf(outfile, "Total JIT compile %.4f milliseconds\n",
(((double)total_jit_compile_time * 1000.0) / (double)timeit) / (((double)total_jit_compile_time * 1000.0) / (double)timeit) /
(double)CLOCKS_PER_SEC); (double)CLOCKS_PER_SEC);
pad = " "; pad = " ";
} }
fprintf(outfile, "Total match time %s%.4f milliseconds\n", pad, fprintf(outfile, "Total match time %s%.4f milliseconds\n", pad,
(((double)total_match_time * 1000.0) / (double)timeitm) / (((double)total_match_time * 1000.0) / (double)timeitm) /

3
testdata/testinput2 vendored
View File

@ -4073,6 +4073,9 @@ a random value. /Ix
123abc456abc789 123abc456abc789
123abc456abc789\=g 123abc456abc789\=g
/(?<=abc)(|def)/g,replace=<$0>
123abcxyzabcdef789abcpqr
# End of substitute tests # End of substitute tests
# End of testinput2 # End of testinput2

3
testdata/testinput5 vendored
View File

@ -1633,4 +1633,7 @@
/ábc/utf,replace=XሴZ /ábc/utf,replace=XሴZ
123ábc123 123ábc123
/(?<=abc)(|def)/g,utf,replace=<$0>
123abcáyzabcdef789abcሴqr
# End of testinput5 # End of testinput5

View File

@ -13699,6 +13699,10 @@ Failed: error -34: bad option value
123abc456abc789\=g 123abc456abc789\=g
2: 123xyz456xyz789 2: 123xyz456xyz789
/(?<=abc)(|def)/g,replace=<$0>
123abcxyzabcdef789abcpqr
4: 123abc<>xyzabc<><def>789abc<>pqr
# End of substitute tests # End of substitute tests
# End of testinput2 # End of testinput2

View File

@ -4002,4 +4002,8 @@ Subject length lower bound = 1
123ábc123 123ábc123
1: 123X\x{1234}Z123 1: 123X\x{1234}Z123
/(?<=abc)(|def)/g,utf,replace=<$0>
123abcáyzabcdef789abcሴqr
4: 123abc<>\x{e1}yzabc<><def>789abc<>\x{1234}qr
# End of testinput5 # End of testinput5