Further substitution tests (code and data), and more documentation.

This commit is contained in:
Philip.Hazel 2014-11-14 18:41:20 +00:00
parent adc7be2d3a
commit 07f8372202
21 changed files with 1014 additions and 693 deletions

View File

@ -51,4 +51,6 @@ the currrent group as "unset". Thus, the ovector for those groups contained
whatever was previously there. An example is the pattern /(x)|((*ACCEPT))/ when whatever was previously there. An example is the pattern /(x)|((*ACCEPT))/ when
matched against "abcd". matched against "abcd".
8. The pcre2_substitute() function has been implemented.
**** ****

View File

@ -135,7 +135,7 @@ remaining sections, except for the <b>pcre2demo</b> section (which is a program
listing), and the short pages for individual functions, are concatenated in listing), and the short pages for individual functions, are concatenated in
<b>pcre2.txt</b>, for ease of searching. The sections are as follows: <b>pcre2.txt</b>, for ease of searching. The sections are as follows:
<pre> <pre>
pcre2 this document FIXME CHECK THIS LIST pcre2 this document
pcre2-config show PCRE2 installation configuration information pcre2-config show PCRE2 installation configuration information
pcre2api details of PCRE2's native C API pcre2api details of PCRE2's native C API
pcre2build building PCRE2 pcre2build building PCRE2

View File

@ -1089,7 +1089,7 @@ equivalent to Perl's /x option, and it can be changed within a pattern by a
Which characters are interpreted as newlines can be specified by a setting in Which characters are interpreted as newlines can be specified by a setting in
the compile context that is passed to <b>pcre2_compile()</b> or by a special the compile context that is passed to <b>pcre2_compile()</b> or by a special
sequence at the start of the pattern, as described in the section entitled sequence at the start of the pattern, as described in the section entitled
<a href="pcrepattern.html#newlines">"Newline conventions"</a> <a href="pcre2pattern.html#newlines">"Newline conventions"</a>
in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is
built. built.
<pre> <pre>
@ -1243,7 +1243,7 @@ This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
\w, and some of the POSIX character classes. By default, only ASCII characters \w, and some of the POSIX character classes. By default, only ASCII characters
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
classify characters. More details are given in the section on classify characters. More details are given in the section on
<a href="pcre2.html#genericchartypes">generic character types</a> <a href="pcre2pattern.html#genericchartypes">generic character types</a>
in the in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page. If you set PCRE2_UCP, matching one of the items it affects takes much page. If you set PCRE2_UCP, matching one of the items it affects takes much
@ -1924,11 +1924,8 @@ documentation.
<P> <P>
When PCRE2 is built, a default newline convention is set; this is usually the When PCRE2 is built, a default newline convention is set; this is usually the
standard convention for the operating system. The default can be overridden in standard convention for the operating system. The default can be overridden in
either a a
<a href="#compilecontext">compile context</a> <a href="#compilecontext">compile context.</a>
or a
<a href="#matchcontext">match context.</a>
However, changing the newline convention at match time disables JIT matching.
During matching, the newline choice affects the behaviour of the dot, During matching, the newline choice affects the behaviour of the dot,
circumflex, and dollar metacharacters. It may also alter the way the match circumflex, and dollar metacharacters. It may also alter the way the match
position is advanced after a match failure for an unanchored pattern. position is advanced after a match failure for an unanchored pattern.
@ -2290,7 +2287,7 @@ subpattern <i>n</i> has not been used at all, it returns an empty string. This
can be distinguished from a genuine zero-length substring by inspecting the can be distinguished from a genuine zero-length substring by inspecting the
appropriate offset in the ovector, which contains PCRE2_UNSET for unset appropriate offset in the ovector, which contains PCRE2_UNSET for unset
substrings. substrings.
<a name="extractbynname"></a></P> <a name="extractbyname"></a></P>
<br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br> <br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
<P> <P>
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b> <b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
@ -2358,7 +2355,8 @@ string in <i>outputbuffer</i>, replacing the part that was matched with the
be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
</P> </P>
<P> <P>
In the replacement string, which is interpreted as a UTF string in UTF mode, a In the replacement string, which is interpreted as a UTF string in UTF mode,
and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
dollar character is an escape character that can specify the insertion of dollar character is an escape character that can specify the insertion of
characters from capturing groups in the pattern. The following forms are characters from capturing groups in the pattern. The following forms are
recognized: recognized:

View File

@ -51,11 +51,12 @@ JIT support is an optional feature of PCRE2. The "configure" option
you want to use JIT. The support is limited to the following hardware you want to use JIT. The support is limited to the following hardware
platforms: platforms:
<pre> <pre>
ARM v5, v7, and Thumb2 ARM 32-bit (v5, v7, and Thumb2)
ARM 64-bit
Intel x86 32-bit and 64-bit Intel x86 32-bit and 64-bit
MIPS 32-bit MIPS 32-bit and 64-bit
Power PC 32-bit and 64-bit Power PC 32-bit and 64-bit
SPARC 32-bit (experimental) SPARC 32-bit
</pre> </pre>
If --enable-jit is set on an unsupported platform, compilation fails. If --enable-jit is set on an unsupported platform, compilation fails.
</P> </P>
@ -73,11 +74,11 @@ To make use of the JIT support in the simplest way, all you have to do is to
call <b>pcre2_jit_compile()</b> after successfully compiling a pattern with call <b>pcre2_jit_compile()</b> after successfully compiling a pattern with
<b>pcre2_compile()</b>. This function has two arguments: the first is the <b>pcre2_compile()</b>. This function has two arguments: the first is the
compiled pattern pointer that was returned by <b>pcre2_compile()</b>, and the compiled pattern pointer that was returned by <b>pcre2_compile()</b>, and the
second is a set of option bits, which must include at least one of second is zero or more of the following option bits: PCRE2_JIT_COMPLETE,
PCRE2_JIT_COMPLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
</P> </P>
<P> <P>
If JIT support is not available, a call to <b>pcre2_jit_comple()</b> does If JIT support is not available, a call to <b>pcre2_jit_compile()</b> does
nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled pattern nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled pattern
is passed to the JIT compiler, which turns it into machine code that executes is passed to the JIT compiler, which turns it into machine code that executes
much faster than the normal interpretive code, but yields exactly the same much faster than the normal interpretive code, but yields exactly the same
@ -95,6 +96,20 @@ appropriate code is run if it is available. Otherwise, the pattern is matched
using interpretive code. using interpretive code.
</P> </P>
<P> <P>
You can call <b>pcre2_jit_compile()</b> multiple times for the same compiled
pattern. It does nothing if it has previously compiled code for any of the
option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
(perhaps later, when you find you need partial matching) again with
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
returns zero. This is an alternative way of testing if JIT is available.
</P>
<P>
At present, it is not possible to free JIT compiled code except when the entire
compiled pattern is freed by calling <b>pcre2_free_code()</b>.
</P>
<P>
In some circumstances you may need to call additional functions. These are In some circumstances you may need to call additional functions. These are
described in the section entitled described in the section entitled
<a href="#stackcontrol">"Controlling the JIT stack"</a> <a href="#stackcontrol">"Controlling the JIT stack"</a>
@ -167,7 +182,7 @@ memory allocation), a starting size and a maximum size, and it returns a
pointer to an opaque structure of type <b>pcre2_jit_stack</b>, or NULL if there pointer to an opaque structure of type <b>pcre2_jit_stack</b>, or NULL if there
is an error. The <b>pcre2_jit_stack_free()</b> function is used to free a stack is an error. The <b>pcre2_jit_stack_free()</b> function is used to free a stack
that is no longer needed. (For the technically minded: the address space is that is no longer needed. (For the technically minded: the address space is
allocated by mmap or VirtualAlloc.) FIXME Is this right? allocated by mmap or VirtualAlloc.)
</P> </P>
<P> <P>
JIT uses far less memory for recursion than the interpretive code, JIT uses far less memory for recursion than the interpretive code,
@ -187,7 +202,8 @@ passed to a matching function, its information determines which JIT stack is
used. There are three cases for the values of the other two options: used. There are three cases for the values of the other two options:
<pre> <pre>
(1) If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32K block (1) If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32K block
on the machine stack is used. on the machine stack is used. This is the default when a match
context is created.
(2) If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must be (2) If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must be
a pointer to a valid JIT stack, the result of calling a pointer to a valid JIT stack, the result of calling
@ -402,7 +418,7 @@ Cambridge CB2 3QH, England.
</P> </P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br> <br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 08 November 2014 Last updated: 12 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -100,8 +100,8 @@ page.
<P> <P>
Some applications that allow their users to supply patterns may wish to Some applications that allow their users to supply patterns may wish to
restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
option is set at compile time, (*UTF) is not allowed, and its appearance causes option is passed to <b>pcre2_compile()</b>, (*UTF) is not allowed, and its
an error. appearance in a pattern causes an error.
</P> </P>
<br><b> <br><b>
Unicode property support Unicode property support
@ -113,6 +113,22 @@ such as \d and \w to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 128 via a lookup instead of recognizing only characters with codes less than 128 via a lookup
table. table.
</P> </P>
<P>
Some applications that allow their users to supply patterns may wish to
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
<b>pcre2_compile()</b>, (*UCP) is not allowed, and its appearance in a pattern
causes an error.
</P>
<br><b>
Locking out empty string matching
</b><br>
<P>
Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
matching function is subsequently called to match the pattern. These options
lock out the matching of empty strings, either entirely, or only at the start
of the subject.
</P>
<br><b> <br><b>
Disabling auto-possessification Disabling auto-possessification
</b><br> </b><br>
@ -133,6 +149,28 @@ PCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly
reaching "no match" results. For more details, see the reaching "no match" results. For more details, see the
<a href="pcre2api.html"><b>pcre2api</b></a> <a href="pcre2api.html"><b>pcre2api</b></a>
documentation. documentation.
</P>
<br><b>
Setting match and recursion limits
</b><br>
<P>
The caller of <b>pcre2_match()</b> can set a limit on the number of times the
internal <b>match()</b> function is called and on the maximum depth of
recursive calls. These facilities are provided to catch runaway matches that
are provoked by patterns with huge matching trees (a typical example is a
pattern with nested unlimited repeats) and to avoid running out of system stack
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
gives an error return. The limits can also be set by items at the start of the
pattern of the form
<pre>
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
</pre>
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
<a name="newlines"></a></P> <a name="newlines"></a></P>
<br><b> <br><b>
Newline conventions Newline conventions
@ -179,26 +217,14 @@ below. A change of \R setting can be combined with a change of newline
convention. convention.
</P> </P>
<br><b> <br><b>
Setting match and recursion limits Specifying what \R matches
</b><br> </b><br>
<P> <P>
The caller of <b>pcre2_match()</b> can set a limit on the number of times the It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
internal <b>match()</b> function is called and on the maximum depth of complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
recursive calls. These facilities are provided to catch runaway matches that at compile time. This effect can also be achieved by starting a pattern with
are provoked by patterns with huge matching trees (a typical example is a (*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
pattern with nested unlimited repeats) and to avoid running out of system stack corresponding to PCRE2_BSR_UNICODE.
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
gives an error return. The limits can also be set by items at the start of the
pattern of the form
<pre>
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
</pre>
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
</P> </P>
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br> <br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
<P> <P>
@ -2280,8 +2306,8 @@ complex:
</PRE> </PRE>
</P> </P>
<P> <P>
There are four kinds of condition: references to subpatterns, references to There are five kinds of condition: references to subpatterns, references to
recursion, a pseudo-condition called DEFINE, and assertions. recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
</P> </P>
<br><b> <br><b>
Checking for a used subpattern by number Checking for a used subpattern by number
@ -2389,6 +2415,23 @@ pattern uses references to the named group to match the four dot-separated
components of an IPv4 address, insisting on a word boundary at each end. components of an IPv4 address, insisting on a word boundary at each end.
</P> </P>
<br><b> <br><b>
Checking the PCRE2 version
</b><br>
<P>
Programs that link with a PCRE2 library can check the version by calling
<b>pcre2_config()</b> with appropriate arguments. Users of applications that do
not have access to the underlying code cannot do this. A special "condition"
called VERSION exists to allow such users to discover which version of PCRE2
they are dealing with by using this condition to match a string such as
"yesno". VERSION must be followed either by "=" or "&#62;=" and a version number.
For example:
<pre>
(?(VERSION&#62;=10.4)yes|no)
</pre>
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
"no" otherwise.
</P>
<br><b>
Assertion conditions Assertion conditions
</b><br> </b><br>
<P> <P>
@ -3180,7 +3223,7 @@ subpattern, (*THEN) causes the subroutine match to fail.
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br> <br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
<P> <P>
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
<b>pcre2syntax</b>(3), <b>pcre2</b>(3), <b>pcre216(3)</b>, <b>pcre232(3)</b>. <b>pcre2syntax</b>(3), <b>pcre2</b>(3).
</P> </P>
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br> <br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
<P> <P>
@ -3193,7 +3236,7 @@ Cambridge CB2 3QH, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 03 November 2014 Last updated: 14 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -493,17 +493,18 @@ Each top-level branch of a look behind must be of a fixed length.
(?(condition)yes-pattern) (?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern) (?(condition)yes-pattern|no-pattern)
(?(n)... absolute reference condition (?(n) absolute reference condition
(?(+n)... relative reference condition (?(+n) relative reference condition
(?(-n)... relative reference condition (?(-n) relative reference condition
(?(&#60;name&#62;)... named reference condition (Perl) (?(&#60;name&#62;) named reference condition (Perl)
(?('name')... named reference condition (Perl) (?('name') named reference condition (Perl)
(?(name)... named reference condition (PCRE2) (?(name) named reference condition (PCRE2)
(?(R)... overall recursion condition (?(R) overall recursion condition
(?(Rn)... specific group recursion condition (?(Rn) specific group recursion condition
(?(R&name)... specific recursion condition (?(R&name) specific recursion condition
(?(DEFINE)... define subpattern for reference (?(DEFINE) define subpattern for reference
(?(assert)... assertion condition (?(VERSION[&#62;]=n.m) test PCRE2 version
(?(assert) assertion condition
</PRE> </PRE>
</P> </P>
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br> <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
@ -552,7 +553,7 @@ Cambridge CB2 3QH, England.
</P> </P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br> <br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 20 October 2014 Last updated: 14 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -201,10 +201,11 @@ Behave as if each subject line contains the given modifiers.
<P> <P>
<b>-t</b> <b>-t</b>
Run each compile and match many times with a timer, and output the resulting Run each compile and match many times with a timer, and output the resulting
times per compile or match. You can control the number of iterations that are times per compile or match. When JIT is used, separate times are given for the
used for timing by following <b>-t</b> with a number (as a separate item on the initial compile and the JIT compile. You can control the number of iterations
command line). For example, "-t 1000" iterates 1000 times. The default is to that are used for timing by following <b>-t</b> with a number (as a separate
iterate 500,000 times. item on the command line). For example, "-t 1000" iterates 1000 times. The
default is to iterate 500,000 times.
</P> </P>
<P> <P>
<b>-tm</b> <b>-tm</b>
@ -490,7 +491,6 @@ about the pattern:
tables=[0|1|2] select internal tables tables=[0|1|2] select internal tables
</pre> </pre>
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
</P> </P>
<br><b> <br><b>
Newline and \R handling Newline and \R handling
@ -528,7 +528,31 @@ one-off tests.
<P> <P>
The <b>info</b> modifier requests information about the compiled pattern The <b>info</b> modifier requests information about the compiled pattern
(whether it is anchored, has a fixed first character, and so on). The (whether it is anchored, has a fixed first character, and so on). The
information is obtained from the <b>pcre2_pattern_info()</b> function. information is obtained from the <b>pcre2_pattern_info()</b> function. Here are
some typical examples:
<pre>
re&#62; /(?i)(^a|^b)/m,info
Capturing subpattern count = 1
Compile options: multiline
Overall options: caseless multiline
First code unit at start or follows newline
Subject length lower bound = 1
re&#62; /(?i)abc/info
Capturing subpattern count = 0
Compile options: &#60;none&#62;
Overall options: caseless
First code unit = 'a' (caseless)
Last code unit = 'c' (caseless)
Subject length lower bound = 3
</pre>
"Compile options" are those specified to the compile function; "overall
options" have added options that are taken or deduced from the pattern. If both
sets of options are the same, just a single "options" line is output. "First
code unit" is where any match must start; if there is more than one they are
listed as "starting code units". "Last code unit" is the last literal code unit
that must be present in any match. This is not necessarily the last character.
These lines are omitted if no starting or ending code units are recorded.
</P> </P>
<br><b> <br><b>
Specifying a pattern in hex Specifying a pattern in hex
@ -543,8 +567,8 @@ pairs. For example:
This feature is provided as a way of creating patterns that contain binary zero This feature is provided as a way of creating patterns that contain binary zero
characters. By default, <b>pcre2test</b> passes patterns as zero-terminated characters. By default, <b>pcre2test</b> passes patterns as zero-terminated
strings to <b>pcre2_compile()</b>, giving the length as PCRE2_ZERO_TERMINATED. strings to <b>pcre2_compile()</b>, giving the length as PCRE2_ZERO_TERMINATED.
However, for patterns specified in hexadecimal, the length of the pattern is However, for patterns specified in hexadecimal, the actual length of the
passed. pattern is passed.
</P> </P>
<br><b> <br><b>
JIT compilation JIT compilation
@ -571,7 +595,7 @@ setting the size of the JIT stack.
</P> </P>
<P> <P>
If the <b>jitfast</b> modifier is specified, matching is done using the JIT If the <b>jitfast</b> modifier is specified, matching is done using the JIT
"fast path" interface (\fBpcre2_jit_match()), which skips some of the sanity "fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
checks that are done by <b>pcre2_match()</b>, and of course does not work when checks that are done by <b>pcre2_match()</b>, and of course does not work when
JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
assumed. assumed.
@ -604,11 +628,17 @@ character tables are mutually exclusive.
Showing pattern memory Showing pattern memory
</b><br> </b><br>
<P> <P>
The <b>/memory</b> modifier causes the size in bytes of the memory block used to The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
hold the compiled pattern to be output. This does not include the size of the the compiled pattern to be output. This does not include the size of the
<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is <b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
subsequently passed to the JIT compiler, the size of the JIT compiled code is subsequently passed to the JIT compiler, the size of the JIT compiled code is
also output. also output. Here is an example:
<pre>
re&#62; /a(b)c/jit,memory
Memory allocation (code space): 21
Memory allocation (JIT code): 1910
</PRE>
</P> </P>
<br><b> <br><b>
Limiting nested parentheses Limiting nested parentheses
@ -650,8 +680,8 @@ enable stack availability to be checked during compilation (see the
<a href="pcre2api.html"><b>pcre2api</b></a> <a href="pcre2api.html"><b>pcre2api</b></a>
documentation for details). If the number specified by the modifier is greater documentation for details). If the number specified by the modifier is greater
than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up
callback from <b>pcre2_compile()</b> to a local function. The argument it is callback from <b>pcre2_compile()</b> to a local function. The argument it
passed is the current nesting parenthesis depth; if this is greater than the receives is the current nesting parenthesis depth; if this is greater than the
value given by the modifier, non-zero is returned, causing the compilation to value given by the modifier, non-zero is returned, causing the compilation to
be aborted. be aborted.
</P> </P>
@ -688,6 +718,7 @@ not affect the compilation process.
allusedtext show all consulted text allusedtext show all consulted text
/g global global matching /g global global matching
mark show mark values mark show mark values
replace=&#60;string&#62; specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
</pre> </pre>
These modifiers may not appear in a <b>#pattern</b> command. If you want them as These modifiers may not appear in a <b>#pattern</b> command. If you want them as
@ -759,11 +790,11 @@ pattern.
offset=&#60;n&#62; set starting offset offset=&#60;n&#62; set starting offset
ovector=&#60;n&#62; set size of output vector ovector=&#60;n&#62; set size of output vector
recursion_limit=&#60;n&#62; set a recursion limit recursion_limit=&#60;n&#62; set a recursion limit
replace=&#60;string&#62; specify a replacement string
startchar show startchar when relevant startchar show startchar when relevant
zero_terminate pass the subject as zero-terminated zero_terminate pass the subject as zero-terminated
</pre> </pre>
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
</P> </P>
<br><b> <br><b>
Showing more text Showing more text
@ -841,6 +872,30 @@ Any value other than zero is used as a return from <b>pcre2test</b>'s callout
function. function.
</P> </P>
<br><b> <br><b>
Finding all matches in a string
</b><br>
<P>
Searching for all possible matches within a subject can be requested by the
<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The difference
between <b>global</b> and <b>altglobal</b> is that the former uses the
<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
to start searching at a new point within the entire string (which is what Perl
does), whereas the latter passes over a shortened substring. This makes a
difference to the matching process if the pattern begins with a lookbehind
assertion (including \b or \B).
</P>
<P>
If an empty string is matched, the next match is done with the
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
another, non-empty, match at the same point in the subject. If this match
fails, the start offset is advanced, and the normal match is retried. This
imitates the way Perl handles such cases when using the <b>/g</b> modifier or
the <b>split()</b> function. Normally, the start offset is advanced by one
character, but if the newline convention recognizes CRLF as a newline, and the
current character is CR followed by LF, an advance of two is used.
</P>
<br><b>
Testing substring extraction functions Testing substring extraction functions
</b><br> </b><br>
<P> <P>
@ -867,28 +922,46 @@ length (that is, the return from the extraction function) is given in
parentheses after each substring. parentheses after each substring.
</P> </P>
<br><b> <br><b>
Finding all matches in a string Testing the substitution function
</b><br> </b><br>
<P> <P>
Searching for all possible matches within a subject can be requested by the If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching called instead of one of the matching functions. Unlike subject strings,
function is called again to search the remainder of the subject. The difference <b>pcre2test</b> does not process replacement strings for escape sequences. In
between <b>global</b> and <b>altglobal</b> is that the former uses the UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> If so, it is correctly converted to a UTF string of the appropriate code unit
to start searching at a new point within the entire string (which is what Perl width. If it is not a valid UTF-8 string, the individual code units are copied
does), whereas the latter passes over a shortened substring. This makes a directly. This provides a means of passing an invalid UTF-8 string for testing
difference to the matching process if the pattern begins with a lookbehind purposes.
assertion (including \b or \B).
</P> </P>
<P> <P>
If an empty string is matched, the next match is done with the If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for <b>pcre2_substitute()</b>. After a successful substitution, the modified string
another, non-empty, match at the same point in the subject. If this match is output, preceded by the number of replacements. This may be zero if there
fails, the start offset is advanced, and the normal match is retried. This were no matches. Here is a simple example of a substitution test:
imitates the way Perl handles such cases when using the <b>/g</b> modifier or <pre>
the <b>split()</b> function. Normally, the start offset is advanced by one /abc/replace=xxx
character, but if the newline convention recognizes CRLF as a newline, and the =abc=abc=
current character is CR followed by LF, an advance of two is used. 1: =xxx=abc=
=abc=abc=\=global
2: =xxx=xxx=
</pre>
Subject and replacement strings should be kept relatively short for
substitution tests, as fixed-size buffers are used. To make it easy to test for
buffer overflow, if the replacement string starts with a number in square
brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
output buffer, with the replacement string starting at the next character. Here
is an example that tests the edge case:
<pre>
/abc/
123abc123\=replace=[10]XYZ
1: 123XYZ123
123abc123\=replace=[9]XYZ
Failed: error -47: no more memory
</pre>
A replacement string is ignored with POSIX and DFA matching. Specifying partial
matching provokes an error return ("bad option value") from
<b>pcre2_substitute()</b>.
</P> </P>
<br><b> <br><b>
Setting the JIT stack size Setting the JIT stack size
@ -969,10 +1042,10 @@ available for storing matching information. The default is 15.
A value of zero is useful when testing the POSIX API because it causes A value of zero is useful when testing the POSIX API because it causes
<b>regexec()</b> to be called with a NULL capture vector. When not testing the <b>regexec()</b> to be called with a NULL capture vector. When not testing the
POSIX API, a value of zero is used to cause POSIX API, a value of zero is used to cause
<b>pcre2_match_data_create_from_pattern</b> to be called, in order to create a <b>pcre2_match_data_create_from_pattern()</b> to be called, in order to create a
match block of exactly the right size for the pattern. (It is not possible to match block of exactly the right size for the pattern. (It is not possible to
create a match block with a zero-length ovector; there is always one pair of create a match block with a zero-length ovector; there is always at least one
offsets.) pair of offsets.)
</P> </P>
<br><b> <br><b>
Passing the subject as zero-terminated Passing the subject as zero-terminated
@ -985,7 +1058,7 @@ be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
this modifier has no effect, as there is no facility for passing a length.) this modifier has no effect, as there is no facility for passing a length.)
</P> </P>
<P> <P>
When testing <b>pcre2_substitute</b>, this modifier also has the effect of When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
passing the replacement string as zero-terminated. passing the replacement string as zero-terminated.
</P> </P>
<br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br> <br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br>
@ -1233,7 +1306,7 @@ Cambridge CB2 3QH, England.
</P> </P>
<br><a name="SEC20" href="#TOC1">REVISION</a><br> <br><a name="SEC20" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 09 November 2014 Last updated: 14 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -132,7 +132,7 @@ remaining sections, except for the \fBpcre2demo\fP section (which is a program
listing), and the short pages for individual functions, are concatenated in listing), and the short pages for individual functions, are concatenated in
\fBpcre2.txt\fP, for ease of searching. The sections are as follows: \fBpcre2.txt\fP, for ease of searching. The sections are as follows:
.sp .sp
pcre2 this document FIXME CHECK THIS LIST pcre2 this document
pcre2-config show PCRE2 installation configuration information pcre2-config show PCRE2 installation configuration information
pcre2api details of PCRE2's native C API pcre2api details of PCRE2's native C API
pcre2build building PCRE2 pcre2build building PCRE2

View File

@ -116,7 +116,7 @@ USER DOCUMENTATION
tions, are concatenated in pcre2.txt, for ease of searching. The sec- tions, are concatenated in pcre2.txt, for ease of searching. The sec-
tions are as follows: tions are as follows:
pcre2 this document FIXME CHECK THIS LIST pcre2 this document
pcre2-config show PCRE2 installation configuration information pcre2-config show PCRE2 installation configuration information
pcre2api details of PCRE2's native C API pcre2api details of PCRE2's native C API
pcre2build building PCRE2 pcre2build building PCRE2
@ -1928,12 +1928,10 @@ NEWLINE HANDLING WHEN MATCHING
When PCRE2 is built, a default newline convention is set; this is usu- When PCRE2 is built, a default newline convention is set; this is usu-
ally the standard convention for the operating system. The default can ally the standard convention for the operating system. The default can
be overridden in either a compile context or a match context. However, be overridden in a compile context. During matching, the newline
changing the newline convention at match time disables JIT matching. choice affects the behaviour of the dot, circumflex, and dollar
During matching, the newline choice affects the behaviour of the dot, metacharacters. It may also alter the way the match position is
circumflex, and dollar metacharacters. It may also alter the way the advanced after a match failure for an unanchored pattern.
match position is advanced after a match failure for an unanchored pat-
tern.
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
set, and a match attempt for an unanchored pattern fails when the cur- set, and a match attempt for an unanchored pattern fails when the cur-
@ -2320,9 +2318,10 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
given as PCRE2_ZERO_TERMINATED for a zero-terminated string. given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
In the replacement string, which is interpreted as a UTF string in UTF In the replacement string, which is interpreted as a UTF string in UTF
mode, a dollar character is an escape character that can specify the mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
insertion of characters from capturing groups in the pattern. The fol- option is set, a dollar character is an escape character that can spec-
lowing forms are recognized: ify the insertion of characters from capturing groups in the pattern.
The following forms are recognized:
$$ insert a dollar character $$ insert a dollar character
$<n> insert the contents of group <n> $<n> insert the contents of group <n>
@ -3508,11 +3507,12 @@ AVAILABILITY OF JIT SUPPORT
built if you want to use JIT. The support is limited to the following built if you want to use JIT. The support is limited to the following
hardware platforms: hardware platforms:
ARM v5, v7, and Thumb2 ARM 32-bit (v5, v7, and Thumb2)
ARM 64-bit
Intel x86 32-bit and 64-bit Intel x86 32-bit and 64-bit
MIPS 32-bit MIPS 32-bit and 64-bit
Power PC 32-bit and 64-bit Power PC 32-bit and 64-bit
SPARC 32-bit (experimental) SPARC 32-bit
If --enable-jit is set on an unsupported platform, compilation fails. If --enable-jit is set on an unsupported platform, compilation fails.
@ -3531,10 +3531,10 @@ SIMPLE USE OF JIT
is to call pcre2_jit_compile() after successfully compiling a pattern is to call pcre2_jit_compile() after successfully compiling a pattern
with pcre2_compile(). This function has two arguments: the first is the with pcre2_compile(). This function has two arguments: the first is the
compiled pattern pointer that was returned by pcre2_compile(), and the compiled pattern pointer that was returned by pcre2_compile(), and the
second is a set of option bits, which must include at least one of second is zero or more of the following option bits: PCRE2_JIT_COM-
PCRE2_JIT_COMPLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
If JIT support is not available, a call to pcre2_jit_comple() does If JIT support is not available, a call to pcre2_jit_compile() does
nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
pattern is passed to the JIT compiler, which turns it into machine code pattern is passed to the JIT compiler, which turns it into machine code
that executes much faster than the normal interpretive code, but yields that executes much faster than the normal interpretive code, but yields
@ -3550,6 +3550,19 @@ SIMPLE USE OF JIT
pcre2_match() is called, the appropriate code is run if it is avail- pcre2_match() is called, the appropriate code is run if it is avail-
able. Otherwise, the pattern is matched using interpretive code. able. Otherwise, the pattern is matched using interpretive code.
You can call pcre2_jit_compile() multiple times for the same compiled
pattern. It does nothing if it has previously compiled code for any of
the option bits. For example, you can call it once with PCRE2_JIT_COM-
PLETE and (perhaps later, when you find you need partial matching)
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
ing. If pcre2_jit_compile() is called with no option bits set, it imme-
diately returns zero. This is an alternative way of testing if JIT is
available.
At present, it is not possible to free JIT compiled code except when
the entire compiled pattern is freed by calling pcre2_free_code().
In some circumstances you may need to call additional functions. These In some circumstances you may need to call additional functions. These
are described in the section entitled "Controlling the JIT stack" are described in the section entitled "Controlling the JIT stack"
below. below.
@ -3618,7 +3631,7 @@ CONTROLLING THE JIT STACK
pcre2_jit_stack, or NULL if there is an error. The pcre2_jit_stack, or NULL if there is an error. The
pcre2_jit_stack_free() function is used to free a stack that is no pcre2_jit_stack_free() function is used to free a stack that is no
longer needed. (For the technically minded: the address space is allo- longer needed. (For the technically minded: the address space is allo-
cated by mmap or VirtualAlloc.) FIXME Is this right? cated by mmap or VirtualAlloc.)
JIT uses far less memory for recursion than the interpretive code, and JIT uses far less memory for recursion than the interpretive code, and
a maximum stack size of 512K to 1M should be more than enough for any a maximum stack size of 512K to 1M should be more than enough for any
@ -3637,7 +3650,8 @@ CONTROLLING THE JIT STACK
two options: two options:
(1) If callback is NULL and data is NULL, an internal 32K block (1) If callback is NULL and data is NULL, an internal 32K block
on the machine stack is used. on the machine stack is used. This is the default when a match
context is created.
(2) If callback is NULL and data is not NULL, data must be (2) If callback is NULL and data is not NULL, data must be
a pointer to a valid JIT stack, the result of calling a pointer to a valid JIT stack, the result of calling
@ -3840,7 +3854,7 @@ AUTHOR
REVISION REVISION
Last updated: 08 November 2014 Last updated: 12 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1063,7 +1063,7 @@ equivalent to Perl's /x option, and it can be changed within a pattern by a
Which characters are interpreted as newlines can be specified by a setting in Which characters are interpreted as newlines can be specified by a setting in
the compile context that is passed to \fBpcre2_compile()\fP or by a special the compile context that is passed to \fBpcre2_compile()\fP or by a special
sequence at the start of the pattern, as described in the section entitled sequence at the start of the pattern, as described in the section entitled
.\" HTML <a href="pcrepattern.html#newlines"> .\" HTML <a href="pcre2pattern.html#newlines">
.\" </a> .\" </a>
"Newline conventions" "Newline conventions"
.\" .\"
@ -1226,7 +1226,7 @@ This option changes the way PCRE2 processes \eB, \eb, \eD, \ed, \eS, \es, \eW,
\ew, and some of the POSIX character classes. By default, only ASCII characters \ew, and some of the POSIX character classes. By default, only ASCII characters
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
classify characters. More details are given in the section on classify characters. More details are given in the section on
.\" HTML <a href="pcre2.html#genericchartypes"> .\" HTML <a href="pcre2pattern.html#genericchartypes">
.\" </a> .\" </a>
generic character types generic character types
.\" .\"
@ -1939,17 +1939,11 @@ documentation.
.sp .sp
When PCRE2 is built, a default newline convention is set; this is usually the When PCRE2 is built, a default newline convention is set; this is usually the
standard convention for the operating system. The default can be overridden in standard convention for the operating system. The default can be overridden in
either a a
.\" HTML <a href="#compilecontext"> .\" HTML <a href="#compilecontext">
.\" </a> .\" </a>
compile context compile context.
.\" .\"
or a
.\" HTML <a href="#matchcontext">
.\" </a>
match context.
.\"
However, changing the newline convention at match time disables JIT matching.
During matching, the newline choice affects the behaviour of the dot, During matching, the newline choice affects the behaviour of the dot,
circumflex, and dollar metacharacters. It may also alter the way the match circumflex, and dollar metacharacters. It may also alter the way the match
position is advanced after a match failure for an unanchored pattern. position is advanced after a match failure for an unanchored pattern.
@ -2322,7 +2316,7 @@ appropriate offset in the ovector, which contains PCRE2_UNSET for unset
substrings. substrings.
. .
. .
.\" HTML <a name="extractbynname"></a> .\" HTML <a name="extractbyname"></a>
.SH "EXTRACTING CAPTURED SUBSTRINGS BY NAME" .SH "EXTRACTING CAPTURED SUBSTRINGS BY NAME"
.rs .rs
.sp .sp

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "03 November 2014" "PCRE2 10.00" .TH PCRE2PATTERN 3 "14 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -63,8 +63,8 @@ page.
.P .P
Some applications that allow their users to supply patterns may wish to Some applications that allow their users to supply patterns may wish to
restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
option is set at compile time, (*UTF) is not allowed, and its appearance causes option is passed to \fBpcre2_compile()\fP, (*UTF) is not allowed, and its
an error. appearance in a pattern causes an error.
. .
. .
.SS "Unicode property support" .SS "Unicode property support"
@ -75,6 +75,21 @@ This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \ed and \ew to use Unicode properties to determine character types, such as \ed and \ew to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 128 via a lookup instead of recognizing only characters with codes less than 128 via a lookup
table. table.
.P
Some applications that allow their users to supply patterns may wish to
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
\fBpcre2_compile()\fP, (*UCP) is not allowed, and its appearance in a pattern
causes an error.
.
.
.SS "Locking out empty string matching"
.rs
.sp
Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
matching function is subsequently called to match the pattern. These options
lock out the matching of empty strings, either entirely, or only at the start
of the subject.
. .
. .
.SS "Disabling auto-possessification" .SS "Disabling auto-possessification"
@ -102,6 +117,28 @@ reaching "no match" results. For more details, see the
documentation. documentation.
. .
. .
.SS "Setting match and recursion limits"
.rs
.sp
The caller of \fBpcre2_match()\fP can set a limit on the number of times the
internal \fBmatch()\fP function is called and on the maximum depth of
recursive calls. These facilities are provided to catch runaway matches that
are provoked by patterns with huge matching trees (a typical example is a
pattern with nested unlimited repeats) and to avoid running out of system stack
by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
gives an error return. The limits can also be set by items at the start of the
pattern of the form
.sp
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
.sp
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
.
.
.\" HTML <a name="newlines"></a> .\" HTML <a name="newlines"></a>
.SS "Newline conventions" .SS "Newline conventions"
.rs .rs
@ -153,26 +190,14 @@ below. A change of \eR setting can be combined with a change of newline
convention. convention.
. .
. .
.SS "Setting match and recursion limits" .SS "Specifying what \eR matches"
.rs .rs
.sp .sp
The caller of \fBpcre2_match()\fP can set a limit on the number of times the It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
internal \fBmatch()\fP function is called and on the maximum depth of complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
recursive calls. These facilities are provided to catch runaway matches that at compile time. This effect can also be achieved by starting a pattern with
are provoked by patterns with huge matching trees (a typical example is a (*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
pattern with nested unlimited repeats) and to avoid running out of system stack corresponding to PCRE2_BSR_UNICODE.
by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
gives an error return. The limits can also be set by items at the start of the
pattern of the form
.sp
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
.sp
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
. .
. .
.SH "EBCDIC CHARACTER CODES" .SH "EBCDIC CHARACTER CODES"
@ -2302,8 +2327,8 @@ complex:
(?(1) (A|B|C) | (D | (?(2)E|F) | E) ) (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
.sp .sp
.P .P
There are four kinds of condition: references to subpatterns, references to There are five kinds of condition: references to subpatterns, references to
recursion, a pseudo-condition called DEFINE, and assertions. recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
. .
. .
.SS "Checking for a used subpattern by number" .SS "Checking for a used subpattern by number"
@ -2418,6 +2443,23 @@ pattern uses references to the named group to match the four dot-separated
components of an IPv4 address, insisting on a word boundary at each end. components of an IPv4 address, insisting on a word boundary at each end.
. .
. .
.SS "Checking the PCRE2 version"
.rs
.sp
Programs that link with a PCRE2 library can check the version by calling
\fBpcre2_config()\fP with appropriate arguments. Users of applications that do
not have access to the underlying code cannot do this. A special "condition"
called VERSION exists to allow such users to discover which version of PCRE2
they are dealing with by using this condition to match a string such as
"yesno". VERSION must be followed either by "=" or ">=" and a version number.
For example:
.sp
(?(VERSION>=10.4)yes|no)
.sp
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
"no" otherwise.
.
.
.SS "Assertion conditions" .SS "Assertion conditions"
.rs .rs
.sp .sp
@ -3219,7 +3261,7 @@ subpattern, (*THEN) causes the subroutine match to fail.
.rs .rs
.sp .sp
\fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3),
\fBpcre2syntax\fP(3), \fBpcre2\fP(3), \fBpcre216(3)\fP, \fBpcre232(3)\fP. \fBpcre2syntax\fP(3), \fBpcre2\fP(3).
. .
. .
.SH AUTHOR .SH AUTHOR
@ -3236,6 +3278,6 @@ Cambridge CB2 3QH, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 03 November 2014 Last updated: 14 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "20 October 2014" "PCRE2 10.00" .TH PCRE2SYNTAX 3 "14 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -470,17 +470,18 @@ Each top-level branch of a look behind must be of a fixed length.
(?(condition)yes-pattern) (?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern) (?(condition)yes-pattern|no-pattern)
.sp .sp
(?(n)... absolute reference condition (?(n) absolute reference condition
(?(+n)... relative reference condition (?(+n) relative reference condition
(?(-n)... relative reference condition (?(-n) relative reference condition
(?(<name>)... named reference condition (Perl) (?(<name>) named reference condition (Perl)
(?('name')... named reference condition (Perl) (?('name') named reference condition (Perl)
(?(name)... named reference condition (PCRE2) (?(name) named reference condition (PCRE2)
(?(R)... overall recursion condition (?(R) overall recursion condition
(?(Rn)... specific group recursion condition (?(Rn) specific group recursion condition
(?(R&name)... specific recursion condition (?(R&name) specific recursion condition
(?(DEFINE)... define subpattern for reference (?(DEFINE) define subpattern for reference
(?(assert)... assertion condition (?(VERSION[>]=n.m) test PCRE2 version
(?(assert) assertion condition
. .
. .
.SH "BACKTRACKING CONTROL" .SH "BACKTRACKING CONTROL"
@ -535,6 +536,6 @@ Cambridge CB2 3QH, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 20 October 2014 Last updated: 14 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "12 November 2014" "PCRE 10.00" .TH PCRE2TEST 1 "14 November 2014" "PCRE 10.00"
.SH NAME .SH NAME
pcre2test - a program for testing Perl-compatible regular expressions. pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -450,7 +450,6 @@ about the pattern:
tables=[0|1|2] select internal tables tables=[0|1|2] select internal tables
.sp .sp
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
. .
. .
.SS "Newline and \eR handling" .SS "Newline and \eR handling"
@ -484,7 +483,31 @@ one-off tests.
.P .P
The \fBinfo\fP modifier requests information about the compiled pattern The \fBinfo\fP modifier requests information about the compiled pattern
(whether it is anchored, has a fixed first character, and so on). The (whether it is anchored, has a fixed first character, and so on). The
information is obtained from the \fBpcre2_pattern_info()\fP function. information is obtained from the \fBpcre2_pattern_info()\fP function. Here are
some typical examples:
.sp
re> /(?i)(^a|^b)/m,info
Capturing subpattern count = 1
Compile options: multiline
Overall options: caseless multiline
First code unit at start or follows newline
Subject length lower bound = 1
.sp
re> /(?i)abc/info
Capturing subpattern count = 0
Compile options: <none>
Overall options: caseless
First code unit = 'a' (caseless)
Last code unit = 'c' (caseless)
Subject length lower bound = 3
.sp
"Compile options" are those specified to the compile function; "overall
options" have added options that are taken or deduced from the pattern. If both
sets of options are the same, just a single "options" line is output. "First
code unit" is where any match must start; if there is more than one they are
listed as "starting code units". "Last code unit" is the last literal code unit
that must be present in any match. This is not necessarily the last character.
These lines are omitted if no starting or ending code units are recorded.
. .
. .
.SS "Specifying a pattern in hex" .SS "Specifying a pattern in hex"
@ -499,8 +522,8 @@ pairs. For example:
This feature is provided as a way of creating patterns that contain binary zero This feature is provided as a way of creating patterns that contain binary zero
characters. By default, \fBpcre2test\fP passes patterns as zero-terminated characters. By default, \fBpcre2test\fP passes patterns as zero-terminated
strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED.
However, for patterns specified in hexadecimal, the length of the pattern is However, for patterns specified in hexadecimal, the actual length of the
passed. pattern is passed.
. .
. .
.SS "JIT compilation" .SS "JIT compilation"
@ -528,7 +551,7 @@ documentation. See also the \fBjitstack\fP modifier below for a way of
setting the size of the JIT stack. setting the size of the JIT stack.
.P .P
If the \fBjitfast\fP modifier is specified, matching is done using the JIT If the \fBjitfast\fP modifier is specified, matching is done using the JIT
"fast path" interface (\fBpcre2_jit_match()), which skips some of the sanity "fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
checks that are done by \fBpcre2_match()\fP, and of course does not work when checks that are done by \fBpcre2_match()\fP, and of course does not work when
JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is
assumed. assumed.
@ -560,11 +583,16 @@ character tables are mutually exclusive.
.SS "Showing pattern memory" .SS "Showing pattern memory"
.rs .rs
.sp .sp
The \fB/memory\fP modifier causes the size in bytes of the memory block used to The \fB/memory\fP modifier causes the size in bytes of the memory used to hold
hold the compiled pattern to be output. This does not include the size of the the compiled pattern to be output. This does not include the size of the
\fBpcre2_code\fP block; it is just the actual compiled data. If the pattern is \fBpcre2_code\fP block; it is just the actual compiled data. If the pattern is
subsequently passed to the JIT compiler, the size of the JIT compiled code is subsequently passed to the JIT compiler, the size of the JIT compiled code is
also output. also output. Here is an example:
.sp
re> /a(b)c/jit,memory
Memory allocation (code space): 21
Memory allocation (JIT code): 1910
.sp
. .
. .
.SS "Limiting nested parentheses" .SS "Limiting nested parentheses"
@ -608,8 +636,8 @@ enable stack availability to be checked during compilation (see the
.\" .\"
documentation for details). If the number specified by the modifier is greater documentation for details). If the number specified by the modifier is greater
than zero, \fBpcre2_set_compile_recursion_guard()\fP is called to set up than zero, \fBpcre2_set_compile_recursion_guard()\fP is called to set up
callback from \fBpcre2_compile()\fP to a local function. The argument it is callback from \fBpcre2_compile()\fP to a local function. The argument it
passed is the current nesting parenthesis depth; if this is greater than the receives is the current nesting parenthesis depth; if this is greater than the
value given by the modifier, non-zero is returned, causing the compilation to value given by the modifier, non-zero is returned, causing the compilation to
be aborted. be aborted.
. .
@ -726,7 +754,6 @@ pattern.
zero_terminate pass the subject as zero-terminated zero_terminate pass the subject as zero-terminated
.sp .sp
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
. .
. .
.SS "Showing more text" .SS "Showing more text"
@ -867,15 +894,22 @@ were no matches. Here is a simple example of a substitution test:
/abc/replace=xxx /abc/replace=xxx
=abc=abc= =abc=abc=
1: =xxx=abc= 1: =xxx=abc=
=abc=abc=\=global =abc=abc=\e=global
2: =xxx=xxx= 2: =xxx=xxx=
.sp .sp
Subject and replacement strings should be kept relatively short for Subject and replacement strings should be kept relatively short for
substitution tests, as fixed-size buffers are used. To make it easy to test for substitution tests, as fixed-size buffers are used. To make it easy to test for
buffer overflow, if the replacement string starts with a number in square buffer overflow, if the replacement string starts with a number in square
brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the
output buffer, with the replacement string starting at the next character. output buffer, with the replacement string starting at the next character. Here
.P is an example that tests the edge case:
.sp
/abc/
123abc123\e=replace=[10]XYZ
1: 123XYZ123
123abc123\e=replace=[9]XYZ
Failed: error -47: no more memory
.sp
A replacement string is ignored with POSIX and DFA matching. Specifying partial A replacement string is ignored with POSIX and DFA matching. Specifying partial
matching provokes an error return ("bad option value") from matching provokes an error return ("bad option value") from
\fBpcre2_substitute()\fP. \fBpcre2_substitute()\fP.
@ -957,10 +991,10 @@ available for storing matching information. The default is 15.
A value of zero is useful when testing the POSIX API because it causes A value of zero is useful when testing the POSIX API because it causes
\fBregexec()\fP to be called with a NULL capture vector. When not testing the \fBregexec()\fP to be called with a NULL capture vector. When not testing the
POSIX API, a value of zero is used to cause POSIX API, a value of zero is used to cause
\fBpcre2_match_data_create_from_pattern\fP to be called, in order to create a \fBpcre2_match_data_create_from_pattern()\fP to be called, in order to create a
match block of exactly the right size for the pattern. (It is not possible to match block of exactly the right size for the pattern. (It is not possible to
create a match block with a zero-length ovector; there is always one pair of create a match block with a zero-length ovector; there is always at least one
offsets.) pair of offsets.)
. .
. .
.SS "Passing the subject as zero-terminated" .SS "Passing the subject as zero-terminated"
@ -972,7 +1006,7 @@ string, the \fBzero_terminate\fP modifier is provided. It causes the length to
be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface, be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
this modifier has no effect, as there is no facility for passing a length.) this modifier has no effect, as there is no facility for passing a length.)
.P .P
When testing \fBpcre2_substitute\fP, this modifier also has the effect of When testing \fBpcre2_substitute()\fP, this modifier also has the effect of
passing the replacement string as zero-terminated. passing the replacement string as zero-terminated.
. .
. .
@ -1237,6 +1271,6 @@ Cambridge CB2 3QH, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 12 November 2014 Last updated: 14 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -150,11 +150,12 @@ COMMAND LINE OPTIONS
Behave as if each subject line contains the given modifiers. Behave as if each subject line contains the given modifiers.
-t Run each compile and match many times with a timer, and out- -t Run each compile and match many times with a timer, and out-
put the resulting times per compile or match. You can control put the resulting times per compile or match. When JIT is
the number of iterations that are used for timing by follow- used, separate times are given for the initial compile and
ing -t with a number (as a separate item on the command the JIT compile. You can control the number of iterations
line). For example, "-t 1000" iterates 1000 times. The that are used for timing by following -t with a number (as a
default is to iterate 500,000 times. separate item on the command line). For example, "-t 1000"
iterates 1000 times. The default is to iterate 500,000 times.
-tm This is like -t except that it times only the matching phase, -tm This is like -t except that it times only the matching phase,
not the compile phase. not the compile phase.
@ -437,7 +438,6 @@ PATTERN MODIFIERS
tables=[0|1|2] select internal tables tables=[0|1|2] select internal tables
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
Newline and \R handling Newline and \R handling
@ -468,7 +468,32 @@ PATTERN MODIFIERS
The info modifier requests information about the compiled pattern The info modifier requests information about the compiled pattern
(whether it is anchored, has a fixed first character, and so on). The (whether it is anchored, has a fixed first character, and so on). The
information is obtained from the pcre2_pattern_info() function. information is obtained from the pcre2_pattern_info() function. Here
are some typical examples:
re> /(?i)(^a|^b)/m,info
Capturing subpattern count = 1
Compile options: multiline
Overall options: caseless multiline
First code unit at start or follows newline
Subject length lower bound = 1
re> /(?i)abc/info
Capturing subpattern count = 0
Compile options: <none>
Overall options: caseless
First code unit = 'a' (caseless)
Last code unit = 'c' (caseless)
Subject length lower bound = 3
"Compile options" are those specified to the compile function; "overall
options" have added options that are taken or deduced from the pattern.
If both sets of options are the same, just a single "options" line is
output. "First code unit" is where any match must start; if there is
more than one they are listed as "starting code units". "Last code
unit" is the last literal code unit that must be present in any match.
This is not necessarily the last character. These lines are omitted if
no starting or ending code units are recorded.
Specifying a pattern in hex Specifying a pattern in hex
@ -482,7 +507,7 @@ PATTERN MODIFIERS
binary zero characters. By default, pcre2test passes patterns as zero- binary zero characters. By default, pcre2test passes patterns as zero-
terminated strings to pcre2_compile(), giving the length as terminated strings to pcre2_compile(), giving the length as
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal,
the length of the pattern is passed. the actual length of the pattern is passed.
JIT compilation JIT compilation
@ -505,7 +530,7 @@ PATTERN MODIFIERS
size of the JIT stack. size of the JIT stack.
If the jitfast modifier is specified, matching is done using the JIT If the jitfast modifier is specified, matching is done using the JIT
"fast path" interface (pcre2_jit_match()), which skips some of the san- "fast path" interface, pcre2_jit_match(), which skips some of the san-
ity checks that are done by pcre2_match(), and of course does not work ity checks that are done by pcre2_match(), and of course does not work
when JIT is not supported. If jitfast is specified without jit, jit=7 when JIT is not supported. If jitfast is specified without jit, jit=7
is assumed. is assumed.
@ -533,11 +558,16 @@ PATTERN MODIFIERS
Showing pattern memory Showing pattern memory
The /memory modifier causes the size in bytes of the memory block used The /memory modifier causes the size in bytes of the memory used to
to hold the compiled pattern to be output. This does not include the hold the compiled pattern to be output. This does not include the size
size of the pcre2_code block; it is just the actual compiled data. If of the pcre2_code block; it is just the actual compiled data. If the
the pattern is subsequently passed to the JIT compiler, the size of the pattern is subsequently passed to the JIT compiler, the size of the JIT
JIT compiled code is also output. compiled code is also output. Here is an example:
re> /a(b)c/jit,memory
Memory allocation (code space): 21
Memory allocation (JIT code): 1910
Limiting nested parentheses Limiting nested parentheses
@ -573,7 +603,7 @@ PATTERN MODIFIERS
mentation for details). If the number specified by the modifier is mentation for details). If the number specified by the modifier is
greater than zero, pcre2_set_compile_recursion_guard() is called to set greater than zero, pcre2_set_compile_recursion_guard() is called to set
up callback from pcre2_compile() to a local function. The argument it up callback from pcre2_compile() to a local function. The argument it
is passed is the current nesting parenthesis depth; if this is greater receives is the current nesting parenthesis depth; if this is greater
than the value given by the modifier, non-zero is returned, causing the than the value given by the modifier, non-zero is returned, causing the
compilation to be aborted. compilation to be aborted.
@ -606,6 +636,7 @@ PATTERN MODIFIERS
allusedtext show all consulted text allusedtext show all consulted text
/g global global matching /g global global matching
mark show mark values mark show mark values
replace=<string> specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
These modifiers may not appear in a #pattern command. If you want them These modifiers may not appear in a #pattern command. If you want them
@ -671,11 +702,11 @@ SUBJECT MODIFIERS
offset=<n> set starting offset offset=<n> set starting offset
ovector=<n> set size of output vector ovector=<n> set size of output vector
recursion_limit=<n> set a recursion limit recursion_limit=<n> set a recursion limit
replace=<string> specify a replacement string
startchar show startchar when relevant startchar show startchar when relevant
zero_terminate pass the subject as zero-terminated zero_terminate pass the subject as zero-terminated
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
FIXME: Give more examples.
Showing more text Showing more text
@ -745,6 +776,28 @@ SUBJECT MODIFIERS
ber. Any value other than zero is used as a return from pcre2test's ber. Any value other than zero is used as a return from pcre2test's
callout function. callout function.
Finding all matches in a string
Searching for all possible matches within a subject can be requested by
the global or /altglobal modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The
difference between global and altglobal is that the former uses the
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
searching at a new point within the entire string (which is what Perl
does), whereas the latter passes over a shortened substring. This makes
a difference to the matching process if the pattern begins with a look-
behind assertion (including \b or \B).
If an empty string is matched, the next match is done with the
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
for another, non-empty, match at the same point in the subject. If this
match fails, the start offset is advanced, and the normal match is
retried. This imitates the way Perl handles such cases when using the
/g modifier or the split() function. Normally, the start offset is
advanced by one character, but if the newline convention recognizes
CRLF as a newline, and the current character is CR followed by LF, an
advance of two is used.
Testing substring extraction functions Testing substring extraction functions
The copy and get modifiers can be used to test the pcre2_sub- The copy and get modifiers can be used to test the pcre2_sub-
@ -767,27 +820,45 @@ SUBJECT MODIFIERS
full list. The string length (that is, the return from the extraction full list. The string length (that is, the return from the extraction
function) is given in parentheses after each substring. function) is given in parentheses after each substring.
Finding all matches in a string Testing the substitution function
Searching for all possible matches within a subject can be requested by If the replace modifier is set, the pcre2_substitute() function is
the global or /altglobal modifier. After finding a match, the matching called instead of one of the matching functions. Unlike subject
function is called again to search the remainder of the subject. The strings, pcre2test does not process replacement strings for escape
difference between global and altglobal is that the former uses the sequences. In UTF mode, a replacement string is checked to see if it is
start_offset argument to pcre2_match() or pcre2_dfa_match() to start a valid UTF-8 string. If so, it is correctly converted to a UTF string
searching at a new point within the entire string (which is what Perl of the appropriate code unit width. If it is not a valid UTF-8 string,
does), whereas the latter passes over a shortened substring. This makes the individual code units are copied directly. This provides a means of
a difference to the matching process if the pattern begins with a look- passing an invalid UTF-8 string for testing purposes.
behind assertion (including \b or \B).
If an empty string is matched, the next match is done with the If the global modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search pcre2_substitute(). After a successful substitution, the modified
for another, non-empty, match at the same point in the subject. If this string is output, preceded by the number of replacements. This may be
match fails, the start offset is advanced, and the normal match is zero if there were no matches. Here is a simple example of a substitu-
retried. This imitates the way Perl handles such cases when using the tion test:
/g modifier or the split() function. Normally, the start offset is
advanced by one character, but if the newline convention recognizes /abc/replace=xxx
CRLF as a newline, and the current character is CR followed by LF, an =abc=abc=
advance of two is used. 1: =xxx=abc=
=abc=abc=\=global
2: =xxx=xxx=
Subject and replacement strings should be kept relatively short for
substitution tests, as fixed-size buffers are used. To make it easy to
test for buffer overflow, if the replacement string starts with a num-
ber in square brackets, that number is passed to pcre2_substitute() as
the size of the output buffer, with the replacement string starting at
the next character. Here is an example that tests the edge case:
/abc/
123abc123\=replace=[10]XYZ
1: 123XYZ123
123abc123\=replace=[9]XYZ
Failed: error -47: no more memory
A replacement string is ignored with POSIX and DFA matching. Specifying
partial matching provokes an error return ("bad option value") from
pcre2_substitute().
Setting the JIT stack size Setting the JIT stack size
@ -853,10 +924,10 @@ SUBJECT MODIFIERS
A value of zero is useful when testing the POSIX API because it causes A value of zero is useful when testing the POSIX API because it causes
regexec() to be called with a NULL capture vector. When not testing the regexec() to be called with a NULL capture vector. When not testing the
POSIX API, a value of zero is used to cause pcre2_match_data_cre- POSIX API, a value of zero is used to cause pcre2_match_data_cre-
ate_from_pattern to be called, in order to create a match block of ate_from_pattern() to be called, in order to create a match block of
exactly the right size for the pattern. (It is not possible to create a exactly the right size for the pattern. (It is not possible to create a
match block with a zero-length ovector; there is always one pair of match block with a zero-length ovector; there is always at least one
offsets.) pair of offsets.)
Passing the subject as zero-terminated Passing the subject as zero-terminated
@ -867,7 +938,7 @@ SUBJECT MODIFIERS
via the POSIX interface, this modifier has no effect, as there is no via the POSIX interface, this modifier has no effect, as there is no
facility for passing a length.) facility for passing a length.)
When testing pcre2_substitute, this modifier also has the effect of When testing pcre2_substitute(), this modifier also has the effect of
passing the replacement string as zero-terminated. passing the replacement string as zero-terminated.
@ -1112,5 +1183,5 @@ AUTHOR
REVISION REVISION
Last updated: 09 November 2014 Last updated: 14 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.

View File

@ -84,7 +84,7 @@ uint32_t ovector_count;
uint32_t goptions = 0; uint32_t goptions = 0;
BOOL match_data_created = FALSE; BOOL match_data_created = FALSE;
BOOL global = FALSE; BOOL global = FALSE;
PCRE2_SIZE buff_offset, lengthleft, endlength; PCRE2_SIZE buff_offset, lengthleft, fraglength;
PCRE2_SIZE *ovector; PCRE2_SIZE *ovector;
/* Partial matching is not valid. */ /* Partial matching is not valid. */
@ -154,14 +154,17 @@ do
/* Any error other than no match returns the error code. No match when not /* Any error other than no match returns the error code. No match when not
doing the special after-empty-match global rematch, or when at the end of the doing the special after-empty-match global rematch, or when at the end of the
subject, breaks the global loop. Otherwise, advance the starting point and subject, breaks the global loop. Otherwise, advance the starting point by one
try again. */ character, copying it to the output, and try again. */
if (rc < 0) if (rc < 0)
{ {
PCRE2_SIZE save_start;
if (rc != PCRE2_ERROR_NOMATCH) goto EXIT; if (rc != PCRE2_ERROR_NOMATCH) goto EXIT;
if (goptions == 0 || start_offset >= length) break; if (goptions == 0 || start_offset >= length) break;
start_offset++;
save_start = start_offset++;
if ((code->overall_options & PCRE2_UTF) != 0) if ((code->overall_options & PCRE2_UTF) != 0)
{ {
#if PCRE2_CODE_UNIT_WIDTH == 8 #if PCRE2_CODE_UNIT_WIDTH == 8
@ -173,6 +176,14 @@ do
start_offset++; start_offset++;
#endif #endif
} }
fraglength = start_offset - save_start;
if (lengthleft < fraglength) goto NOROOM;
memcpy(buffer + buff_offset, subject + save_start,
fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
buff_offset += fraglength;
lengthleft -= fraglength;
goptions = 0; goptions = 0;
continue; continue;
} }
@ -181,12 +192,12 @@ do
subs++; subs++;
if (rc == 0) rc = ovector_count; if (rc == 0) rc = ovector_count;
endlength = ovector[0] - start_offset; fraglength = ovector[0] - start_offset;
if (endlength >= lengthleft) goto NOROOM; if (fraglength >= lengthleft) goto NOROOM;
memcpy(buffer + buff_offset, subject + start_offset, memcpy(buffer + buff_offset, subject + start_offset,
endlength*(PCRE2_CODE_UNIT_WIDTH/8)); fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
buff_offset += endlength; buff_offset += fraglength;
lengthleft -= endlength; lengthleft -= fraglength;
for (i = 0; i < rlength; i++) for (i = 0; i < rlength; i++)
{ {
@ -279,11 +290,11 @@ do
/* Copy the rest of the subject and return the number of substitutions. */ /* Copy the rest of the subject and return the number of substitutions. */
rc = subs; rc = subs;
endlength = length - start_offset; fraglength = length - start_offset;
if (endlength + 1 > lengthleft) goto NOROOM; if (fraglength + 1 > lengthleft) goto NOROOM;
memcpy(buffer + buff_offset, subject + start_offset, memcpy(buffer + buff_offset, subject + start_offset,
endlength*(PCRE2_CODE_UNIT_WIDTH/8)); fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
buff_offset += endlength; buff_offset += fraglength;
buffer[buff_offset] = 0; buffer[buff_offset] = 0;
*blength = buff_offset; *blength = buff_offset;

View File

@ -164,6 +164,7 @@ void vms_setsymbol( char *, char *, int );
#define DFA_WS_DIMENSION 1000 /* Size of DFA workspace */ #define DFA_WS_DIMENSION 1000 /* Size of DFA workspace */
#define DEFAULT_OVECCOUNT 15 /* Default ovector count */ #define DEFAULT_OVECCOUNT 15 /* Default ovector count */
#define JUNK_OFFSET 0xdeadbeef /* For initializing ovector */ #define JUNK_OFFSET 0xdeadbeef /* For initializing ovector */
#define LOCALESIZE 32 /* Size of locale name */
#define LOOPREPEAT 500000 /* Default loop count for timing */ #define LOOPREPEAT 500000 /* Default loop count for timing */
#define REPLACE_MODSIZE 96 /* Field for reading 8-bit replacement */ #define REPLACE_MODSIZE 96 /* Field for reading 8-bit replacement */
#define VERSION_SIZE 64 /* Size of buffer for the version strings */ #define VERSION_SIZE 64 /* Size of buffer for the version strings */
@ -401,7 +402,7 @@ typedef struct patctl { /* Structure for pattern modifiers. */
uint32_t jit; uint32_t jit;
uint32_t stackguard_test; uint32_t stackguard_test;
uint32_t tables_id; uint32_t tables_id;
uint8_t locale[32]; uint8_t locale[LOCALESIZE];
} patctl; } patctl;
#define MAXCPYGET 10 #define MAXCPYGET 10
@ -486,7 +487,7 @@ static modstruct modlist[] = {
{ "jitfast", MOD_PAT, MOD_CTL, CTL_JITFAST, PO(control) }, { "jitfast", MOD_PAT, MOD_CTL, CTL_JITFAST, PO(control) },
{ "jitstack", MOD_DAT, MOD_INT, 0, DO(jitstack) }, { "jitstack", MOD_DAT, MOD_INT, 0, DO(jitstack) },
{ "jitverify", MOD_PAT, MOD_CTL, CTL_JITVERIFY, PO(control) }, { "jitverify", MOD_PAT, MOD_CTL, CTL_JITVERIFY, PO(control) },
{ "locale", MOD_PAT, MOD_STR, 0, PO(locale) }, { "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) },
{ "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) }, { "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) },
{ "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) }, { "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) },
{ "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) }, { "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) },
@ -512,7 +513,7 @@ static modstruct modlist[] = {
{ "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) }, { "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) },
{ "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) }, { "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) },
{ "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) }, { "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) },
{ "replace", MOD_PND, MOD_STR, 0, PO(replacement) }, { "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) },
{ "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) }, { "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) },
{ "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) }, { "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) },
{ "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) }, { "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) },
@ -3141,6 +3142,12 @@ for (;;)
break; break;
case MOD_STR: case MOD_STR:
if (len + 1 > m->value)
{
fprintf(outfile, "** Overlong value for '%s' (max %d code units)\n",
m->name, m->value - 1);
return FALSE;
}
memcpy(field, pp, len); memcpy(field, pp, len);
((uint8_t *)field)[len] = 0; ((uint8_t *)field)[len] = 0;
pp = ep; pp = ep;

3
testdata/testinput2 vendored
View File

@ -4073,6 +4073,9 @@ a random value. /Ix
123abc456abc789 123abc456abc789
123abc456abc789\=g 123abc456abc789\=g
/(?<=abc)(|def)/g,replace=<$0>
123abcxyzabcdef789abcpqr
# End of substitute tests # End of substitute tests
# End of testinput2 # End of testinput2

3
testdata/testinput5 vendored
View File

@ -1633,4 +1633,7 @@
/ábc/utf,replace=XሴZ /ábc/utf,replace=XሴZ
123ábc123 123ábc123
/(?<=abc)(|def)/g,utf,replace=<$0>
123abcáyzabcdef789abcሴqr
# End of testinput5 # End of testinput5

View File

@ -13699,6 +13699,10 @@ Failed: error -34: bad option value
123abc456abc789\=g 123abc456abc789\=g
2: 123xyz456xyz789 2: 123xyz456xyz789
/(?<=abc)(|def)/g,replace=<$0>
123abcxyzabcdef789abcpqr
4: 123abc<>xyzabc<><def>789abc<>pqr
# End of substitute tests # End of substitute tests
# End of testinput2 # End of testinput2

View File

@ -4002,4 +4002,8 @@ Subject length lower bound = 1
123ábc123 123ábc123
1: 123X\x{1234}Z123 1: 123X\x{1234}Z123
/(?<=abc)(|def)/g,utf,replace=<$0>
123abcáyzabcdef789abcሴqr
4: 123abc<>\x{e1}yzabc<><def>789abc<>\x{1234}qr
# End of testinput5 # End of testinput5