Further substitution tests (code and data), and more documentation.
This commit is contained in:
parent
adc7be2d3a
commit
07f8372202
|
@ -51,4 +51,6 @@ the currrent group as "unset". Thus, the ovector for those groups contained
|
|||
whatever was previously there. An example is the pattern /(x)|((*ACCEPT))/ when
|
||||
matched against "abcd".
|
||||
|
||||
8. The pcre2_substitute() function has been implemented.
|
||||
|
||||
****
|
||||
|
|
|
@ -135,7 +135,7 @@ remaining sections, except for the <b>pcre2demo</b> section (which is a program
|
|||
listing), and the short pages for individual functions, are concatenated in
|
||||
<b>pcre2.txt</b>, for ease of searching. The sections are as follows:
|
||||
<pre>
|
||||
pcre2 this document FIXME CHECK THIS LIST
|
||||
pcre2 this document
|
||||
pcre2-config show PCRE2 installation configuration information
|
||||
pcre2api details of PCRE2's native C API
|
||||
pcre2build building PCRE2
|
||||
|
|
|
@ -1089,7 +1089,7 @@ equivalent to Perl's /x option, and it can be changed within a pattern by a
|
|||
Which characters are interpreted as newlines can be specified by a setting in
|
||||
the compile context that is passed to <b>pcre2_compile()</b> or by a special
|
||||
sequence at the start of the pattern, as described in the section entitled
|
||||
<a href="pcrepattern.html#newlines">"Newline conventions"</a>
|
||||
<a href="pcre2pattern.html#newlines">"Newline conventions"</a>
|
||||
in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is
|
||||
built.
|
||||
<pre>
|
||||
|
@ -1243,7 +1243,7 @@ This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
|
|||
\w, and some of the POSIX character classes. By default, only ASCII characters
|
||||
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
|
||||
classify characters. More details are given in the section on
|
||||
<a href="pcre2.html#genericchartypes">generic character types</a>
|
||||
<a href="pcre2pattern.html#genericchartypes">generic character types</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
||||
|
@ -1924,11 +1924,8 @@ documentation.
|
|||
<P>
|
||||
When PCRE2 is built, a default newline convention is set; this is usually the
|
||||
standard convention for the operating system. The default can be overridden in
|
||||
either a
|
||||
<a href="#compilecontext">compile context</a>
|
||||
or a
|
||||
<a href="#matchcontext">match context.</a>
|
||||
However, changing the newline convention at match time disables JIT matching.
|
||||
a
|
||||
<a href="#compilecontext">compile context.</a>
|
||||
During matching, the newline choice affects the behaviour of the dot,
|
||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||
position is advanced after a match failure for an unanchored pattern.
|
||||
|
@ -2290,7 +2287,7 @@ subpattern <i>n</i> has not been used at all, it returns an empty string. This
|
|||
can be distinguished from a genuine zero-length substring by inspecting the
|
||||
appropriate offset in the ovector, which contains PCRE2_UNSET for unset
|
||||
substrings.
|
||||
<a name="extractbynname"></a></P>
|
||||
<a name="extractbyname"></a></P>
|
||||
<br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
|
||||
<P>
|
||||
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
|
||||
|
@ -2358,7 +2355,8 @@ string in <i>outputbuffer</i>, replacing the part that was matched with the
|
|||
be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
|
||||
</P>
|
||||
<P>
|
||||
In the replacement string, which is interpreted as a UTF string in UTF mode, a
|
||||
In the replacement string, which is interpreted as a UTF string in UTF mode,
|
||||
and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
|
||||
dollar character is an escape character that can specify the insertion of
|
||||
characters from capturing groups in the pattern. The following forms are
|
||||
recognized:
|
||||
|
|
|
@ -51,11 +51,12 @@ JIT support is an optional feature of PCRE2. The "configure" option
|
|||
you want to use JIT. The support is limited to the following hardware
|
||||
platforms:
|
||||
<pre>
|
||||
ARM v5, v7, and Thumb2
|
||||
ARM 32-bit (v5, v7, and Thumb2)
|
||||
ARM 64-bit
|
||||
Intel x86 32-bit and 64-bit
|
||||
MIPS 32-bit
|
||||
MIPS 32-bit and 64-bit
|
||||
Power PC 32-bit and 64-bit
|
||||
SPARC 32-bit (experimental)
|
||||
SPARC 32-bit
|
||||
</pre>
|
||||
If --enable-jit is set on an unsupported platform, compilation fails.
|
||||
</P>
|
||||
|
@ -73,11 +74,11 @@ To make use of the JIT support in the simplest way, all you have to do is to
|
|||
call <b>pcre2_jit_compile()</b> after successfully compiling a pattern with
|
||||
<b>pcre2_compile()</b>. This function has two arguments: the first is the
|
||||
compiled pattern pointer that was returned by <b>pcre2_compile()</b>, and the
|
||||
second is a set of option bits, which must include at least one of
|
||||
PCRE2_JIT_COMPLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
|
||||
second is zero or more of the following option bits: PCRE2_JIT_COMPLETE,
|
||||
PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
|
||||
</P>
|
||||
<P>
|
||||
If JIT support is not available, a call to <b>pcre2_jit_comple()</b> does
|
||||
If JIT support is not available, a call to <b>pcre2_jit_compile()</b> does
|
||||
nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled pattern
|
||||
is passed to the JIT compiler, which turns it into machine code that executes
|
||||
much faster than the normal interpretive code, but yields exactly the same
|
||||
|
@ -95,6 +96,20 @@ appropriate code is run if it is available. Otherwise, the pattern is matched
|
|||
using interpretive code.
|
||||
</P>
|
||||
<P>
|
||||
You can call <b>pcre2_jit_compile()</b> multiple times for the same compiled
|
||||
pattern. It does nothing if it has previously compiled code for any of the
|
||||
option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
|
||||
(perhaps later, when you find you need partial matching) again with
|
||||
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
|
||||
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
|
||||
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
|
||||
returns zero. This is an alternative way of testing if JIT is available.
|
||||
</P>
|
||||
<P>
|
||||
At present, it is not possible to free JIT compiled code except when the entire
|
||||
compiled pattern is freed by calling <b>pcre2_free_code()</b>.
|
||||
</P>
|
||||
<P>
|
||||
In some circumstances you may need to call additional functions. These are
|
||||
described in the section entitled
|
||||
<a href="#stackcontrol">"Controlling the JIT stack"</a>
|
||||
|
@ -167,7 +182,7 @@ memory allocation), a starting size and a maximum size, and it returns a
|
|||
pointer to an opaque structure of type <b>pcre2_jit_stack</b>, or NULL if there
|
||||
is an error. The <b>pcre2_jit_stack_free()</b> function is used to free a stack
|
||||
that is no longer needed. (For the technically minded: the address space is
|
||||
allocated by mmap or VirtualAlloc.) FIXME Is this right?
|
||||
allocated by mmap or VirtualAlloc.)
|
||||
</P>
|
||||
<P>
|
||||
JIT uses far less memory for recursion than the interpretive code,
|
||||
|
@ -187,7 +202,8 @@ passed to a matching function, its information determines which JIT stack is
|
|||
used. There are three cases for the values of the other two options:
|
||||
<pre>
|
||||
(1) If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32K block
|
||||
on the machine stack is used.
|
||||
on the machine stack is used. This is the default when a match
|
||||
context is created.
|
||||
|
||||
(2) If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must be
|
||||
a pointer to a valid JIT stack, the result of calling
|
||||
|
@ -402,7 +418,7 @@ Cambridge CB2 3QH, England.
|
|||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 08 November 2014
|
||||
Last updated: 12 November 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -100,8 +100,8 @@ page.
|
|||
<P>
|
||||
Some applications that allow their users to supply patterns may wish to
|
||||
restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
|
||||
option is set at compile time, (*UTF) is not allowed, and its appearance causes
|
||||
an error.
|
||||
option is passed to <b>pcre2_compile()</b>, (*UTF) is not allowed, and its
|
||||
appearance in a pattern causes an error.
|
||||
</P>
|
||||
<br><b>
|
||||
Unicode property support
|
||||
|
@ -113,6 +113,22 @@ such as \d and \w to use Unicode properties to determine character types,
|
|||
instead of recognizing only characters with codes less than 128 via a lookup
|
||||
table.
|
||||
</P>
|
||||
<P>
|
||||
Some applications that allow their users to supply patterns may wish to
|
||||
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
|
||||
<b>pcre2_compile()</b>, (*UCP) is not allowed, and its appearance in a pattern
|
||||
causes an error.
|
||||
</P>
|
||||
<br><b>
|
||||
Locking out empty string matching
|
||||
</b><br>
|
||||
<P>
|
||||
Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
|
||||
as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
|
||||
matching function is subsequently called to match the pattern. These options
|
||||
lock out the matching of empty strings, either entirely, or only at the start
|
||||
of the subject.
|
||||
</P>
|
||||
<br><b>
|
||||
Disabling auto-possessification
|
||||
</b><br>
|
||||
|
@ -133,6 +149,28 @@ PCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly
|
|||
reaching "no match" results. For more details, see the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting match and recursion limits
|
||||
</b><br>
|
||||
<P>
|
||||
The caller of <b>pcre2_match()</b> can set a limit on the number of times the
|
||||
internal <b>match()</b> function is called and on the maximum depth of
|
||||
recursive calls. These facilities are provided to catch runaway matches that
|
||||
are provoked by patterns with huge matching trees (a typical example is a
|
||||
pattern with nested unlimited repeats) and to avoid running out of system stack
|
||||
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
|
||||
gives an error return. The limits can also be set by items at the start of the
|
||||
pattern of the form
|
||||
<pre>
|
||||
(*LIMIT_MATCH=d)
|
||||
(*LIMIT_RECURSION=d)
|
||||
</pre>
|
||||
where d is any number of decimal digits. However, the value of the setting must
|
||||
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
<a name="newlines"></a></P>
|
||||
<br><b>
|
||||
Newline conventions
|
||||
|
@ -179,26 +217,14 @@ below. A change of \R setting can be combined with a change of newline
|
|||
convention.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting match and recursion limits
|
||||
Specifying what \R matches
|
||||
</b><br>
|
||||
<P>
|
||||
The caller of <b>pcre2_match()</b> can set a limit on the number of times the
|
||||
internal <b>match()</b> function is called and on the maximum depth of
|
||||
recursive calls. These facilities are provided to catch runaway matches that
|
||||
are provoked by patterns with huge matching trees (a typical example is a
|
||||
pattern with nested unlimited repeats) and to avoid running out of system stack
|
||||
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
|
||||
gives an error return. The limits can also be set by items at the start of the
|
||||
pattern of the form
|
||||
<pre>
|
||||
(*LIMIT_MATCH=d)
|
||||
(*LIMIT_RECURSION=d)
|
||||
</pre>
|
||||
where d is any number of decimal digits. However, the value of the setting must
|
||||
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
|
||||
complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
|
||||
at compile time. This effect can also be achieved by starting a pattern with
|
||||
(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
|
||||
corresponding to PCRE2_BSR_UNICODE.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
|
||||
<P>
|
||||
|
@ -2280,8 +2306,8 @@ complex:
|
|||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
There are four kinds of condition: references to subpatterns, references to
|
||||
recursion, a pseudo-condition called DEFINE, and assertions.
|
||||
There are five kinds of condition: references to subpatterns, references to
|
||||
recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
|
||||
</P>
|
||||
<br><b>
|
||||
Checking for a used subpattern by number
|
||||
|
@ -2389,6 +2415,23 @@ pattern uses references to the named group to match the four dot-separated
|
|||
components of an IPv4 address, insisting on a word boundary at each end.
|
||||
</P>
|
||||
<br><b>
|
||||
Checking the PCRE2 version
|
||||
</b><br>
|
||||
<P>
|
||||
Programs that link with a PCRE2 library can check the version by calling
|
||||
<b>pcre2_config()</b> with appropriate arguments. Users of applications that do
|
||||
not have access to the underlying code cannot do this. A special "condition"
|
||||
called VERSION exists to allow such users to discover which version of PCRE2
|
||||
they are dealing with by using this condition to match a string such as
|
||||
"yesno". VERSION must be followed either by "=" or ">=" and a version number.
|
||||
For example:
|
||||
<pre>
|
||||
(?(VERSION>=10.4)yes|no)
|
||||
</pre>
|
||||
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
|
||||
"no" otherwise.
|
||||
</P>
|
||||
<br><b>
|
||||
Assertion conditions
|
||||
</b><br>
|
||||
<P>
|
||||
|
@ -3180,7 +3223,7 @@ subpattern, (*THEN) causes the subroutine match to fail.
|
|||
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
|
||||
<b>pcre2syntax</b>(3), <b>pcre2</b>(3), <b>pcre216(3)</b>, <b>pcre232(3)</b>.
|
||||
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
|
@ -3193,7 +3236,7 @@ Cambridge CB2 3QH, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 03 November 2014
|
||||
Last updated: 14 November 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -493,17 +493,18 @@ Each top-level branch of a look behind must be of a fixed length.
|
|||
(?(condition)yes-pattern)
|
||||
(?(condition)yes-pattern|no-pattern)
|
||||
|
||||
(?(n)... absolute reference condition
|
||||
(?(+n)... relative reference condition
|
||||
(?(-n)... relative reference condition
|
||||
(?(<name>)... named reference condition (Perl)
|
||||
(?('name')... named reference condition (Perl)
|
||||
(?(name)... named reference condition (PCRE2)
|
||||
(?(R)... overall recursion condition
|
||||
(?(Rn)... specific group recursion condition
|
||||
(?(R&name)... specific recursion condition
|
||||
(?(DEFINE)... define subpattern for reference
|
||||
(?(assert)... assertion condition
|
||||
(?(n) absolute reference condition
|
||||
(?(+n) relative reference condition
|
||||
(?(-n) relative reference condition
|
||||
(?(<name>) named reference condition (Perl)
|
||||
(?('name') named reference condition (Perl)
|
||||
(?(name) named reference condition (PCRE2)
|
||||
(?(R) overall recursion condition
|
||||
(?(Rn) specific group recursion condition
|
||||
(?(R&name) specific recursion condition
|
||||
(?(DEFINE) define subpattern for reference
|
||||
(?(VERSION[>]=n.m) test PCRE2 version
|
||||
(?(assert) assertion condition
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
|
@ -552,7 +553,7 @@ Cambridge CB2 3QH, England.
|
|||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 20 October 2014
|
||||
Last updated: 14 November 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -201,10 +201,11 @@ Behave as if each subject line contains the given modifiers.
|
|||
<P>
|
||||
<b>-t</b>
|
||||
Run each compile and match many times with a timer, and output the resulting
|
||||
times per compile or match. You can control the number of iterations that are
|
||||
used for timing by following <b>-t</b> with a number (as a separate item on the
|
||||
command line). For example, "-t 1000" iterates 1000 times. The default is to
|
||||
iterate 500,000 times.
|
||||
times per compile or match. When JIT is used, separate times are given for the
|
||||
initial compile and the JIT compile. You can control the number of iterations
|
||||
that are used for timing by following <b>-t</b> with a number (as a separate
|
||||
item on the command line). For example, "-t 1000" iterates 1000 times. The
|
||||
default is to iterate 500,000 times.
|
||||
</P>
|
||||
<P>
|
||||
<b>-tm</b>
|
||||
|
@ -490,7 +491,6 @@ about the pattern:
|
|||
tables=[0|1|2] select internal tables
|
||||
</pre>
|
||||
The effects of these modifiers are described in the following sections.
|
||||
FIXME: Give more examples.
|
||||
</P>
|
||||
<br><b>
|
||||
Newline and \R handling
|
||||
|
@ -528,7 +528,31 @@ one-off tests.
|
|||
<P>
|
||||
The <b>info</b> modifier requests information about the compiled pattern
|
||||
(whether it is anchored, has a fixed first character, and so on). The
|
||||
information is obtained from the <b>pcre2_pattern_info()</b> function.
|
||||
information is obtained from the <b>pcre2_pattern_info()</b> function. Here are
|
||||
some typical examples:
|
||||
<pre>
|
||||
re> /(?i)(^a|^b)/m,info
|
||||
Capturing subpattern count = 1
|
||||
Compile options: multiline
|
||||
Overall options: caseless multiline
|
||||
First code unit at start or follows newline
|
||||
Subject length lower bound = 1
|
||||
|
||||
re> /(?i)abc/info
|
||||
Capturing subpattern count = 0
|
||||
Compile options: <none>
|
||||
Overall options: caseless
|
||||
First code unit = 'a' (caseless)
|
||||
Last code unit = 'c' (caseless)
|
||||
Subject length lower bound = 3
|
||||
</pre>
|
||||
"Compile options" are those specified to the compile function; "overall
|
||||
options" have added options that are taken or deduced from the pattern. If both
|
||||
sets of options are the same, just a single "options" line is output. "First
|
||||
code unit" is where any match must start; if there is more than one they are
|
||||
listed as "starting code units". "Last code unit" is the last literal code unit
|
||||
that must be present in any match. This is not necessarily the last character.
|
||||
These lines are omitted if no starting or ending code units are recorded.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying a pattern in hex
|
||||
|
@ -543,8 +567,8 @@ pairs. For example:
|
|||
This feature is provided as a way of creating patterns that contain binary zero
|
||||
characters. By default, <b>pcre2test</b> passes patterns as zero-terminated
|
||||
strings to <b>pcre2_compile()</b>, giving the length as PCRE2_ZERO_TERMINATED.
|
||||
However, for patterns specified in hexadecimal, the length of the pattern is
|
||||
passed.
|
||||
However, for patterns specified in hexadecimal, the actual length of the
|
||||
pattern is passed.
|
||||
</P>
|
||||
<br><b>
|
||||
JIT compilation
|
||||
|
@ -571,7 +595,7 @@ setting the size of the JIT stack.
|
|||
</P>
|
||||
<P>
|
||||
If the <b>jitfast</b> modifier is specified, matching is done using the JIT
|
||||
"fast path" interface (\fBpcre2_jit_match()), which skips some of the sanity
|
||||
"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
|
||||
checks that are done by <b>pcre2_match()</b>, and of course does not work when
|
||||
JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
|
||||
assumed.
|
||||
|
@ -604,11 +628,17 @@ character tables are mutually exclusive.
|
|||
Showing pattern memory
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/memory</b> modifier causes the size in bytes of the memory block used to
|
||||
hold the compiled pattern to be output. This does not include the size of the
|
||||
The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
|
||||
the compiled pattern to be output. This does not include the size of the
|
||||
<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
|
||||
subsequently passed to the JIT compiler, the size of the JIT compiled code is
|
||||
also output.
|
||||
also output. Here is an example:
|
||||
<pre>
|
||||
re> /a(b)c/jit,memory
|
||||
Memory allocation (code space): 21
|
||||
Memory allocation (JIT code): 1910
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<br><b>
|
||||
Limiting nested parentheses
|
||||
|
@ -650,8 +680,8 @@ enable stack availability to be checked during compilation (see the
|
|||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation for details). If the number specified by the modifier is greater
|
||||
than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up
|
||||
callback from <b>pcre2_compile()</b> to a local function. The argument it is
|
||||
passed is the current nesting parenthesis depth; if this is greater than the
|
||||
callback from <b>pcre2_compile()</b> to a local function. The argument it
|
||||
receives is the current nesting parenthesis depth; if this is greater than the
|
||||
value given by the modifier, non-zero is returned, causing the compilation to
|
||||
be aborted.
|
||||
</P>
|
||||
|
@ -688,6 +718,7 @@ not affect the compilation process.
|
|||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
</pre>
|
||||
These modifiers may not appear in a <b>#pattern</b> command. If you want them as
|
||||
|
@ -759,11 +790,11 @@ pattern.
|
|||
offset=<n> set starting offset
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
</pre>
|
||||
The effects of these modifiers are described in the following sections.
|
||||
FIXME: Give more examples.
|
||||
</P>
|
||||
<br><b>
|
||||
Showing more text
|
||||
|
@ -841,6 +872,30 @@ Any value other than zero is used as a return from <b>pcre2test</b>'s callout
|
|||
function.
|
||||
</P>
|
||||
<br><b>
|
||||
Finding all matches in a string
|
||||
</b><br>
|
||||
<P>
|
||||
Searching for all possible matches within a subject can be requested by the
|
||||
<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The difference
|
||||
between <b>global</b> and <b>altglobal</b> is that the former uses the
|
||||
<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
|
||||
to start searching at a new point within the entire string (which is what Perl
|
||||
does), whereas the latter passes over a shortened substring. This makes a
|
||||
difference to the matching process if the pattern begins with a lookbehind
|
||||
assertion (including \b or \B).
|
||||
</P>
|
||||
<P>
|
||||
If an empty string is matched, the next match is done with the
|
||||
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
|
||||
another, non-empty, match at the same point in the subject. If this match
|
||||
fails, the start offset is advanced, and the normal match is retried. This
|
||||
imitates the way Perl handles such cases when using the <b>/g</b> modifier or
|
||||
the <b>split()</b> function. Normally, the start offset is advanced by one
|
||||
character, but if the newline convention recognizes CRLF as a newline, and the
|
||||
current character is CR followed by LF, an advance of two is used.
|
||||
</P>
|
||||
<br><b>
|
||||
Testing substring extraction functions
|
||||
</b><br>
|
||||
<P>
|
||||
|
@ -867,28 +922,46 @@ length (that is, the return from the extraction function) is given in
|
|||
parentheses after each substring.
|
||||
</P>
|
||||
<br><b>
|
||||
Finding all matches in a string
|
||||
Testing the substitution function
|
||||
</b><br>
|
||||
<P>
|
||||
Searching for all possible matches within a subject can be requested by the
|
||||
<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The difference
|
||||
between <b>global</b> and <b>altglobal</b> is that the former uses the
|
||||
<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
|
||||
to start searching at a new point within the entire string (which is what Perl
|
||||
does), whereas the latter passes over a shortened substring. This makes a
|
||||
difference to the matching process if the pattern begins with a lookbehind
|
||||
assertion (including \b or \B).
|
||||
If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
|
||||
called instead of one of the matching functions. Unlike subject strings,
|
||||
<b>pcre2test</b> does not process replacement strings for escape sequences. In
|
||||
UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
|
||||
If so, it is correctly converted to a UTF string of the appropriate code unit
|
||||
width. If it is not a valid UTF-8 string, the individual code units are copied
|
||||
directly. This provides a means of passing an invalid UTF-8 string for testing
|
||||
purposes.
|
||||
</P>
|
||||
<P>
|
||||
If an empty string is matched, the next match is done with the
|
||||
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
|
||||
another, non-empty, match at the same point in the subject. If this match
|
||||
fails, the start offset is advanced, and the normal match is retried. This
|
||||
imitates the way Perl handles such cases when using the <b>/g</b> modifier or
|
||||
the <b>split()</b> function. Normally, the start offset is advanced by one
|
||||
character, but if the newline convention recognizes CRLF as a newline, and the
|
||||
current character is CR followed by LF, an advance of two is used.
|
||||
If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
||||
<b>pcre2_substitute()</b>. After a successful substitution, the modified string
|
||||
is output, preceded by the number of replacements. This may be zero if there
|
||||
were no matches. Here is a simple example of a substitution test:
|
||||
<pre>
|
||||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
1: =xxx=abc=
|
||||
=abc=abc=\=global
|
||||
2: =xxx=xxx=
|
||||
</pre>
|
||||
Subject and replacement strings should be kept relatively short for
|
||||
substitution tests, as fixed-size buffers are used. To make it easy to test for
|
||||
buffer overflow, if the replacement string starts with a number in square
|
||||
brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
|
||||
output buffer, with the replacement string starting at the next character. Here
|
||||
is an example that tests the edge case:
|
||||
<pre>
|
||||
/abc/
|
||||
123abc123\=replace=[10]XYZ
|
||||
1: 123XYZ123
|
||||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
</pre>
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying partial
|
||||
matching provokes an error return ("bad option value") from
|
||||
<b>pcre2_substitute()</b>.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting the JIT stack size
|
||||
|
@ -969,10 +1042,10 @@ available for storing matching information. The default is 15.
|
|||
A value of zero is useful when testing the POSIX API because it causes
|
||||
<b>regexec()</b> to be called with a NULL capture vector. When not testing the
|
||||
POSIX API, a value of zero is used to cause
|
||||
<b>pcre2_match_data_create_from_pattern</b> to be called, in order to create a
|
||||
<b>pcre2_match_data_create_from_pattern()</b> to be called, in order to create a
|
||||
match block of exactly the right size for the pattern. (It is not possible to
|
||||
create a match block with a zero-length ovector; there is always one pair of
|
||||
offsets.)
|
||||
create a match block with a zero-length ovector; there is always at least one
|
||||
pair of offsets.)
|
||||
</P>
|
||||
<br><b>
|
||||
Passing the subject as zero-terminated
|
||||
|
@ -985,7 +1058,7 @@ be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
|
|||
this modifier has no effect, as there is no facility for passing a length.)
|
||||
</P>
|
||||
<P>
|
||||
When testing <b>pcre2_substitute</b>, this modifier also has the effect of
|
||||
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
|
||||
passing the replacement string as zero-terminated.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br>
|
||||
|
@ -1233,7 +1306,7 @@ Cambridge CB2 3QH, England.
|
|||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 09 November 2014
|
||||
Last updated: 14 November 2014
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -132,7 +132,7 @@ remaining sections, except for the \fBpcre2demo\fP section (which is a program
|
|||
listing), and the short pages for individual functions, are concatenated in
|
||||
\fBpcre2.txt\fP, for ease of searching. The sections are as follows:
|
||||
.sp
|
||||
pcre2 this document FIXME CHECK THIS LIST
|
||||
pcre2 this document
|
||||
pcre2-config show PCRE2 installation configuration information
|
||||
pcre2api details of PCRE2's native C API
|
||||
pcre2build building PCRE2
|
||||
|
|
|
@ -116,7 +116,7 @@ USER DOCUMENTATION
|
|||
tions, are concatenated in pcre2.txt, for ease of searching. The sec-
|
||||
tions are as follows:
|
||||
|
||||
pcre2 this document FIXME CHECK THIS LIST
|
||||
pcre2 this document
|
||||
pcre2-config show PCRE2 installation configuration information
|
||||
pcre2api details of PCRE2's native C API
|
||||
pcre2build building PCRE2
|
||||
|
@ -1928,12 +1928,10 @@ NEWLINE HANDLING WHEN MATCHING
|
|||
|
||||
When PCRE2 is built, a default newline convention is set; this is usu-
|
||||
ally the standard convention for the operating system. The default can
|
||||
be overridden in either a compile context or a match context. However,
|
||||
changing the newline convention at match time disables JIT matching.
|
||||
During matching, the newline choice affects the behaviour of the dot,
|
||||
circumflex, and dollar metacharacters. It may also alter the way the
|
||||
match position is advanced after a match failure for an unanchored pat-
|
||||
tern.
|
||||
be overridden in a compile context. During matching, the newline
|
||||
choice affects the behaviour of the dot, circumflex, and dollar
|
||||
metacharacters. It may also alter the way the match position is
|
||||
advanced after a match failure for an unanchored pattern.
|
||||
|
||||
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
|
||||
set, and a match attempt for an unanchored pattern fails when the cur-
|
||||
|
@ -2320,9 +2318,10 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
|||
given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
|
||||
|
||||
In the replacement string, which is interpreted as a UTF string in UTF
|
||||
mode, a dollar character is an escape character that can specify the
|
||||
insertion of characters from capturing groups in the pattern. The fol-
|
||||
lowing forms are recognized:
|
||||
mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
|
||||
option is set, a dollar character is an escape character that can spec-
|
||||
ify the insertion of characters from capturing groups in the pattern.
|
||||
The following forms are recognized:
|
||||
|
||||
$$ insert a dollar character
|
||||
$<n> insert the contents of group <n>
|
||||
|
@ -3508,11 +3507,12 @@ AVAILABILITY OF JIT SUPPORT
|
|||
built if you want to use JIT. The support is limited to the following
|
||||
hardware platforms:
|
||||
|
||||
ARM v5, v7, and Thumb2
|
||||
ARM 32-bit (v5, v7, and Thumb2)
|
||||
ARM 64-bit
|
||||
Intel x86 32-bit and 64-bit
|
||||
MIPS 32-bit
|
||||
MIPS 32-bit and 64-bit
|
||||
Power PC 32-bit and 64-bit
|
||||
SPARC 32-bit (experimental)
|
||||
SPARC 32-bit
|
||||
|
||||
If --enable-jit is set on an unsupported platform, compilation fails.
|
||||
|
||||
|
@ -3531,10 +3531,10 @@ SIMPLE USE OF JIT
|
|||
is to call pcre2_jit_compile() after successfully compiling a pattern
|
||||
with pcre2_compile(). This function has two arguments: the first is the
|
||||
compiled pattern pointer that was returned by pcre2_compile(), and the
|
||||
second is a set of option bits, which must include at least one of
|
||||
PCRE2_JIT_COMPLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
|
||||
second is zero or more of the following option bits: PCRE2_JIT_COM-
|
||||
PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
|
||||
|
||||
If JIT support is not available, a call to pcre2_jit_comple() does
|
||||
If JIT support is not available, a call to pcre2_jit_compile() does
|
||||
nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
|
||||
pattern is passed to the JIT compiler, which turns it into machine code
|
||||
that executes much faster than the normal interpretive code, but yields
|
||||
|
@ -3550,6 +3550,19 @@ SIMPLE USE OF JIT
|
|||
pcre2_match() is called, the appropriate code is run if it is avail-
|
||||
able. Otherwise, the pattern is matched using interpretive code.
|
||||
|
||||
You can call pcre2_jit_compile() multiple times for the same compiled
|
||||
pattern. It does nothing if it has previously compiled code for any of
|
||||
the option bits. For example, you can call it once with PCRE2_JIT_COM-
|
||||
PLETE and (perhaps later, when you find you need partial matching)
|
||||
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
|
||||
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
|
||||
ing. If pcre2_jit_compile() is called with no option bits set, it imme-
|
||||
diately returns zero. This is an alternative way of testing if JIT is
|
||||
available.
|
||||
|
||||
At present, it is not possible to free JIT compiled code except when
|
||||
the entire compiled pattern is freed by calling pcre2_free_code().
|
||||
|
||||
In some circumstances you may need to call additional functions. These
|
||||
are described in the section entitled "Controlling the JIT stack"
|
||||
below.
|
||||
|
@ -3618,7 +3631,7 @@ CONTROLLING THE JIT STACK
|
|||
pcre2_jit_stack, or NULL if there is an error. The
|
||||
pcre2_jit_stack_free() function is used to free a stack that is no
|
||||
longer needed. (For the technically minded: the address space is allo-
|
||||
cated by mmap or VirtualAlloc.) FIXME Is this right?
|
||||
cated by mmap or VirtualAlloc.)
|
||||
|
||||
JIT uses far less memory for recursion than the interpretive code, and
|
||||
a maximum stack size of 512K to 1M should be more than enough for any
|
||||
|
@ -3637,7 +3650,8 @@ CONTROLLING THE JIT STACK
|
|||
two options:
|
||||
|
||||
(1) If callback is NULL and data is NULL, an internal 32K block
|
||||
on the machine stack is used.
|
||||
on the machine stack is used. This is the default when a match
|
||||
context is created.
|
||||
|
||||
(2) If callback is NULL and data is not NULL, data must be
|
||||
a pointer to a valid JIT stack, the result of calling
|
||||
|
@ -3840,7 +3854,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 08 November 2014
|
||||
Last updated: 12 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1063,7 +1063,7 @@ equivalent to Perl's /x option, and it can be changed within a pattern by a
|
|||
Which characters are interpreted as newlines can be specified by a setting in
|
||||
the compile context that is passed to \fBpcre2_compile()\fP or by a special
|
||||
sequence at the start of the pattern, as described in the section entitled
|
||||
.\" HTML <a href="pcrepattern.html#newlines">
|
||||
.\" HTML <a href="pcre2pattern.html#newlines">
|
||||
.\" </a>
|
||||
"Newline conventions"
|
||||
.\"
|
||||
|
@ -1226,7 +1226,7 @@ This option changes the way PCRE2 processes \eB, \eb, \eD, \ed, \eS, \es, \eW,
|
|||
\ew, and some of the POSIX character classes. By default, only ASCII characters
|
||||
are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
|
||||
classify characters. More details are given in the section on
|
||||
.\" HTML <a href="pcre2.html#genericchartypes">
|
||||
.\" HTML <a href="pcre2pattern.html#genericchartypes">
|
||||
.\" </a>
|
||||
generic character types
|
||||
.\"
|
||||
|
@ -1939,17 +1939,11 @@ documentation.
|
|||
.sp
|
||||
When PCRE2 is built, a default newline convention is set; this is usually the
|
||||
standard convention for the operating system. The default can be overridden in
|
||||
either a
|
||||
a
|
||||
.\" HTML <a href="#compilecontext">
|
||||
.\" </a>
|
||||
compile context
|
||||
compile context.
|
||||
.\"
|
||||
or a
|
||||
.\" HTML <a href="#matchcontext">
|
||||
.\" </a>
|
||||
match context.
|
||||
.\"
|
||||
However, changing the newline convention at match time disables JIT matching.
|
||||
During matching, the newline choice affects the behaviour of the dot,
|
||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||
position is advanced after a match failure for an unanchored pattern.
|
||||
|
@ -2322,7 +2316,7 @@ appropriate offset in the ovector, which contains PCRE2_UNSET for unset
|
|||
substrings.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="extractbynname"></a>
|
||||
.\" HTML <a name="extractbyname"></a>
|
||||
.SH "EXTRACTING CAPTURED SUBSTRINGS BY NAME"
|
||||
.rs
|
||||
.sp
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "03 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2PATTERN 3 "14 November 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -63,8 +63,8 @@ page.
|
|||
.P
|
||||
Some applications that allow their users to supply patterns may wish to
|
||||
restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
|
||||
option is set at compile time, (*UTF) is not allowed, and its appearance causes
|
||||
an error.
|
||||
option is passed to \fBpcre2_compile()\fP, (*UTF) is not allowed, and its
|
||||
appearance in a pattern causes an error.
|
||||
.
|
||||
.
|
||||
.SS "Unicode property support"
|
||||
|
@ -75,6 +75,21 @@ This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
|||
such as \ed and \ew to use Unicode properties to determine character types,
|
||||
instead of recognizing only characters with codes less than 128 via a lookup
|
||||
table.
|
||||
.P
|
||||
Some applications that allow their users to supply patterns may wish to
|
||||
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
|
||||
\fBpcre2_compile()\fP, (*UCP) is not allowed, and its appearance in a pattern
|
||||
causes an error.
|
||||
.
|
||||
.
|
||||
.SS "Locking out empty string matching"
|
||||
.rs
|
||||
.sp
|
||||
Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
|
||||
as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
|
||||
matching function is subsequently called to match the pattern. These options
|
||||
lock out the matching of empty strings, either entirely, or only at the start
|
||||
of the subject.
|
||||
.
|
||||
.
|
||||
.SS "Disabling auto-possessification"
|
||||
|
@ -102,6 +117,28 @@ reaching "no match" results. For more details, see the
|
|||
documentation.
|
||||
.
|
||||
.
|
||||
.SS "Setting match and recursion limits"
|
||||
.rs
|
||||
.sp
|
||||
The caller of \fBpcre2_match()\fP can set a limit on the number of times the
|
||||
internal \fBmatch()\fP function is called and on the maximum depth of
|
||||
recursive calls. These facilities are provided to catch runaway matches that
|
||||
are provoked by patterns with huge matching trees (a typical example is a
|
||||
pattern with nested unlimited repeats) and to avoid running out of system stack
|
||||
by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
|
||||
gives an error return. The limits can also be set by items at the start of the
|
||||
pattern of the form
|
||||
.sp
|
||||
(*LIMIT_MATCH=d)
|
||||
(*LIMIT_RECURSION=d)
|
||||
.sp
|
||||
where d is any number of decimal digits. However, the value of the setting must
|
||||
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="newlines"></a>
|
||||
.SS "Newline conventions"
|
||||
.rs
|
||||
|
@ -153,26 +190,14 @@ below. A change of \eR setting can be combined with a change of newline
|
|||
convention.
|
||||
.
|
||||
.
|
||||
.SS "Setting match and recursion limits"
|
||||
.SS "Specifying what \eR matches"
|
||||
.rs
|
||||
.sp
|
||||
The caller of \fBpcre2_match()\fP can set a limit on the number of times the
|
||||
internal \fBmatch()\fP function is called and on the maximum depth of
|
||||
recursive calls. These facilities are provided to catch runaway matches that
|
||||
are provoked by patterns with huge matching trees (a typical example is a
|
||||
pattern with nested unlimited repeats) and to avoid running out of system stack
|
||||
by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
|
||||
gives an error return. The limits can also be set by items at the start of the
|
||||
pattern of the form
|
||||
.sp
|
||||
(*LIMIT_MATCH=d)
|
||||
(*LIMIT_RECURSION=d)
|
||||
.sp
|
||||
where d is any number of decimal digits. However, the value of the setting must
|
||||
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
|
||||
complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
|
||||
at compile time. This effect can also be achieved by starting a pattern with
|
||||
(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
|
||||
corresponding to PCRE2_BSR_UNICODE.
|
||||
.
|
||||
.
|
||||
.SH "EBCDIC CHARACTER CODES"
|
||||
|
@ -2302,8 +2327,8 @@ complex:
|
|||
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
|
||||
.sp
|
||||
.P
|
||||
There are four kinds of condition: references to subpatterns, references to
|
||||
recursion, a pseudo-condition called DEFINE, and assertions.
|
||||
There are five kinds of condition: references to subpatterns, references to
|
||||
recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
|
||||
.
|
||||
.
|
||||
.SS "Checking for a used subpattern by number"
|
||||
|
@ -2418,6 +2443,23 @@ pattern uses references to the named group to match the four dot-separated
|
|||
components of an IPv4 address, insisting on a word boundary at each end.
|
||||
.
|
||||
.
|
||||
.SS "Checking the PCRE2 version"
|
||||
.rs
|
||||
.sp
|
||||
Programs that link with a PCRE2 library can check the version by calling
|
||||
\fBpcre2_config()\fP with appropriate arguments. Users of applications that do
|
||||
not have access to the underlying code cannot do this. A special "condition"
|
||||
called VERSION exists to allow such users to discover which version of PCRE2
|
||||
they are dealing with by using this condition to match a string such as
|
||||
"yesno". VERSION must be followed either by "=" or ">=" and a version number.
|
||||
For example:
|
||||
.sp
|
||||
(?(VERSION>=10.4)yes|no)
|
||||
.sp
|
||||
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
|
||||
"no" otherwise.
|
||||
.
|
||||
.
|
||||
.SS "Assertion conditions"
|
||||
.rs
|
||||
.sp
|
||||
|
@ -3219,7 +3261,7 @@ subpattern, (*THEN) causes the subroutine match to fail.
|
|||
.rs
|
||||
.sp
|
||||
\fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3),
|
||||
\fBpcre2syntax\fP(3), \fBpcre2\fP(3), \fBpcre216(3)\fP, \fBpcre232(3)\fP.
|
||||
\fBpcre2syntax\fP(3), \fBpcre2\fP(3).
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
|
@ -3236,6 +3278,6 @@ Cambridge CB2 3QH, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 03 November 2014
|
||||
Last updated: 14 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "20 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2SYNTAX 3 "14 November 2014" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -470,17 +470,18 @@ Each top-level branch of a look behind must be of a fixed length.
|
|||
(?(condition)yes-pattern)
|
||||
(?(condition)yes-pattern|no-pattern)
|
||||
.sp
|
||||
(?(n)... absolute reference condition
|
||||
(?(+n)... relative reference condition
|
||||
(?(-n)... relative reference condition
|
||||
(?(<name>)... named reference condition (Perl)
|
||||
(?('name')... named reference condition (Perl)
|
||||
(?(name)... named reference condition (PCRE2)
|
||||
(?(R)... overall recursion condition
|
||||
(?(Rn)... specific group recursion condition
|
||||
(?(R&name)... specific recursion condition
|
||||
(?(DEFINE)... define subpattern for reference
|
||||
(?(assert)... assertion condition
|
||||
(?(n) absolute reference condition
|
||||
(?(+n) relative reference condition
|
||||
(?(-n) relative reference condition
|
||||
(?(<name>) named reference condition (Perl)
|
||||
(?('name') named reference condition (Perl)
|
||||
(?(name) named reference condition (PCRE2)
|
||||
(?(R) overall recursion condition
|
||||
(?(Rn) specific group recursion condition
|
||||
(?(R&name) specific recursion condition
|
||||
(?(DEFINE) define subpattern for reference
|
||||
(?(VERSION[>]=n.m) test PCRE2 version
|
||||
(?(assert) assertion condition
|
||||
.
|
||||
.
|
||||
.SH "BACKTRACKING CONTROL"
|
||||
|
@ -535,6 +536,6 @@ Cambridge CB2 3QH, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 October 2014
|
||||
Last updated: 14 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2TEST 1 "12 November 2014" "PCRE 10.00"
|
||||
.TH PCRE2TEST 1 "14 November 2014" "PCRE 10.00"
|
||||
.SH NAME
|
||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
|
@ -450,7 +450,6 @@ about the pattern:
|
|||
tables=[0|1|2] select internal tables
|
||||
.sp
|
||||
The effects of these modifiers are described in the following sections.
|
||||
FIXME: Give more examples.
|
||||
.
|
||||
.
|
||||
.SS "Newline and \eR handling"
|
||||
|
@ -484,7 +483,31 @@ one-off tests.
|
|||
.P
|
||||
The \fBinfo\fP modifier requests information about the compiled pattern
|
||||
(whether it is anchored, has a fixed first character, and so on). The
|
||||
information is obtained from the \fBpcre2_pattern_info()\fP function.
|
||||
information is obtained from the \fBpcre2_pattern_info()\fP function. Here are
|
||||
some typical examples:
|
||||
.sp
|
||||
re> /(?i)(^a|^b)/m,info
|
||||
Capturing subpattern count = 1
|
||||
Compile options: multiline
|
||||
Overall options: caseless multiline
|
||||
First code unit at start or follows newline
|
||||
Subject length lower bound = 1
|
||||
.sp
|
||||
re> /(?i)abc/info
|
||||
Capturing subpattern count = 0
|
||||
Compile options: <none>
|
||||
Overall options: caseless
|
||||
First code unit = 'a' (caseless)
|
||||
Last code unit = 'c' (caseless)
|
||||
Subject length lower bound = 3
|
||||
.sp
|
||||
"Compile options" are those specified to the compile function; "overall
|
||||
options" have added options that are taken or deduced from the pattern. If both
|
||||
sets of options are the same, just a single "options" line is output. "First
|
||||
code unit" is where any match must start; if there is more than one they are
|
||||
listed as "starting code units". "Last code unit" is the last literal code unit
|
||||
that must be present in any match. This is not necessarily the last character.
|
||||
These lines are omitted if no starting or ending code units are recorded.
|
||||
.
|
||||
.
|
||||
.SS "Specifying a pattern in hex"
|
||||
|
@ -499,8 +522,8 @@ pairs. For example:
|
|||
This feature is provided as a way of creating patterns that contain binary zero
|
||||
characters. By default, \fBpcre2test\fP passes patterns as zero-terminated
|
||||
strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED.
|
||||
However, for patterns specified in hexadecimal, the length of the pattern is
|
||||
passed.
|
||||
However, for patterns specified in hexadecimal, the actual length of the
|
||||
pattern is passed.
|
||||
.
|
||||
.
|
||||
.SS "JIT compilation"
|
||||
|
@ -528,7 +551,7 @@ documentation. See also the \fBjitstack\fP modifier below for a way of
|
|||
setting the size of the JIT stack.
|
||||
.P
|
||||
If the \fBjitfast\fP modifier is specified, matching is done using the JIT
|
||||
"fast path" interface (\fBpcre2_jit_match()), which skips some of the sanity
|
||||
"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
|
||||
checks that are done by \fBpcre2_match()\fP, and of course does not work when
|
||||
JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is
|
||||
assumed.
|
||||
|
@ -560,11 +583,16 @@ character tables are mutually exclusive.
|
|||
.SS "Showing pattern memory"
|
||||
.rs
|
||||
.sp
|
||||
The \fB/memory\fP modifier causes the size in bytes of the memory block used to
|
||||
hold the compiled pattern to be output. This does not include the size of the
|
||||
The \fB/memory\fP modifier causes the size in bytes of the memory used to hold
|
||||
the compiled pattern to be output. This does not include the size of the
|
||||
\fBpcre2_code\fP block; it is just the actual compiled data. If the pattern is
|
||||
subsequently passed to the JIT compiler, the size of the JIT compiled code is
|
||||
also output.
|
||||
also output. Here is an example:
|
||||
.sp
|
||||
re> /a(b)c/jit,memory
|
||||
Memory allocation (code space): 21
|
||||
Memory allocation (JIT code): 1910
|
||||
.sp
|
||||
.
|
||||
.
|
||||
.SS "Limiting nested parentheses"
|
||||
|
@ -608,8 +636,8 @@ enable stack availability to be checked during compilation (see the
|
|||
.\"
|
||||
documentation for details). If the number specified by the modifier is greater
|
||||
than zero, \fBpcre2_set_compile_recursion_guard()\fP is called to set up
|
||||
callback from \fBpcre2_compile()\fP to a local function. The argument it is
|
||||
passed is the current nesting parenthesis depth; if this is greater than the
|
||||
callback from \fBpcre2_compile()\fP to a local function. The argument it
|
||||
receives is the current nesting parenthesis depth; if this is greater than the
|
||||
value given by the modifier, non-zero is returned, causing the compilation to
|
||||
be aborted.
|
||||
.
|
||||
|
@ -726,7 +754,6 @@ pattern.
|
|||
zero_terminate pass the subject as zero-terminated
|
||||
.sp
|
||||
The effects of these modifiers are described in the following sections.
|
||||
FIXME: Give more examples.
|
||||
.
|
||||
.
|
||||
.SS "Showing more text"
|
||||
|
@ -867,15 +894,22 @@ were no matches. Here is a simple example of a substitution test:
|
|||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
1: =xxx=abc=
|
||||
=abc=abc=\=global
|
||||
=abc=abc=\e=global
|
||||
2: =xxx=xxx=
|
||||
.sp
|
||||
Subject and replacement strings should be kept relatively short for
|
||||
substitution tests, as fixed-size buffers are used. To make it easy to test for
|
||||
buffer overflow, if the replacement string starts with a number in square
|
||||
brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the
|
||||
output buffer, with the replacement string starting at the next character.
|
||||
.P
|
||||
output buffer, with the replacement string starting at the next character. Here
|
||||
is an example that tests the edge case:
|
||||
.sp
|
||||
/abc/
|
||||
123abc123\e=replace=[10]XYZ
|
||||
1: 123XYZ123
|
||||
123abc123\e=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
.sp
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying partial
|
||||
matching provokes an error return ("bad option value") from
|
||||
\fBpcre2_substitute()\fP.
|
||||
|
@ -957,10 +991,10 @@ available for storing matching information. The default is 15.
|
|||
A value of zero is useful when testing the POSIX API because it causes
|
||||
\fBregexec()\fP to be called with a NULL capture vector. When not testing the
|
||||
POSIX API, a value of zero is used to cause
|
||||
\fBpcre2_match_data_create_from_pattern\fP to be called, in order to create a
|
||||
\fBpcre2_match_data_create_from_pattern()\fP to be called, in order to create a
|
||||
match block of exactly the right size for the pattern. (It is not possible to
|
||||
create a match block with a zero-length ovector; there is always one pair of
|
||||
offsets.)
|
||||
create a match block with a zero-length ovector; there is always at least one
|
||||
pair of offsets.)
|
||||
.
|
||||
.
|
||||
.SS "Passing the subject as zero-terminated"
|
||||
|
@ -972,7 +1006,7 @@ string, the \fBzero_terminate\fP modifier is provided. It causes the length to
|
|||
be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
|
||||
this modifier has no effect, as there is no facility for passing a length.)
|
||||
.P
|
||||
When testing \fBpcre2_substitute\fP, this modifier also has the effect of
|
||||
When testing \fBpcre2_substitute()\fP, this modifier also has the effect of
|
||||
passing the replacement string as zero-terminated.
|
||||
.
|
||||
.
|
||||
|
@ -1237,6 +1271,6 @@ Cambridge CB2 3QH, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 12 November 2014
|
||||
Last updated: 14 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -150,11 +150,12 @@ COMMAND LINE OPTIONS
|
|||
Behave as if each subject line contains the given modifiers.
|
||||
|
||||
-t Run each compile and match many times with a timer, and out-
|
||||
put the resulting times per compile or match. You can control
|
||||
the number of iterations that are used for timing by follow-
|
||||
ing -t with a number (as a separate item on the command
|
||||
line). For example, "-t 1000" iterates 1000 times. The
|
||||
default is to iterate 500,000 times.
|
||||
put the resulting times per compile or match. When JIT is
|
||||
used, separate times are given for the initial compile and
|
||||
the JIT compile. You can control the number of iterations
|
||||
that are used for timing by following -t with a number (as a
|
||||
separate item on the command line). For example, "-t 1000"
|
||||
iterates 1000 times. The default is to iterate 500,000 times.
|
||||
|
||||
-tm This is like -t except that it times only the matching phase,
|
||||
not the compile phase.
|
||||
|
@ -437,7 +438,6 @@ PATTERN MODIFIERS
|
|||
tables=[0|1|2] select internal tables
|
||||
|
||||
The effects of these modifiers are described in the following sections.
|
||||
FIXME: Give more examples.
|
||||
|
||||
Newline and \R handling
|
||||
|
||||
|
@ -468,7 +468,32 @@ PATTERN MODIFIERS
|
|||
|
||||
The info modifier requests information about the compiled pattern
|
||||
(whether it is anchored, has a fixed first character, and so on). The
|
||||
information is obtained from the pcre2_pattern_info() function.
|
||||
information is obtained from the pcre2_pattern_info() function. Here
|
||||
are some typical examples:
|
||||
|
||||
re> /(?i)(^a|^b)/m,info
|
||||
Capturing subpattern count = 1
|
||||
Compile options: multiline
|
||||
Overall options: caseless multiline
|
||||
First code unit at start or follows newline
|
||||
Subject length lower bound = 1
|
||||
|
||||
re> /(?i)abc/info
|
||||
Capturing subpattern count = 0
|
||||
Compile options: <none>
|
||||
Overall options: caseless
|
||||
First code unit = 'a' (caseless)
|
||||
Last code unit = 'c' (caseless)
|
||||
Subject length lower bound = 3
|
||||
|
||||
"Compile options" are those specified to the compile function; "overall
|
||||
options" have added options that are taken or deduced from the pattern.
|
||||
If both sets of options are the same, just a single "options" line is
|
||||
output. "First code unit" is where any match must start; if there is
|
||||
more than one they are listed as "starting code units". "Last code
|
||||
unit" is the last literal code unit that must be present in any match.
|
||||
This is not necessarily the last character. These lines are omitted if
|
||||
no starting or ending code units are recorded.
|
||||
|
||||
Specifying a pattern in hex
|
||||
|
||||
|
@ -482,7 +507,7 @@ PATTERN MODIFIERS
|
|||
binary zero characters. By default, pcre2test passes patterns as zero-
|
||||
terminated strings to pcre2_compile(), giving the length as
|
||||
PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal,
|
||||
the length of the pattern is passed.
|
||||
the actual length of the pattern is passed.
|
||||
|
||||
JIT compilation
|
||||
|
||||
|
@ -505,7 +530,7 @@ PATTERN MODIFIERS
|
|||
size of the JIT stack.
|
||||
|
||||
If the jitfast modifier is specified, matching is done using the JIT
|
||||
"fast path" interface (pcre2_jit_match()), which skips some of the san-
|
||||
"fast path" interface, pcre2_jit_match(), which skips some of the san-
|
||||
ity checks that are done by pcre2_match(), and of course does not work
|
||||
when JIT is not supported. If jitfast is specified without jit, jit=7
|
||||
is assumed.
|
||||
|
@ -533,11 +558,16 @@ PATTERN MODIFIERS
|
|||
|
||||
Showing pattern memory
|
||||
|
||||
The /memory modifier causes the size in bytes of the memory block used
|
||||
to hold the compiled pattern to be output. This does not include the
|
||||
size of the pcre2_code block; it is just the actual compiled data. If
|
||||
the pattern is subsequently passed to the JIT compiler, the size of the
|
||||
JIT compiled code is also output.
|
||||
The /memory modifier causes the size in bytes of the memory used to
|
||||
hold the compiled pattern to be output. This does not include the size
|
||||
of the pcre2_code block; it is just the actual compiled data. If the
|
||||
pattern is subsequently passed to the JIT compiler, the size of the JIT
|
||||
compiled code is also output. Here is an example:
|
||||
|
||||
re> /a(b)c/jit,memory
|
||||
Memory allocation (code space): 21
|
||||
Memory allocation (JIT code): 1910
|
||||
|
||||
|
||||
Limiting nested parentheses
|
||||
|
||||
|
@ -573,7 +603,7 @@ PATTERN MODIFIERS
|
|||
mentation for details). If the number specified by the modifier is
|
||||
greater than zero, pcre2_set_compile_recursion_guard() is called to set
|
||||
up callback from pcre2_compile() to a local function. The argument it
|
||||
is passed is the current nesting parenthesis depth; if this is greater
|
||||
receives is the current nesting parenthesis depth; if this is greater
|
||||
than the value given by the modifier, non-zero is returned, causing the
|
||||
compilation to be aborted.
|
||||
|
||||
|
@ -606,6 +636,7 @@ PATTERN MODIFIERS
|
|||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
|
||||
These modifiers may not appear in a #pattern command. If you want them
|
||||
|
@ -671,11 +702,11 @@ SUBJECT MODIFIERS
|
|||
offset=<n> set starting offset
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
|
||||
The effects of these modifiers are described in the following sections.
|
||||
FIXME: Give more examples.
|
||||
|
||||
Showing more text
|
||||
|
||||
|
@ -745,6 +776,28 @@ SUBJECT MODIFIERS
|
|||
ber. Any value other than zero is used as a return from pcre2test's
|
||||
callout function.
|
||||
|
||||
Finding all matches in a string
|
||||
|
||||
Searching for all possible matches within a subject can be requested by
|
||||
the global or /altglobal modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The
|
||||
difference between global and altglobal is that the former uses the
|
||||
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
|
||||
searching at a new point within the entire string (which is what Perl
|
||||
does), whereas the latter passes over a shortened substring. This makes
|
||||
a difference to the matching process if the pattern begins with a look-
|
||||
behind assertion (including \b or \B).
|
||||
|
||||
If an empty string is matched, the next match is done with the
|
||||
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
|
||||
for another, non-empty, match at the same point in the subject. If this
|
||||
match fails, the start offset is advanced, and the normal match is
|
||||
retried. This imitates the way Perl handles such cases when using the
|
||||
/g modifier or the split() function. Normally, the start offset is
|
||||
advanced by one character, but if the newline convention recognizes
|
||||
CRLF as a newline, and the current character is CR followed by LF, an
|
||||
advance of two is used.
|
||||
|
||||
Testing substring extraction functions
|
||||
|
||||
The copy and get modifiers can be used to test the pcre2_sub-
|
||||
|
@ -767,27 +820,45 @@ SUBJECT MODIFIERS
|
|||
full list. The string length (that is, the return from the extraction
|
||||
function) is given in parentheses after each substring.
|
||||
|
||||
Finding all matches in a string
|
||||
Testing the substitution function
|
||||
|
||||
Searching for all possible matches within a subject can be requested by
|
||||
the global or /altglobal modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The
|
||||
difference between global and altglobal is that the former uses the
|
||||
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
|
||||
searching at a new point within the entire string (which is what Perl
|
||||
does), whereas the latter passes over a shortened substring. This makes
|
||||
a difference to the matching process if the pattern begins with a look-
|
||||
behind assertion (including \b or \B).
|
||||
If the replace modifier is set, the pcre2_substitute() function is
|
||||
called instead of one of the matching functions. Unlike subject
|
||||
strings, pcre2test does not process replacement strings for escape
|
||||
sequences. In UTF mode, a replacement string is checked to see if it is
|
||||
a valid UTF-8 string. If so, it is correctly converted to a UTF string
|
||||
of the appropriate code unit width. If it is not a valid UTF-8 string,
|
||||
the individual code units are copied directly. This provides a means of
|
||||
passing an invalid UTF-8 string for testing purposes.
|
||||
|
||||
If an empty string is matched, the next match is done with the
|
||||
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
|
||||
for another, non-empty, match at the same point in the subject. If this
|
||||
match fails, the start offset is advanced, and the normal match is
|
||||
retried. This imitates the way Perl handles such cases when using the
|
||||
/g modifier or the split() function. Normally, the start offset is
|
||||
advanced by one character, but if the newline convention recognizes
|
||||
CRLF as a newline, and the current character is CR followed by LF, an
|
||||
advance of two is used.
|
||||
If the global modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
||||
pcre2_substitute(). After a successful substitution, the modified
|
||||
string is output, preceded by the number of replacements. This may be
|
||||
zero if there were no matches. Here is a simple example of a substitu-
|
||||
tion test:
|
||||
|
||||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
1: =xxx=abc=
|
||||
=abc=abc=\=global
|
||||
2: =xxx=xxx=
|
||||
|
||||
Subject and replacement strings should be kept relatively short for
|
||||
substitution tests, as fixed-size buffers are used. To make it easy to
|
||||
test for buffer overflow, if the replacement string starts with a num-
|
||||
ber in square brackets, that number is passed to pcre2_substitute() as
|
||||
the size of the output buffer, with the replacement string starting at
|
||||
the next character. Here is an example that tests the edge case:
|
||||
|
||||
/abc/
|
||||
123abc123\=replace=[10]XYZ
|
||||
1: 123XYZ123
|
||||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying
|
||||
partial matching provokes an error return ("bad option value") from
|
||||
pcre2_substitute().
|
||||
|
||||
Setting the JIT stack size
|
||||
|
||||
|
@ -853,10 +924,10 @@ SUBJECT MODIFIERS
|
|||
A value of zero is useful when testing the POSIX API because it causes
|
||||
regexec() to be called with a NULL capture vector. When not testing the
|
||||
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
||||
ate_from_pattern to be called, in order to create a match block of
|
||||
ate_from_pattern() to be called, in order to create a match block of
|
||||
exactly the right size for the pattern. (It is not possible to create a
|
||||
match block with a zero-length ovector; there is always one pair of
|
||||
offsets.)
|
||||
match block with a zero-length ovector; there is always at least one
|
||||
pair of offsets.)
|
||||
|
||||
Passing the subject as zero-terminated
|
||||
|
||||
|
@ -867,7 +938,7 @@ SUBJECT MODIFIERS
|
|||
via the POSIX interface, this modifier has no effect, as there is no
|
||||
facility for passing a length.)
|
||||
|
||||
When testing pcre2_substitute, this modifier also has the effect of
|
||||
When testing pcre2_substitute(), this modifier also has the effect of
|
||||
passing the replacement string as zero-terminated.
|
||||
|
||||
|
||||
|
@ -1112,5 +1183,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 09 November 2014
|
||||
Last updated: 14 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
|
|
|
@ -84,7 +84,7 @@ uint32_t ovector_count;
|
|||
uint32_t goptions = 0;
|
||||
BOOL match_data_created = FALSE;
|
||||
BOOL global = FALSE;
|
||||
PCRE2_SIZE buff_offset, lengthleft, endlength;
|
||||
PCRE2_SIZE buff_offset, lengthleft, fraglength;
|
||||
PCRE2_SIZE *ovector;
|
||||
|
||||
/* Partial matching is not valid. */
|
||||
|
@ -154,14 +154,17 @@ do
|
|||
|
||||
/* Any error other than no match returns the error code. No match when not
|
||||
doing the special after-empty-match global rematch, or when at the end of the
|
||||
subject, breaks the global loop. Otherwise, advance the starting point and
|
||||
try again. */
|
||||
subject, breaks the global loop. Otherwise, advance the starting point by one
|
||||
character, copying it to the output, and try again. */
|
||||
|
||||
if (rc < 0)
|
||||
{
|
||||
PCRE2_SIZE save_start;
|
||||
|
||||
if (rc != PCRE2_ERROR_NOMATCH) goto EXIT;
|
||||
if (goptions == 0 || start_offset >= length) break;
|
||||
start_offset++;
|
||||
|
||||
save_start = start_offset++;
|
||||
if ((code->overall_options & PCRE2_UTF) != 0)
|
||||
{
|
||||
#if PCRE2_CODE_UNIT_WIDTH == 8
|
||||
|
@ -173,6 +176,14 @@ do
|
|||
start_offset++;
|
||||
#endif
|
||||
}
|
||||
|
||||
fraglength = start_offset - save_start;
|
||||
if (lengthleft < fraglength) goto NOROOM;
|
||||
memcpy(buffer + buff_offset, subject + save_start,
|
||||
fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
|
||||
buff_offset += fraglength;
|
||||
lengthleft -= fraglength;
|
||||
|
||||
goptions = 0;
|
||||
continue;
|
||||
}
|
||||
|
@ -181,12 +192,12 @@ do
|
|||
|
||||
subs++;
|
||||
if (rc == 0) rc = ovector_count;
|
||||
endlength = ovector[0] - start_offset;
|
||||
if (endlength >= lengthleft) goto NOROOM;
|
||||
fraglength = ovector[0] - start_offset;
|
||||
if (fraglength >= lengthleft) goto NOROOM;
|
||||
memcpy(buffer + buff_offset, subject + start_offset,
|
||||
endlength*(PCRE2_CODE_UNIT_WIDTH/8));
|
||||
buff_offset += endlength;
|
||||
lengthleft -= endlength;
|
||||
fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
|
||||
buff_offset += fraglength;
|
||||
lengthleft -= fraglength;
|
||||
|
||||
for (i = 0; i < rlength; i++)
|
||||
{
|
||||
|
@ -279,11 +290,11 @@ do
|
|||
/* Copy the rest of the subject and return the number of substitutions. */
|
||||
|
||||
rc = subs;
|
||||
endlength = length - start_offset;
|
||||
if (endlength + 1 > lengthleft) goto NOROOM;
|
||||
fraglength = length - start_offset;
|
||||
if (fraglength + 1 > lengthleft) goto NOROOM;
|
||||
memcpy(buffer + buff_offset, subject + start_offset,
|
||||
endlength*(PCRE2_CODE_UNIT_WIDTH/8));
|
||||
buff_offset += endlength;
|
||||
fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
|
||||
buff_offset += fraglength;
|
||||
buffer[buff_offset] = 0;
|
||||
*blength = buff_offset;
|
||||
|
||||
|
|
|
@ -164,6 +164,7 @@ void vms_setsymbol( char *, char *, int );
|
|||
#define DFA_WS_DIMENSION 1000 /* Size of DFA workspace */
|
||||
#define DEFAULT_OVECCOUNT 15 /* Default ovector count */
|
||||
#define JUNK_OFFSET 0xdeadbeef /* For initializing ovector */
|
||||
#define LOCALESIZE 32 /* Size of locale name */
|
||||
#define LOOPREPEAT 500000 /* Default loop count for timing */
|
||||
#define REPLACE_MODSIZE 96 /* Field for reading 8-bit replacement */
|
||||
#define VERSION_SIZE 64 /* Size of buffer for the version strings */
|
||||
|
@ -401,7 +402,7 @@ typedef struct patctl { /* Structure for pattern modifiers. */
|
|||
uint32_t jit;
|
||||
uint32_t stackguard_test;
|
||||
uint32_t tables_id;
|
||||
uint8_t locale[32];
|
||||
uint8_t locale[LOCALESIZE];
|
||||
} patctl;
|
||||
|
||||
#define MAXCPYGET 10
|
||||
|
@ -486,7 +487,7 @@ static modstruct modlist[] = {
|
|||
{ "jitfast", MOD_PAT, MOD_CTL, CTL_JITFAST, PO(control) },
|
||||
{ "jitstack", MOD_DAT, MOD_INT, 0, DO(jitstack) },
|
||||
{ "jitverify", MOD_PAT, MOD_CTL, CTL_JITVERIFY, PO(control) },
|
||||
{ "locale", MOD_PAT, MOD_STR, 0, PO(locale) },
|
||||
{ "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) },
|
||||
{ "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) },
|
||||
{ "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) },
|
||||
{ "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) },
|
||||
|
@ -512,7 +513,7 @@ static modstruct modlist[] = {
|
|||
{ "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) },
|
||||
{ "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) },
|
||||
{ "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) },
|
||||
{ "replace", MOD_PND, MOD_STR, 0, PO(replacement) },
|
||||
{ "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) },
|
||||
{ "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) },
|
||||
{ "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) },
|
||||
{ "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) },
|
||||
|
@ -3141,6 +3142,12 @@ for (;;)
|
|||
break;
|
||||
|
||||
case MOD_STR:
|
||||
if (len + 1 > m->value)
|
||||
{
|
||||
fprintf(outfile, "** Overlong value for '%s' (max %d code units)\n",
|
||||
m->name, m->value - 1);
|
||||
return FALSE;
|
||||
}
|
||||
memcpy(field, pp, len);
|
||||
((uint8_t *)field)[len] = 0;
|
||||
pp = ep;
|
||||
|
|
|
@ -4073,6 +4073,9 @@ a random value. /Ix
|
|||
123abc456abc789
|
||||
123abc456abc789\=g
|
||||
|
||||
/(?<=abc)(|def)/g,replace=<$0>
|
||||
123abcxyzabcdef789abcpqr
|
||||
|
||||
# End of substitute tests
|
||||
|
||||
# End of testinput2
|
||||
|
|
|
@ -1633,4 +1633,7 @@
|
|||
/ábc/utf,replace=XሴZ
|
||||
123ábc123
|
||||
|
||||
/(?<=abc)(|def)/g,utf,replace=<$0>
|
||||
123abcáyzabcdef789abcሴqr
|
||||
|
||||
# End of testinput5
|
||||
|
|
|
@ -13699,6 +13699,10 @@ Failed: error -34: bad option value
|
|||
123abc456abc789\=g
|
||||
2: 123xyz456xyz789
|
||||
|
||||
/(?<=abc)(|def)/g,replace=<$0>
|
||||
123abcxyzabcdef789abcpqr
|
||||
4: 123abc<>xyzabc<><def>789abc<>pqr
|
||||
|
||||
# End of substitute tests
|
||||
|
||||
# End of testinput2
|
||||
|
|
|
@ -4002,4 +4002,8 @@ Subject length lower bound = 1
|
|||
123ábc123
|
||||
1: 123X\x{1234}Z123
|
||||
|
||||
/(?<=abc)(|def)/g,utf,replace=<$0>
|
||||
123abcáyzabcdef789abcሴqr
|
||||
4: 123abc<>\x{e1}yzabc<><def>789abc<>\x{1234}qr
|
||||
|
||||
# End of testinput5
|
||||
|
|
Loading…
Reference in New Issue