File tidies, version updates, etc. for 10.21-RC1
This commit is contained in:
parent
293da188aa
commit
dffd559601
41
NEWS
41
NEWS
|
@ -1,6 +1,47 @@
|
|||
News about PCRE2 releases
|
||||
-------------------------
|
||||
|
||||
Version 10.21 15-December-2015
|
||||
------------------------------
|
||||
|
||||
1. Many bugs have been fixed. A large number of them were provoked only by very
|
||||
strange pattern input, and were discovered by fuzzers. Some others were
|
||||
discovered by code auditing. See ChangeLog for details.
|
||||
|
||||
2. The Unicode tables have been updated to Unicode version 8.0.0.
|
||||
|
||||
3. For Perl compatibility in EBCDIC environments, ranges such as a-z in a
|
||||
class, where both values are literal letters in the same case, omit the
|
||||
non-letter EBCDIC code points within the range.
|
||||
|
||||
4. There have been a number of enhancements to the pcre2_substitute() function,
|
||||
giving more flexibility to replacement facilities. It is now also possible to
|
||||
cause the function to return the needed buffer size if the one given is too
|
||||
small.
|
||||
|
||||
5. The PCRE2_ALT_VERBNAMES option causes the "name" parts of special verbs such
|
||||
as (*THEN:name) to be processed for backslashes and to take note of
|
||||
PCRE2_EXTENDED.
|
||||
|
||||
6. PCRE2_INFO_HASBACKSLASHC makes it possible for a client to find out if a
|
||||
pattern uses \C, and --never-backslash-C makes it possible to compile a version
|
||||
PCRE2 in which the use of \C is always forbidden.
|
||||
|
||||
7. A limit to the length of pattern that can be handled can now be set by
|
||||
calling pcre2_set_max_pattern_length().
|
||||
|
||||
8. When matching an unanchored pattern, a match can be required to begin within
|
||||
a given number of code units after the start of the subject by calling
|
||||
pcre2_set_offset_limit().
|
||||
|
||||
9. The pcre2test program has been extended to test new facilities, and it can
|
||||
now run the tests when LF on its own is not a valid newline sequence.
|
||||
|
||||
10. The RunTest script has also been updated to enable more tests to be run.
|
||||
|
||||
11. There have been some minor performance enhancements.
|
||||
|
||||
|
||||
Version 10.20 30-June-2015
|
||||
--------------------------
|
||||
|
||||
|
|
10
configure.ac
10
configure.ac
|
@ -11,16 +11,16 @@ dnl be defined as -RC2, for example. For real releases, it should be empty.
|
|||
m4_define(pcre2_major, [10])
|
||||
m4_define(pcre2_minor, [21])
|
||||
m4_define(pcre2_prerelease, [-RC1])
|
||||
m4_define(pcre2_date, [2015-07-06])
|
||||
m4_define(pcre2_date, [2015-12-15])
|
||||
|
||||
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
||||
# 50 lines of this file. Please update that if the variables above are moved.
|
||||
|
||||
# Libtool shared library interface versions (current:revision:age)
|
||||
m4_define(libpcre2_8_version, [2:0:2])
|
||||
m4_define(libpcre2_16_version, [2:0:2])
|
||||
m4_define(libpcre2_32_version, [2:0:2])
|
||||
m4_define(libpcre2_posix_version, [0:0:0])
|
||||
m4_define(libpcre2_8_version, [3:0:3])
|
||||
m4_define(libpcre2_16_version, [3:0:3])
|
||||
m4_define(libpcre2_32_version, [3:0:3])
|
||||
m4_define(libpcre2_posix_version, [0:1:0])
|
||||
|
||||
AC_PREREQ(2.57)
|
||||
AC_INIT(PCRE2, pcre2_major.pcre2_minor[]pcre2_prerelease, , pcre2)
|
||||
|
|
|
@ -42,19 +42,20 @@ request are as follows:
|
|||
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
|
||||
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
|
||||
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
|
||||
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
||||
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
|
||||
0 nothing set
|
||||
1 first code unit is set
|
||||
2 start of string or after newline
|
||||
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
||||
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \C
|
||||
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches
|
||||
exist in the pattern
|
||||
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
|
||||
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
|
||||
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
||||
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
|
||||
0 nothing set
|
||||
1 code unit is set
|
||||
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
||||
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an
|
||||
empty string, 0 otherwise
|
||||
PCRE2_INFO_MATCHLIMIT Match limit if set,
|
||||
|
@ -62,8 +63,8 @@ request are as follows:
|
|||
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest
|
||||
lookbehind assertion
|
||||
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
|
||||
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
||||
PCRE2_INFO_NAMECOUNT Number of named subpatterns
|
||||
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
||||
PCRE2_INFO_NAMETABLE Pointer to name table
|
||||
PCRE2_CONFIG_NEWLINE Code for the newline sequence:
|
||||
PCRE2_NEWLINE_CR
|
||||
|
|
|
@ -70,6 +70,9 @@ The options are:
|
|||
PCRE2_UTF was set at compile time)
|
||||
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
|
||||
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
|
||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
|
||||
</pre>
|
||||
The function returns the number of substitutions, which may be zero if there
|
||||
were no matches. The result can be greater than one only when
|
||||
|
|
|
@ -716,8 +716,8 @@ of the following match-time parameters:
|
|||
<pre>
|
||||
A callout function
|
||||
The offset limit for matching an unanchored pattern
|
||||
The limit for calling <i>match()</i>
|
||||
The limit for calling <i>match()</i> recursively
|
||||
The limit for calling <b>match()</b> (see below)
|
||||
The limit for calling <b>match()</b> recursively
|
||||
</pre>
|
||||
A match context is also required if you are using custom memory management.
|
||||
If none of these apply, just pass NULL as the context argument of
|
||||
|
@ -771,7 +771,9 @@ PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
|
|||
<P>
|
||||
The offset limit facility can be used to track progress when searching large
|
||||
subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to
|
||||
start within the first line of the subject.
|
||||
start within the first line of the subject. If this is set with an offset
|
||||
limit, a match must occur in the first line and also within the offset limit.
|
||||
In other words, whichever limit comes first is used.
|
||||
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
|
@ -1212,7 +1214,9 @@ built.
|
|||
If this option is set, an unanchored pattern is required to match before or at
|
||||
the first newline in the subject string, though the matched text may continue
|
||||
over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
|
||||
general limiting facility.
|
||||
general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, a
|
||||
match must occur in the first line and also within the offset limit. In other
|
||||
words, whichever limit comes first is used.
|
||||
<pre>
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
</pre>
|
||||
|
@ -1563,11 +1567,10 @@ are as follows:
|
|||
Return a copy of the pattern's options. The third argument should point to a
|
||||
<b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
|
||||
were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns
|
||||
the compile options as modified by any top-level option settings at the start
|
||||
of the pattern itself. In other words, they are the options that will be in
|
||||
force when matching starts. For example, if the pattern /(?im)abc(?-i)d/ is
|
||||
compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
|
||||
PCRE2_MULTILINE, and PCRE2_EXTENDED.
|
||||
the compile options as modified by any top-level option settings such as (*UTF)
|
||||
at the start of the pattern itself. For example, if the pattern /(*UTF)abc/ is
|
||||
compiled with the PCRE2_EXTENDED option, the result is PCRE2_EXTENDED and
|
||||
PCRE2_UTF.
|
||||
</P>
|
||||
<P>
|
||||
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
|
||||
|
@ -1609,18 +1612,27 @@ matches only CR, LF, or CRLF.
|
|||
<pre>
|
||||
PCRE2_INFO_CAPTURECOUNT
|
||||
</pre>
|
||||
Return the number of capturing subpatterns in the pattern. The third argument
|
||||
should point to an <b>uint32_t</b> variable.
|
||||
Return the highest capturing subpattern number in the pattern. In patterns
|
||||
where (?| is not used, this is also the total number of capturing subpatterns.
|
||||
The third argument should point to an <b>uint32_t</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
</pre>
|
||||
In the absence of a single first code unit for a non-anchored pattern,
|
||||
<b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of
|
||||
values for the first code unit in any match. For example, a pattern that starts
|
||||
with [abc] results in a table with three bits set. When code unit values
|
||||
greater than 255 are supported, the flag bit for 255 means "any code unit of
|
||||
value 255 or above". If such a table was constructed, a pointer to it is
|
||||
returned. Otherwise NULL is returned. The third argument should point to an
|
||||
<b>const uint8_t *</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTCODETYPE
|
||||
</pre>
|
||||
Return information about the first code unit of any matched string, for a
|
||||
non-anchored pattern. The third argument should point to an <b>uint32_t</b>
|
||||
variable.
|
||||
</P>
|
||||
<P>
|
||||
If there is a fixed first value, for example, the letter "c" from a pattern
|
||||
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||
variable. If there is a fixed first value, for example, the letter "c" from a
|
||||
pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
||||
it is known that a match can occur only at the start of the subject or
|
||||
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
||||
|
@ -1635,16 +1647,10 @@ value is always less than 256. In the 16-bit library the value can be up to
|
|||
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
|
||||
and up to 0xffffffff when not using UTF-32 mode.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
PCRE2_INFO_HASBACKSLASHC
|
||||
</pre>
|
||||
In the absence of a single first code unit for a non-anchored pattern,
|
||||
<b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of
|
||||
values for the first code unit in any match. For example, a pattern that starts
|
||||
with [abc] results in a table with three bits set. When code unit values
|
||||
greater than 255 are supported, the flag bit for 255 means "any code unit of
|
||||
value 255 or above". If such a table was constructed, a pointer to it is
|
||||
returned. Otherwise NULL is returned. The third argument should point to an
|
||||
<b>const uint8_t *</b> variable.
|
||||
Return 1 if the pattern contains any instances of \C, otherwise 0. The third
|
||||
argument should point to an <b>uint32_t</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_HASCRORLF
|
||||
</pre>
|
||||
|
@ -1670,13 +1676,10 @@ Returns 1 if there is a rightmost literal code unit that must exist in any
|
|||
matched string, other than at its start. The third argument should point to an
|
||||
<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is
|
||||
returned, the code unit value itself can be retrieved using
|
||||
PCRE2_INFO_LASTCODEUNIT.
|
||||
</P>
|
||||
<P>
|
||||
For anchored patterns, a last literal value is recorded only if it follows
|
||||
something of variable length. For example, for the pattern /^a\d+z\d+/ the
|
||||
returned value is 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for
|
||||
/^a\dz\d/ the returned value is 0.
|
||||
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
|
||||
recorded only if it follows something of variable length. For example, for the
|
||||
pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned from
|
||||
PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
|
||||
<pre>
|
||||
PCRE2_INFO_LASTCODEUNIT
|
||||
</pre>
|
||||
|
@ -1687,8 +1690,11 @@ value, 0 is returned.
|
|||
<pre>
|
||||
PCRE2_INFO_MATCHEMPTY
|
||||
</pre>
|
||||
Return 1 if the pattern can match an empty string, otherwise 0. The third
|
||||
argument should point to an <b>uint32_t</b> variable.
|
||||
Return 1 if the pattern might match an empty string, otherwise 0. The third
|
||||
argument should point to an <b>uint32_t</b> variable. When a pattern contains
|
||||
recursive subroutine calls it is not always possible to determine whether or
|
||||
not it can match an empty string. PCRE2 takes a cautious approach and returns 1
|
||||
in such cases.
|
||||
<pre>
|
||||
PCRE2_INFO_MATCHLIMIT
|
||||
</pre>
|
||||
|
@ -2142,8 +2148,13 @@ documentation.
|
|||
When PCRE2 is built, a default newline convention is set; this is usually the
|
||||
standard convention for the operating system. The default can be overridden in
|
||||
a
|
||||
<a href="#compilecontext">compile context.</a>
|
||||
During matching, the newline choice affects the behaviour of the dot,
|
||||
<a href="#compilecontext">compile context</a>
|
||||
by calling <b>pcre2_set_newline()</b>. It can also be overridden by starting a
|
||||
pattern string with, for example, (*CRLF), as described in the
|
||||
<a href="pcre2pattern.html#newlines">section on newline conventions</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page. During matching, the newline choice affects the behaviour of the dot,
|
||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||
starting position is advanced after a match failure for an unanchored pattern.
|
||||
</P>
|
||||
|
@ -2191,19 +2202,20 @@ function can be used to find out how many capturing subpatterns there are in a
|
|||
compiled pattern.
|
||||
</P>
|
||||
<P>
|
||||
A successful match returns the overall matched string and any captured
|
||||
substrings to the caller via a vector of PCRE2_SIZE values. This is called the
|
||||
<b>ovector</b>, and is contained within the
|
||||
<a href="#matchdatablock">match data block.</a>
|
||||
You can obtain direct access to the ovector by calling
|
||||
<b>pcre2_get_ovector_pointer()</b> to find its address, and
|
||||
<b>pcre2_get_ovector_count()</b> to find the number of pairs of values it
|
||||
contains. Alternatively, you can use the auxiliary functions for accessing
|
||||
captured substrings
|
||||
You can use auxiliary functions for accessing captured substrings
|
||||
<a href="#extractbynumber">by number</a>
|
||||
or
|
||||
<a href="#extractbyname">by name</a>
|
||||
(see below).
|
||||
<a href="#extractbyname">by name,</a>
|
||||
as described in sections below.
|
||||
</P>
|
||||
<P>
|
||||
Alternatively, you can make direct use of the vector of PCRE2_SIZE values,
|
||||
called the <b>ovector</b>, which contains the offsets of captured strings. It is
|
||||
part of the
|
||||
<a href="#matchdatablock">match data block.</a>
|
||||
The function <b>pcre2_get_ovector_pointer()</b> returns the address of the
|
||||
ovector, and <b>pcre2_get_ovector_count()</b> returns the number of pairs of
|
||||
values it contains.
|
||||
</P>
|
||||
<P>
|
||||
Within the ovector, the first in each pair of values is set to the offset of
|
||||
|
@ -2292,7 +2304,13 @@ After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
|
|||
to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
|
||||
<b>pcre2_get_mark()</b> can be called. It returns a pointer to the
|
||||
zero-terminated name, which is within the compiled pattern. Otherwise NULL is
|
||||
returned. After a successful match, the (*MARK) name that is returned is the
|
||||
returned. The length of the (*MARK) name (excluding the terminating zero) is
|
||||
stored in the code unit that preceeds the name. You should use this instead of
|
||||
relying on the terminating zero if the (*MARK) name might contain a binary
|
||||
zero.
|
||||
</P>
|
||||
<P>
|
||||
After a successful match, the (*MARK) name that is returned is the
|
||||
last one encountered on the matching path through the pattern. After a "no
|
||||
match" or a partial match, the last encountered (*MARK) name is returned. For
|
||||
example, consider this pattern:
|
||||
|
@ -2313,7 +2331,7 @@ escape sequence. After a partial match, however, this value is always the same
|
|||
as <i>ovector[0]</i> because \K does not affect the result of a partial match.
|
||||
</P>
|
||||
<P>
|
||||
After a UTF check failure, \fBpcre2_get_startchar()\fB can be used to obtain
|
||||
After a UTF check failure, <b>pcre2_get_startchar()</b> can be used to obtain
|
||||
the code unit offset of the invalid UTF character. Details are given in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page.
|
||||
|
@ -2650,12 +2668,21 @@ allocate memory for the compiled code.
|
|||
</P>
|
||||
<P>
|
||||
The <i>outlengthptr</i> argument must point to a variable that contains the
|
||||
length, in code units, of the output buffer. If the function is successful,
|
||||
the value is updated to contain the length of the new string, excluding the
|
||||
trailing zero that is automatically added. If the function is not successful,
|
||||
the value is set to PCRE2_UNSET for general errors (such as output buffer too
|
||||
small). For syntax errors in the replacement string, the value is set to the
|
||||
offset in the replacement string where the error was detected.
|
||||
length, in code units, of the output buffer. If the function is successful, the
|
||||
value is updated to contain the length of the new string, excluding the
|
||||
trailing zero that is automatically added.
|
||||
</P>
|
||||
<P>
|
||||
If the function is not successful, the value set via <i>outlengthptr</i> depends
|
||||
on the type of error. For syntax errors in the replacement string, the value is
|
||||
the offset in the replacement string where the error was detected. For other
|
||||
errors, the value is PCRE2_UNSET by default. This includes the case of the
|
||||
output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set
|
||||
(see below), in which case the value is the minimum length needed, including
|
||||
space for the trailing zero. Note that in order to compute the required length,
|
||||
<b>pcre2_substitute()</b> has to simulate all the matching and copying, instead
|
||||
of giving an error return as soon as the buffer overflows. Note also that the
|
||||
length is in code units, not bytes.
|
||||
</P>
|
||||
<P>
|
||||
In the replacement string, which is interpreted as a UTF string in UTF mode,
|
||||
|
@ -2682,15 +2709,53 @@ simultaneous substitutions, as this <b>pcre2test</b> example shows:
|
|||
apple lemon
|
||||
2: pear orange
|
||||
</pre>
|
||||
There is an additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes the
|
||||
function to iterate over the subject string, replacing every matching
|
||||
substring. If this is not set, only the first matching substring is replaced.
|
||||
As well as the usual options for <b>pcre2_match()</b>, a number of additional
|
||||
options can be set in the <i>options</i> argument.
|
||||
</P>
|
||||
<P>
|
||||
A second additional option, PCRE2_SUBSTITUTE_EXTENDED, causes extra processing
|
||||
to be applied to the replacement string. Without this option, only the dollar
|
||||
character is special, and only the group insertion forms listed above are
|
||||
valid. When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
||||
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
|
||||
replacing every matching substring. If this is not set, only the first matching
|
||||
substring is replaced. If any matched substring has zero length, after the
|
||||
substitution has happened, an attempt to find a non-empty match at the same
|
||||
position is performed. If this is not successful, the current position is
|
||||
advanced by one character except when CRLF is a valid newline sequence and the
|
||||
next two characters are CR, LF. In this case, the current position is advanced
|
||||
by two characters.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
|
||||
too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
|
||||
this option is set, however, <b>pcre2_substitute()</b> continues to go through
|
||||
the motions of matching and substituting (without, of course, writing anything)
|
||||
in order to compute the size of buffer that is needed. This value is passed
|
||||
back via the <i>outlengthptr</i> variable, with the result of the function still
|
||||
being PCRE2_ERROR_NOMEMORY.
|
||||
</P>
|
||||
<P>
|
||||
Passing a buffer size of zero is a permitted way of finding out how much memory
|
||||
is needed for given substitution. However, this does mean that the entire
|
||||
operation is carried out twice. Depending on the application, it may be more
|
||||
efficient to allocate a large buffer and free the excess afterwards, instead of
|
||||
using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups that do
|
||||
not appear in the pattern to be treated as unset groups. This option should be
|
||||
used with care, because it means that a typo in a group name or number no
|
||||
longer causes the PCRE2_ERROR_NOSUBSTRING error.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including unknown
|
||||
groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty
|
||||
strings when inserted as described above. If this option is not set, an attempt
|
||||
to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does
|
||||
not influence the extended substitution syntax described below.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
|
||||
replacement string. Without this option, only the dollar character is special,
|
||||
and only the group insertion forms listed above are valid. When
|
||||
PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
||||
</P>
|
||||
<P>
|
||||
Firstly, backslash in a replacement string is interpreted as an escape
|
||||
|
@ -2740,22 +2805,46 @@ string remains in force afterwards, as shown in this <b>pcre2test</b> example:
|
|||
somebody
|
||||
1: HELLO
|
||||
</pre>
|
||||
If successful, the function returns the number of replacements that were made.
|
||||
This may be zero if no matches were found, and is never greater than 1 unless
|
||||
PCRE2_SUBSTITUTE_GLOBAL is set.
|
||||
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
|
||||
substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause unknown
|
||||
groups in the extended syntax forms to be treated as unset.
|
||||
</P>
|
||||
<P>
|
||||
If successful, <b>pcre2_substitute()</b> returns the number of replacements that
|
||||
were made. This may be zero if no matches were found, and is never greater than
|
||||
1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
|
||||
</P>
|
||||
<P>
|
||||
In the event of an error, a negative error code is returned. Except for
|
||||
PCRE2_ERROR_NOMATCH (which is never returned), errors from <b>pcre2_match()</b>
|
||||
are passed straight back. PCRE2_ERROR_NOMEMORY is returned if the output buffer
|
||||
is not big enough. PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax
|
||||
errors in the replacement string, with more particular errors being
|
||||
PCRE2_ERROR_BADREPESCAPE (invalid escape sequence),
|
||||
PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket not found),
|
||||
PCRE2_BADSUBSTITUTION (syntax error in extended group substitution), and
|
||||
PCRE2_BADSUBPATTERN (the pattern match ended before it started). As for all
|
||||
PCRE2 errors, a text message that describes the error can be obtained by
|
||||
calling <b>pcre2_get_error_message()</b>.
|
||||
are passed straight back.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion,
|
||||
unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an
|
||||
unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple
|
||||
(non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is
|
||||
needed is returned via <i>outlengthptr</i>. Note that this does not happen by
|
||||
default.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
|
||||
replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE
|
||||
(invalid escape sequence), PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket
|
||||
not found), PCRE2_BADSUBSTITUTION (syntax error in extended group
|
||||
substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it
|
||||
started, which can happen if \K is used in an assertion).
|
||||
</P>
|
||||
<P>
|
||||
As for all PCRE2 errors, a text message that describes the error can be
|
||||
obtained by calling <b>pcre2_get_error_message()</b>.
|
||||
</P>
|
||||
<br><a name="SEC35" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
|
||||
<P>
|
||||
|
@ -2796,11 +2885,11 @@ function returns the length of each entry in code units. In both cases,
|
|||
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
||||
</P>
|
||||
<P>
|
||||
The format of the name table is described above in the section entitled
|
||||
<i>Information about a pattern</i>
|
||||
<a href="#infoaboutpattern">above.</a>
|
||||
Given all the relevant entries for the name, you can extract each of their
|
||||
numbers, and hence the captured data.
|
||||
The format of the name table is described
|
||||
<a href="#infoaboutpattern">above</a>
|
||||
in the section entitled <i>Information about a pattern</i>. Given all the
|
||||
relevant entries for the name, you can extract each of their numbers, and hence
|
||||
the captured data.
|
||||
</P>
|
||||
<br><a name="SEC36" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
|
||||
<P>
|
||||
|
@ -3032,7 +3121,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC40" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 05 November 2015
|
||||
Last updated: 16 December 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -86,6 +86,13 @@ results. The returned value from <b>pcre2_jit_compile()</b> is zero on success,
|
|||
or a negative error code.
|
||||
</P>
|
||||
<P>
|
||||
There is a limit to the size of pattern that JIT supports, imposed by the size
|
||||
of machine stack that it uses. The exact rules are not documented because they
|
||||
may change at any time, in particular, when new optimizations are introduced.
|
||||
If a pattern is too big, a call to \fBpcre2_jit_compile()\fB returns
|
||||
PCRE2_ERROR_NOMEMORY.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
|
||||
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
|
||||
PCRE2_PARTIAL_SOFT options of <b>pcre2_match()</b>, you should set one or both
|
||||
|
@ -425,7 +432,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 July 2015
|
||||
Last updated: 14 November 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -669,8 +669,8 @@ This is an example of an "atomic group", details of which are given
|
|||
This particular group matches either the two-character sequence CR followed by
|
||||
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
|
||||
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
|
||||
line, U+0085). The two-character sequence is treated as a single unit that
|
||||
cannot be split.
|
||||
line, U+0085). Because this is an atomic group, the two-character sequence is
|
||||
treated as a single unit that cannot be split.
|
||||
</P>
|
||||
<P>
|
||||
In other modes, two additional characters whose codepoints are greater than 255
|
||||
|
@ -1186,6 +1186,16 @@ when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
|
|||
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
|
||||
</P>
|
||||
<P>
|
||||
When the newline convention (see
|
||||
<a href="#newlines">"Newline conventions"</a>
|
||||
below) recognizes the two-character sequence CRLF as a newline, this is
|
||||
preferred, even if the single characters CR and LF are also recognized as
|
||||
newlines. For example, if the newline convention is "any", a multiline mode
|
||||
circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after
|
||||
CR, even though CR on its own is a valid newline. (It also matches at the very
|
||||
start of the string, of course.)
|
||||
</P>
|
||||
<P>
|
||||
Note that the sequences \A, \Z, and \z can be used to match the start and
|
||||
end of the subject in both modes, and if all branches of a pattern start with
|
||||
\A it is always anchored, whether or not PCRE2_MULTILINE is set.
|
||||
|
@ -1672,6 +1682,10 @@ first one in the pattern with the given number. The following pattern matches
|
|||
<pre>
|
||||
/(?|(abc)|(def))(?1)/
|
||||
</pre>
|
||||
A relative reference such as (?-1) is no different: it is just a convenient way
|
||||
of computing an absolute group number.
|
||||
</P>
|
||||
<P>
|
||||
If a
|
||||
<a href="#conditions">condition test</a>
|
||||
for a subpattern's having matched refers to a non-unique number, the test is
|
||||
|
@ -2626,6 +2640,21 @@ parentheses preceding the recursion. In other words, a negative number counts
|
|||
capturing parentheses leftwards from the point at which it is encountered.
|
||||
</P>
|
||||
<P>
|
||||
Be aware however, that if
|
||||
<a href="#dupsubpatternnumber">duplicate subpattern numbers</a>
|
||||
are in use, relative references refer to the earliest subpattern with the
|
||||
appropriate number. Consider, for example:
|
||||
<pre>
|
||||
(?|(a)|(b)) (c) (?-2)
|
||||
</pre>
|
||||
The first two capturing groups (a) and (b) are both numbered 1, and group (c)
|
||||
is number 2. When the reference (?-2) is encountered, the second most recently
|
||||
opened parentheses has the number 1, but it is the first such group (the (a)
|
||||
group) to which the recursion refers. This would be the same if an absolute
|
||||
reference (?1) was used. In other words, relative references are just a
|
||||
shorthand for computing a group number.
|
||||
</P>
|
||||
<P>
|
||||
It is also possible to refer to subsequently opened parentheses, by writing
|
||||
references such as (?+2). However, these cannot be recursive because the
|
||||
reference is not inside the parentheses that are referenced. They are always
|
||||
|
@ -3359,7 +3388,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 01 November 2015
|
||||
Last updated: 13 November 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -235,7 +235,8 @@ to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
|
|||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
|
||||
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
||||
how it is matched.
|
||||
how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL are
|
||||
mutually exclusive; the error REG_INVARG is returned.
|
||||
</P>
|
||||
<P>
|
||||
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||
|
@ -289,7 +290,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 30 October 2015
|
||||
Last updated: 29 November 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -892,14 +892,18 @@ are applied to every subject line that is processed with that pattern. They may
|
|||
not appear in <b>#pattern</b> commands. These modifiers do not affect the
|
||||
compilation process.
|
||||
<pre>
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
</pre>
|
||||
These modifiers may not appear in a <b>#pattern</b> command. If you want them as
|
||||
defaults, set them in a <b>#subject</b> command.
|
||||
|
@ -964,33 +968,38 @@ information. Some of them may also be specified on a pattern line (see above),
|
|||
in which case they apply to every subject line that is matched against that
|
||||
pattern.
|
||||
<pre>
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text (non-JIT only)
|
||||
altglobal alternative global matching
|
||||
callout_capture show captures at callout time
|
||||
callout_data=<n> set a value to pass via callouts
|
||||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
dfa use <b>pcre2_dfa_match()</b>
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=<n> set a match limit
|
||||
memory show memory usage
|
||||
null_context match with a NULL context
|
||||
offset=<n> set starting offset
|
||||
offset_limit=<n> set offset limit
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text (non-JIT only)
|
||||
altglobal alternative global matching
|
||||
callout_capture show captures at callout time
|
||||
callout_data=<n> set a value to pass via callouts
|
||||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
dfa use <b>pcre2_dfa_match()</b>
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=<n> set a match limit
|
||||
memory show memory usage
|
||||
null_context match with a NULL context
|
||||
offset=<n> set starting offset
|
||||
offset_limit=<n> set offset limit
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
startoffset=<n> same as offset=<n>
|
||||
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
</pre>
|
||||
The effects of these modifiers are described in the following sections.
|
||||
</P>
|
||||
|
@ -1129,19 +1138,34 @@ Testing the substitution function
|
|||
</b><br>
|
||||
<P>
|
||||
If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
|
||||
called instead of one of the matching functions. Unlike subject strings,
|
||||
<b>pcre2test</b> does not process replacement strings for escape sequences. In
|
||||
UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
|
||||
If so, it is correctly converted to a UTF string of the appropriate code unit
|
||||
width. If it is not a valid UTF-8 string, the individual code units are copied
|
||||
directly. This provides a means of passing an invalid UTF-8 string for testing
|
||||
purposes.
|
||||
called instead of one of the matching functions. Note that replacement strings
|
||||
cannot contain commas, because a comma signifies the end of a modifier. This is
|
||||
not thought to be an issue in a test program.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
||||
<b>pcre2_substitute()</b>. After a successful substitution, the modified string
|
||||
is output, preceded by the number of replacements. This may be zero if there
|
||||
were no matches. Here is a simple example of a substitution test:
|
||||
Unlike subject strings, <b>pcre2test</b> does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to see if it
|
||||
is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
|
||||
the appropriate code unit width. If it is not a valid UTF-8 string, the
|
||||
individual code units are copied directly. This provides a means of passing an
|
||||
invalid UTF-8 string for testing purposes.
|
||||
</P>
|
||||
<P>
|
||||
The following modifiers set options (in additional to the normal match options)
|
||||
for <b>pcre2_substitute()</b>:
|
||||
<pre>
|
||||
global PCRE2_SUBSTITUTE_GLOBAL
|
||||
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
After a successful substitution, the modified string is output, preceded by the
|
||||
number of replacements. This may be zero if there were no matches. Here is a
|
||||
simple example of a substitution test:
|
||||
<pre>
|
||||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
|
@ -1149,12 +1173,12 @@ were no matches. Here is a simple example of a substitution test:
|
|||
=abc=abc=\=global
|
||||
2: =xxx=xxx=
|
||||
</pre>
|
||||
Subject and replacement strings should be kept relatively short for
|
||||
substitution tests, as fixed-size buffers are used. To make it easy to test for
|
||||
buffer overflow, if the replacement string starts with a number in square
|
||||
brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
|
||||
output buffer, with the replacement string starting at the next character. Here
|
||||
is an example that tests the edge case:
|
||||
Subject and replacement strings should be kept relatively short (fewer than 256
|
||||
characters) for substitution tests, as fixed-size buffers are used. To make it
|
||||
easy to test for buffer overflow, if the replacement string starts with a
|
||||
number in square brackets, that number is passed to <b>pcre2_substitute()</b> as
|
||||
the size of the output buffer, with the replacement string starting at the next
|
||||
character. Here is an example that tests the edge case:
|
||||
<pre>
|
||||
/abc/
|
||||
123abc123\=replace=[10]XYZ
|
||||
|
@ -1162,6 +1186,19 @@ is an example that tests the edge case:
|
|||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
</pre>
|
||||
The default action of <b>pcre2_substitute()</b> is to return
|
||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the
|
||||
<b>substitute_overflow_length</b> modifier), <b>pcre2_substitute()</b> continues
|
||||
to go through the motions of matching and substituting, in order to compute the
|
||||
size of buffer that is required. When this happens, <b>pcre2test</b> shows the
|
||||
required buffer length (which includes space for the trailing zero) as part of
|
||||
the error message. For example:
|
||||
<pre>
|
||||
/abc/substitute_overflow_length
|
||||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory: 10 code units are needed
|
||||
</pre>
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying partial
|
||||
matching provokes an error return ("bad option value") from
|
||||
<b>pcre2_substitute()</b>.
|
||||
|
@ -1623,7 +1660,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 05 November 2015
|
||||
Last updated: 12 December 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
1761
doc/pcre2.txt
1761
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "12 December 2015" "PCRE2 10.21"
|
||||
.TH PCRE2API 3 "16 December 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -678,8 +678,8 @@ of the following match-time parameters:
|
|||
.sp
|
||||
A callout function
|
||||
The offset limit for matching an unanchored pattern
|
||||
The limit for calling \fImatch()\fP
|
||||
The limit for calling \fImatch()\fP recursively
|
||||
The limit for calling \fBmatch()\fP (see below)
|
||||
The limit for calling \fBmatch()\fP recursively
|
||||
.sp
|
||||
A match context is also required if you are using custom memory management.
|
||||
If none of these apply, just pass NULL as the context argument of
|
||||
|
@ -1611,8 +1611,9 @@ matches only CR, LF, or CRLF.
|
|||
.sp
|
||||
PCRE2_INFO_CAPTURECOUNT
|
||||
.sp
|
||||
Return the number of capturing subpatterns in the pattern. The third argument
|
||||
should point to an \fBuint32_t\fP variable.
|
||||
Return the highest capturing subpattern number in the pattern. In patterns
|
||||
where (?| is not used, this is also the total number of capturing subpatterns.
|
||||
The third argument should point to an \fBuint32_t\fP variable.
|
||||
.sp
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
.sp
|
||||
|
@ -1629,10 +1630,8 @@ returned. Otherwise NULL is returned. The third argument should point to an
|
|||
.sp
|
||||
Return information about the first code unit of any matched string, for a
|
||||
non-anchored pattern. The third argument should point to an \fBuint32_t\fP
|
||||
variable.
|
||||
.P
|
||||
If there is a fixed first value, for example, the letter "c" from a pattern
|
||||
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||
variable. If there is a fixed first value, for example, the letter "c" from a
|
||||
pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
||||
it is known that a match can occur only at the start of the subject or
|
||||
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
||||
|
@ -1676,12 +1675,10 @@ Returns 1 if there is a rightmost literal code unit that must exist in any
|
|||
matched string, other than at its start. The third argument should point to an
|
||||
\fBuint32_t\fP variable. If there is no such value, 0 is returned. When 1 is
|
||||
returned, the code unit value itself can be retrieved using
|
||||
PCRE2_INFO_LASTCODEUNIT.
|
||||
.P
|
||||
For anchored patterns, a last literal value is recorded only if it follows
|
||||
something of variable length. For example, for the pattern /^a\ed+z\ed+/ the
|
||||
returned value is 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for
|
||||
/^a\edz\ed/ the returned value is 0.
|
||||
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
|
||||
recorded only if it follows something of variable length. For example, for the
|
||||
pattern /^a\ed+z\ed+/ the returned value is 1 (with "z" returned from
|
||||
PCRE2_INFO_LASTCODEUNIT), but for /^a\edz\ed/ the returned value is 0.
|
||||
.sp
|
||||
PCRE2_INFO_LASTCODEUNIT
|
||||
.sp
|
||||
|
@ -2181,9 +2178,19 @@ standard convention for the operating system. The default can be overridden in
|
|||
a
|
||||
.\" HTML <a href="#compilecontext">
|
||||
.\" </a>
|
||||
compile context.
|
||||
compile context
|
||||
.\"
|
||||
During matching, the newline choice affects the behaviour of the dot,
|
||||
by calling \fBpcre2_set_newline()\fP. It can also be overridden by starting a
|
||||
pattern string with, for example, (*CRLF), as described in the
|
||||
.\" HTML <a href="pcre2pattern.html#newlines">
|
||||
.\" </a>
|
||||
section on newline conventions
|
||||
.\"
|
||||
in the
|
||||
.\" HREF
|
||||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
page. During matching, the newline choice affects the behaviour of the dot,
|
||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||
starting position is advanced after a match failure for an unanchored pattern.
|
||||
.P
|
||||
|
@ -2229,18 +2236,7 @@ that do not cause substrings to be captured. The \fBpcre2_pattern_info()\fP
|
|||
function can be used to find out how many capturing subpatterns there are in a
|
||||
compiled pattern.
|
||||
.P
|
||||
A successful match returns the overall matched string and any captured
|
||||
substrings to the caller via a vector of PCRE2_SIZE values. This is called the
|
||||
\fBovector\fP, and is contained within the
|
||||
.\" HTML <a href="#matchdatablock">
|
||||
.\" </a>
|
||||
match data block.
|
||||
.\"
|
||||
You can obtain direct access to the ovector by calling
|
||||
\fBpcre2_get_ovector_pointer()\fP to find its address, and
|
||||
\fBpcre2_get_ovector_count()\fP to find the number of pairs of values it
|
||||
contains. Alternatively, you can use the auxiliary functions for accessing
|
||||
captured substrings
|
||||
You can use auxiliary functions for accessing captured substrings
|
||||
.\" HTML <a href="#extractbynumber">
|
||||
.\" </a>
|
||||
by number
|
||||
|
@ -2248,9 +2244,20 @@ by number
|
|||
or
|
||||
.\" HTML <a href="#extractbyname">
|
||||
.\" </a>
|
||||
by name
|
||||
by name,
|
||||
.\"
|
||||
(see below).
|
||||
as described in sections below.
|
||||
.P
|
||||
Alternatively, you can make direct use of the vector of PCRE2_SIZE values,
|
||||
called the \fBovector\fP, which contains the offsets of captured strings. It is
|
||||
part of the
|
||||
.\" HTML <a href="#matchdatablock">
|
||||
.\" </a>
|
||||
match data block.
|
||||
.\"
|
||||
The function \fBpcre2_get_ovector_pointer()\fP returns the address of the
|
||||
ovector, and \fBpcre2_get_ovector_count()\fP returns the number of pairs of
|
||||
values it contains.
|
||||
.P
|
||||
Within the ovector, the first in each pair of values is set to the offset of
|
||||
the first code unit of a substring, and the second is set to the offset of the
|
||||
|
@ -2334,7 +2341,12 @@ After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
|
|||
to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
|
||||
\fBpcre2_get_mark()\fP can be called. It returns a pointer to the
|
||||
zero-terminated name, which is within the compiled pattern. Otherwise NULL is
|
||||
returned. After a successful match, the (*MARK) name that is returned is the
|
||||
returned. The length of the (*MARK) name (excluding the terminating zero) is
|
||||
stored in the code unit that preceeds the name. You should use this instead of
|
||||
relying on the terminating zero if the (*MARK) name might contain a binary
|
||||
zero.
|
||||
.P
|
||||
After a successful match, the (*MARK) name that is returned is the
|
||||
last one encountered on the matching path through the pattern. After a "no
|
||||
match" or a partial match, the last encountered (*MARK) name is returned. For
|
||||
example, consider this pattern:
|
||||
|
@ -2353,7 +2365,7 @@ different to the value of \fIovector[0]\fP if the pattern contains the \eK
|
|||
escape sequence. After a partial match, however, this value is always the same
|
||||
as \fIovector[0]\fP because \eK does not affect the result of a partial match.
|
||||
.P
|
||||
After a UTF check failure, \fBpcre2_get_startchar()\fB can be used to obtain
|
||||
After a UTF check failure, \fBpcre2_get_startchar()\fP can be used to obtain
|
||||
the code unit offset of the invalid UTF character. Details are given in the
|
||||
.\" HREF
|
||||
\fBpcre2unicode\fP
|
||||
|
@ -2901,14 +2913,14 @@ first and last entries in the name-to-number table for the given name, and the
|
|||
function returns the length of each entry in code units. In both cases,
|
||||
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
||||
.P
|
||||
The format of the name table is described above in the section entitled
|
||||
\fIInformation about a pattern\fP
|
||||
The format of the name table is described
|
||||
.\" HTML <a href="#infoaboutpattern">
|
||||
.\" </a>
|
||||
above.
|
||||
above
|
||||
.\"
|
||||
Given all the relevant entries for the name, you can extract each of their
|
||||
numbers, and hence the captured data.
|
||||
in the section entitled \fIInformation about a pattern\fP. Given all the
|
||||
relevant entries for the name, you can extract each of their numbers, and hence
|
||||
the captured data.
|
||||
.
|
||||
.
|
||||
.SH "FINDING ALL POSSIBLE MATCHES AT ONE POSITION"
|
||||
|
@ -3154,6 +3166,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 21 December 2015
|
||||
Last updated: 16 December 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -797,14 +797,18 @@ PATTERN MODIFIERS
|
|||
with that pattern. They may not appear in #pattern commands. These mod-
|
||||
ifiers do not affect the compilation process.
|
||||
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
|
||||
These modifiers may not appear in a #pattern command. If you want them
|
||||
as defaults, set them in a #subject command.
|
||||
|
@ -860,33 +864,38 @@ SUBJECT MODIFIERS
|
|||
line (see above), in which case they apply to every subject line that
|
||||
is matched against that pattern.
|
||||
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text (non-JIT only)
|
||||
altglobal alternative global matching
|
||||
callout_capture show captures at callout time
|
||||
callout_data=<n> set a value to pass via callouts
|
||||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
dfa use pcre2_dfa_match()
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=<n> set a match limit
|
||||
memory show memory usage
|
||||
null_context match with a NULL context
|
||||
offset=<n> set starting offset
|
||||
offset_limit=<n> set offset limit
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text (non-JIT only)
|
||||
altglobal alternative global matching
|
||||
callout_capture show captures at callout time
|
||||
callout_data=<n> set a value to pass via callouts
|
||||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
dfa use pcre2_dfa_match()
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=<n> set a match limit
|
||||
memory show memory usage
|
||||
null_context match with a NULL context
|
||||
offset=<n> set starting offset
|
||||
offset_limit=<n> set offset limit
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
startoffset=<n> same as offset=<n>
|
||||
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
|
||||
The effects of these modifiers are described in the following sections.
|
||||
|
||||
|
@ -1011,19 +1020,30 @@ SUBJECT MODIFIERS
|
|||
Testing the substitution function
|
||||
|
||||
If the replace modifier is set, the pcre2_substitute() function is
|
||||
called instead of one of the matching functions. Unlike subject
|
||||
strings, pcre2test does not process replacement strings for escape
|
||||
sequences. In UTF mode, a replacement string is checked to see if it is
|
||||
a valid UTF-8 string. If so, it is correctly converted to a UTF string
|
||||
of the appropriate code unit width. If it is not a valid UTF-8 string,
|
||||
the individual code units are copied directly. This provides a means of
|
||||
passing an invalid UTF-8 string for testing purposes.
|
||||
called instead of one of the matching functions. Note that replacement
|
||||
strings cannot contain commas, because a comma signifies the end of a
|
||||
modifier. This is not thought to be an issue in a test program.
|
||||
|
||||
If the global modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
||||
pcre2_substitute(). After a successful substitution, the modified
|
||||
string is output, preceded by the number of replacements. This may be
|
||||
zero if there were no matches. Here is a simple example of a substitu-
|
||||
tion test:
|
||||
Unlike subject strings, pcre2test does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to
|
||||
see if it is a valid UTF-8 string. If so, it is correctly converted to
|
||||
a UTF string of the appropriate code unit width. If it is not a valid
|
||||
UTF-8 string, the individual code units are copied directly. This pro-
|
||||
vides a means of passing an invalid UTF-8 string for testing purposes.
|
||||
|
||||
The following modifiers set options (in additional to the normal match
|
||||
options) for pcre2_substitute():
|
||||
|
||||
global PCRE2_SUBSTITUTE_GLOBAL
|
||||
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
|
||||
|
||||
After a successful substitution, the modified string is output, pre-
|
||||
ceded by the number of replacements. This may be zero if there were no
|
||||
matches. Here is a simple example of a substitution test:
|
||||
|
||||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
|
@ -1031,12 +1051,13 @@ SUBJECT MODIFIERS
|
|||
=abc=abc=\=global
|
||||
2: =xxx=xxx=
|
||||
|
||||
Subject and replacement strings should be kept relatively short for
|
||||
substitution tests, as fixed-size buffers are used. To make it easy to
|
||||
test for buffer overflow, if the replacement string starts with a num-
|
||||
ber in square brackets, that number is passed to pcre2_substitute() as
|
||||
the size of the output buffer, with the replacement string starting at
|
||||
the next character. Here is an example that tests the edge case:
|
||||
Subject and replacement strings should be kept relatively short (fewer
|
||||
than 256 characters) for substitution tests, as fixed-size buffers are
|
||||
used. To make it easy to test for buffer overflow, if the replacement
|
||||
string starts with a number in square brackets, that number is passed
|
||||
to pcre2_substitute() as the size of the output buffer, with the
|
||||
replacement string starting at the next character. Here is an example
|
||||
that tests the edge case:
|
||||
|
||||
/abc/
|
||||
123abc123\=replace=[10]XYZ
|
||||
|
@ -1044,6 +1065,19 @@ SUBJECT MODIFIERS
|
|||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
|
||||
The default action of pcre2_substitute() is to return
|
||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
|
||||
the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
|
||||
stitute_overflow_length modifier), pcre2_substitute() continues to go
|
||||
through the motions of matching and substituting, in order to compute
|
||||
the size of buffer that is required. When this happens, pcre2test shows
|
||||
the required buffer length (which includes space for the trailing zero)
|
||||
as part of the error message. For example:
|
||||
|
||||
/abc/substitute_overflow_length
|
||||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory: 10 code units are needed
|
||||
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying
|
||||
partial matching provokes an error return ("bad option value") from
|
||||
pcre2_substitute().
|
||||
|
@ -1471,5 +1505,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 05 November 2015
|
||||
Last updated: 12 December 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
|
|
|
@ -44,7 +44,7 @@ POSSIBILITY OF SUCH DAMAGE.
|
|||
#define PCRE2_MAJOR 10
|
||||
#define PCRE2_MINOR 21
|
||||
#define PCRE2_PRERELEASE -RC1
|
||||
#define PCRE2_DATE 2015-07-06
|
||||
#define PCRE2_DATE 2015-12-15
|
||||
|
||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||
imported have to be identified as such. When building PCRE2, the appropriate
|
||||
|
|
Loading…
Reference in New Issue