File tidies, version updates, etc. for 10.21-RC1
This commit is contained in:
parent
293da188aa
commit
dffd559601
41
NEWS
41
NEWS
|
@ -1,6 +1,47 @@
|
|||
News about PCRE2 releases
|
||||
-------------------------
|
||||
|
||||
Version 10.21 15-December-2015
|
||||
------------------------------
|
||||
|
||||
1. Many bugs have been fixed. A large number of them were provoked only by very
|
||||
strange pattern input, and were discovered by fuzzers. Some others were
|
||||
discovered by code auditing. See ChangeLog for details.
|
||||
|
||||
2. The Unicode tables have been updated to Unicode version 8.0.0.
|
||||
|
||||
3. For Perl compatibility in EBCDIC environments, ranges such as a-z in a
|
||||
class, where both values are literal letters in the same case, omit the
|
||||
non-letter EBCDIC code points within the range.
|
||||
|
||||
4. There have been a number of enhancements to the pcre2_substitute() function,
|
||||
giving more flexibility to replacement facilities. It is now also possible to
|
||||
cause the function to return the needed buffer size if the one given is too
|
||||
small.
|
||||
|
||||
5. The PCRE2_ALT_VERBNAMES option causes the "name" parts of special verbs such
|
||||
as (*THEN:name) to be processed for backslashes and to take note of
|
||||
PCRE2_EXTENDED.
|
||||
|
||||
6. PCRE2_INFO_HASBACKSLASHC makes it possible for a client to find out if a
|
||||
pattern uses \C, and --never-backslash-C makes it possible to compile a version
|
||||
PCRE2 in which the use of \C is always forbidden.
|
||||
|
||||
7. A limit to the length of pattern that can be handled can now be set by
|
||||
calling pcre2_set_max_pattern_length().
|
||||
|
||||
8. When matching an unanchored pattern, a match can be required to begin within
|
||||
a given number of code units after the start of the subject by calling
|
||||
pcre2_set_offset_limit().
|
||||
|
||||
9. The pcre2test program has been extended to test new facilities, and it can
|
||||
now run the tests when LF on its own is not a valid newline sequence.
|
||||
|
||||
10. The RunTest script has also been updated to enable more tests to be run.
|
||||
|
||||
11. There have been some minor performance enhancements.
|
||||
|
||||
|
||||
Version 10.20 30-June-2015
|
||||
--------------------------
|
||||
|
||||
|
|
10
configure.ac
10
configure.ac
|
@ -11,16 +11,16 @@ dnl be defined as -RC2, for example. For real releases, it should be empty.
|
|||
m4_define(pcre2_major, [10])
|
||||
m4_define(pcre2_minor, [21])
|
||||
m4_define(pcre2_prerelease, [-RC1])
|
||||
m4_define(pcre2_date, [2015-07-06])
|
||||
m4_define(pcre2_date, [2015-12-15])
|
||||
|
||||
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
||||
# 50 lines of this file. Please update that if the variables above are moved.
|
||||
|
||||
# Libtool shared library interface versions (current:revision:age)
|
||||
m4_define(libpcre2_8_version, [2:0:2])
|
||||
m4_define(libpcre2_16_version, [2:0:2])
|
||||
m4_define(libpcre2_32_version, [2:0:2])
|
||||
m4_define(libpcre2_posix_version, [0:0:0])
|
||||
m4_define(libpcre2_8_version, [3:0:3])
|
||||
m4_define(libpcre2_16_version, [3:0:3])
|
||||
m4_define(libpcre2_32_version, [3:0:3])
|
||||
m4_define(libpcre2_posix_version, [0:1:0])
|
||||
|
||||
AC_PREREQ(2.57)
|
||||
AC_INIT(PCRE2, pcre2_major.pcre2_minor[]pcre2_prerelease, , pcre2)
|
||||
|
|
|
@ -42,19 +42,20 @@ request are as follows:
|
|||
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
|
||||
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
|
||||
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
|
||||
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
||||
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
|
||||
0 nothing set
|
||||
1 first code unit is set
|
||||
2 start of string or after newline
|
||||
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
||||
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \C
|
||||
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches
|
||||
exist in the pattern
|
||||
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
|
||||
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
|
||||
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
||||
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
|
||||
0 nothing set
|
||||
1 code unit is set
|
||||
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
||||
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an
|
||||
empty string, 0 otherwise
|
||||
PCRE2_INFO_MATCHLIMIT Match limit if set,
|
||||
|
@ -62,8 +63,8 @@ request are as follows:
|
|||
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest
|
||||
lookbehind assertion
|
||||
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
|
||||
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
||||
PCRE2_INFO_NAMECOUNT Number of named subpatterns
|
||||
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
||||
PCRE2_INFO_NAMETABLE Pointer to name table
|
||||
PCRE2_CONFIG_NEWLINE Code for the newline sequence:
|
||||
PCRE2_NEWLINE_CR
|
||||
|
|
|
@ -70,6 +70,9 @@ The options are:
|
|||
PCRE2_UTF was set at compile time)
|
||||
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
|
||||
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
|
||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
|
||||
</pre>
|
||||
The function returns the number of substitutions, which may be zero if there
|
||||
were no matches. The result can be greater than one only when
|
||||
|
|
|
@ -716,8 +716,8 @@ of the following match-time parameters:
|
|||
<pre>
|
||||
A callout function
|
||||
The offset limit for matching an unanchored pattern
|
||||
The limit for calling <i>match()</i>
|
||||
The limit for calling <i>match()</i> recursively
|
||||
The limit for calling <b>match()</b> (see below)
|
||||
The limit for calling <b>match()</b> recursively
|
||||
</pre>
|
||||
A match context is also required if you are using custom memory management.
|
||||
If none of these apply, just pass NULL as the context argument of
|
||||
|
@ -771,7 +771,9 @@ PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
|
|||
<P>
|
||||
The offset limit facility can be used to track progress when searching large
|
||||
subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to
|
||||
start within the first line of the subject.
|
||||
start within the first line of the subject. If this is set with an offset
|
||||
limit, a match must occur in the first line and also within the offset limit.
|
||||
In other words, whichever limit comes first is used.
|
||||
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
|
@ -1212,7 +1214,9 @@ built.
|
|||
If this option is set, an unanchored pattern is required to match before or at
|
||||
the first newline in the subject string, though the matched text may continue
|
||||
over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
|
||||
general limiting facility.
|
||||
general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, a
|
||||
match must occur in the first line and also within the offset limit. In other
|
||||
words, whichever limit comes first is used.
|
||||
<pre>
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
</pre>
|
||||
|
@ -1563,11 +1567,10 @@ are as follows:
|
|||
Return a copy of the pattern's options. The third argument should point to a
|
||||
<b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
|
||||
were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns
|
||||
the compile options as modified by any top-level option settings at the start
|
||||
of the pattern itself. In other words, they are the options that will be in
|
||||
force when matching starts. For example, if the pattern /(?im)abc(?-i)d/ is
|
||||
compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
|
||||
PCRE2_MULTILINE, and PCRE2_EXTENDED.
|
||||
the compile options as modified by any top-level option settings such as (*UTF)
|
||||
at the start of the pattern itself. For example, if the pattern /(*UTF)abc/ is
|
||||
compiled with the PCRE2_EXTENDED option, the result is PCRE2_EXTENDED and
|
||||
PCRE2_UTF.
|
||||
</P>
|
||||
<P>
|
||||
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
|
||||
|
@ -1609,18 +1612,27 @@ matches only CR, LF, or CRLF.
|
|||
<pre>
|
||||
PCRE2_INFO_CAPTURECOUNT
|
||||
</pre>
|
||||
Return the number of capturing subpatterns in the pattern. The third argument
|
||||
should point to an <b>uint32_t</b> variable.
|
||||
Return the highest capturing subpattern number in the pattern. In patterns
|
||||
where (?| is not used, this is also the total number of capturing subpatterns.
|
||||
The third argument should point to an <b>uint32_t</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
</pre>
|
||||
In the absence of a single first code unit for a non-anchored pattern,
|
||||
<b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of
|
||||
values for the first code unit in any match. For example, a pattern that starts
|
||||
with [abc] results in a table with three bits set. When code unit values
|
||||
greater than 255 are supported, the flag bit for 255 means "any code unit of
|
||||
value 255 or above". If such a table was constructed, a pointer to it is
|
||||
returned. Otherwise NULL is returned. The third argument should point to an
|
||||
<b>const uint8_t *</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTCODETYPE
|
||||
</pre>
|
||||
Return information about the first code unit of any matched string, for a
|
||||
non-anchored pattern. The third argument should point to an <b>uint32_t</b>
|
||||
variable.
|
||||
</P>
|
||||
<P>
|
||||
If there is a fixed first value, for example, the letter "c" from a pattern
|
||||
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||
variable. If there is a fixed first value, for example, the letter "c" from a
|
||||
pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
||||
it is known that a match can occur only at the start of the subject or
|
||||
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
||||
|
@ -1635,16 +1647,10 @@ value is always less than 256. In the 16-bit library the value can be up to
|
|||
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
|
||||
and up to 0xffffffff when not using UTF-32 mode.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
PCRE2_INFO_HASBACKSLASHC
|
||||
</pre>
|
||||
In the absence of a single first code unit for a non-anchored pattern,
|
||||
<b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of
|
||||
values for the first code unit in any match. For example, a pattern that starts
|
||||
with [abc] results in a table with three bits set. When code unit values
|
||||
greater than 255 are supported, the flag bit for 255 means "any code unit of
|
||||
value 255 or above". If such a table was constructed, a pointer to it is
|
||||
returned. Otherwise NULL is returned. The third argument should point to an
|
||||
<b>const uint8_t *</b> variable.
|
||||
Return 1 if the pattern contains any instances of \C, otherwise 0. The third
|
||||
argument should point to an <b>uint32_t</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_HASCRORLF
|
||||
</pre>
|
||||
|
@ -1670,13 +1676,10 @@ Returns 1 if there is a rightmost literal code unit that must exist in any
|
|||
matched string, other than at its start. The third argument should point to an
|
||||
<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is
|
||||
returned, the code unit value itself can be retrieved using
|
||||
PCRE2_INFO_LASTCODEUNIT.
|
||||
</P>
|
||||
<P>
|
||||
For anchored patterns, a last literal value is recorded only if it follows
|
||||
something of variable length. For example, for the pattern /^a\d+z\d+/ the
|
||||
returned value is 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for
|
||||
/^a\dz\d/ the returned value is 0.
|
||||
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
|
||||
recorded only if it follows something of variable length. For example, for the
|
||||
pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned from
|
||||
PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
|
||||
<pre>
|
||||
PCRE2_INFO_LASTCODEUNIT
|
||||
</pre>
|
||||
|
@ -1687,8 +1690,11 @@ value, 0 is returned.
|
|||
<pre>
|
||||
PCRE2_INFO_MATCHEMPTY
|
||||
</pre>
|
||||
Return 1 if the pattern can match an empty string, otherwise 0. The third
|
||||
argument should point to an <b>uint32_t</b> variable.
|
||||
Return 1 if the pattern might match an empty string, otherwise 0. The third
|
||||
argument should point to an <b>uint32_t</b> variable. When a pattern contains
|
||||
recursive subroutine calls it is not always possible to determine whether or
|
||||
not it can match an empty string. PCRE2 takes a cautious approach and returns 1
|
||||
in such cases.
|
||||
<pre>
|
||||
PCRE2_INFO_MATCHLIMIT
|
||||
</pre>
|
||||
|
@ -2142,8 +2148,13 @@ documentation.
|
|||
When PCRE2 is built, a default newline convention is set; this is usually the
|
||||
standard convention for the operating system. The default can be overridden in
|
||||
a
|
||||
<a href="#compilecontext">compile context.</a>
|
||||
During matching, the newline choice affects the behaviour of the dot,
|
||||
<a href="#compilecontext">compile context</a>
|
||||
by calling <b>pcre2_set_newline()</b>. It can also be overridden by starting a
|
||||
pattern string with, for example, (*CRLF), as described in the
|
||||
<a href="pcre2pattern.html#newlines">section on newline conventions</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page. During matching, the newline choice affects the behaviour of the dot,
|
||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||
starting position is advanced after a match failure for an unanchored pattern.
|
||||
</P>
|
||||
|
@ -2191,19 +2202,20 @@ function can be used to find out how many capturing subpatterns there are in a
|
|||
compiled pattern.
|
||||
</P>
|
||||
<P>
|
||||
A successful match returns the overall matched string and any captured
|
||||
substrings to the caller via a vector of PCRE2_SIZE values. This is called the
|
||||
<b>ovector</b>, and is contained within the
|
||||
<a href="#matchdatablock">match data block.</a>
|
||||
You can obtain direct access to the ovector by calling
|
||||
<b>pcre2_get_ovector_pointer()</b> to find its address, and
|
||||
<b>pcre2_get_ovector_count()</b> to find the number of pairs of values it
|
||||
contains. Alternatively, you can use the auxiliary functions for accessing
|
||||
captured substrings
|
||||
You can use auxiliary functions for accessing captured substrings
|
||||
<a href="#extractbynumber">by number</a>
|
||||
or
|
||||
<a href="#extractbyname">by name</a>
|
||||
(see below).
|
||||
<a href="#extractbyname">by name,</a>
|
||||
as described in sections below.
|
||||
</P>
|
||||
<P>
|
||||
Alternatively, you can make direct use of the vector of PCRE2_SIZE values,
|
||||
called the <b>ovector</b>, which contains the offsets of captured strings. It is
|
||||
part of the
|
||||
<a href="#matchdatablock">match data block.</a>
|
||||
The function <b>pcre2_get_ovector_pointer()</b> returns the address of the
|
||||
ovector, and <b>pcre2_get_ovector_count()</b> returns the number of pairs of
|
||||
values it contains.
|
||||
</P>
|
||||
<P>
|
||||
Within the ovector, the first in each pair of values is set to the offset of
|
||||
|
@ -2292,7 +2304,13 @@ After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
|
|||
to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
|
||||
<b>pcre2_get_mark()</b> can be called. It returns a pointer to the
|
||||
zero-terminated name, which is within the compiled pattern. Otherwise NULL is
|
||||
returned. After a successful match, the (*MARK) name that is returned is the
|
||||
returned. The length of the (*MARK) name (excluding the terminating zero) is
|
||||
stored in the code unit that preceeds the name. You should use this instead of
|
||||
relying on the terminating zero if the (*MARK) name might contain a binary
|
||||
zero.
|
||||
</P>
|
||||
<P>
|
||||
After a successful match, the (*MARK) name that is returned is the
|
||||
last one encountered on the matching path through the pattern. After a "no
|
||||
match" or a partial match, the last encountered (*MARK) name is returned. For
|
||||
example, consider this pattern:
|
||||
|
@ -2313,7 +2331,7 @@ escape sequence. After a partial match, however, this value is always the same
|
|||
as <i>ovector[0]</i> because \K does not affect the result of a partial match.
|
||||
</P>
|
||||
<P>
|
||||
After a UTF check failure, \fBpcre2_get_startchar()\fB can be used to obtain
|
||||
After a UTF check failure, <b>pcre2_get_startchar()</b> can be used to obtain
|
||||
the code unit offset of the invalid UTF character. Details are given in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page.
|
||||
|
@ -2650,12 +2668,21 @@ allocate memory for the compiled code.
|
|||
</P>
|
||||
<P>
|
||||
The <i>outlengthptr</i> argument must point to a variable that contains the
|
||||
length, in code units, of the output buffer. If the function is successful,
|
||||
the value is updated to contain the length of the new string, excluding the
|
||||
trailing zero that is automatically added. If the function is not successful,
|
||||
the value is set to PCRE2_UNSET for general errors (such as output buffer too
|
||||
small). For syntax errors in the replacement string, the value is set to the
|
||||
offset in the replacement string where the error was detected.
|
||||
length, in code units, of the output buffer. If the function is successful, the
|
||||
value is updated to contain the length of the new string, excluding the
|
||||
trailing zero that is automatically added.
|
||||
</P>
|
||||
<P>
|
||||
If the function is not successful, the value set via <i>outlengthptr</i> depends
|
||||
on the type of error. For syntax errors in the replacement string, the value is
|
||||
the offset in the replacement string where the error was detected. For other
|
||||
errors, the value is PCRE2_UNSET by default. This includes the case of the
|
||||
output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set
|
||||
(see below), in which case the value is the minimum length needed, including
|
||||
space for the trailing zero. Note that in order to compute the required length,
|
||||
<b>pcre2_substitute()</b> has to simulate all the matching and copying, instead
|
||||
of giving an error return as soon as the buffer overflows. Note also that the
|
||||
length is in code units, not bytes.
|
||||
</P>
|
||||
<P>
|
||||
In the replacement string, which is interpreted as a UTF string in UTF mode,
|
||||
|
@ -2682,15 +2709,53 @@ simultaneous substitutions, as this <b>pcre2test</b> example shows:
|
|||
apple lemon
|
||||
2: pear orange
|
||||
</pre>
|
||||
There is an additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes the
|
||||
function to iterate over the subject string, replacing every matching
|
||||
substring. If this is not set, only the first matching substring is replaced.
|
||||
As well as the usual options for <b>pcre2_match()</b>, a number of additional
|
||||
options can be set in the <i>options</i> argument.
|
||||
</P>
|
||||
<P>
|
||||
A second additional option, PCRE2_SUBSTITUTE_EXTENDED, causes extra processing
|
||||
to be applied to the replacement string. Without this option, only the dollar
|
||||
character is special, and only the group insertion forms listed above are
|
||||
valid. When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
||||
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
|
||||
replacing every matching substring. If this is not set, only the first matching
|
||||
substring is replaced. If any matched substring has zero length, after the
|
||||
substitution has happened, an attempt to find a non-empty match at the same
|
||||
position is performed. If this is not successful, the current position is
|
||||
advanced by one character except when CRLF is a valid newline sequence and the
|
||||
next two characters are CR, LF. In this case, the current position is advanced
|
||||
by two characters.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
|
||||
too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
|
||||
this option is set, however, <b>pcre2_substitute()</b> continues to go through
|
||||
the motions of matching and substituting (without, of course, writing anything)
|
||||
in order to compute the size of buffer that is needed. This value is passed
|
||||
back via the <i>outlengthptr</i> variable, with the result of the function still
|
||||
being PCRE2_ERROR_NOMEMORY.
|
||||
</P>
|
||||
<P>
|
||||
Passing a buffer size of zero is a permitted way of finding out how much memory
|
||||
is needed for given substitution. However, this does mean that the entire
|
||||
operation is carried out twice. Depending on the application, it may be more
|
||||
efficient to allocate a large buffer and free the excess afterwards, instead of
|
||||
using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups that do
|
||||
not appear in the pattern to be treated as unset groups. This option should be
|
||||
used with care, because it means that a typo in a group name or number no
|
||||
longer causes the PCRE2_ERROR_NOSUBSTRING error.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including unknown
|
||||
groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty
|
||||
strings when inserted as described above. If this option is not set, an attempt
|
||||
to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does
|
||||
not influence the extended substitution syntax described below.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
|
||||
replacement string. Without this option, only the dollar character is special,
|
||||
and only the group insertion forms listed above are valid. When
|
||||
PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
||||
</P>
|
||||
<P>
|
||||
Firstly, backslash in a replacement string is interpreted as an escape
|
||||
|
@ -2740,22 +2805,46 @@ string remains in force afterwards, as shown in this <b>pcre2test</b> example:
|
|||
somebody
|
||||
1: HELLO
|
||||
</pre>
|
||||
If successful, the function returns the number of replacements that were made.
|
||||
This may be zero if no matches were found, and is never greater than 1 unless
|
||||
PCRE2_SUBSTITUTE_GLOBAL is set.
|
||||
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
|
||||
substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause unknown
|
||||
groups in the extended syntax forms to be treated as unset.
|
||||
</P>
|
||||
<P>
|
||||
If successful, <b>pcre2_substitute()</b> returns the number of replacements that
|
||||
were made. This may be zero if no matches were found, and is never greater than
|
||||
1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
|
||||
</P>
|
||||
<P>
|
||||
In the event of an error, a negative error code is returned. Except for
|
||||
PCRE2_ERROR_NOMATCH (which is never returned), errors from <b>pcre2_match()</b>
|
||||
are passed straight back. PCRE2_ERROR_NOMEMORY is returned if the output buffer
|
||||
is not big enough. PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax
|
||||
errors in the replacement string, with more particular errors being
|
||||
PCRE2_ERROR_BADREPESCAPE (invalid escape sequence),
|
||||
PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket not found),
|
||||
PCRE2_BADSUBSTITUTION (syntax error in extended group substitution), and
|
||||
PCRE2_BADSUBPATTERN (the pattern match ended before it started). As for all
|
||||
PCRE2 errors, a text message that describes the error can be obtained by
|
||||
calling <b>pcre2_get_error_message()</b>.
|
||||
are passed straight back.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion,
|
||||
unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an
|
||||
unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple
|
||||
(non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is
|
||||
needed is returned via <i>outlengthptr</i>. Note that this does not happen by
|
||||
default.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
|
||||
replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE
|
||||
(invalid escape sequence), PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket
|
||||
not found), PCRE2_BADSUBSTITUTION (syntax error in extended group
|
||||
substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it
|
||||
started, which can happen if \K is used in an assertion).
|
||||
</P>
|
||||
<P>
|
||||
As for all PCRE2 errors, a text message that describes the error can be
|
||||
obtained by calling <b>pcre2_get_error_message()</b>.
|
||||
</P>
|
||||
<br><a name="SEC35" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
|
||||
<P>
|
||||
|
@ -2796,11 +2885,11 @@ function returns the length of each entry in code units. In both cases,
|
|||
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
||||
</P>
|
||||
<P>
|
||||
The format of the name table is described above in the section entitled
|
||||
<i>Information about a pattern</i>
|
||||
<a href="#infoaboutpattern">above.</a>
|
||||
Given all the relevant entries for the name, you can extract each of their
|
||||
numbers, and hence the captured data.
|
||||
The format of the name table is described
|
||||
<a href="#infoaboutpattern">above</a>
|
||||
in the section entitled <i>Information about a pattern</i>. Given all the
|
||||
relevant entries for the name, you can extract each of their numbers, and hence
|
||||
the captured data.
|
||||
</P>
|
||||
<br><a name="SEC36" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
|
||||
<P>
|
||||
|
@ -3032,7 +3121,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC40" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 05 November 2015
|
||||
Last updated: 16 December 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -86,6 +86,13 @@ results. The returned value from <b>pcre2_jit_compile()</b> is zero on success,
|
|||
or a negative error code.
|
||||
</P>
|
||||
<P>
|
||||
There is a limit to the size of pattern that JIT supports, imposed by the size
|
||||
of machine stack that it uses. The exact rules are not documented because they
|
||||
may change at any time, in particular, when new optimizations are introduced.
|
||||
If a pattern is too big, a call to \fBpcre2_jit_compile()\fB returns
|
||||
PCRE2_ERROR_NOMEMORY.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
|
||||
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
|
||||
PCRE2_PARTIAL_SOFT options of <b>pcre2_match()</b>, you should set one or both
|
||||
|
@ -425,7 +432,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 July 2015
|
||||
Last updated: 14 November 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -669,8 +669,8 @@ This is an example of an "atomic group", details of which are given
|
|||
This particular group matches either the two-character sequence CR followed by
|
||||
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
|
||||
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
|
||||
line, U+0085). The two-character sequence is treated as a single unit that
|
||||
cannot be split.
|
||||
line, U+0085). Because this is an atomic group, the two-character sequence is
|
||||
treated as a single unit that cannot be split.
|
||||
</P>
|
||||
<P>
|
||||
In other modes, two additional characters whose codepoints are greater than 255
|
||||
|
@ -1186,6 +1186,16 @@ when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
|
|||
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
|
||||
</P>
|
||||
<P>
|
||||
When the newline convention (see
|
||||
<a href="#newlines">"Newline conventions"</a>
|
||||
below) recognizes the two-character sequence CRLF as a newline, this is
|
||||
preferred, even if the single characters CR and LF are also recognized as
|
||||
newlines. For example, if the newline convention is "any", a multiline mode
|
||||
circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after
|
||||
CR, even though CR on its own is a valid newline. (It also matches at the very
|
||||
start of the string, of course.)
|
||||
</P>
|
||||
<P>
|
||||
Note that the sequences \A, \Z, and \z can be used to match the start and
|
||||
end of the subject in both modes, and if all branches of a pattern start with
|
||||
\A it is always anchored, whether or not PCRE2_MULTILINE is set.
|
||||
|
@ -1672,6 +1682,10 @@ first one in the pattern with the given number. The following pattern matches
|
|||
<pre>
|
||||
/(?|(abc)|(def))(?1)/
|
||||
</pre>
|
||||
A relative reference such as (?-1) is no different: it is just a convenient way
|
||||
of computing an absolute group number.
|
||||
</P>
|
||||
<P>
|
||||
If a
|
||||
<a href="#conditions">condition test</a>
|
||||
for a subpattern's having matched refers to a non-unique number, the test is
|
||||
|
@ -2626,6 +2640,21 @@ parentheses preceding the recursion. In other words, a negative number counts
|
|||
capturing parentheses leftwards from the point at which it is encountered.
|
||||
</P>
|
||||
<P>
|
||||
Be aware however, that if
|
||||
<a href="#dupsubpatternnumber">duplicate subpattern numbers</a>
|
||||
are in use, relative references refer to the earliest subpattern with the
|
||||
appropriate number. Consider, for example:
|
||||
<pre>
|
||||
(?|(a)|(b)) (c) (?-2)
|
||||
</pre>
|
||||
The first two capturing groups (a) and (b) are both numbered 1, and group (c)
|
||||
is number 2. When the reference (?-2) is encountered, the second most recently
|
||||
opened parentheses has the number 1, but it is the first such group (the (a)
|
||||
group) to which the recursion refers. This would be the same if an absolute
|
||||
reference (?1) was used. In other words, relative references are just a
|
||||
shorthand for computing a group number.
|
||||
</P>
|
||||
<P>
|
||||
It is also possible to refer to subsequently opened parentheses, by writing
|
||||
references such as (?+2). However, these cannot be recursive because the
|
||||
reference is not inside the parentheses that are referenced. They are always
|
||||
|
@ -3359,7 +3388,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 01 November 2015
|
||||
Last updated: 13 November 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -235,7 +235,8 @@ to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
|
|||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
|
||||
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
||||
how it is matched.
|
||||
how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL are
|
||||
mutually exclusive; the error REG_INVARG is returned.
|
||||
</P>
|
||||
<P>
|
||||
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||
|
@ -289,7 +290,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 30 October 2015
|
||||
Last updated: 29 November 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -900,6 +900,10 @@ compilation process.
|
|||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
</pre>
|
||||
These modifiers may not appear in a <b>#pattern</b> command. If you want them as
|
||||
defaults, set them in a <b>#subject</b> command.
|
||||
|
@ -990,6 +994,11 @@ pattern.
|
|||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
startoffset=<n> same as offset=<n>
|
||||
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
</pre>
|
||||
The effects of these modifiers are described in the following sections.
|
||||
|
@ -1129,19 +1138,34 @@ Testing the substitution function
|
|||
</b><br>
|
||||
<P>
|
||||
If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
|
||||
called instead of one of the matching functions. Unlike subject strings,
|
||||
<b>pcre2test</b> does not process replacement strings for escape sequences. In
|
||||
UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
|
||||
If so, it is correctly converted to a UTF string of the appropriate code unit
|
||||
width. If it is not a valid UTF-8 string, the individual code units are copied
|
||||
directly. This provides a means of passing an invalid UTF-8 string for testing
|
||||
purposes.
|
||||
called instead of one of the matching functions. Note that replacement strings
|
||||
cannot contain commas, because a comma signifies the end of a modifier. This is
|
||||
not thought to be an issue in a test program.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
||||
<b>pcre2_substitute()</b>. After a successful substitution, the modified string
|
||||
is output, preceded by the number of replacements. This may be zero if there
|
||||
were no matches. Here is a simple example of a substitution test:
|
||||
Unlike subject strings, <b>pcre2test</b> does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to see if it
|
||||
is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
|
||||
the appropriate code unit width. If it is not a valid UTF-8 string, the
|
||||
individual code units are copied directly. This provides a means of passing an
|
||||
invalid UTF-8 string for testing purposes.
|
||||
</P>
|
||||
<P>
|
||||
The following modifiers set options (in additional to the normal match options)
|
||||
for <b>pcre2_substitute()</b>:
|
||||
<pre>
|
||||
global PCRE2_SUBSTITUTE_GLOBAL
|
||||
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
After a successful substitution, the modified string is output, preceded by the
|
||||
number of replacements. This may be zero if there were no matches. Here is a
|
||||
simple example of a substitution test:
|
||||
<pre>
|
||||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
|
@ -1149,12 +1173,12 @@ were no matches. Here is a simple example of a substitution test:
|
|||
=abc=abc=\=global
|
||||
2: =xxx=xxx=
|
||||
</pre>
|
||||
Subject and replacement strings should be kept relatively short for
|
||||
substitution tests, as fixed-size buffers are used. To make it easy to test for
|
||||
buffer overflow, if the replacement string starts with a number in square
|
||||
brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
|
||||
output buffer, with the replacement string starting at the next character. Here
|
||||
is an example that tests the edge case:
|
||||
Subject and replacement strings should be kept relatively short (fewer than 256
|
||||
characters) for substitution tests, as fixed-size buffers are used. To make it
|
||||
easy to test for buffer overflow, if the replacement string starts with a
|
||||
number in square brackets, that number is passed to <b>pcre2_substitute()</b> as
|
||||
the size of the output buffer, with the replacement string starting at the next
|
||||
character. Here is an example that tests the edge case:
|
||||
<pre>
|
||||
/abc/
|
||||
123abc123\=replace=[10]XYZ
|
||||
|
@ -1162,6 +1186,19 @@ is an example that tests the edge case:
|
|||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
</pre>
|
||||
The default action of <b>pcre2_substitute()</b> is to return
|
||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the
|
||||
<b>substitute_overflow_length</b> modifier), <b>pcre2_substitute()</b> continues
|
||||
to go through the motions of matching and substituting, in order to compute the
|
||||
size of buffer that is required. When this happens, <b>pcre2test</b> shows the
|
||||
required buffer length (which includes space for the trailing zero) as part of
|
||||
the error message. For example:
|
||||
<pre>
|
||||
/abc/substitute_overflow_length
|
||||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory: 10 code units are needed
|
||||
</pre>
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying partial
|
||||
matching provokes an error return ("bad option value") from
|
||||
<b>pcre2_substitute()</b>.
|
||||
|
@ -1623,7 +1660,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 05 November 2015
|
||||
Last updated: 12 December 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
275
doc/pcre2.txt
275
doc/pcre2.txt
|
@ -774,7 +774,7 @@ PCRE2 CONTEXTS
|
|||
|
||||
A callout function
|
||||
The offset limit for matching an unanchored pattern
|
||||
The limit for calling match()
|
||||
The limit for calling match() (see below)
|
||||
The limit for calling match() recursively
|
||||
|
||||
A match context is also required if you are using custom memory manage-
|
||||
|
@ -824,7 +824,10 @@ PCRE2 CONTEXTS
|
|||
|
||||
The offset limit facility can be used to track progress when searching
|
||||
large subject strings. See also the PCRE2_FIRSTLINE option, which
|
||||
requires a match to start within the first line of the subject.
|
||||
requires a match to start within the first line of the subject. If this
|
||||
is set with an offset limit, a match must occur in the first line and
|
||||
also within the offset limit. In other words, whichever limit comes
|
||||
first is used.
|
||||
|
||||
int pcre2_set_match_limit(pcre2_match_context *mcontext,
|
||||
uint32_t value);
|
||||
|
@ -1251,7 +1254,10 @@ COMPILING A PATTERN
|
|||
If this option is set, an unanchored pattern is required to match
|
||||
before or at the first newline in the subject string, though the
|
||||
matched text may continue over the newline. See also PCRE2_USE_OFF-
|
||||
SET_LIMIT, which provides a more general limiting facility.
|
||||
SET_LIMIT, which provides a more general limiting facility. If
|
||||
PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the
|
||||
first line and also within the offset limit. In other words, whichever
|
||||
limit comes first is used.
|
||||
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
|
||||
|
@ -1590,11 +1596,9 @@ INFORMATION ABOUT A COMPILED PATTERN
|
|||
to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
|
||||
options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
|
||||
TIONS returns the compile options as modified by any top-level option
|
||||
settings at the start of the pattern itself. In other words, they are
|
||||
the options that will be in force when matching starts. For example, if
|
||||
the pattern /(?im)abc(?-i)d/ is compiled with the PCRE2_EXTENDED
|
||||
option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and
|
||||
PCRE2_EXTENDED.
|
||||
settings such as (*UTF) at the start of the pattern itself. For exam-
|
||||
ple, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
|
||||
option, the result is PCRE2_EXTENDED and PCRE2_UTF.
|
||||
|
||||
A pattern compiled without PCRE2_ANCHORED is automatically anchored by
|
||||
PCRE2 if the first significant item in every top-level branch is one of
|
||||
|
@ -1638,20 +1642,30 @@ INFORMATION ABOUT A COMPILED PATTERN
|
|||
|
||||
PCRE2_INFO_CAPTURECOUNT
|
||||
|
||||
Return the number of capturing subpatterns in the pattern. The third
|
||||
argument should point to an uint32_t variable.
|
||||
Return the highest capturing subpattern number in the pattern. In pat-
|
||||
terns where (?| is not used, this is also the total number of capturing
|
||||
subpatterns. The third argument should point to an uint32_t variable.
|
||||
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
|
||||
In the absence of a single first code unit for a non-anchored pattern,
|
||||
pcre2_compile() may construct a 256-bit table that defines a fixed set
|
||||
of values for the first code unit in any match. For example, a pattern
|
||||
that starts with [abc] results in a table with three bits set. When
|
||||
code unit values greater than 255 are supported, the flag bit for 255
|
||||
means "any code unit of value 255 or above". If such a table was con-
|
||||
structed, a pointer to it is returned. Otherwise NULL is returned. The
|
||||
third argument should point to an const uint8_t * variable.
|
||||
|
||||
PCRE2_INFO_FIRSTCODETYPE
|
||||
|
||||
Return information about the first code unit of any matched string, for
|
||||
a non-anchored pattern. The third argument should point to an uint32_t
|
||||
variable.
|
||||
|
||||
If there is a fixed first value, for example, the letter "c" from a
|
||||
pattern such as (cat|cow|coyote), 1 is returned, and the character
|
||||
value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no
|
||||
fixed first value, but it is known that a match can occur only at the
|
||||
start of the subject or following a newline in the subject, 2 is
|
||||
variable. If there is a fixed first value, for example, the letter "c"
|
||||
from a pattern such as (cat|cow|coyote), 1 is returned, and the charac-
|
||||
ter value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is
|
||||
no fixed first value, but it is known that a match can occur only at
|
||||
the start of the subject or following a newline in the subject, 2 is
|
||||
returned. Otherwise, and for anchored patterns, 0 is returned.
|
||||
|
||||
PCRE2_INFO_FIRSTCODEUNIT
|
||||
|
@ -1664,16 +1678,10 @@ INFORMATION ABOUT A COMPILED PATTERN
|
|||
value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
|
||||
mode.
|
||||
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
PCRE2_INFO_HASBACKSLASHC
|
||||
|
||||
In the absence of a single first code unit for a non-anchored pattern,
|
||||
pcre2_compile() may construct a 256-bit table that defines a fixed set
|
||||
of values for the first code unit in any match. For example, a pattern
|
||||
that starts with [abc] results in a table with three bits set. When
|
||||
code unit values greater than 255 are supported, the flag bit for 255
|
||||
means "any code unit of value 255 or above". If such a table was con-
|
||||
structed, a pointer to it is returned. Otherwise NULL is returned. The
|
||||
third argument should point to an const uint8_t * variable.
|
||||
Return 1 if the pattern contains any instances of \C, otherwise 0. The
|
||||
third argument should point to an uint32_t variable.
|
||||
|
||||
PCRE2_INFO_HASCRORLF
|
||||
|
||||
|
@ -1701,12 +1709,11 @@ INFORMATION ABOUT A COMPILED PATTERN
|
|||
any matched string, other than at its start. The third argument should
|
||||
point to an uint32_t variable. If there is no such value, 0 is
|
||||
returned. When 1 is returned, the code unit value itself can be
|
||||
retrieved using PCRE2_INFO_LASTCODEUNIT.
|
||||
|
||||
For anchored patterns, a last literal value is recorded only if it fol-
|
||||
lows something of variable length. For example, for the pattern
|
||||
/^a\d+z\d+/ the returned value is 1 (with "z" returned from
|
||||
PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
|
||||
retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last
|
||||
literal value is recorded only if it follows something of variable
|
||||
length. For example, for the pattern /^a\d+z\d+/ the returned value is
|
||||
1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/
|
||||
the returned value is 0.
|
||||
|
||||
PCRE2_INFO_LASTCODEUNIT
|
||||
|
||||
|
@ -1717,8 +1724,11 @@ INFORMATION ABOUT A COMPILED PATTERN
|
|||
|
||||
PCRE2_INFO_MATCHEMPTY
|
||||
|
||||
Return 1 if the pattern can match an empty string, otherwise 0. The
|
||||
third argument should point to an uint32_t variable.
|
||||
Return 1 if the pattern might match an empty string, otherwise 0. The
|
||||
third argument should point to an uint32_t variable. When a pattern
|
||||
contains recursive subroutine calls it is not always possible to deter-
|
||||
mine whether or not it can match an empty string. PCRE2 takes a cau-
|
||||
tious approach and returns 1 in such cases.
|
||||
|
||||
PCRE2_INFO_MATCHLIMIT
|
||||
|
||||
|
@ -2142,10 +2152,13 @@ NEWLINE HANDLING WHEN MATCHING
|
|||
|
||||
When PCRE2 is built, a default newline convention is set; this is usu-
|
||||
ally the standard convention for the operating system. The default can
|
||||
be overridden in a compile context. During matching, the newline
|
||||
choice affects the behaviour of the dot, circumflex, and dollar
|
||||
metacharacters. It may also alter the way the match starting position
|
||||
is advanced after a match failure for an unanchored pattern.
|
||||
be overridden in a compile context by calling pcre2_set_newline(). It
|
||||
can also be overridden by starting a pattern string with, for example,
|
||||
(*CRLF), as described in the section on newline conventions in the
|
||||
pcre2pattern page. During matching, the newline choice affects the be-
|
||||
haviour of the dot, circumflex, and dollar metacharacters. It may also
|
||||
alter the way the match starting position is advanced after a match
|
||||
failure for an unanchored pattern.
|
||||
|
||||
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
|
||||
set as the newline convention, and a match attempt for an unanchored
|
||||
|
@ -2188,14 +2201,15 @@ HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
|
|||
be captured. The pcre2_pattern_info() function can be used to find out
|
||||
how many capturing subpatterns there are in a compiled pattern.
|
||||
|
||||
A successful match returns the overall matched string and any captured
|
||||
substrings to the caller via a vector of PCRE2_SIZE values. This is
|
||||
called the ovector, and is contained within the match data block. You
|
||||
can obtain direct access to the ovector by calling pcre2_get_ovec-
|
||||
tor_pointer() to find its address, and pcre2_get_ovector_count() to
|
||||
find the number of pairs of values it contains. Alternatively, you can
|
||||
use the auxiliary functions for accessing captured substrings by number
|
||||
or by name (see below).
|
||||
You can use auxiliary functions for accessing captured substrings by
|
||||
number or by name, as described in sections below.
|
||||
|
||||
Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
|
||||
ues, called the ovector, which contains the offsets of captured
|
||||
strings. It is part of the match data block. The function
|
||||
pcre2_get_ovector_pointer() returns the address of the ovector, and
|
||||
pcre2_get_ovector_count() returns the number of pairs of values it con-
|
||||
tains.
|
||||
|
||||
Within the ovector, the first in each pair of values is set to the off-
|
||||
set of the first code unit of a substring, and the second is set to the
|
||||
|
@ -2274,10 +2288,15 @@ OTHER INFORMATION ABOUT A MATCH
|
|||
failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail-
|
||||
able, and pcre2_get_mark() can be called. It returns a pointer to the
|
||||
zero-terminated name, which is within the compiled pattern. Otherwise
|
||||
NULL is returned. After a successful match, the (*MARK) name that is
|
||||
returned is the last one encountered on the matching path through the
|
||||
pattern. After a "no match" or a partial match, the last encountered
|
||||
(*MARK) name is returned. For example, consider this pattern:
|
||||
NULL is returned. The length of the (*MARK) name (excluding the termi-
|
||||
nating zero) is stored in the code unit that preceeds the name. You
|
||||
should use this instead of relying on the terminating zero if the
|
||||
(*MARK) name might contain a binary zero.
|
||||
|
||||
After a successful match, the (*MARK) name that is returned is the last
|
||||
one encountered on the matching path through the pattern. After a "no
|
||||
match" or a partial match, the last encountered (*MARK) name is
|
||||
returned. For example, consider this pattern:
|
||||
|
||||
^(*MARK:A)((*MARK:B)a|b)c
|
||||
|
||||
|
@ -2609,11 +2628,19 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
|||
The outlengthptr argument must point to a variable that contains the
|
||||
length, in code units, of the output buffer. If the function is suc-
|
||||
cessful, the value is updated to contain the length of the new string,
|
||||
excluding the trailing zero that is automatically added. If the func-
|
||||
tion is not successful, the value is set to PCRE2_UNSET for general
|
||||
errors (such as output buffer too small). For syntax errors in the
|
||||
replacement string, the value is set to the offset in the replacement
|
||||
string where the error was detected.
|
||||
excluding the trailing zero that is automatically added.
|
||||
|
||||
If the function is not successful, the value set via outlengthptr
|
||||
depends on the type of error. For syntax errors in the replacement
|
||||
string, the value is the offset in the replacement string where the
|
||||
error was detected. For other errors, the value is PCRE2_UNSET by
|
||||
default. This includes the case of the output buffer being too small,
|
||||
unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which
|
||||
case the value is the minimum length needed, including space for the
|
||||
trailing zero. Note that in order to compute the required length,
|
||||
pcre2_substitute() has to simulate all the matching and copying,
|
||||
instead of giving an error return as soon as the buffer overflows. Note
|
||||
also that the length is in code units, not bytes.
|
||||
|
||||
In the replacement string, which is interpreted as a UTF string in UTF
|
||||
mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
|
||||
|
@ -2639,16 +2666,51 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
|||
apple lemon
|
||||
2: pear orange
|
||||
|
||||
There is an additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes
|
||||
the function to iterate over the subject string, replacing every match-
|
||||
ing substring. If this is not set, only the first matching substring is
|
||||
replaced.
|
||||
As well as the usual options for pcre2_match(), a number of additional
|
||||
options can be set in the options argument.
|
||||
|
||||
A second additional option, PCRE2_SUBSTITUTE_EXTENDED, causes extra
|
||||
processing to be applied to the replacement string. Without this
|
||||
option, only the dollar character is special, and only the group inser-
|
||||
tion forms listed above are valid. When PCRE2_SUBSTITUTE_EXTENDED is
|
||||
set, two things change:
|
||||
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
|
||||
string, replacing every matching substring. If this is not set, only
|
||||
the first matching substring is replaced. If any matched substring has
|
||||
zero length, after the substitution has happened, an attempt to find a
|
||||
non-empty match at the same position is performed. If this is not suc-
|
||||
cessful, the current position is advanced by one character except when
|
||||
CRLF is a valid newline sequence and the next two characters are CR,
|
||||
LF. In this case, the current position is advanced by two characters.
|
||||
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
|
||||
buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
|
||||
ORY immediately. If this option is set, however, pcre2_substitute()
|
||||
continues to go through the motions of matching and substituting (with-
|
||||
out, of course, writing anything) in order to compute the size of buf-
|
||||
fer that is needed. This value is passed back via the outlengthptr
|
||||
variable, with the result of the function still being
|
||||
PCRE2_ERROR_NOMEMORY.
|
||||
|
||||
Passing a buffer size of zero is a permitted way of finding out how
|
||||
much memory is needed for given substitution. However, this does mean
|
||||
that the entire operation is carried out twice. Depending on the appli-
|
||||
cation, it may be more efficient to allocate a large buffer and free
|
||||
the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
|
||||
FLOW_LENGTH.
|
||||
|
||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups
|
||||
that do not appear in the pattern to be treated as unset groups. This
|
||||
option should be used with care, because it means that a typo in a
|
||||
group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING
|
||||
error.
|
||||
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including
|
||||
unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be
|
||||
treated as empty strings when inserted as described above. If this
|
||||
option is not set, an attempt to insert an unset group causes the
|
||||
PCRE2_ERROR_UNSET error. This option does not influence the extended
|
||||
substitution syntax described below.
|
||||
|
||||
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
|
||||
replacement string. Without this option, only the dollar character is
|
||||
special, and only the group insertion forms listed above are valid.
|
||||
When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
||||
|
||||
Firstly, backslash in a replacement string is interpreted as an escape
|
||||
character. The usual forms such as \n or \x{ddd} can be used to specify
|
||||
|
@ -2698,22 +2760,41 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
|||
somebody
|
||||
1: HELLO
|
||||
|
||||
If successful, the function returns the number of replacements that
|
||||
were made. This may be zero if no matches were found, and is never
|
||||
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
|
||||
substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause
|
||||
unknown groups in the extended syntax forms to be treated as unset.
|
||||
|
||||
If successful, pcre2_substitute() returns the number of replacements
|
||||
that were made. This may be zero if no matches were found, and is never
|
||||
greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
|
||||
|
||||
In the event of an error, a negative error code is returned. Except for
|
||||
PCRE2_ERROR_NOMATCH (which is never returned), errors from
|
||||
pcre2_match() are passed straight back. PCRE2_ERROR_NOMEMORY is
|
||||
returned if the output buffer is not big enough.
|
||||
pcre2_match() are passed straight back.
|
||||
|
||||
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
|
||||
tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
|
||||
|
||||
PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
|
||||
ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
|
||||
when the simple (non-extended) syntax is used and PCRE2_SUBSTI-
|
||||
TUTE_UNSET_EMPTY is not set.
|
||||
|
||||
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
|
||||
enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
|
||||
of buffer that is needed is returned via outlengthptr. Note that this
|
||||
does not happen by default.
|
||||
|
||||
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
|
||||
the replacement string, with more particular errors being
|
||||
PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
|
||||
MISSING_BRACE (closing curly bracket not found), PCRE2_BADSUBSTITUTION
|
||||
(syntax error in extended group substitution), and PCRE2_BADSUBPATTERN
|
||||
(the pattern match ended before it started). As for all PCRE2 errors, a
|
||||
text message that describes the error can be obtained by calling
|
||||
pcre2_get_error_message().
|
||||
(the pattern match ended before it started, which can happen if \K is
|
||||
used in an assertion).
|
||||
|
||||
As for all PCRE2 errors, a text message that describes the error can be
|
||||
obtained by calling pcre2_get_error_message().
|
||||
|
||||
|
||||
DUPLICATE SUBPATTERN NAMES
|
||||
|
@ -2752,8 +2833,8 @@ DUPLICATE SUBPATTERN NAMES
|
|||
no entries for the given name.
|
||||
|
||||
The format of the name table is described above in the section entitled
|
||||
Information about a pattern above. Given all the relevant entries for
|
||||
the name, you can extract each of their numbers, and hence the captured
|
||||
Information about a pattern. Given all the relevant entries for the
|
||||
name, you can extract each of their numbers, and hence the captured
|
||||
data.
|
||||
|
||||
|
||||
|
@ -2974,7 +3055,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 05 November 2015
|
||||
Last updated: 16 December 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -4078,6 +4159,12 @@ SIMPLE USE OF JIT
|
|||
exactly the same results. The returned value from pcre2_jit_compile()
|
||||
is zero on success, or a negative error code.
|
||||
|
||||
There is a limit to the size of pattern that JIT supports, imposed by
|
||||
the size of machine stack that it uses. The exact rules are not docu-
|
||||
mented because they may change at any time, in particular, when new
|
||||
optimizations are introduced. If a pattern is too big, a call to
|
||||
pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
|
||||
|
||||
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
|
||||
plete matches. If you want to run partial matches using the PCRE2_PAR-
|
||||
TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should
|
||||
|
@ -4394,7 +4481,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 28 July 2015
|
||||
Last updated: 14 November 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -5706,8 +5793,9 @@ BACKSLASH
|
|||
below. This particular group matches either the two-character sequence
|
||||
CR followed by LF, or one of the single characters LF (linefeed,
|
||||
U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
|
||||
riage return, U+000D), or NEL (next line, U+0085). The two-character
|
||||
sequence is treated as a single unit that cannot be split.
|
||||
riage return, U+000D), or NEL (next line, U+0085). Because this is an
|
||||
atomic group, the two-character sequence is treated as a single unit
|
||||
that cannot be split.
|
||||
|
||||
In other modes, two additional characters whose codepoints are greater
|
||||
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
|
||||
|
@ -6076,6 +6164,14 @@ CIRCUMFLEX AND DOLLAR
|
|||
pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
|
||||
if PCRE2_MULTILINE is set.
|
||||
|
||||
When the newline convention (see "Newline conventions" below) recog-
|
||||
nizes the two-character sequence CRLF as a newline, this is preferred,
|
||||
even if the single characters CR and LF are also recognized as new-
|
||||
lines. For example, if the newline convention is "any", a multiline
|
||||
mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
|
||||
than after CR, even though CR on its own is a valid newline. (It also
|
||||
matches at the very start of the string, of course.)
|
||||
|
||||
Note that the sequences \A, \Z, and \z can be used to match the start
|
||||
and end of the subject in both modes, and if all branches of a pattern
|
||||
start with \A it is always anchored, whether or not PCRE2_MULTILINE is
|
||||
|
@ -6545,6 +6641,9 @@ DUPLICATE SUBPATTERN NUMBERS
|
|||
|
||||
/(?|(abc)|(def))(?1)/
|
||||
|
||||
A relative reference such as (?-1) is no different: it is just a conve-
|
||||
nient way of computing an absolute group number.
|
||||
|
||||
If a condition test for a subpattern's having matched refers to a non-
|
||||
unique number, the test is true if any of the subpatterns of that num-
|
||||
ber have matched.
|
||||
|
@ -7444,6 +7543,20 @@ RECURSIVE PATTERNS
|
|||
words, a negative number counts capturing parentheses leftwards from
|
||||
the point at which it is encountered.
|
||||
|
||||
Be aware however, that if duplicate subpattern numbers are in use, rel-
|
||||
ative references refer to the earliest subpattern with the appropriate
|
||||
number. Consider, for example:
|
||||
|
||||
(?|(a)|(b)) (c) (?-2)
|
||||
|
||||
The first two capturing groups (a) and (b) are both numbered 1, and
|
||||
group (c) is number 2. When the reference (?-2) is encountered, the
|
||||
second most recently opened parentheses has the number 1, but it is the
|
||||
first such group (the (a) group) to which the recursion refers. This
|
||||
would be the same if an absolute reference (?1) was used. In other
|
||||
words, relative references are just a shorthand for computing a group
|
||||
number.
|
||||
|
||||
It is also possible to refer to subsequently opened parentheses, by
|
||||
writing references such as (?+2). However, these cannot be recursive
|
||||
because the reference is not inside the parentheses that are refer-
|
||||
|
@ -8141,7 +8254,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 01 November 2015
|
||||
Last updated: 13 November 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -8534,7 +8647,9 @@ MATCHING A PATTERN
|
|||
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in
|
||||
software intended to be portable to other systems. Note that a non-zero
|
||||
rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
|
||||
of the string, not how it is matched.
|
||||
of the string, not how it is matched. Setting REG_STARTEND and passing
|
||||
pmatch as NULL are mutually exclusive; the error REG_INVARG is
|
||||
returned.
|
||||
|
||||
If the pattern was compiled with the REG_NOSUB flag, no data about any
|
||||
matched strings is returned. The nmatch and pmatch arguments of
|
||||
|
@ -8587,7 +8702,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 30 October 2015
|
||||
Last updated: 29 November 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "12 December 2015" "PCRE2 10.21"
|
||||
.TH PCRE2API 3 "16 December 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -678,8 +678,8 @@ of the following match-time parameters:
|
|||
.sp
|
||||
A callout function
|
||||
The offset limit for matching an unanchored pattern
|
||||
The limit for calling \fImatch()\fP
|
||||
The limit for calling \fImatch()\fP recursively
|
||||
The limit for calling \fBmatch()\fP (see below)
|
||||
The limit for calling \fBmatch()\fP recursively
|
||||
.sp
|
||||
A match context is also required if you are using custom memory management.
|
||||
If none of these apply, just pass NULL as the context argument of
|
||||
|
@ -1611,8 +1611,9 @@ matches only CR, LF, or CRLF.
|
|||
.sp
|
||||
PCRE2_INFO_CAPTURECOUNT
|
||||
.sp
|
||||
Return the number of capturing subpatterns in the pattern. The third argument
|
||||
should point to an \fBuint32_t\fP variable.
|
||||
Return the highest capturing subpattern number in the pattern. In patterns
|
||||
where (?| is not used, this is also the total number of capturing subpatterns.
|
||||
The third argument should point to an \fBuint32_t\fP variable.
|
||||
.sp
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
.sp
|
||||
|
@ -1629,10 +1630,8 @@ returned. Otherwise NULL is returned. The third argument should point to an
|
|||
.sp
|
||||
Return information about the first code unit of any matched string, for a
|
||||
non-anchored pattern. The third argument should point to an \fBuint32_t\fP
|
||||
variable.
|
||||
.P
|
||||
If there is a fixed first value, for example, the letter "c" from a pattern
|
||||
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||
variable. If there is a fixed first value, for example, the letter "c" from a
|
||||
pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
||||
it is known that a match can occur only at the start of the subject or
|
||||
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
||||
|
@ -1676,12 +1675,10 @@ Returns 1 if there is a rightmost literal code unit that must exist in any
|
|||
matched string, other than at its start. The third argument should point to an
|
||||
\fBuint32_t\fP variable. If there is no such value, 0 is returned. When 1 is
|
||||
returned, the code unit value itself can be retrieved using
|
||||
PCRE2_INFO_LASTCODEUNIT.
|
||||
.P
|
||||
For anchored patterns, a last literal value is recorded only if it follows
|
||||
something of variable length. For example, for the pattern /^a\ed+z\ed+/ the
|
||||
returned value is 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for
|
||||
/^a\edz\ed/ the returned value is 0.
|
||||
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
|
||||
recorded only if it follows something of variable length. For example, for the
|
||||
pattern /^a\ed+z\ed+/ the returned value is 1 (with "z" returned from
|
||||
PCRE2_INFO_LASTCODEUNIT), but for /^a\edz\ed/ the returned value is 0.
|
||||
.sp
|
||||
PCRE2_INFO_LASTCODEUNIT
|
||||
.sp
|
||||
|
@ -2181,9 +2178,19 @@ standard convention for the operating system. The default can be overridden in
|
|||
a
|
||||
.\" HTML <a href="#compilecontext">
|
||||
.\" </a>
|
||||
compile context.
|
||||
compile context
|
||||
.\"
|
||||
During matching, the newline choice affects the behaviour of the dot,
|
||||
by calling \fBpcre2_set_newline()\fP. It can also be overridden by starting a
|
||||
pattern string with, for example, (*CRLF), as described in the
|
||||
.\" HTML <a href="pcre2pattern.html#newlines">
|
||||
.\" </a>
|
||||
section on newline conventions
|
||||
.\"
|
||||
in the
|
||||
.\" HREF
|
||||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
page. During matching, the newline choice affects the behaviour of the dot,
|
||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||
starting position is advanced after a match failure for an unanchored pattern.
|
||||
.P
|
||||
|
@ -2229,18 +2236,7 @@ that do not cause substrings to be captured. The \fBpcre2_pattern_info()\fP
|
|||
function can be used to find out how many capturing subpatterns there are in a
|
||||
compiled pattern.
|
||||
.P
|
||||
A successful match returns the overall matched string and any captured
|
||||
substrings to the caller via a vector of PCRE2_SIZE values. This is called the
|
||||
\fBovector\fP, and is contained within the
|
||||
.\" HTML <a href="#matchdatablock">
|
||||
.\" </a>
|
||||
match data block.
|
||||
.\"
|
||||
You can obtain direct access to the ovector by calling
|
||||
\fBpcre2_get_ovector_pointer()\fP to find its address, and
|
||||
\fBpcre2_get_ovector_count()\fP to find the number of pairs of values it
|
||||
contains. Alternatively, you can use the auxiliary functions for accessing
|
||||
captured substrings
|
||||
You can use auxiliary functions for accessing captured substrings
|
||||
.\" HTML <a href="#extractbynumber">
|
||||
.\" </a>
|
||||
by number
|
||||
|
@ -2248,9 +2244,20 @@ by number
|
|||
or
|
||||
.\" HTML <a href="#extractbyname">
|
||||
.\" </a>
|
||||
by name
|
||||
by name,
|
||||
.\"
|
||||
(see below).
|
||||
as described in sections below.
|
||||
.P
|
||||
Alternatively, you can make direct use of the vector of PCRE2_SIZE values,
|
||||
called the \fBovector\fP, which contains the offsets of captured strings. It is
|
||||
part of the
|
||||
.\" HTML <a href="#matchdatablock">
|
||||
.\" </a>
|
||||
match data block.
|
||||
.\"
|
||||
The function \fBpcre2_get_ovector_pointer()\fP returns the address of the
|
||||
ovector, and \fBpcre2_get_ovector_count()\fP returns the number of pairs of
|
||||
values it contains.
|
||||
.P
|
||||
Within the ovector, the first in each pair of values is set to the offset of
|
||||
the first code unit of a substring, and the second is set to the offset of the
|
||||
|
@ -2334,7 +2341,12 @@ After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
|
|||
to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
|
||||
\fBpcre2_get_mark()\fP can be called. It returns a pointer to the
|
||||
zero-terminated name, which is within the compiled pattern. Otherwise NULL is
|
||||
returned. After a successful match, the (*MARK) name that is returned is the
|
||||
returned. The length of the (*MARK) name (excluding the terminating zero) is
|
||||
stored in the code unit that preceeds the name. You should use this instead of
|
||||
relying on the terminating zero if the (*MARK) name might contain a binary
|
||||
zero.
|
||||
.P
|
||||
After a successful match, the (*MARK) name that is returned is the
|
||||
last one encountered on the matching path through the pattern. After a "no
|
||||
match" or a partial match, the last encountered (*MARK) name is returned. For
|
||||
example, consider this pattern:
|
||||
|
@ -2353,7 +2365,7 @@ different to the value of \fIovector[0]\fP if the pattern contains the \eK
|
|||
escape sequence. After a partial match, however, this value is always the same
|
||||
as \fIovector[0]\fP because \eK does not affect the result of a partial match.
|
||||
.P
|
||||
After a UTF check failure, \fBpcre2_get_startchar()\fB can be used to obtain
|
||||
After a UTF check failure, \fBpcre2_get_startchar()\fP can be used to obtain
|
||||
the code unit offset of the invalid UTF character. Details are given in the
|
||||
.\" HREF
|
||||
\fBpcre2unicode\fP
|
||||
|
@ -2901,14 +2913,14 @@ first and last entries in the name-to-number table for the given name, and the
|
|||
function returns the length of each entry in code units. In both cases,
|
||||
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
||||
.P
|
||||
The format of the name table is described above in the section entitled
|
||||
\fIInformation about a pattern\fP
|
||||
The format of the name table is described
|
||||
.\" HTML <a href="#infoaboutpattern">
|
||||
.\" </a>
|
||||
above.
|
||||
above
|
||||
.\"
|
||||
Given all the relevant entries for the name, you can extract each of their
|
||||
numbers, and hence the captured data.
|
||||
in the section entitled \fIInformation about a pattern\fP. Given all the
|
||||
relevant entries for the name, you can extract each of their numbers, and hence
|
||||
the captured data.
|
||||
.
|
||||
.
|
||||
.SH "FINDING ALL POSSIBLE MATCHES AT ONE POSITION"
|
||||
|
@ -3154,6 +3166,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 21 December 2015
|
||||
Last updated: 16 December 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -805,6 +805,10 @@ PATTERN MODIFIERS
|
|||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
|
||||
These modifiers may not appear in a #pattern command. If you want them
|
||||
as defaults, set them in a #subject command.
|
||||
|
@ -886,6 +890,11 @@ SUBJECT MODIFIERS
|
|||
recursion_limit=<n> set a recursion limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
startoffset=<n> same as offset=<n>
|
||||
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
|
||||
The effects of these modifiers are described in the following sections.
|
||||
|
@ -1011,19 +1020,30 @@ SUBJECT MODIFIERS
|
|||
Testing the substitution function
|
||||
|
||||
If the replace modifier is set, the pcre2_substitute() function is
|
||||
called instead of one of the matching functions. Unlike subject
|
||||
strings, pcre2test does not process replacement strings for escape
|
||||
sequences. In UTF mode, a replacement string is checked to see if it is
|
||||
a valid UTF-8 string. If so, it is correctly converted to a UTF string
|
||||
of the appropriate code unit width. If it is not a valid UTF-8 string,
|
||||
the individual code units are copied directly. This provides a means of
|
||||
passing an invalid UTF-8 string for testing purposes.
|
||||
called instead of one of the matching functions. Note that replacement
|
||||
strings cannot contain commas, because a comma signifies the end of a
|
||||
modifier. This is not thought to be an issue in a test program.
|
||||
|
||||
If the global modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
||||
pcre2_substitute(). After a successful substitution, the modified
|
||||
string is output, preceded by the number of replacements. This may be
|
||||
zero if there were no matches. Here is a simple example of a substitu-
|
||||
tion test:
|
||||
Unlike subject strings, pcre2test does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to
|
||||
see if it is a valid UTF-8 string. If so, it is correctly converted to
|
||||
a UTF string of the appropriate code unit width. If it is not a valid
|
||||
UTF-8 string, the individual code units are copied directly. This pro-
|
||||
vides a means of passing an invalid UTF-8 string for testing purposes.
|
||||
|
||||
The following modifiers set options (in additional to the normal match
|
||||
options) for pcre2_substitute():
|
||||
|
||||
global PCRE2_SUBSTITUTE_GLOBAL
|
||||
substitute_extended PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
|
||||
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
|
||||
|
||||
After a successful substitution, the modified string is output, pre-
|
||||
ceded by the number of replacements. This may be zero if there were no
|
||||
matches. Here is a simple example of a substitution test:
|
||||
|
||||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
|
@ -1031,12 +1051,13 @@ SUBJECT MODIFIERS
|
|||
=abc=abc=\=global
|
||||
2: =xxx=xxx=
|
||||
|
||||
Subject and replacement strings should be kept relatively short for
|
||||
substitution tests, as fixed-size buffers are used. To make it easy to
|
||||
test for buffer overflow, if the replacement string starts with a num-
|
||||
ber in square brackets, that number is passed to pcre2_substitute() as
|
||||
the size of the output buffer, with the replacement string starting at
|
||||
the next character. Here is an example that tests the edge case:
|
||||
Subject and replacement strings should be kept relatively short (fewer
|
||||
than 256 characters) for substitution tests, as fixed-size buffers are
|
||||
used. To make it easy to test for buffer overflow, if the replacement
|
||||
string starts with a number in square brackets, that number is passed
|
||||
to pcre2_substitute() as the size of the output buffer, with the
|
||||
replacement string starting at the next character. Here is an example
|
||||
that tests the edge case:
|
||||
|
||||
/abc/
|
||||
123abc123\=replace=[10]XYZ
|
||||
|
@ -1044,6 +1065,19 @@ SUBJECT MODIFIERS
|
|||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
|
||||
The default action of pcre2_substitute() is to return
|
||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
|
||||
the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
|
||||
stitute_overflow_length modifier), pcre2_substitute() continues to go
|
||||
through the motions of matching and substituting, in order to compute
|
||||
the size of buffer that is required. When this happens, pcre2test shows
|
||||
the required buffer length (which includes space for the trailing zero)
|
||||
as part of the error message. For example:
|
||||
|
||||
/abc/substitute_overflow_length
|
||||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory: 10 code units are needed
|
||||
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying
|
||||
partial matching provokes an error return ("bad option value") from
|
||||
pcre2_substitute().
|
||||
|
@ -1471,5 +1505,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 05 November 2015
|
||||
Last updated: 12 December 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
|
|
|
@ -44,7 +44,7 @@ POSSIBILITY OF SUCH DAMAGE.
|
|||
#define PCRE2_MAJOR 10
|
||||
#define PCRE2_MINOR 21
|
||||
#define PCRE2_PRERELEASE -RC1
|
||||
#define PCRE2_DATE 2015-07-06
|
||||
#define PCRE2_DATE 2015-12-15
|
||||
|
||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||
imported have to be identified as such. When building PCRE2, the appropriate
|
||||
|
|
Loading…
Reference in New Issue