More documentation and file tidies.
This commit is contained in:
parent
ba1e2e0cbb
commit
eb4fffbbf4
|
@ -25,9 +25,10 @@ PCRE2 is the name used for a revised API for the PCRE library, which is a set
|
||||||
of functions, written in C, that implement regular expression pattern matching
|
of functions, written in C, that implement regular expression pattern matching
|
||||||
using the same syntax and semantics as Perl, with just a few differences. Some
|
using the same syntax and semantics as Perl, with just a few differences. Some
|
||||||
features that appeared in Python and the original PCRE before they appeared in
|
features that appeared in Python and the original PCRE before they appeared in
|
||||||
Perl are also available using the Python syntax, there is some support for one
|
Perl are also available using the Python syntax. There is also some support for
|
||||||
or two .NET and Oniguruma syntax items, and there are options for requesting
|
one or two .NET and Oniguruma syntax items, and there are options for
|
||||||
some minor changes that give better ECMAScript (aka JavaScript) compatibility.
|
requesting some minor changes that give better ECMAScript (aka JavaScript)
|
||||||
|
compatibility.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit
|
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit
|
||||||
|
@ -36,7 +37,7 @@ The original work to extend PCRE to 16-bit and 32-bit code units was done by
|
||||||
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
|
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
|
||||||
can be interpreted either as one character per code unit, or as UTF-encoded
|
can be interpreted either as one character per code unit, or as UTF-encoded
|
||||||
Unicode, with support for Unicode general category properties. Unicode support
|
Unicode, with support for Unicode general category properties. Unicode support
|
||||||
is optional at build time (but is the default); however, processing strings as
|
is optional at build time (but is the default). However, processing strings as
|
||||||
UTF code units must be enabled explicitly at run time. The version of Unicode
|
UTF code units must be enabled explicitly at run time. The version of Unicode
|
||||||
in use can be discovered by running
|
in use can be discovered by running
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -143,17 +144,17 @@ listing), and the short pages for individual functions, are concatenated in
|
||||||
pcre2compat discussion of Perl compatibility
|
pcre2compat discussion of Perl compatibility
|
||||||
pcre2demo a demonstration C program that uses PCRE2
|
pcre2demo a demonstration C program that uses PCRE2
|
||||||
pcre2grep description of the <b>pcre2grep</b> command (8-bit only)
|
pcre2grep description of the <b>pcre2grep</b> command (8-bit only)
|
||||||
pcre2jit discussion of the just-in-time optimization support
|
pcre2jit discussion of just-in-time optimization support
|
||||||
pcre2limits details of size and other limits
|
pcre2limits details of size and other limits
|
||||||
pcre2matching discussion of the two matching algorithms
|
pcre2matching discussion of the two matching algorithms
|
||||||
pcre2partial details of the partial matching facility
|
pcre2partial details of the partial matching facility
|
||||||
pcre2pattern syntax and semantics of supported regular expressions
|
pcre2pattern syntax and semantics of supported regular expression patterns
|
||||||
pcre2perform discussion of performance issues
|
pcre2perform discussion of performance issues
|
||||||
pcre2posix the POSIX-compatible C API for the 8-bit library
|
pcre2posix the POSIX-compatible C API for the 8-bit library
|
||||||
pcre2sample discussion of the pcre2demo program
|
pcre2sample discussion of the pcre2demo program
|
||||||
pcre2stack discussion of stack usage
|
pcre2stack discussion of stack usage
|
||||||
pcre2syntax quick syntax reference
|
pcre2syntax quick syntax reference
|
||||||
pcre2test description of the <b>pcre2test</b> testing command
|
pcre2test description of the <b>pcre2test</b> command
|
||||||
pcre2unicode discussion of Unicode and UTF support
|
pcre2unicode discussion of Unicode and UTF support
|
||||||
</pre>
|
</pre>
|
||||||
In the "man" and HTML formats, there is also a short page for each C library
|
In the "man" and HTML formats, there is also a short page for each C library
|
||||||
|
@ -165,7 +166,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -174,7 +175,7 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 03 November 2014
|
Last updated: 18 November 2014
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -37,16 +37,18 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<li><a name="TOC22" href="#SEC22">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a>
|
<li><a name="TOC22" href="#SEC22">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a>
|
||||||
<li><a name="TOC23" href="#SEC23">NEWLINE HANDLING WHEN MATCHING</a>
|
<li><a name="TOC23" href="#SEC23">NEWLINE HANDLING WHEN MATCHING</a>
|
||||||
<li><a name="TOC24" href="#SEC24">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a>
|
<li><a name="TOC24" href="#SEC24">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a>
|
||||||
<li><a name="TOC25" href="#SEC25">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a>
|
<li><a name="TOC25" href="#SEC25">OTHER INFORMATION ABOUT A MATCH</a>
|
||||||
<li><a name="TOC26" href="#SEC26">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a>
|
<li><a name="TOC26" href="#SEC26">ERROR RETURNS FROM <b>pcre2_match()</b></a>
|
||||||
<li><a name="TOC27" href="#SEC27">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a>
|
<li><a name="TOC27" href="#SEC27">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a>
|
||||||
<li><a name="TOC28" href="#SEC28">CREATING A NEW STRING WITH SUBSTITUTIONS</a>
|
<li><a name="TOC28" href="#SEC28">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a>
|
||||||
<li><a name="TOC29" href="#SEC29">DUPLICATE SUBPATTERN NAMES</a>
|
<li><a name="TOC29" href="#SEC29">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a>
|
||||||
<li><a name="TOC30" href="#SEC30">FINDING ALL POSSIBLE MATCHES</a>
|
<li><a name="TOC30" href="#SEC30">CREATING A NEW STRING WITH SUBSTITUTIONS</a>
|
||||||
<li><a name="TOC31" href="#SEC31">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a>
|
<li><a name="TOC31" href="#SEC31">DUPLICATE SUBPATTERN NAMES</a>
|
||||||
<li><a name="TOC32" href="#SEC32">SEE ALSO</a>
|
<li><a name="TOC32" href="#SEC32">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a>
|
||||||
<li><a name="TOC33" href="#SEC33">AUTHOR</a>
|
<li><a name="TOC33" href="#SEC33">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a>
|
||||||
<li><a name="TOC34" href="#SEC34">REVISION</a>
|
<li><a name="TOC34" href="#SEC34">SEE ALSO</a>
|
||||||
|
<li><a name="TOC35" href="#SEC35">AUTHOR</a>
|
||||||
|
<li><a name="TOC36" href="#SEC36">REVISION</a>
|
||||||
</ul>
|
</ul>
|
||||||
<P>
|
<P>
|
||||||
<b>#include <pcre2.h></b>
|
<b>#include <pcre2.h></b>
|
||||||
|
@ -436,13 +438,9 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
|
||||||
<P>
|
<P>
|
||||||
Each of the first three conventions is used by at least one operating system as
|
Each of the first three conventions is used by at least one operating system as
|
||||||
its standard newline sequence. When PCRE2 is built, a default can be specified.
|
its standard newline sequence. When PCRE2 is built, a default can be specified.
|
||||||
The default default is LF, which is the Unix standard. When PCRE2 is run, the
|
The default default is LF, which is the Unix standard. However, the newline
|
||||||
default can be overridden, either when a pattern is compiled, or when it is
|
convention can be changed by an application when calling <b>pcre2_compile()</b>,
|
||||||
matched.
|
or it can be specified by special text at the start of the pattern itself; this
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
The newline convention can be changed when calling <b>pcre2_compile()</b>, or it
|
|
||||||
can be specified by special text at the start of the pattern itself; this
|
|
||||||
overrides any other settings. See the
|
overrides any other settings. See the
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
page for details of the special character sequences.
|
page for details of the special character sequences.
|
||||||
|
@ -459,8 +457,8 @@ below.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The choice of newline convention does not affect the interpretation of
|
The choice of newline convention does not affect the interpretation of
|
||||||
the \n or \r escape sequences, nor does it affect what \R matches, which has
|
the \n or \r escape sequences, nor does it affect what \R matches; this has
|
||||||
its own separate control.
|
its own separate convention.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC13" href="#TOC1">MULTITHREADING</a><br>
|
<br><a name="SEC13" href="#TOC1">MULTITHREADING</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -472,7 +470,7 @@ time ensuring that multithreaded applications can use it.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There are several different blocks of data that are used to pass information
|
There are several different blocks of data that are used to pass information
|
||||||
between the application and the PCRE libraries.
|
between the application and the PCRE2 libraries.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
(1) A pointer to the compiled form of a pattern is returned to the user when
|
(1) A pointer to the compiled form of a pattern is returned to the user when
|
||||||
|
@ -572,11 +570,11 @@ The compile context
|
||||||
A compile context is required if you want to change the default values of any
|
A compile context is required if you want to change the default values of any
|
||||||
of the following compile-time parameters:
|
of the following compile-time parameters:
|
||||||
<pre>
|
<pre>
|
||||||
What \R matches (Unicode newlines or CR, LF, CRLF only);
|
What \R matches (Unicode newlines or CR, LF, CRLF only)
|
||||||
PCRE2's character tables;
|
PCRE2's character tables
|
||||||
The newline character sequence;
|
The newline character sequence
|
||||||
The compile time nested parentheses limit;
|
The compile time nested parentheses limit
|
||||||
An external function for stack checking.
|
An external function for stack checking
|
||||||
</pre>
|
</pre>
|
||||||
A compile context is also required if you are using custom memory management.
|
A compile context is also required if you are using custom memory management.
|
||||||
If none of these apply, just pass NULL as the context argument of
|
If none of these apply, just pass NULL as the context argument of
|
||||||
|
@ -604,9 +602,8 @@ PCRE2_ERROR_BADDATA if invalid data is detected.
|
||||||
<br>
|
<br>
|
||||||
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF,
|
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF,
|
||||||
or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line
|
or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line
|
||||||
ending sequence. The value of this parameter does not affect what is compiled;
|
ending sequence. The value is used by the JIT compiler and by the two
|
||||||
it is just saved with the compiled pattern. The value is used by the JIT
|
interpreted matching functions, <i>pcre2_match()</i> and
|
||||||
compiler and by the two interpreted matching functions, <i>pcre2_match()</i> and
|
|
||||||
<i>pcre2_dfa_match()</i>.
|
<i>pcre2_dfa_match()</i>.
|
||||||
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
|
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||||
<b> const unsigned char *<i>tables</i>);</b>
|
<b> const unsigned char *<i>tables</i>);</b>
|
||||||
|
@ -709,12 +706,12 @@ in the subject string. This limit is not relevant to <b>pcre2_dfa_match()</b>,
|
||||||
which ignores it.
|
which ignores it.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When <b>pcre2_match()</b> is called with a pattern that was successfully studied
|
When <b>pcre2_match()</b> is called with a pattern that was successfully
|
||||||
with <b>pcre2_jit_compile()</b>, the way that the matching is executed is
|
processed by <b>pcre2_jit_compile()</b>, the way in which matching is executed
|
||||||
entirely different. However, there is still the possibility of runaway matching
|
is entirely different. However, there is still the possibility of runaway
|
||||||
that goes on for a very long time, and so the <i>match_limit</i> value is also
|
matching that goes on for a very long time, and so the <i>match_limit</i> value
|
||||||
used in this case (but in a different way) to limit how long the matching can
|
is also used in this case (but in a different way) to limit how long the
|
||||||
continue.
|
matching can continue.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The default value for the limit can be set when PCRE2 is built; the default
|
The default value for the limit can be set when PCRE2 is built; the default
|
||||||
|
@ -770,15 +767,17 @@ stack. There is a discussion about PCRE2's stack usage in the
|
||||||
<a href="pcre2stack.html"><b>pcre2stack</b></a>
|
<a href="pcre2stack.html"><b>pcre2stack</b></a>
|
||||||
documentation. See the
|
documentation. See the
|
||||||
<a href="pcre2build.html"><b>pcre2build</b></a>
|
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||||
documentation for details of how to build PCRE2. Using the heap for recursion
|
documentation for details of how to build PCRE2.
|
||||||
is a non-standard way of building PCRE2, for use in environments that have
|
</P>
|
||||||
limited stacks. Because of the greater use of memory management,
|
<P>
|
||||||
<b>pcre2_match()</b> runs more slowly. Functions that are different to the
|
Using the heap for recursion is a non-standard way of building PCRE2, for use
|
||||||
general custom memory functions are provided so that special-purpose external
|
in environments that have limited stacks. Because of the greater use of memory
|
||||||
code can be used for this case, because the memory blocks are all the same
|
management, <b>pcre2_match()</b> runs more slowly. Functions that are different
|
||||||
size. The blocks are retained by <b>pcre2_match()</b> until it is about to exit
|
to the general custom memory functions are provided so that special-purpose
|
||||||
so that they can be re-used when possible during the match. In the absence of
|
external code can be used for this case, because the memory blocks are all the
|
||||||
these functions, the normal custom memory management functions are used, if
|
same size. The blocks are retained by <b>pcre2_match()</b> until it is about to
|
||||||
|
exit so that they can be re-used when possible during the match. In the absence
|
||||||
|
of these functions, the normal custom memory management functions are used, if
|
||||||
supplied, otherwise the system functions.
|
supplied, otherwise the system functions.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC15" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br>
|
<br><a name="SEC15" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br>
|
||||||
|
@ -809,9 +808,10 @@ available:
|
||||||
PCRE2_CONFIG_BSR
|
PCRE2_CONFIG_BSR
|
||||||
</pre>
|
</pre>
|
||||||
The output is an integer whose value indicates what character sequences the \R
|
The output is an integer whose value indicates what character sequences the \R
|
||||||
escape sequence matches by default. A value of 0 means that \R matches any
|
escape sequence matches by default. A value of PCRE2_BSR_UNICODE means that \R
|
||||||
Unicode line ending sequence; a value of 1 means that \R matches only CR, LF,
|
matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
|
||||||
or CRLF. The default can be overridden when a pattern is compiled or matched.
|
that \R matches only CR, LF, or CRLF. The default can be overridden when a
|
||||||
|
pattern is compiled.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_CONFIG_JIT
|
PCRE2_CONFIG_JIT
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -821,7 +821,7 @@ compiling is available; otherwise it is set to zero.
|
||||||
PCRE2_CONFIG_JITTARGET
|
PCRE2_CONFIG_JITTARGET
|
||||||
</pre>
|
</pre>
|
||||||
The <i>where</i> argument should point to a buffer that is at least 48 code
|
The <i>where</i> argument should point to a buffer that is at least 48 code
|
||||||
units long. (The exact length needed can be found by calling
|
units long. (The exact length required can be found by calling
|
||||||
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a
|
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a
|
||||||
string that contains the name of the architecture for which the JIT compiler is
|
string that contains the name of the architecture for which the JIT compiler is
|
||||||
configured, for example "x86 32bit (little endian + unaligned)". If JIT support
|
configured, for example "x86 32bit (little endian + unaligned)". If JIT support
|
||||||
|
@ -855,11 +855,11 @@ Further details are given with <b>pcre2_match()</b> below.
|
||||||
The output is an integer whose value specifies the default character sequence
|
The output is an integer whose value specifies the default character sequence
|
||||||
that is recognized as meaning "newline". The values are:
|
that is recognized as meaning "newline". The values are:
|
||||||
<pre>
|
<pre>
|
||||||
1 Carriage return (CR)
|
PCRE2_NEWLINE_CR Carriage return (CR)
|
||||||
2 Linefeed (LF)
|
PCRE2_NEWLINE_LF Linefeed (LF)
|
||||||
3 Carriage return, linefeed (CRLF)
|
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
|
||||||
4 Any Unicode line ending
|
PCRE2_NEWLINE_ANY Any Unicode line ending
|
||||||
5 Any of CR, LF, or CRLF
|
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
|
||||||
</pre>
|
</pre>
|
||||||
The default should normally correspond to the standard sequence for your
|
The default should normally correspond to the standard sequence for your
|
||||||
operating system.
|
operating system.
|
||||||
|
@ -891,7 +891,7 @@ heap instead of recursive function calls.
|
||||||
PCRE2_CONFIG_UNICODE_VERSION
|
PCRE2_CONFIG_UNICODE_VERSION
|
||||||
</pre>
|
</pre>
|
||||||
The <i>where</i> argument should point to a buffer that is at least 24 code
|
The <i>where</i> argument should point to a buffer that is at least 24 code
|
||||||
units long. (The exact length needed can be found by calling
|
units long. (The exact length required can be found by calling
|
||||||
<b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled
|
<b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled
|
||||||
without Unicode support, the buffer is filled with the text "Unicode not
|
without Unicode support, the buffer is filled with the text "Unicode not
|
||||||
supported". Otherwise, the Unicode version string (for example, "7.0.0") is
|
supported". Otherwise, the Unicode version string (for example, "7.0.0") is
|
||||||
|
@ -906,7 +906,7 @@ otherwise it is set to zero. Unicode support implies UTF support.
|
||||||
PCRE2_CONFIG_VERSION
|
PCRE2_CONFIG_VERSION
|
||||||
</pre>
|
</pre>
|
||||||
The <i>where</i> argument should point to a buffer that is at least 12 code
|
The <i>where</i> argument should point to a buffer that is at least 12 code
|
||||||
units long. (The exact length needed can be found by calling
|
units long. (The exact length required can be found by calling
|
||||||
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with
|
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with
|
||||||
the PCRE2 version string, zero-terminated. The number of code units used is
|
the PCRE2 version string, zero-terminated. The number of code units used is
|
||||||
returned. This is the length of the string plus one unit for the terminating
|
returned. This is the length of the string plus one unit for the terminating
|
||||||
|
@ -922,17 +922,17 @@ zero.
|
||||||
<b>pcre2_code_free(pcre2_code *<i>code</i>);</b>
|
<b>pcre2_code_free(pcre2_code *<i>code</i>);</b>
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
This function compiles a pattern, defined by a pointer to a string of code
|
The <b>pcre2_compile()</b> function compiles a pattern into an internal form.
|
||||||
units and a length, into an internal form. If the pattern is zero-terminated,
|
The pattern is defined by a pointer to a string of code units and a length, If
|
||||||
the length should be specified as PCRE2_ZERO_TERMINATED. The function returns a
|
the pattern is zero-terminated, the length can be specified as
|
||||||
pointer to a block of memory that contains the compiled pattern and related
|
PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
|
||||||
data. The caller must free the memory by calling <b>pcre2_code_free()</b> when
|
contains the compiled pattern and related data. The caller must free the memory
|
||||||
it is no longer needed.
|
by calling <b>pcre2_code_free()</b> when it is no longer needed.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If the compile context argument <i>ccontext</i> is NULL, the memory is obtained
|
If the compile context argument <i>ccontext</i> is NULL, memory for the compiled
|
||||||
by calling <b>malloc()</b>. Otherwise, it is obtained from the same memory
|
pattern is obtained by calling <b>malloc()</b>. Otherwise, it is obtained from
|
||||||
function that was used for the compile context.
|
the same memory function that was used for the compile context.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The <i>options</i> argument contains various bit settings that affect the
|
The <i>options</i> argument contains various bit settings that affect the
|
||||||
|
@ -1247,7 +1247,7 @@ classify characters. More details are given in the section on
|
||||||
in the
|
in the
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
||||||
longer. The option is available only if PCRE2 has been compiled with UTF
|
longer. The option is available only if PCRE2 has been compiled with Unicode
|
||||||
support.
|
support.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_UNGREEDY
|
PCRE2_UNGREEDY
|
||||||
|
@ -1260,9 +1260,10 @@ with Perl. It can also be set by a (?U) option setting within the pattern.
|
||||||
</pre>
|
</pre>
|
||||||
This option causes PCRE2 to regard both the pattern and the subject strings
|
This option causes PCRE2 to regard both the pattern and the subject strings
|
||||||
that are subsequently processed as strings of UTF characters instead of
|
that are subsequently processed as strings of UTF characters instead of
|
||||||
single-code-unit strings. However, it is available only when PCRE2 is built to
|
single-code-unit strings. It is available when PCRE2 is built to include
|
||||||
include UTF support. If not, the use of this option provokes an error. Details
|
Unicode support (which is the default). If Unicode support is not available,
|
||||||
of how this option changes the behaviour of PCRE2 are given in the
|
the use of this option provokes an error. Details of how this option changes
|
||||||
|
the behaviour of PCRE2 are given in the
|
||||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||||
page.
|
page.
|
||||||
</P>
|
</P>
|
||||||
|
@ -1318,13 +1319,12 @@ Most, but not all patterns can be optimized by the JIT compiler.
|
||||||
<P>
|
<P>
|
||||||
PCRE2 handles caseless matching, and determines whether characters are letters,
|
PCRE2 handles caseless matching, and determines whether characters are letters,
|
||||||
digits, or whatever, by reference to a set of tables, indexed by character code
|
digits, or whatever, by reference to a set of tables, indexed by character code
|
||||||
point. When running in UTF-8 mode, or using the 16-bit or 32-bit libraries,
|
point. This applies only to characters whose code points are less than 256. By
|
||||||
this applies only to characters with code points less than 256. By default,
|
default, higher-valued code points never match escapes such as \w or \d.
|
||||||
higher-valued code points never match escapes such as \w or \d. However, if
|
However, if PCRE2 is built with UTF support, all characters can be tested with
|
||||||
PCRE2 is built with UTF support, all characters can be tested with \p and \P,
|
\p and \P, or, alternatively, the PCRE2_UCP option can be set when a pattern
|
||||||
or, alternatively, the PCRE2_UCP option can be set when a pattern is compiled;
|
is compiled; this causes \w and friends to use Unicode property support
|
||||||
this causes \w and friends to use Unicode property support instead of the
|
instead of the built-in tables.
|
||||||
built-in tables.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The use of locales with Unicode is discouraged. If you are handling characters
|
The use of locales with Unicode is discouraged. If you are handling characters
|
||||||
|
@ -1437,9 +1437,9 @@ are no back references.
|
||||||
PCRE2_INFO_BSR
|
PCRE2_INFO_BSR
|
||||||
</pre>
|
</pre>
|
||||||
The output is a uint32_t whose value indicates what character sequences the \R
|
The output is a uint32_t whose value indicates what character sequences the \R
|
||||||
escape sequence matches by default. A value of 0 means that \R matches any
|
escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R matches
|
||||||
Unicode line ending sequence; a value of 1 means that \R matches only CR, LF,
|
any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \R
|
||||||
or CRLF. The default can be overridden when a pattern is matched.
|
matches only CR, LF, or CRLF.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_CAPTURECOUNT
|
PCRE2_INFO_CAPTURECOUNT
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1581,15 +1581,18 @@ values.
|
||||||
<P>
|
<P>
|
||||||
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
|
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
|
||||||
the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
|
the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
|
||||||
entry; both of these return a <b>uint32_t</b> value. The entry size depends on
|
entry in code units; both of these return a <b>uint32_t</b> value. The entry
|
||||||
the length of the longest name. PCRE2_INFO_NAMETABLE returns a pointer to the
|
size depends on the length of the longest name.
|
||||||
first entry of the table. This is a PCRE2_SPTR pointer to a block of code
|
</P>
|
||||||
units. In the 8-bit library, the first two bytes of each entry are the number
|
<P>
|
||||||
of the capturing parenthesis, most significant byte first. In the 16-bit
|
PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
|
||||||
library, the pointer points to 16-bit data units, the first of which contains
|
a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first
|
||||||
the parenthesis number. In the 32-bit library, the pointer points to 32-bit
|
two bytes of each entry are the number of the capturing parenthesis, most
|
||||||
data units, the first of which contains the parenthesis number. The rest of the
|
significant byte first. In the 16-bit library, the pointer points to 16-bit
|
||||||
entry is the corresponding name, zero terminated.
|
code units, the first of which contains the parenthesis number. In the 32-bit
|
||||||
|
library, the pointer points to 32-bit code units, the first of which contains
|
||||||
|
the parenthesis number. The rest of the entry is the corresponding name, zero
|
||||||
|
terminated.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The names are in alphabetical order. If (?| is used to create multiple groups
|
The names are in alphabetical order. If (?| is used to create multiple groups
|
||||||
|
@ -1629,17 +1632,16 @@ different for each compiled pattern.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_NEWLINE
|
PCRE2_INFO_NEWLINE
|
||||||
</pre>
|
</pre>
|
||||||
The output is a <b>uint32_t</b> whose value specifies the default character
|
The output is a <b>uint32_t</b> with one of the following values:
|
||||||
sequence that will be recognized as meaning "newline" while matching. The
|
|
||||||
values are:
|
|
||||||
<pre>
|
<pre>
|
||||||
1 Carriage return (CR)
|
PCRE2_NEWLINE_CR Carriage return (CR)
|
||||||
2 Linefeed (LF)
|
PCRE2_NEWLINE_LF Linefeed (LF)
|
||||||
3 Carriage return, linefeed (CRLF)
|
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
|
||||||
4 Any Unicode line ending
|
PCRE2_NEWLINE_ANY Any Unicode line ending
|
||||||
5 Any of CR, LF, or CRLF
|
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
|
||||||
</pre>
|
</pre>
|
||||||
The default can be overridden when a pattern is matched.
|
This specifies the default character sequence that will be recognized as
|
||||||
|
meaning "newline" while matching.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_RECURSIONLIMIT
|
PCRE2_INFO_RECURSIONLIMIT
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1675,18 +1677,19 @@ Information about successful and unsuccessful matches is placed in a match
|
||||||
data block, which is an opaque structure that is accessed by function calls. In
|
data block, which is an opaque structure that is accessed by function calls. In
|
||||||
particular, the match data block contains a vector of offsets into the subject
|
particular, the match data block contains a vector of offsets into the subject
|
||||||
string that define the matched part of the subject and any substrings that were
|
string that define the matched part of the subject and any substrings that were
|
||||||
capured. This is know as the <i>ovector</i>.
|
captured. This is know as the <i>ovector</i>.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Before calling <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> you must create a
|
Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
|
||||||
match data block by calling one of the creation functions above. For
|
<b>pcre2_jit_match()</b> you must create a match data block by calling one of
|
||||||
<b>pcre2_match_data_create()</b>, the first argument is the number of pairs of
|
the creation functions above. For <b>pcre2_match_data_create()</b>, the first
|
||||||
offsets in the <i>ovector</i>. One pair of offsets is required to identify the
|
argument is the number of pairs of offsets in the <i>ovector</i>. One pair of
|
||||||
string that matched the whole pattern, with another pair for each captured
|
offsets is required to identify the string that matched the whole pattern, with
|
||||||
substring. For example, a value of 4 creates enough space to record the matched
|
another pair for each captured substring. For example, a value of 4 creates
|
||||||
portion of the subject plus three captured substrings. A minimum of at least 1
|
enough space to record the matched portion of the subject plus three captured
|
||||||
pair is imposed by <b>pcre2_match_data_create()</b>, so it is always possible to
|
substrings. A minimum of at least 1 pair is imposed by
|
||||||
return the overall matched string.
|
<b>pcre2_match_data_create()</b>, so it is always possible to return the overall
|
||||||
|
matched string.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
|
For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
|
||||||
|
@ -1694,15 +1697,16 @@ pointer to a compiled pattern. In this case the ovector is created to be
|
||||||
exactly the right size to hold all the substrings a pattern might capture.
|
exactly the right size to hold all the substrings a pattern might capture.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The second argument of both these functions ia a pointer to a general context,
|
The second argument of both these functions is a pointer to a general context,
|
||||||
which can specify custom memory management for obtaining the memory for the
|
which can specify custom memory management for obtaining the memory for the
|
||||||
match data block. If you are not using custom memory management, pass NULL.
|
match data block. If you are not using custom memory management, pass NULL.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
A match data block can be used many times, with the same or different compiled
|
A match data block can be used many times, with the same or different compiled
|
||||||
patterns. When it is no longer needed, it should be freed by calling
|
patterns. When it is no longer needed, it should be freed by calling
|
||||||
<b>pcre2_match_data_free()</b>. How to extract information from a match data
|
<b>pcre2_match_data_free()</b>. You can extract information from a match data
|
||||||
block after a match operation is described in the sections on
|
block after a match operation has finished, using functions that are described
|
||||||
|
in the sections on
|
||||||
<a href="#matchedstrings">matched strings</a>
|
<a href="#matchedstrings">matched strings</a>
|
||||||
and
|
and
|
||||||
<a href="#matchotherdata">other match data</a>
|
<a href="#matchotherdata">other match data</a>
|
||||||
|
@ -1816,12 +1820,10 @@ PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
|
||||||
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
|
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If the pattern was successfully processed by the just-in-time (JIT) compiler,
|
Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
|
||||||
the only supported options for matching using the JIT code are PCRE2_NOTBOL,
|
compiler. If it is set, JIT matching is disabled and the normal interpretive
|
||||||
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
|
code in <b>pcre2_match()</b> is run. The remaining options are supported for JIT
|
||||||
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. If an unsupported option is used,
|
matching.
|
||||||
JIT matching is disabled and the normal interpretive code in
|
|
||||||
<b>pcre2_match()</b> is run.
|
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_ANCHORED
|
PCRE2_ANCHORED
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1835,17 +1837,18 @@ matching.
|
||||||
</pre>
|
</pre>
|
||||||
This option specifies that first character of the subject string is not the
|
This option specifies that first character of the subject string is not the
|
||||||
beginning of a line, so the circumflex metacharacter should not match before
|
beginning of a line, so the circumflex metacharacter should not match before
|
||||||
it. Setting this without PCRE2_MULTILINE (at compile time) causes circumflex
|
it. Setting this without having set PCRE2_MULTILINE at compile time causes
|
||||||
never to match. This option affects only the behaviour of the circumflex
|
circumflex never to match. This option affects only the behaviour of the
|
||||||
metacharacter. It does not affect \A.
|
circumflex metacharacter. It does not affect \A.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_NOTEOL
|
PCRE2_NOTEOL
|
||||||
</pre>
|
</pre>
|
||||||
This option specifies that the end of the subject string is not the end of a
|
This option specifies that the end of the subject string is not the end of a
|
||||||
line, so the dollar metacharacter should not match it nor (except in multiline
|
line, so the dollar metacharacter should not match it nor (except in multiline
|
||||||
mode) a newline immediately before it. Setting this without PCRE2_MULTILINE (at
|
mode) a newline immediately before it. Setting this without having set
|
||||||
compile time) causes dollar never to match. This option affects only the
|
PCRE2_MULTILINE at compile time causes dollar never to match. This option
|
||||||
behaviour of the dollar metacharacter. It does not affect \Z or \z.
|
affects only the behaviour of the dollar metacharacter. It does not affect \Z
|
||||||
|
or \z.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_NOTEMPTY
|
PCRE2_NOTEMPTY
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1857,13 +1860,16 @@ match the empty string, the entire match fails. For example, if the pattern
|
||||||
</pre>
|
</pre>
|
||||||
is applied to a string not beginning with "a" or "b", it matches an empty
|
is applied to a string not beginning with "a" or "b", it matches an empty
|
||||||
string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
|
string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
|
||||||
valid, so PCRE2 searches further into the string for occurrences of "a" or "b".
|
valid, so <b>pcre2_match()</b> searches further into the string for occurrences
|
||||||
|
of "a" or "b".
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_NOTEMPTY_ATSTART
|
PCRE2_NOTEMPTY_ATSTART
|
||||||
</pre>
|
</pre>
|
||||||
This is like PCRE2_NOTEMPTY, except that an empty string match that is not at
|
This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
|
||||||
the start of the subject is permitted. If the pattern is anchored, such a match
|
only at the first matching position, that is, at the start of the subject plus
|
||||||
can occur only if the pattern contains \K.
|
the starting offset. An empty string match later in the subject is permitted.
|
||||||
|
If the pattern is anchored, such a match can occur only if the pattern contains
|
||||||
|
\K.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_NO_UTF_CHECK
|
PCRE2_NO_UTF_CHECK
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1904,8 +1910,8 @@ subject characters to complete the match. If this happens when
|
||||||
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
|
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
|
||||||
testing any remaining alternatives. Only if no complete match can be found is
|
testing any remaining alternatives. Only if no complete match can be found is
|
||||||
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
|
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
|
||||||
PCRE2_PARTIAL_SOFT says that the caller is prepared to handle a partial match,
|
PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
|
||||||
but only if no complete match can be found.
|
match, but only if no complete match can be found.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
|
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
|
||||||
|
@ -1928,14 +1934,14 @@ a
|
||||||
<a href="#compilecontext">compile context.</a>
|
<a href="#compilecontext">compile context.</a>
|
||||||
During matching, the newline choice affects the behaviour of the dot,
|
During matching, the newline choice affects the behaviour of the dot,
|
||||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||||
position is advanced after a match failure for an unanchored pattern.
|
starting position is advanced after a match failure for an unanchored pattern.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set,
|
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
|
||||||
and a match attempt for an unanchored pattern fails when the current position
|
the newline convention, and a match attempt for an unanchored pattern fails
|
||||||
is at a CRLF sequence, and the pattern contains no explicit matches for CR or
|
when the current starting position is at a CRLF sequence, and the pattern
|
||||||
LF characters, the match position is advanced by two characters instead of one,
|
contains no explicit matches for CR or LF characters, the match position is
|
||||||
in other words, to after the CRLF.
|
advanced by two characters instead of one, in other words, to after the CRLF.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The above rule is a compromise that makes the most common cases work as
|
The above rule is a compromise that makes the most common cases work as
|
||||||
|
@ -1948,8 +1954,8 @@ reference, and so advances only by one character after the first failure.
|
||||||
<P>
|
<P>
|
||||||
An explicit match for CR of LF is either a literal appearance of one of those
|
An explicit match for CR of LF is either a literal appearance of one of those
|
||||||
characters in the pattern, or one of the \r or \n escape sequences. Implicit
|
characters in the pattern, or one of the \r or \n escape sequences. Implicit
|
||||||
matches such as [^X] do not count, nor does \s (which includes CR and LF in
|
matches such as [^X] do not count, nor does \s, even though it includes CR and
|
||||||
the characters that it matches).
|
LF in the characters that it matches.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Notwithstanding the above, anomalous effects may still occur when CRLF is a
|
Notwithstanding the above, anomalous effects may still occur when CRLF is a
|
||||||
|
@ -1967,16 +1973,16 @@ In general, a pattern matches a certain portion of the subject, and in
|
||||||
addition, further substrings from the subject may be picked out by
|
addition, further substrings from the subject may be picked out by
|
||||||
parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
|
parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
|
||||||
book, this is called "capturing" in what follows, and the phrase "capturing
|
book, this is called "capturing" in what follows, and the phrase "capturing
|
||||||
subpattern" is used for a fragment of a pattern that picks out a substring.
|
subpattern" or "capturing group" is used for a fragment of a pattern that picks
|
||||||
PCRE2 supports several other kinds of parenthesized subpattern that do not
|
out a substring. PCRE2 supports several other kinds of parenthesized subpattern
|
||||||
cause substrings to be captured. The <b>pcre2_pattern_info()</b> function can be
|
that do not cause substrings to be captured. The <b>pcre2_pattern_info()</b>
|
||||||
used to find out how many capturing subpatterns there are in a compiled
|
function can be used to find out how many capturing subpatterns there are in a
|
||||||
pattern.
|
compiled pattern.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The overall matched string and any captured substrings are returned to the
|
The overall matched string and any captured substrings are returned to the
|
||||||
caller via a vector of PCRE2_SIZE values, called the <b>ovector</b>. This is
|
caller via a vector of PCRE2_SIZE values. This is called the <b>ovector</b>, and
|
||||||
contained within the
|
is contained within the
|
||||||
<a href="#matchdatablock">match data block.</a>
|
<a href="#matchdatablock">match data block.</a>
|
||||||
You can obtain direct access to the ovector by calling
|
You can obtain direct access to the ovector by calling
|
||||||
<b>pcre2_get_ovector_pointer()</b> to find its address, and
|
<b>pcre2_get_ovector_pointer()</b> to find its address, and
|
||||||
|
@ -2045,9 +2051,7 @@ parentheses, no more than <i>ovector[0]</i> to <i>ovector[2n+1]</i> are set by
|
||||||
<b>pcre2_match()</b>. The other elements retain whatever values they previously
|
<b>pcre2_match()</b>. The other elements retain whatever values they previously
|
||||||
had.
|
had.
|
||||||
<a name="matchotherdata"></a></P>
|
<a name="matchotherdata"></a></P>
|
||||||
<br><b>
|
<br><a name="SEC25" href="#TOC1">OTHER INFORMATION ABOUT A MATCH</a><br>
|
||||||
Other information about the match
|
|
||||||
</b><br>
|
|
||||||
<P>
|
<P>
|
||||||
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
|
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
|
||||||
<br>
|
<br>
|
||||||
|
@ -2055,7 +2059,7 @@ Other information about the match
|
||||||
<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
|
<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In addition to the offsets in the ovector, other information about a match is
|
As well as the offsets in the ovector, other information about a match is
|
||||||
retained in the match data block and can be retrieved by the above functions.
|
retained in the match data block and can be retrieved by the above functions.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -2071,9 +2075,7 @@ different to the value of <i>ovector[0]</i> if the pattern contains the \K
|
||||||
escape sequence. After a partial match, however, this value is always the same
|
escape sequence. After a partial match, however, this value is always the same
|
||||||
as <i>ovector[0]</i> because \K does not affect the result of a partial match.
|
as <i>ovector[0]</i> because \K does not affect the result of a partial match.
|
||||||
<a name="errorlist"></a></P>
|
<a name="errorlist"></a></P>
|
||||||
<br><b>
|
<br><a name="SEC26" href="#TOC1">ERROR RETURNS FROM <b>pcre2_match()</b></a><br>
|
||||||
Error return values from <b>pcre2_match()</b>
|
|
||||||
</b><br>
|
|
||||||
<P>
|
<P>
|
||||||
If <b>pcre2_match()</b> fails, it returns a negative number. This can be
|
If <b>pcre2_match()</b> fails, it returns a negative number. This can be
|
||||||
converted to a text string by calling <b>pcre2_get_error_message()</b>. Negative
|
converted to a text string by calling <b>pcre2_get_error_message()</b>. Negative
|
||||||
|
@ -2108,7 +2110,7 @@ passed to a 16-bit or 32-bit library function, or vice versa.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_ERROR_BADOFFSET
|
PCRE2_ERROR_BADOFFSET
|
||||||
</pre>
|
</pre>
|
||||||
The value of <i>startoffset</i> greater than the length of the subject.
|
The value of <i>startoffset</i> was greater than the length of the subject.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_ERROR_BADOPTION
|
PCRE2_ERROR_BADOPTION
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -2175,14 +2177,14 @@ the pattern. Specifically, it means that either the whole pattern or a
|
||||||
subpattern has been called recursively for the second time at the same position
|
subpattern has been called recursively for the second time at the same position
|
||||||
in the subject string. Some simple patterns that might do this are detected and
|
in the subject string. Some simple patterns that might do this are detected and
|
||||||
faulted at compile time, but more complicated cases, in particular mutual
|
faulted at compile time, but more complicated cases, in particular mutual
|
||||||
recursions between two different subpatterns, cannot be detected until run
|
recursions between two different subpatterns, cannot be detected until matching
|
||||||
time.
|
is attempted.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_ERROR_RECURSIONLIMIT
|
PCRE2_ERROR_RECURSIONLIMIT
|
||||||
</pre>
|
</pre>
|
||||||
The internal recursion limit was reached.
|
The internal recursion limit was reached.
|
||||||
<a name="extractbynumber"></a></P>
|
<a name="extractbynumber"></a></P>
|
||||||
<br><a name="SEC25" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
|
<br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
||||||
<b> unsigned int <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
|
<b> unsigned int <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
|
||||||
|
@ -2228,8 +2230,8 @@ extract the captured substrings.
|
||||||
<P>
|
<P>
|
||||||
The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to
|
The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to
|
||||||
the buffer and a pointer to a variable that contains its length in code units.
|
the buffer and a pointer to a variable that contains its length in code units.
|
||||||
This is updated to contain the actual number of code units used, excluding the
|
This is updated to contain the actual number of code units used for the
|
||||||
terminating zero.
|
extracted substring, excluding the terminating zero.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point
|
For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point
|
||||||
|
@ -2254,7 +2256,7 @@ no capturing group of that number in the pattern, or because the group with
|
||||||
that number did not participate in the match, or because the ovector was too
|
that number did not participate in the match, or because the ovector was too
|
||||||
small to capture that group.
|
small to capture that group.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC26" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br>
|
<br><a name="SEC28" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
|
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
|
||||||
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
|
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
|
||||||
|
@ -2264,10 +2266,11 @@ small to capture that group.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The <b>pcre2_substring_list_get()</b> function extracts all available substrings
|
The <b>pcre2_substring_list_get()</b> function extracts all available substrings
|
||||||
and builds a list of pointers to them, and a second list that contains their
|
and builds a list of pointers to them. It also (optionally) builds a second
|
||||||
lengths (in code units), excluding a terminating zero that is added to each of
|
list that contains their lengths (in code units), excluding a terminating zero
|
||||||
them. All this is done in a single block of memory that is obtained using the
|
that is added to each of them. All this is done in a single block of memory
|
||||||
same memory allocation function that was used to get the match data block.
|
that is obtained using the same memory allocation function that was used to get
|
||||||
|
the match data block.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The address of the memory block is returned via <i>listptr</i>, which is also
|
The address of the memory block is returned via <i>listptr</i>, which is also
|
||||||
|
@ -2285,10 +2288,10 @@ If this function encounters a substring that is unset, which can happen when
|
||||||
capturing subpattern number <i>n+1</i> matches some part of the subject, but
|
capturing subpattern number <i>n+1</i> matches some part of the subject, but
|
||||||
subpattern <i>n</i> has not been used at all, it returns an empty string. This
|
subpattern <i>n</i> has not been used at all, it returns an empty string. This
|
||||||
can be distinguished from a genuine zero-length substring by inspecting the
|
can be distinguished from a genuine zero-length substring by inspecting the
|
||||||
appropriate offset in the ovector, which contains PCRE2_UNSET for unset
|
appropriate offset in the ovector, which contain PCRE2_UNSET for unset
|
||||||
substrings.
|
substrings.
|
||||||
<a name="extractbyname"></a></P>
|
<a name="extractbyname"></a></P>
|
||||||
<br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
|
<br><a name="SEC29" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
|
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
|
||||||
<b> PCRE2_SPTR <i>name</i>);</b>
|
<b> PCRE2_SPTR <i>name</i>);</b>
|
||||||
|
@ -2324,11 +2327,10 @@ that name.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Given the number, you can extract the substring directly, or use one of the
|
Given the number, you can extract the substring directly, or use one of the
|
||||||
functions described in the previous section. For convenience, there are also
|
functions described above. For convenience, there are also "byname" functions
|
||||||
"byname" functions that correspond to the "bynumber" functions, the only
|
that correspond to the "bynumber" functions, the only difference being that the
|
||||||
difference being that the second argument is a name instead of a number.
|
second argument is a name instead of a number. However, if PCRE2_DUPNAMES is
|
||||||
However, if PCRE2_DUPNAMES is set and there are duplicate names,
|
set and there are duplicate names, the behaviour may not be what you want.
|
||||||
the behaviour may not be what you want (see the next section).
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
<b>Warning:</b> If the pattern uses the (?| feature to set up multiple
|
<b>Warning:</b> If the pattern uses the (?| feature to set up multiple
|
||||||
|
@ -2341,7 +2343,7 @@ names are not included in the compiled code. The matching process uses only
|
||||||
numbers. For this reason, the use of different names for subpatterns of the
|
numbers. For this reason, the use of different names for subpatterns of the
|
||||||
same number causes an error at compile time.
|
same number causes an error at compile time.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC28" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br>
|
<br><a name="SEC30" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||||
|
@ -2368,8 +2370,8 @@ recognized:
|
||||||
Either a group number or a group name can be given for <n>. Curly brackets are
|
Either a group number or a group name can be given for <n>. Curly brackets are
|
||||||
required only if the following character would be interpreted as part of the
|
required only if the following character would be interpreted as part of the
|
||||||
number or name. The number may be zero to include the entire matched string.
|
number or name. The number may be zero to include the entire matched string.
|
||||||
For example, if the pattern a(b)c is matched with "[abc]" and the replacement
|
For example, if the pattern a(b)c is matched with "=abc=" and the replacement
|
||||||
string "+$1$0$1+", the result is "[+babcb+]". Group insertion is done by
|
string "+$1$0$1+", the result is "=+babcb+=". Group insertion is done by
|
||||||
calling <b>pcre2_copy_byname()</b> or <b>pcre2_copy_bynumber()</b> as
|
calling <b>pcre2_copy_byname()</b> or <b>pcre2_copy_bynumber()</b> as
|
||||||
appropriate.
|
appropriate.
|
||||||
</P>
|
</P>
|
||||||
|
@ -2402,7 +2404,7 @@ straight back. PCRE2_ERROR_BADREPLACEMENT is returned for an invalid
|
||||||
replacement string (unrecognized sequence following a dollar sign), and
|
replacement string (unrecognized sequence following a dollar sign), and
|
||||||
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough.
|
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC29" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
|
<br><a name="SEC31" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
|
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
|
||||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
|
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
|
||||||
|
@ -2423,19 +2425,21 @@ documentation.
|
||||||
When duplicates are present, <b>pcre2_substring_copy_byname()</b> and
|
When duplicates are present, <b>pcre2_substring_copy_byname()</b> and
|
||||||
<b>pcre2_substring_get_byname()</b> return the first substring corresponding to
|
<b>pcre2_substring_get_byname()</b> return the first substring corresponding to
|
||||||
the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is
|
the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is
|
||||||
returned. The <b>pcre2_substring_number_from_name()</b> function returns one of
|
returned. The <b>pcre2_substring_number_from_name()</b> function returns
|
||||||
the numbers that are associated with the name, but it is not defined which it
|
the error PCRE2_ERROR_NOUNIQUESUBSTRING.
|
||||||
is.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If you want to get full details of all captured substrings for a given name,
|
If you want to get full details of all captured substrings for a given name,
|
||||||
you must use the <b>pcre2_substring_nametable_scan()</b> function. The first
|
you must use the <b>pcre2_substring_nametable_scan()</b> function. The first
|
||||||
argument is the compiled pattern, and the second is the name. If the third and
|
argument is the compiled pattern, and the second is the name. If the third and
|
||||||
fourth arguments are NULL, the function returns a group number (it is not
|
fourth arguments are NULL, the function returns a group number for a unique
|
||||||
defined which). Otherwise, the third and fourth arguments must be pointers to
|
name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
When the third and fourth arguments are not NULL, they must be pointers to
|
||||||
variables that are updated by the function. After it has run, they point to the
|
variables that are updated by the function. After it has run, they point to the
|
||||||
first and last entries in the name-to-number table for the given name, and the
|
first and last entries in the name-to-number table for the given name, and the
|
||||||
function returns the length of each entry. In both cases,
|
function returns the length of each entry in code units. In both cases,
|
||||||
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -2445,14 +2449,14 @@ The format of the name table is described above in the section entitled
|
||||||
Given all the relevant entries for the name, you can extract each of their
|
Given all the relevant entries for the name, you can extract each of their
|
||||||
numbers, and hence the captured data.
|
numbers, and hence the captured data.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">FINDING ALL POSSIBLE MATCHES</a><br>
|
<br><a name="SEC32" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
|
||||||
<P>
|
<P>
|
||||||
The traditional matching function uses a similar algorithm to Perl, which stops
|
The traditional matching function uses a similar algorithm to Perl, which stops
|
||||||
when it finds the first match, starting at a given point in the subject. If you
|
when it finds the first match at a given point in the subject. If you want to
|
||||||
want to find all possible matches, or the longest possible match at a given
|
find all possible matches, or the longest possible match at a given position,
|
||||||
position, consider using the alternative matching function (see below) instead.
|
consider using the alternative matching function (see below) instead. If you
|
||||||
If you cannot use the alternative function, you can kludge it up by making use
|
cannot use the alternative function, you can kludge it up by making use of the
|
||||||
of the callout facility, which is described in the
|
callout facility, which is described in the
|
||||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||||
documentation.
|
documentation.
|
||||||
</P>
|
</P>
|
||||||
|
@ -2463,7 +2467,7 @@ substring. Then return 1, which forces <b>pcre2_match()</b> to backtrack and try
|
||||||
other alternatives. Ultimately, when it runs out of matches,
|
other alternatives. Ultimately, when it runs out of matches,
|
||||||
<b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH.
|
<b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH.
|
||||||
<a name="dfamatch"></a></P>
|
<a name="dfamatch"></a></P>
|
||||||
<br><a name="SEC31" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br>
|
<br><a name="SEC33" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||||
|
@ -2591,11 +2595,10 @@ the longest matches.
|
||||||
<P>
|
<P>
|
||||||
NOTE: PCRE2's "auto-possessification" optimization usually applies to character
|
NOTE: PCRE2's "auto-possessification" optimization usually applies to character
|
||||||
repeats at the end of a pattern (as well as internally). For example, the
|
repeats at the end of a pattern (as well as internally). For example, the
|
||||||
pattern "a\d+" is compiled as if it were "a\d++" because there is no point in
|
pattern "a\d+" is compiled as if it were "a\d++". For DFA matching, this
|
||||||
backtracking into the repeated digits. For DFA matching, this means that only
|
means that only one possible match is found. If you really do want multiple
|
||||||
one possible match is found. If you really do want multiple matches in such
|
matches in such cases, either use an ungreedy repeat auch as "a\d+?" or set
|
||||||
cases, either use an ungreedy repeat ("a\d+?") or set the
|
the PCRE2_NO_AUTO_POSSESS option when compiling.
|
||||||
PCRE2_NO_AUTO_POSSESS option when compiling.
|
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Error returns from <b>pcre2_dfa_match()</b>
|
Error returns from <b>pcre2_dfa_match()</b>
|
||||||
|
@ -2633,29 +2636,29 @@ extremely rare, as a vector of size 1000 is used.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_ERROR_DFA_BADRESTART
|
PCRE2_ERROR_DFA_BADRESTART
|
||||||
</pre>
|
</pre>
|
||||||
When <b>pcre2_dfa_match()</b> is called with the <b>pcre2_dfa_RESTART</b> option,
|
When <b>pcre2_dfa_match()</b> is called with the <b>PCRE2_DFA_RESTART</b> option,
|
||||||
some plausibility checks are made on the contents of the workspace, which
|
some plausibility checks are made on the contents of the workspace, which
|
||||||
should contain data about the previous partial match. If any of these checks
|
should contain data about the previous partial match. If any of these checks
|
||||||
fail, this error is given.
|
fail, this error is given.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC32" href="#TOC1">SEE ALSO</a><br>
|
<br><a name="SEC34" href="#TOC1">SEE ALSO</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>pcre2build</b>(3), <b>pcre2libs</b>(3), <b>pcre2callout</b>(3),
|
<b>pcre2build</b>(3), <b>pcre2callout</b>(3), <b>pcre2demo(3)</b>,
|
||||||
<b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3),
|
<b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3),
|
||||||
<b>pcre2demo(3)</b>, <b>pcre2sample</b>(3), <b>pcre2stack</b>(3).
|
<b>pcre2sample</b>(3), <b>pcre2stack</b>(3), <b>pcre2unicode</b>(3).
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC33" href="#TOC1">AUTHOR</a><br>
|
<br><a name="SEC35" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC34" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC36" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 11 November 2014
|
Last updated: 21 November 2014
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -461,7 +461,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||||
|
|
|
@ -256,7 +256,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
|
||||||
|
|
|
@ -207,7 +207,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
|
|
@ -745,7 +745,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
||||||
|
|
|
@ -413,7 +413,7 @@ Philip Hazel (FAQ by Zoltan Herczeg)
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||||
|
|
|
@ -73,7 +73,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
|
|
@ -227,7 +227,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||||
|
|
|
@ -450,7 +450,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC10" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC10" href="#TOC1">REVISION</a><br>
|
||||||
|
|
|
@ -3231,7 +3231,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
|
|
|
@ -180,7 +180,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
|
|
@ -278,7 +278,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||||
|
|
|
@ -90,7 +90,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
|
|
@ -33,6 +33,13 @@ the recursive call would immediately be passed back as the result of the
|
||||||
current call (a "tail recursion"), the function is just restarted instead.
|
current call (a "tail recursion"), the function is just restarted instead.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
Each time the internal <b>match()</b> function is called recursively, it uses
|
||||||
|
memory from the process stack. For certain kinds of pattern and data, very
|
||||||
|
large amounts of stack may be needed, despite the recognition of "tail
|
||||||
|
recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
|
||||||
|
of the GCC compiler, the stack requirements are greatly increased.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
The above comments apply when <b>pcre2_match()</b> is run in its normal
|
The above comments apply when <b>pcre2_match()</b> is run in its normal
|
||||||
interpretive manner. If the compiled pattern was processed by
|
interpretive manner. If the compiled pattern was processed by
|
||||||
<b>pcre2_jit_compile()</b>, and just-in-time compiling was successful, and the
|
<b>pcre2_jit_compile()</b>, and just-in-time compiling was successful, and the
|
||||||
|
@ -61,10 +68,7 @@ relevant only for <b>pcre2_match()</b> without the JIT optimization.
|
||||||
Reducing <b>pcre2_match()</b>'s stack usage
|
Reducing <b>pcre2_match()</b>'s stack usage
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Each time that the internal <b>match()</b> function is called recursively, it
|
You can often reduce the amount of recursion, and therefore the
|
||||||
uses memory from the process stack. For certain kinds of pattern and data, very
|
|
||||||
large amounts of stack may be needed, despite the recognition of "tail
|
|
||||||
recursion". You can often reduce the amount of recursion, and therefore the
|
|
||||||
amount of stack used, by modifying the pattern that is being matched. Consider,
|
amount of stack used, by modifying the pattern that is being matched. Consider,
|
||||||
for example, this pattern:
|
for example, this pattern:
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -187,14 +191,14 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 20 October 2014
|
Last updated: 21 November 2014
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -548,7 +548,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||||
|
|
|
@ -1301,7 +1301,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC20" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC20" href="#TOC1">REVISION</a><br>
|
||||||
|
|
|
@ -254,7 +254,7 @@ Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge CB2 3QH, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
|
1333
doc/pcre2.txt
1333
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
162
doc/pcre2api.3
162
doc/pcre2api.3
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "18 November 2014" "PCRE2 10.00"
|
.TH PCRE2API 3 "21 November 2014" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -1569,15 +1569,17 @@ values.
|
||||||
.P
|
.P
|
||||||
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
|
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
|
||||||
the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
|
the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
|
||||||
entry; both of these return a \fBuint32_t\fP value. The entry size depends on
|
entry in code units; both of these return a \fBuint32_t\fP value. The entry
|
||||||
the length of the longest name. PCRE2_INFO_NAMETABLE returns a pointer to the
|
size depends on the length of the longest name.
|
||||||
first entry of the table. This is a PCRE2_SPTR pointer to a block of code
|
.P
|
||||||
units. In the 8-bit library, the first two bytes of each entry are the number
|
PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
|
||||||
of the capturing parenthesis, most significant byte first. In the 16-bit
|
a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first
|
||||||
library, the pointer points to 16-bit data units, the first of which contains
|
two bytes of each entry are the number of the capturing parenthesis, most
|
||||||
the parenthesis number. In the 32-bit library, the pointer points to 32-bit
|
significant byte first. In the 16-bit library, the pointer points to 16-bit
|
||||||
data units, the first of which contains the parenthesis number. The rest of the
|
code units, the first of which contains the parenthesis number. In the 32-bit
|
||||||
entry is the corresponding name, zero terminated.
|
library, the pointer points to 32-bit code units, the first of which contains
|
||||||
|
the parenthesis number. The rest of the entry is the corresponding name, zero
|
||||||
|
terminated.
|
||||||
.P
|
.P
|
||||||
The names are in alphabetical order. If (?| is used to create multiple groups
|
The names are in alphabetical order. If (?| is used to create multiple groups
|
||||||
with the same number, as described in the
|
with the same number, as described in the
|
||||||
|
@ -1835,17 +1837,18 @@ matching.
|
||||||
.sp
|
.sp
|
||||||
This option specifies that first character of the subject string is not the
|
This option specifies that first character of the subject string is not the
|
||||||
beginning of a line, so the circumflex metacharacter should not match before
|
beginning of a line, so the circumflex metacharacter should not match before
|
||||||
it. Setting this without PCRE2_MULTILINE (at compile time) causes circumflex
|
it. Setting this without having set PCRE2_MULTILINE at compile time causes
|
||||||
never to match. This option affects only the behaviour of the circumflex
|
circumflex never to match. This option affects only the behaviour of the
|
||||||
metacharacter. It does not affect \eA.
|
circumflex metacharacter. It does not affect \eA.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_NOTEOL
|
PCRE2_NOTEOL
|
||||||
.sp
|
.sp
|
||||||
This option specifies that the end of the subject string is not the end of a
|
This option specifies that the end of the subject string is not the end of a
|
||||||
line, so the dollar metacharacter should not match it nor (except in multiline
|
line, so the dollar metacharacter should not match it nor (except in multiline
|
||||||
mode) a newline immediately before it. Setting this without PCRE2_MULTILINE (at
|
mode) a newline immediately before it. Setting this without having set
|
||||||
compile time) causes dollar never to match. This option affects only the
|
PCRE2_MULTILINE at compile time causes dollar never to match. This option
|
||||||
behaviour of the dollar metacharacter. It does not affect \eZ or \ez.
|
affects only the behaviour of the dollar metacharacter. It does not affect \eZ
|
||||||
|
or \ez.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_NOTEMPTY
|
PCRE2_NOTEMPTY
|
||||||
.sp
|
.sp
|
||||||
|
@ -1857,13 +1860,16 @@ match the empty string, the entire match fails. For example, if the pattern
|
||||||
.sp
|
.sp
|
||||||
is applied to a string not beginning with "a" or "b", it matches an empty
|
is applied to a string not beginning with "a" or "b", it matches an empty
|
||||||
string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
|
string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
|
||||||
valid, so PCRE2 searches further into the string for occurrences of "a" or "b".
|
valid, so \fBpcre2_match()\fP searches further into the string for occurrences
|
||||||
|
of "a" or "b".
|
||||||
.sp
|
.sp
|
||||||
PCRE2_NOTEMPTY_ATSTART
|
PCRE2_NOTEMPTY_ATSTART
|
||||||
.sp
|
.sp
|
||||||
This is like PCRE2_NOTEMPTY, except that an empty string match that is not at
|
This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
|
||||||
the start of the subject is permitted. If the pattern is anchored, such a match
|
only at the first matching position, that is, at the start of the subject plus
|
||||||
can occur only if the pattern contains \eK.
|
the starting offset. An empty string match later in the subject is permitted.
|
||||||
|
If the pattern is anchored, such a match can occur only if the pattern contains
|
||||||
|
\eK.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_NO_UTF_CHECK
|
PCRE2_NO_UTF_CHECK
|
||||||
.sp
|
.sp
|
||||||
|
@ -1913,8 +1919,8 @@ subject characters to complete the match. If this happens when
|
||||||
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
|
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
|
||||||
testing any remaining alternatives. Only if no complete match can be found is
|
testing any remaining alternatives. Only if no complete match can be found is
|
||||||
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
|
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
|
||||||
PCRE2_PARTIAL_SOFT says that the caller is prepared to handle a partial match,
|
PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
|
||||||
but only if no complete match can be found.
|
match, but only if no complete match can be found.
|
||||||
.P
|
.P
|
||||||
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
|
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
|
||||||
a partial match is found, \fBpcre2_match()\fP immediately returns
|
a partial match is found, \fBpcre2_match()\fP immediately returns
|
||||||
|
@ -1943,13 +1949,13 @@ compile context.
|
||||||
.\"
|
.\"
|
||||||
During matching, the newline choice affects the behaviour of the dot,
|
During matching, the newline choice affects the behaviour of the dot,
|
||||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||||
position is advanced after a match failure for an unanchored pattern.
|
starting position is advanced after a match failure for an unanchored pattern.
|
||||||
.P
|
.P
|
||||||
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set,
|
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
|
||||||
and a match attempt for an unanchored pattern fails when the current position
|
the newline convention, and a match attempt for an unanchored pattern fails
|
||||||
is at a CRLF sequence, and the pattern contains no explicit matches for CR or
|
when the current starting position is at a CRLF sequence, and the pattern
|
||||||
LF characters, the match position is advanced by two characters instead of one,
|
contains no explicit matches for CR or LF characters, the match position is
|
||||||
in other words, to after the CRLF.
|
advanced by two characters instead of one, in other words, to after the CRLF.
|
||||||
.P
|
.P
|
||||||
The above rule is a compromise that makes the most common cases work as
|
The above rule is a compromise that makes the most common cases work as
|
||||||
expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is
|
expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is
|
||||||
|
@ -1960,8 +1966,8 @@ reference, and so advances only by one character after the first failure.
|
||||||
.P
|
.P
|
||||||
An explicit match for CR of LF is either a literal appearance of one of those
|
An explicit match for CR of LF is either a literal appearance of one of those
|
||||||
characters in the pattern, or one of the \er or \en escape sequences. Implicit
|
characters in the pattern, or one of the \er or \en escape sequences. Implicit
|
||||||
matches such as [^X] do not count, nor does \es (which includes CR and LF in
|
matches such as [^X] do not count, nor does \es, even though it includes CR and
|
||||||
the characters that it matches).
|
LF in the characters that it matches.
|
||||||
.P
|
.P
|
||||||
Notwithstanding the above, anomalous effects may still occur when CRLF is a
|
Notwithstanding the above, anomalous effects may still occur when CRLF is a
|
||||||
valid newline sequence and explicit \er or \en escapes appear in the pattern.
|
valid newline sequence and explicit \er or \en escapes appear in the pattern.
|
||||||
|
@ -1981,15 +1987,15 @@ In general, a pattern matches a certain portion of the subject, and in
|
||||||
addition, further substrings from the subject may be picked out by
|
addition, further substrings from the subject may be picked out by
|
||||||
parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
|
parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
|
||||||
book, this is called "capturing" in what follows, and the phrase "capturing
|
book, this is called "capturing" in what follows, and the phrase "capturing
|
||||||
subpattern" is used for a fragment of a pattern that picks out a substring.
|
subpattern" or "capturing group" is used for a fragment of a pattern that picks
|
||||||
PCRE2 supports several other kinds of parenthesized subpattern that do not
|
out a substring. PCRE2 supports several other kinds of parenthesized subpattern
|
||||||
cause substrings to be captured. The \fBpcre2_pattern_info()\fP function can be
|
that do not cause substrings to be captured. The \fBpcre2_pattern_info()\fP
|
||||||
used to find out how many capturing subpatterns there are in a compiled
|
function can be used to find out how many capturing subpatterns there are in a
|
||||||
pattern.
|
compiled pattern.
|
||||||
.P
|
.P
|
||||||
The overall matched string and any captured substrings are returned to the
|
The overall matched string and any captured substrings are returned to the
|
||||||
caller via a vector of PCRE2_SIZE values, called the \fBovector\fP. This is
|
caller via a vector of PCRE2_SIZE values. This is called the \fBovector\fP, and
|
||||||
contained within the
|
is contained within the
|
||||||
.\" HTML <a href="#matchdatablock">
|
.\" HTML <a href="#matchdatablock">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
match data block.
|
match data block.
|
||||||
|
@ -2062,7 +2068,7 @@ had.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="matchotherdata"></a>
|
.\" HTML <a name="matchotherdata"></a>
|
||||||
.SS "Other information about the match"
|
.SH "OTHER INFORMATION ABOUT A MATCH"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
|
@ -2071,7 +2077,7 @@ had.
|
||||||
.B PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *\fImatch_data\fP);
|
.B PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *\fImatch_data\fP);
|
||||||
.fi
|
.fi
|
||||||
.P
|
.P
|
||||||
In addition to the offsets in the ovector, other information about a match is
|
As well as the offsets in the ovector, other information about a match is
|
||||||
retained in the match data block and can be retrieved by the above functions.
|
retained in the match data block and can be retrieved by the above functions.
|
||||||
.P
|
.P
|
||||||
When a (*MARK) name is to be passed back, \fBpcre2_get_mark()\fP returns a
|
When a (*MARK) name is to be passed back, \fBpcre2_get_mark()\fP returns a
|
||||||
|
@ -2087,7 +2093,7 @@ as \fIovector[0]\fP because \eK does not affect the result of a partial match.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="errorlist"></a>
|
.\" HTML <a name="errorlist"></a>
|
||||||
.SS "Error return values from \fBpcre2_match()\fP"
|
.SH "ERROR RETURNS FROM \fBpcre2_match()\fP"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
If \fBpcre2_match()\fP fails, it returns a negative number. This can be
|
If \fBpcre2_match()\fP fails, it returns a negative number. This can be
|
||||||
|
@ -2127,7 +2133,7 @@ passed to a 16-bit or 32-bit library function, or vice versa.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_ERROR_BADOFFSET
|
PCRE2_ERROR_BADOFFSET
|
||||||
.sp
|
.sp
|
||||||
The value of \fIstartoffset\fP greater than the length of the subject.
|
The value of \fIstartoffset\fP was greater than the length of the subject.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_ERROR_BADOPTION
|
PCRE2_ERROR_BADOPTION
|
||||||
.sp
|
.sp
|
||||||
|
@ -2200,8 +2206,8 @@ the pattern. Specifically, it means that either the whole pattern or a
|
||||||
subpattern has been called recursively for the second time at the same position
|
subpattern has been called recursively for the second time at the same position
|
||||||
in the subject string. Some simple patterns that might do this are detected and
|
in the subject string. Some simple patterns that might do this are detected and
|
||||||
faulted at compile time, but more complicated cases, in particular mutual
|
faulted at compile time, but more complicated cases, in particular mutual
|
||||||
recursions between two different subpatterns, cannot be detected until run
|
recursions between two different subpatterns, cannot be detected until matching
|
||||||
time.
|
is attempted.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_ERROR_RECURSIONLIMIT
|
PCRE2_ERROR_RECURSIONLIMIT
|
||||||
.sp
|
.sp
|
||||||
|
@ -2254,8 +2260,8 @@ extract the captured substrings.
|
||||||
.P
|
.P
|
||||||
The final arguments of \fBpcre2_substring_copy_bynumber()\fP are a pointer to
|
The final arguments of \fBpcre2_substring_copy_bynumber()\fP are a pointer to
|
||||||
the buffer and a pointer to a variable that contains its length in code units.
|
the buffer and a pointer to a variable that contains its length in code units.
|
||||||
This is updated to contain the actual number of code units used, excluding the
|
This is updated to contain the actual number of code units used for the
|
||||||
terminating zero.
|
extracted substring, excluding the terminating zero.
|
||||||
.P
|
.P
|
||||||
For \fBpcre2_substring_get_bynumber()\fP the third and fourth arguments point
|
For \fBpcre2_substring_get_bynumber()\fP the third and fourth arguments point
|
||||||
to variables that are updated with a pointer to the new memory and the number
|
to variables that are updated with a pointer to the new memory and the number
|
||||||
|
@ -2290,10 +2296,11 @@ small to capture that group.
|
||||||
.fi
|
.fi
|
||||||
.P
|
.P
|
||||||
The \fBpcre2_substring_list_get()\fP function extracts all available substrings
|
The \fBpcre2_substring_list_get()\fP function extracts all available substrings
|
||||||
and builds a list of pointers to them, and a second list that contains their
|
and builds a list of pointers to them. It also (optionally) builds a second
|
||||||
lengths (in code units), excluding a terminating zero that is added to each of
|
list that contains their lengths (in code units), excluding a terminating zero
|
||||||
them. All this is done in a single block of memory that is obtained using the
|
that is added to each of them. All this is done in a single block of memory
|
||||||
same memory allocation function that was used to get the match data block.
|
that is obtained using the same memory allocation function that was used to get
|
||||||
|
the match data block.
|
||||||
.P
|
.P
|
||||||
The address of the memory block is returned via \fIlistptr\fP, which is also
|
The address of the memory block is returned via \fIlistptr\fP, which is also
|
||||||
the start of the list of string pointers. The end of the list is marked by a
|
the start of the list of string pointers. The end of the list is marked by a
|
||||||
|
@ -2309,7 +2316,7 @@ If this function encounters a substring that is unset, which can happen when
|
||||||
capturing subpattern number \fIn+1\fP matches some part of the subject, but
|
capturing subpattern number \fIn+1\fP matches some part of the subject, but
|
||||||
subpattern \fIn\fP has not been used at all, it returns an empty string. This
|
subpattern \fIn\fP has not been used at all, it returns an empty string. This
|
||||||
can be distinguished from a genuine zero-length substring by inspecting the
|
can be distinguished from a genuine zero-length substring by inspecting the
|
||||||
appropriate offset in the ovector, which contains PCRE2_UNSET for unset
|
appropriate offset in the ovector, which contain PCRE2_UNSET for unset
|
||||||
substrings.
|
substrings.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
@ -2347,11 +2354,10 @@ name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one subpattern of
|
||||||
that name.
|
that name.
|
||||||
.P
|
.P
|
||||||
Given the number, you can extract the substring directly, or use one of the
|
Given the number, you can extract the substring directly, or use one of the
|
||||||
functions described in the previous section. For convenience, there are also
|
functions described above. For convenience, there are also "byname" functions
|
||||||
"byname" functions that correspond to the "bynumber" functions, the only
|
that correspond to the "bynumber" functions, the only difference being that the
|
||||||
difference being that the second argument is a name instead of a number.
|
second argument is a name instead of a number. However, if PCRE2_DUPNAMES is
|
||||||
However, if PCRE2_DUPNAMES is set and there are duplicate names,
|
set and there are duplicate names, the behaviour may not be what you want.
|
||||||
the behaviour may not be what you want (see the next section).
|
|
||||||
.P
|
.P
|
||||||
\fBWarning:\fP If the pattern uses the (?| feature to set up multiple
|
\fBWarning:\fP If the pattern uses the (?| feature to set up multiple
|
||||||
subpatterns with the same number, as described in the
|
subpatterns with the same number, as described in the
|
||||||
|
@ -2398,8 +2404,8 @@ recognized:
|
||||||
Either a group number or a group name can be given for <n>. Curly brackets are
|
Either a group number or a group name can be given for <n>. Curly brackets are
|
||||||
required only if the following character would be interpreted as part of the
|
required only if the following character would be interpreted as part of the
|
||||||
number or name. The number may be zero to include the entire matched string.
|
number or name. The number may be zero to include the entire matched string.
|
||||||
For example, if the pattern a(b)c is matched with "[abc]" and the replacement
|
For example, if the pattern a(b)c is matched with "=abc=" and the replacement
|
||||||
string "+$1$0$1+", the result is "[+babcb+]". Group insertion is done by
|
string "+$1$0$1+", the result is "=+babcb+=". Group insertion is done by
|
||||||
calling \fBpcre2_copy_byname()\fP or \fBpcre2_copy_bynumber()\fP as
|
calling \fBpcre2_copy_byname()\fP or \fBpcre2_copy_bynumber()\fP as
|
||||||
appropriate.
|
appropriate.
|
||||||
.P
|
.P
|
||||||
|
@ -2452,18 +2458,19 @@ documentation.
|
||||||
When duplicates are present, \fBpcre2_substring_copy_byname()\fP and
|
When duplicates are present, \fBpcre2_substring_copy_byname()\fP and
|
||||||
\fBpcre2_substring_get_byname()\fP return the first substring corresponding to
|
\fBpcre2_substring_get_byname()\fP return the first substring corresponding to
|
||||||
the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is
|
the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is
|
||||||
returned. The \fBpcre2_substring_number_from_name()\fP function returns one of
|
returned. The \fBpcre2_substring_number_from_name()\fP function returns
|
||||||
the numbers that are associated with the name, but it is not defined which it
|
the error PCRE2_ERROR_NOUNIQUESUBSTRING.
|
||||||
is.
|
|
||||||
.P
|
.P
|
||||||
If you want to get full details of all captured substrings for a given name,
|
If you want to get full details of all captured substrings for a given name,
|
||||||
you must use the \fBpcre2_substring_nametable_scan()\fP function. The first
|
you must use the \fBpcre2_substring_nametable_scan()\fP function. The first
|
||||||
argument is the compiled pattern, and the second is the name. If the third and
|
argument is the compiled pattern, and the second is the name. If the third and
|
||||||
fourth arguments are NULL, the function returns a group number (it is not
|
fourth arguments are NULL, the function returns a group number for a unique
|
||||||
defined which). Otherwise, the third and fourth arguments must be pointers to
|
name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
|
||||||
|
.P
|
||||||
|
When the third and fourth arguments are not NULL, they must be pointers to
|
||||||
variables that are updated by the function. After it has run, they point to the
|
variables that are updated by the function. After it has run, they point to the
|
||||||
first and last entries in the name-to-number table for the given name, and the
|
first and last entries in the name-to-number table for the given name, and the
|
||||||
function returns the length of each entry. In both cases,
|
function returns the length of each entry in code units. In both cases,
|
||||||
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
||||||
.P
|
.P
|
||||||
The format of the name table is described above in the section entitled
|
The format of the name table is described above in the section entitled
|
||||||
|
@ -2476,15 +2483,15 @@ Given all the relevant entries for the name, you can extract each of their
|
||||||
numbers, and hence the captured data.
|
numbers, and hence the captured data.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "FINDING ALL POSSIBLE MATCHES"
|
.SH "FINDING ALL POSSIBLE MATCHES AT ONE POSITION"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
The traditional matching function uses a similar algorithm to Perl, which stops
|
The traditional matching function uses a similar algorithm to Perl, which stops
|
||||||
when it finds the first match, starting at a given point in the subject. If you
|
when it finds the first match at a given point in the subject. If you want to
|
||||||
want to find all possible matches, or the longest possible match at a given
|
find all possible matches, or the longest possible match at a given position,
|
||||||
position, consider using the alternative matching function (see below) instead.
|
consider using the alternative matching function (see below) instead. If you
|
||||||
If you cannot use the alternative function, you can kludge it up by making use
|
cannot use the alternative function, you can kludge it up by making use of the
|
||||||
of the callout facility, which is described in the
|
callout facility, which is described in the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2callout\fP
|
\fBpcre2callout\fP
|
||||||
.\"
|
.\"
|
||||||
|
@ -2628,11 +2635,10 @@ the longest matches.
|
||||||
.P
|
.P
|
||||||
NOTE: PCRE2's "auto-possessification" optimization usually applies to character
|
NOTE: PCRE2's "auto-possessification" optimization usually applies to character
|
||||||
repeats at the end of a pattern (as well as internally). For example, the
|
repeats at the end of a pattern (as well as internally). For example, the
|
||||||
pattern "a\ed+" is compiled as if it were "a\ed++" because there is no point in
|
pattern "a\ed+" is compiled as if it were "a\ed++". For DFA matching, this
|
||||||
backtracking into the repeated digits. For DFA matching, this means that only
|
means that only one possible match is found. If you really do want multiple
|
||||||
one possible match is found. If you really do want multiple matches in such
|
matches in such cases, either use an ungreedy repeat auch as "a\ed+?" or set
|
||||||
cases, either use an ungreedy repeat ("a\ed+?") or set the
|
the PCRE2_NO_AUTO_POSSESS option when compiling.
|
||||||
PCRE2_NO_AUTO_POSSESS option when compiling.
|
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Error returns from \fBpcre2_dfa_match()\fP"
|
.SS "Error returns from \fBpcre2_dfa_match()\fP"
|
||||||
|
@ -2673,7 +2679,7 @@ extremely rare, as a vector of size 1000 is used.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_ERROR_DFA_BADRESTART
|
PCRE2_ERROR_DFA_BADRESTART
|
||||||
.sp
|
.sp
|
||||||
When \fBpcre2_dfa_match()\fP is called with the \fBpcre2_dfa_RESTART\fP option,
|
When \fBpcre2_dfa_match()\fP is called with the \fBPCRE2_DFA_RESTART\fP option,
|
||||||
some plausibility checks are made on the contents of the workspace, which
|
some plausibility checks are made on the contents of the workspace, which
|
||||||
should contain data about the previous partial match. If any of these checks
|
should contain data about the previous partial match. If any of these checks
|
||||||
fail, this error is given.
|
fail, this error is given.
|
||||||
|
@ -2682,9 +2688,9 @@ fail, this error is given.
|
||||||
.SH "SEE ALSO"
|
.SH "SEE ALSO"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
\fBpcre2build\fP(3), \fBpcre2libs\fP(3), \fBpcre2callout\fP(3),
|
\fBpcre2build\fP(3), \fBpcre2callout\fP(3), \fBpcre2demo(3)\fP,
|
||||||
\fBpcre2matching\fP(3), \fBpcre2partial\fP(3), \fBpcre2posix\fP(3),
|
\fBpcre2matching\fP(3), \fBpcre2partial\fP(3), \fBpcre2posix\fP(3),
|
||||||
\fBpcre2demo(3)\fP, \fBpcre2sample\fP(3), \fBpcre2stack\fP(3).
|
\fBpcre2sample\fP(3), \fBpcre2stack\fP(3), \fBpcre2unicode\fP(3).
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH AUTHOR
|
.SH AUTHOR
|
||||||
|
@ -2701,6 +2707,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 18 November 2014
|
Last updated: 21 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
Loading…
Reference in New Issue