More documentation and file tidies.

This commit is contained in:
Philip.Hazel 2014-11-21 16:45:06 +00:00
parent ba1e2e0cbb
commit eb4fffbbf4
25 changed files with 1002 additions and 987 deletions

View File

@ -25,9 +25,10 @@ PCRE2 is the name used for a revised API for the PCRE library, which is a set
of functions, written in C, that implement regular expression pattern matching of functions, written in C, that implement regular expression pattern matching
using the same syntax and semantics as Perl, with just a few differences. Some using the same syntax and semantics as Perl, with just a few differences. Some
features that appeared in Python and the original PCRE before they appeared in features that appeared in Python and the original PCRE before they appeared in
Perl are also available using the Python syntax, there is some support for one Perl are also available using the Python syntax. There is also some support for
or two .NET and Oniguruma syntax items, and there are options for requesting one or two .NET and Oniguruma syntax items, and there are options for
some minor changes that give better ECMAScript (aka JavaScript) compatibility. requesting some minor changes that give better ECMAScript (aka JavaScript)
compatibility.
</P> </P>
<P> <P>
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit
@ -36,7 +37,7 @@ The original work to extend PCRE to 16-bit and 32-bit code units was done by
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
can be interpreted either as one character per code unit, or as UTF-encoded can be interpreted either as one character per code unit, or as UTF-encoded
Unicode, with support for Unicode general category properties. Unicode support Unicode, with support for Unicode general category properties. Unicode support
is optional at build time (but is the default); however, processing strings as is optional at build time (but is the default). However, processing strings as
UTF code units must be enabled explicitly at run time. The version of Unicode UTF code units must be enabled explicitly at run time. The version of Unicode
in use can be discovered by running in use can be discovered by running
<pre> <pre>
@ -143,17 +144,17 @@ listing), and the short pages for individual functions, are concatenated in
pcre2compat discussion of Perl compatibility pcre2compat discussion of Perl compatibility
pcre2demo a demonstration C program that uses PCRE2 pcre2demo a demonstration C program that uses PCRE2
pcre2grep description of the <b>pcre2grep</b> command (8-bit only) pcre2grep description of the <b>pcre2grep</b> command (8-bit only)
pcre2jit discussion of the just-in-time optimization support pcre2jit discussion of just-in-time optimization support
pcre2limits details of size and other limits pcre2limits details of size and other limits
pcre2matching discussion of the two matching algorithms pcre2matching discussion of the two matching algorithms
pcre2partial details of the partial matching facility pcre2partial details of the partial matching facility
pcre2pattern syntax and semantics of supported regular expressions pcre2pattern syntax and semantics of supported regular expression patterns
pcre2perform discussion of performance issues pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program pcre2sample discussion of the pcre2demo program
pcre2stack discussion of stack usage pcre2stack discussion of stack usage
pcre2syntax quick syntax reference pcre2syntax quick syntax reference
pcre2test description of the <b>pcre2test</b> testing command pcre2test description of the <b>pcre2test</b> command
pcre2unicode discussion of Unicode and UTF support pcre2unicode discussion of Unicode and UTF support
</pre> </pre>
In the "man" and HTML formats, there is also a short page for each C library In the "man" and HTML formats, there is also a short page for each C library
@ -165,7 +166,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<P> <P>
@ -174,7 +175,7 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
</P> </P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br> <br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 03 November 2014 Last updated: 18 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -37,16 +37,18 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC22" href="#SEC22">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a> <li><a name="TOC22" href="#SEC22">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a>
<li><a name="TOC23" href="#SEC23">NEWLINE HANDLING WHEN MATCHING</a> <li><a name="TOC23" href="#SEC23">NEWLINE HANDLING WHEN MATCHING</a>
<li><a name="TOC24" href="#SEC24">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a> <li><a name="TOC24" href="#SEC24">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a>
<li><a name="TOC25" href="#SEC25">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a> <li><a name="TOC25" href="#SEC25">OTHER INFORMATION ABOUT A MATCH</a>
<li><a name="TOC26" href="#SEC26">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a> <li><a name="TOC26" href="#SEC26">ERROR RETURNS FROM <b>pcre2_match()</b></a>
<li><a name="TOC27" href="#SEC27">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a> <li><a name="TOC27" href="#SEC27">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a>
<li><a name="TOC28" href="#SEC28">CREATING A NEW STRING WITH SUBSTITUTIONS</a> <li><a name="TOC28" href="#SEC28">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a>
<li><a name="TOC29" href="#SEC29">DUPLICATE SUBPATTERN NAMES</a> <li><a name="TOC29" href="#SEC29">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a>
<li><a name="TOC30" href="#SEC30">FINDING ALL POSSIBLE MATCHES</a> <li><a name="TOC30" href="#SEC30">CREATING A NEW STRING WITH SUBSTITUTIONS</a>
<li><a name="TOC31" href="#SEC31">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a> <li><a name="TOC31" href="#SEC31">DUPLICATE SUBPATTERN NAMES</a>
<li><a name="TOC32" href="#SEC32">SEE ALSO</a> <li><a name="TOC32" href="#SEC32">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a>
<li><a name="TOC33" href="#SEC33">AUTHOR</a> <li><a name="TOC33" href="#SEC33">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a>
<li><a name="TOC34" href="#SEC34">REVISION</a> <li><a name="TOC34" href="#SEC34">SEE ALSO</a>
<li><a name="TOC35" href="#SEC35">AUTHOR</a>
<li><a name="TOC36" href="#SEC36">REVISION</a>
</ul> </ul>
<P> <P>
<b>#include &#60;pcre2.h&#62;</b> <b>#include &#60;pcre2.h&#62;</b>
@ -436,13 +438,9 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
<P> <P>
Each of the first three conventions is used by at least one operating system as Each of the first three conventions is used by at least one operating system as
its standard newline sequence. When PCRE2 is built, a default can be specified. its standard newline sequence. When PCRE2 is built, a default can be specified.
The default default is LF, which is the Unix standard. When PCRE2 is run, the The default default is LF, which is the Unix standard. However, the newline
default can be overridden, either when a pattern is compiled, or when it is convention can be changed by an application when calling <b>pcre2_compile()</b>,
matched. or it can be specified by special text at the start of the pattern itself; this
</P>
<P>
The newline convention can be changed when calling <b>pcre2_compile()</b>, or it
can be specified by special text at the start of the pattern itself; this
overrides any other settings. See the overrides any other settings. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page for details of the special character sequences. page for details of the special character sequences.
@ -459,8 +457,8 @@ below.
</P> </P>
<P> <P>
The choice of newline convention does not affect the interpretation of The choice of newline convention does not affect the interpretation of
the \n or \r escape sequences, nor does it affect what \R matches, which has the \n or \r escape sequences, nor does it affect what \R matches; this has
its own separate control. its own separate convention.
</P> </P>
<br><a name="SEC13" href="#TOC1">MULTITHREADING</a><br> <br><a name="SEC13" href="#TOC1">MULTITHREADING</a><br>
<P> <P>
@ -472,7 +470,7 @@ time ensuring that multithreaded applications can use it.
</P> </P>
<P> <P>
There are several different blocks of data that are used to pass information There are several different blocks of data that are used to pass information
between the application and the PCRE libraries. between the application and the PCRE2 libraries.
</P> </P>
<P> <P>
(1) A pointer to the compiled form of a pattern is returned to the user when (1) A pointer to the compiled form of a pattern is returned to the user when
@ -572,11 +570,11 @@ The compile context
A compile context is required if you want to change the default values of any A compile context is required if you want to change the default values of any
of the following compile-time parameters: of the following compile-time parameters:
<pre> <pre>
What \R matches (Unicode newlines or CR, LF, CRLF only); What \R matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables; PCRE2's character tables
The newline character sequence; The newline character sequence
The compile time nested parentheses limit; The compile time nested parentheses limit
An external function for stack checking. An external function for stack checking
</pre> </pre>
A compile context is also required if you are using custom memory management. A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of If none of these apply, just pass NULL as the context argument of
@ -604,9 +602,8 @@ PCRE2_ERROR_BADDATA if invalid data is detected.
<br> <br>
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF, The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF,
or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line
ending sequence. The value of this parameter does not affect what is compiled; ending sequence. The value is used by the JIT compiler and by the two
it is just saved with the compiled pattern. The value is used by the JIT interpreted matching functions, <i>pcre2_match()</i> and
compiler and by the two interpreted matching functions, <i>pcre2_match()</i> and
<i>pcre2_dfa_match()</i>. <i>pcre2_dfa_match()</i>.
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b> <b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
<b> const unsigned char *<i>tables</i>);</b> <b> const unsigned char *<i>tables</i>);</b>
@ -709,12 +706,12 @@ in the subject string. This limit is not relevant to <b>pcre2_dfa_match()</b>,
which ignores it. which ignores it.
</P> </P>
<P> <P>
When <b>pcre2_match()</b> is called with a pattern that was successfully studied When <b>pcre2_match()</b> is called with a pattern that was successfully
with <b>pcre2_jit_compile()</b>, the way that the matching is executed is processed by <b>pcre2_jit_compile()</b>, the way in which matching is executed
entirely different. However, there is still the possibility of runaway matching is entirely different. However, there is still the possibility of runaway
that goes on for a very long time, and so the <i>match_limit</i> value is also matching that goes on for a very long time, and so the <i>match_limit</i> value
used in this case (but in a different way) to limit how long the matching can is also used in this case (but in a different way) to limit how long the
continue. matching can continue.
</P> </P>
<P> <P>
The default value for the limit can be set when PCRE2 is built; the default The default value for the limit can be set when PCRE2 is built; the default
@ -770,15 +767,17 @@ stack. There is a discussion about PCRE2's stack usage in the
<a href="pcre2stack.html"><b>pcre2stack</b></a> <a href="pcre2stack.html"><b>pcre2stack</b></a>
documentation. See the documentation. See the
<a href="pcre2build.html"><b>pcre2build</b></a> <a href="pcre2build.html"><b>pcre2build</b></a>
documentation for details of how to build PCRE2. Using the heap for recursion documentation for details of how to build PCRE2.
is a non-standard way of building PCRE2, for use in environments that have </P>
limited stacks. Because of the greater use of memory management, <P>
<b>pcre2_match()</b> runs more slowly. Functions that are different to the Using the heap for recursion is a non-standard way of building PCRE2, for use
general custom memory functions are provided so that special-purpose external in environments that have limited stacks. Because of the greater use of memory
code can be used for this case, because the memory blocks are all the same management, <b>pcre2_match()</b> runs more slowly. Functions that are different
size. The blocks are retained by <b>pcre2_match()</b> until it is about to exit to the general custom memory functions are provided so that special-purpose
so that they can be re-used when possible during the match. In the absence of external code can be used for this case, because the memory blocks are all the
these functions, the normal custom memory management functions are used, if same size. The blocks are retained by <b>pcre2_match()</b> until it is about to
exit so that they can be re-used when possible during the match. In the absence
of these functions, the normal custom memory management functions are used, if
supplied, otherwise the system functions. supplied, otherwise the system functions.
</P> </P>
<br><a name="SEC15" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br> <br><a name="SEC15" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br>
@ -809,9 +808,10 @@ available:
PCRE2_CONFIG_BSR PCRE2_CONFIG_BSR
</pre> </pre>
The output is an integer whose value indicates what character sequences the \R The output is an integer whose value indicates what character sequences the \R
escape sequence matches by default. A value of 0 means that \R matches any escape sequence matches by default. A value of PCRE2_BSR_UNICODE means that \R
Unicode line ending sequence; a value of 1 means that \R matches only CR, LF, matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
or CRLF. The default can be overridden when a pattern is compiled or matched. that \R matches only CR, LF, or CRLF. The default can be overridden when a
pattern is compiled.
<pre> <pre>
PCRE2_CONFIG_JIT PCRE2_CONFIG_JIT
</pre> </pre>
@ -821,7 +821,7 @@ compiling is available; otherwise it is set to zero.
PCRE2_CONFIG_JITTARGET PCRE2_CONFIG_JITTARGET
</pre> </pre>
The <i>where</i> argument should point to a buffer that is at least 48 code The <i>where</i> argument should point to a buffer that is at least 48 code
units long. (The exact length needed can be found by calling units long. (The exact length required can be found by calling
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a <b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a
string that contains the name of the architecture for which the JIT compiler is string that contains the name of the architecture for which the JIT compiler is
configured, for example "x86 32bit (little endian + unaligned)". If JIT support configured, for example "x86 32bit (little endian + unaligned)". If JIT support
@ -855,11 +855,11 @@ Further details are given with <b>pcre2_match()</b> below.
The output is an integer whose value specifies the default character sequence The output is an integer whose value specifies the default character sequence
that is recognized as meaning "newline". The values are: that is recognized as meaning "newline". The values are:
<pre> <pre>
1 Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
2 Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
3 Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
4 Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
5 Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
</pre> </pre>
The default should normally correspond to the standard sequence for your The default should normally correspond to the standard sequence for your
operating system. operating system.
@ -891,7 +891,7 @@ heap instead of recursive function calls.
PCRE2_CONFIG_UNICODE_VERSION PCRE2_CONFIG_UNICODE_VERSION
</pre> </pre>
The <i>where</i> argument should point to a buffer that is at least 24 code The <i>where</i> argument should point to a buffer that is at least 24 code
units long. (The exact length needed can be found by calling units long. (The exact length required can be found by calling
<b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled <b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled
without Unicode support, the buffer is filled with the text "Unicode not without Unicode support, the buffer is filled with the text "Unicode not
supported". Otherwise, the Unicode version string (for example, "7.0.0") is supported". Otherwise, the Unicode version string (for example, "7.0.0") is
@ -906,7 +906,7 @@ otherwise it is set to zero. Unicode support implies UTF support.
PCRE2_CONFIG_VERSION PCRE2_CONFIG_VERSION
</pre> </pre>
The <i>where</i> argument should point to a buffer that is at least 12 code The <i>where</i> argument should point to a buffer that is at least 12 code
units long. (The exact length needed can be found by calling units long. (The exact length required can be found by calling
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with <b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with
the PCRE2 version string, zero-terminated. The number of code units used is the PCRE2 version string, zero-terminated. The number of code units used is
returned. This is the length of the string plus one unit for the terminating returned. This is the length of the string plus one unit for the terminating
@ -922,17 +922,17 @@ zero.
<b>pcre2_code_free(pcre2_code *<i>code</i>);</b> <b>pcre2_code_free(pcre2_code *<i>code</i>);</b>
</P> </P>
<P> <P>
This function compiles a pattern, defined by a pointer to a string of code The <b>pcre2_compile()</b> function compiles a pattern into an internal form.
units and a length, into an internal form. If the pattern is zero-terminated, The pattern is defined by a pointer to a string of code units and a length, If
the length should be specified as PCRE2_ZERO_TERMINATED. The function returns a the pattern is zero-terminated, the length can be specified as
pointer to a block of memory that contains the compiled pattern and related PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
data. The caller must free the memory by calling <b>pcre2_code_free()</b> when contains the compiled pattern and related data. The caller must free the memory
it is no longer needed. by calling <b>pcre2_code_free()</b> when it is no longer needed.
</P> </P>
<P> <P>
If the compile context argument <i>ccontext</i> is NULL, the memory is obtained If the compile context argument <i>ccontext</i> is NULL, memory for the compiled
by calling <b>malloc()</b>. Otherwise, it is obtained from the same memory pattern is obtained by calling <b>malloc()</b>. Otherwise, it is obtained from
function that was used for the compile context. the same memory function that was used for the compile context.
</P> </P>
<P> <P>
The <i>options</i> argument contains various bit settings that affect the The <i>options</i> argument contains various bit settings that affect the
@ -1247,7 +1247,7 @@ classify characters. More details are given in the section on
in the in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page. If you set PCRE2_UCP, matching one of the items it affects takes much page. If you set PCRE2_UCP, matching one of the items it affects takes much
longer. The option is available only if PCRE2 has been compiled with UTF longer. The option is available only if PCRE2 has been compiled with Unicode
support. support.
<pre> <pre>
PCRE2_UNGREEDY PCRE2_UNGREEDY
@ -1260,9 +1260,10 @@ with Perl. It can also be set by a (?U) option setting within the pattern.
</pre> </pre>
This option causes PCRE2 to regard both the pattern and the subject strings This option causes PCRE2 to regard both the pattern and the subject strings
that are subsequently processed as strings of UTF characters instead of that are subsequently processed as strings of UTF characters instead of
single-code-unit strings. However, it is available only when PCRE2 is built to single-code-unit strings. It is available when PCRE2 is built to include
include UTF support. If not, the use of this option provokes an error. Details Unicode support (which is the default). If Unicode support is not available,
of how this option changes the behaviour of PCRE2 are given in the the use of this option provokes an error. Details of how this option changes
the behaviour of PCRE2 are given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a> <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page. page.
</P> </P>
@ -1318,13 +1319,12 @@ Most, but not all patterns can be optimized by the JIT compiler.
<P> <P>
PCRE2 handles caseless matching, and determines whether characters are letters, PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code digits, or whatever, by reference to a set of tables, indexed by character code
point. When running in UTF-8 mode, or using the 16-bit or 32-bit libraries, point. This applies only to characters whose code points are less than 256. By
this applies only to characters with code points less than 256. By default, default, higher-valued code points never match escapes such as \w or \d.
higher-valued code points never match escapes such as \w or \d. However, if However, if PCRE2 is built with UTF support, all characters can be tested with
PCRE2 is built with UTF support, all characters can be tested with \p and \P, \p and \P, or, alternatively, the PCRE2_UCP option can be set when a pattern
or, alternatively, the PCRE2_UCP option can be set when a pattern is compiled; is compiled; this causes \w and friends to use Unicode property support
this causes \w and friends to use Unicode property support instead of the instead of the built-in tables.
built-in tables.
</P> </P>
<P> <P>
The use of locales with Unicode is discouraged. If you are handling characters The use of locales with Unicode is discouraged. If you are handling characters
@ -1437,9 +1437,9 @@ are no back references.
PCRE2_INFO_BSR PCRE2_INFO_BSR
</pre> </pre>
The output is a uint32_t whose value indicates what character sequences the \R The output is a uint32_t whose value indicates what character sequences the \R
escape sequence matches by default. A value of 0 means that \R matches any escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R matches
Unicode line ending sequence; a value of 1 means that \R matches only CR, LF, any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \R
or CRLF. The default can be overridden when a pattern is matched. matches only CR, LF, or CRLF.
<pre> <pre>
PCRE2_INFO_CAPTURECOUNT PCRE2_INFO_CAPTURECOUNT
</pre> </pre>
@ -1581,15 +1581,18 @@ values.
<P> <P>
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
entry; both of these return a <b>uint32_t</b> value. The entry size depends on entry in code units; both of these return a <b>uint32_t</b> value. The entry
the length of the longest name. PCRE2_INFO_NAMETABLE returns a pointer to the size depends on the length of the longest name.
first entry of the table. This is a PCRE2_SPTR pointer to a block of code </P>
units. In the 8-bit library, the first two bytes of each entry are the number <P>
of the capturing parenthesis, most significant byte first. In the 16-bit PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
library, the pointer points to 16-bit data units, the first of which contains a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first
the parenthesis number. In the 32-bit library, the pointer points to 32-bit two bytes of each entry are the number of the capturing parenthesis, most
data units, the first of which contains the parenthesis number. The rest of the significant byte first. In the 16-bit library, the pointer points to 16-bit
entry is the corresponding name, zero terminated. code units, the first of which contains the parenthesis number. In the 32-bit
library, the pointer points to 32-bit code units, the first of which contains
the parenthesis number. The rest of the entry is the corresponding name, zero
terminated.
</P> </P>
<P> <P>
The names are in alphabetical order. If (?| is used to create multiple groups The names are in alphabetical order. If (?| is used to create multiple groups
@ -1629,17 +1632,16 @@ different for each compiled pattern.
<pre> <pre>
PCRE2_INFO_NEWLINE PCRE2_INFO_NEWLINE
</pre> </pre>
The output is a <b>uint32_t</b> whose value specifies the default character The output is a <b>uint32_t</b> with one of the following values:
sequence that will be recognized as meaning "newline" while matching. The
values are:
<pre> <pre>
1 Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
2 Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
3 Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
4 Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
5 Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
</pre> </pre>
The default can be overridden when a pattern is matched. This specifies the default character sequence that will be recognized as
meaning "newline" while matching.
<pre> <pre>
PCRE2_INFO_RECURSIONLIMIT PCRE2_INFO_RECURSIONLIMIT
</pre> </pre>
@ -1675,18 +1677,19 @@ Information about successful and unsuccessful matches is placed in a match
data block, which is an opaque structure that is accessed by function calls. In data block, which is an opaque structure that is accessed by function calls. In
particular, the match data block contains a vector of offsets into the subject particular, the match data block contains a vector of offsets into the subject
string that define the matched part of the subject and any substrings that were string that define the matched part of the subject and any substrings that were
capured. This is know as the <i>ovector</i>. captured. This is know as the <i>ovector</i>.
</P> </P>
<P> <P>
Before calling <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> you must create a Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
match data block by calling one of the creation functions above. For <b>pcre2_jit_match()</b> you must create a match data block by calling one of
<b>pcre2_match_data_create()</b>, the first argument is the number of pairs of the creation functions above. For <b>pcre2_match_data_create()</b>, the first
offsets in the <i>ovector</i>. One pair of offsets is required to identify the argument is the number of pairs of offsets in the <i>ovector</i>. One pair of
string that matched the whole pattern, with another pair for each captured offsets is required to identify the string that matched the whole pattern, with
substring. For example, a value of 4 creates enough space to record the matched another pair for each captured substring. For example, a value of 4 creates
portion of the subject plus three captured substrings. A minimum of at least 1 enough space to record the matched portion of the subject plus three captured
pair is imposed by <b>pcre2_match_data_create()</b>, so it is always possible to substrings. A minimum of at least 1 pair is imposed by
return the overall matched string. <b>pcre2_match_data_create()</b>, so it is always possible to return the overall
matched string.
</P> </P>
<P> <P>
For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
@ -1694,15 +1697,16 @@ pointer to a compiled pattern. In this case the ovector is created to be
exactly the right size to hold all the substrings a pattern might capture. exactly the right size to hold all the substrings a pattern might capture.
</P> </P>
<P> <P>
The second argument of both these functions ia a pointer to a general context, The second argument of both these functions is a pointer to a general context,
which can specify custom memory management for obtaining the memory for the which can specify custom memory management for obtaining the memory for the
match data block. If you are not using custom memory management, pass NULL. match data block. If you are not using custom memory management, pass NULL.
</P> </P>
<P> <P>
A match data block can be used many times, with the same or different compiled A match data block can be used many times, with the same or different compiled
patterns. When it is no longer needed, it should be freed by calling patterns. When it is no longer needed, it should be freed by calling
<b>pcre2_match_data_free()</b>. How to extract information from a match data <b>pcre2_match_data_free()</b>. You can extract information from a match data
block after a match operation is described in the sections on block after a match operation has finished, using functions that are described
in the sections on
<a href="#matchedstrings">matched strings</a> <a href="#matchedstrings">matched strings</a>
and and
<a href="#matchotherdata">other match data</a> <a href="#matchotherdata">other match data</a>
@ -1816,12 +1820,10 @@ PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below. PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
</P> </P>
<P> <P>
If the pattern was successfully processed by the just-in-time (JIT) compiler, Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
the only supported options for matching using the JIT code are PCRE2_NOTBOL, compiler. If it is set, JIT matching is disabled and the normal interpretive
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, code in <b>pcre2_match()</b> is run. The remaining options are supported for JIT
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. If an unsupported option is used, matching.
JIT matching is disabled and the normal interpretive code in
<b>pcre2_match()</b> is run.
<pre> <pre>
PCRE2_ANCHORED PCRE2_ANCHORED
</pre> </pre>
@ -1835,17 +1837,18 @@ matching.
</pre> </pre>
This option specifies that first character of the subject string is not the This option specifies that first character of the subject string is not the
beginning of a line, so the circumflex metacharacter should not match before beginning of a line, so the circumflex metacharacter should not match before
it. Setting this without PCRE2_MULTILINE (at compile time) causes circumflex it. Setting this without having set PCRE2_MULTILINE at compile time causes
never to match. This option affects only the behaviour of the circumflex circumflex never to match. This option affects only the behaviour of the
metacharacter. It does not affect \A. circumflex metacharacter. It does not affect \A.
<pre> <pre>
PCRE2_NOTEOL PCRE2_NOTEOL
</pre> </pre>
This option specifies that the end of the subject string is not the end of a This option specifies that the end of the subject string is not the end of a
line, so the dollar metacharacter should not match it nor (except in multiline line, so the dollar metacharacter should not match it nor (except in multiline
mode) a newline immediately before it. Setting this without PCRE2_MULTILINE (at mode) a newline immediately before it. Setting this without having set
compile time) causes dollar never to match. This option affects only the PCRE2_MULTILINE at compile time causes dollar never to match. This option
behaviour of the dollar metacharacter. It does not affect \Z or \z. affects only the behaviour of the dollar metacharacter. It does not affect \Z
or \z.
<pre> <pre>
PCRE2_NOTEMPTY PCRE2_NOTEMPTY
</pre> </pre>
@ -1857,13 +1860,16 @@ match the empty string, the entire match fails. For example, if the pattern
</pre> </pre>
is applied to a string not beginning with "a" or "b", it matches an empty is applied to a string not beginning with "a" or "b", it matches an empty
string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
valid, so PCRE2 searches further into the string for occurrences of "a" or "b". valid, so <b>pcre2_match()</b> searches further into the string for occurrences
of "a" or "b".
<pre> <pre>
PCRE2_NOTEMPTY_ATSTART PCRE2_NOTEMPTY_ATSTART
</pre> </pre>
This is like PCRE2_NOTEMPTY, except that an empty string match that is not at This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
the start of the subject is permitted. If the pattern is anchored, such a match only at the first matching position, that is, at the start of the subject plus
can occur only if the pattern contains \K. the starting offset. An empty string match later in the subject is permitted.
If the pattern is anchored, such a match can occur only if the pattern contains
\K.
<pre> <pre>
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
</pre> </pre>
@ -1904,8 +1910,8 @@ subject characters to complete the match. If this happens when
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
testing any remaining alternatives. Only if no complete match can be found is testing any remaining alternatives. Only if no complete match can be found is
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
PCRE2_PARTIAL_SOFT says that the caller is prepared to handle a partial match, PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
but only if no complete match can be found. match, but only if no complete match can be found.
</P> </P>
<P> <P>
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
@ -1928,14 +1934,14 @@ a
<a href="#compilecontext">compile context.</a> <a href="#compilecontext">compile context.</a>
During matching, the newline choice affects the behaviour of the dot, During matching, the newline choice affects the behaviour of the dot,
circumflex, and dollar metacharacters. It may also alter the way the match circumflex, and dollar metacharacters. It may also alter the way the match
position is advanced after a match failure for an unanchored pattern. starting position is advanced after a match failure for an unanchored pattern.
</P> </P>
<P> <P>
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set, When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
and a match attempt for an unanchored pattern fails when the current position the newline convention, and a match attempt for an unanchored pattern fails
is at a CRLF sequence, and the pattern contains no explicit matches for CR or when the current starting position is at a CRLF sequence, and the pattern
LF characters, the match position is advanced by two characters instead of one, contains no explicit matches for CR or LF characters, the match position is
in other words, to after the CRLF. advanced by two characters instead of one, in other words, to after the CRLF.
</P> </P>
<P> <P>
The above rule is a compromise that makes the most common cases work as The above rule is a compromise that makes the most common cases work as
@ -1948,8 +1954,8 @@ reference, and so advances only by one character after the first failure.
<P> <P>
An explicit match for CR of LF is either a literal appearance of one of those An explicit match for CR of LF is either a literal appearance of one of those
characters in the pattern, or one of the \r or \n escape sequences. Implicit characters in the pattern, or one of the \r or \n escape sequences. Implicit
matches such as [^X] do not count, nor does \s (which includes CR and LF in matches such as [^X] do not count, nor does \s, even though it includes CR and
the characters that it matches). LF in the characters that it matches.
</P> </P>
<P> <P>
Notwithstanding the above, anomalous effects may still occur when CRLF is a Notwithstanding the above, anomalous effects may still occur when CRLF is a
@ -1967,16 +1973,16 @@ In general, a pattern matches a certain portion of the subject, and in
addition, further substrings from the subject may be picked out by addition, further substrings from the subject may be picked out by
parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
book, this is called "capturing" in what follows, and the phrase "capturing book, this is called "capturing" in what follows, and the phrase "capturing
subpattern" is used for a fragment of a pattern that picks out a substring. subpattern" or "capturing group" is used for a fragment of a pattern that picks
PCRE2 supports several other kinds of parenthesized subpattern that do not out a substring. PCRE2 supports several other kinds of parenthesized subpattern
cause substrings to be captured. The <b>pcre2_pattern_info()</b> function can be that do not cause substrings to be captured. The <b>pcre2_pattern_info()</b>
used to find out how many capturing subpatterns there are in a compiled function can be used to find out how many capturing subpatterns there are in a
pattern. compiled pattern.
</P> </P>
<P> <P>
The overall matched string and any captured substrings are returned to the The overall matched string and any captured substrings are returned to the
caller via a vector of PCRE2_SIZE values, called the <b>ovector</b>. This is caller via a vector of PCRE2_SIZE values. This is called the <b>ovector</b>, and
contained within the is contained within the
<a href="#matchdatablock">match data block.</a> <a href="#matchdatablock">match data block.</a>
You can obtain direct access to the ovector by calling You can obtain direct access to the ovector by calling
<b>pcre2_get_ovector_pointer()</b> to find its address, and <b>pcre2_get_ovector_pointer()</b> to find its address, and
@ -2045,9 +2051,7 @@ parentheses, no more than <i>ovector[0]</i> to <i>ovector[2n+1]</i> are set by
<b>pcre2_match()</b>. The other elements retain whatever values they previously <b>pcre2_match()</b>. The other elements retain whatever values they previously
had. had.
<a name="matchotherdata"></a></P> <a name="matchotherdata"></a></P>
<br><b> <br><a name="SEC25" href="#TOC1">OTHER INFORMATION ABOUT A MATCH</a><br>
Other information about the match
</b><br>
<P> <P>
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b> <b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
<br> <br>
@ -2055,7 +2059,7 @@ Other information about the match
<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b> <b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
</P> </P>
<P> <P>
In addition to the offsets in the ovector, other information about a match is As well as the offsets in the ovector, other information about a match is
retained in the match data block and can be retrieved by the above functions. retained in the match data block and can be retrieved by the above functions.
</P> </P>
<P> <P>
@ -2071,9 +2075,7 @@ different to the value of <i>ovector[0]</i> if the pattern contains the \K
escape sequence. After a partial match, however, this value is always the same escape sequence. After a partial match, however, this value is always the same
as <i>ovector[0]</i> because \K does not affect the result of a partial match. as <i>ovector[0]</i> because \K does not affect the result of a partial match.
<a name="errorlist"></a></P> <a name="errorlist"></a></P>
<br><b> <br><a name="SEC26" href="#TOC1">ERROR RETURNS FROM <b>pcre2_match()</b></a><br>
Error return values from <b>pcre2_match()</b>
</b><br>
<P> <P>
If <b>pcre2_match()</b> fails, it returns a negative number. This can be If <b>pcre2_match()</b> fails, it returns a negative number. This can be
converted to a text string by calling <b>pcre2_get_error_message()</b>. Negative converted to a text string by calling <b>pcre2_get_error_message()</b>. Negative
@ -2108,7 +2110,7 @@ passed to a 16-bit or 32-bit library function, or vice versa.
<pre> <pre>
PCRE2_ERROR_BADOFFSET PCRE2_ERROR_BADOFFSET
</pre> </pre>
The value of <i>startoffset</i> greater than the length of the subject. The value of <i>startoffset</i> was greater than the length of the subject.
<pre> <pre>
PCRE2_ERROR_BADOPTION PCRE2_ERROR_BADOPTION
</pre> </pre>
@ -2175,14 +2177,14 @@ the pattern. Specifically, it means that either the whole pattern or a
subpattern has been called recursively for the second time at the same position subpattern has been called recursively for the second time at the same position
in the subject string. Some simple patterns that might do this are detected and in the subject string. Some simple patterns that might do this are detected and
faulted at compile time, but more complicated cases, in particular mutual faulted at compile time, but more complicated cases, in particular mutual
recursions between two different subpatterns, cannot be detected until run recursions between two different subpatterns, cannot be detected until matching
time. is attempted.
<pre> <pre>
PCRE2_ERROR_RECURSIONLIMIT PCRE2_ERROR_RECURSIONLIMIT
</pre> </pre>
The internal recursion limit was reached. The internal recursion limit was reached.
<a name="extractbynumber"></a></P> <a name="extractbynumber"></a></P>
<br><a name="SEC25" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br> <br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
<P> <P>
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b> <b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> unsigned int <i>number</i>, PCRE2_SIZE *<i>length</i>);</b> <b> unsigned int <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
@ -2228,8 +2230,8 @@ extract the captured substrings.
<P> <P>
The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to
the buffer and a pointer to a variable that contains its length in code units. the buffer and a pointer to a variable that contains its length in code units.
This is updated to contain the actual number of code units used, excluding the This is updated to contain the actual number of code units used for the
terminating zero. extracted substring, excluding the terminating zero.
</P> </P>
<P> <P>
For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point
@ -2254,7 +2256,7 @@ no capturing group of that number in the pattern, or because the group with
that number did not participate in the match, or because the ovector was too that number did not participate in the match, or because the ovector was too
small to capture that group. small to capture that group.
</P> </P>
<br><a name="SEC26" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br> <br><a name="SEC28" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br>
<P> <P>
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b> <b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b> <b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
@ -2264,10 +2266,11 @@ small to capture that group.
</P> </P>
<P> <P>
The <b>pcre2_substring_list_get()</b> function extracts all available substrings The <b>pcre2_substring_list_get()</b> function extracts all available substrings
and builds a list of pointers to them, and a second list that contains their and builds a list of pointers to them. It also (optionally) builds a second
lengths (in code units), excluding a terminating zero that is added to each of list that contains their lengths (in code units), excluding a terminating zero
them. All this is done in a single block of memory that is obtained using the that is added to each of them. All this is done in a single block of memory
same memory allocation function that was used to get the match data block. that is obtained using the same memory allocation function that was used to get
the match data block.
</P> </P>
<P> <P>
The address of the memory block is returned via <i>listptr</i>, which is also The address of the memory block is returned via <i>listptr</i>, which is also
@ -2285,10 +2288,10 @@ If this function encounters a substring that is unset, which can happen when
capturing subpattern number <i>n+1</i> matches some part of the subject, but capturing subpattern number <i>n+1</i> matches some part of the subject, but
subpattern <i>n</i> has not been used at all, it returns an empty string. This subpattern <i>n</i> has not been used at all, it returns an empty string. This
can be distinguished from a genuine zero-length substring by inspecting the can be distinguished from a genuine zero-length substring by inspecting the
appropriate offset in the ovector, which contains PCRE2_UNSET for unset appropriate offset in the ovector, which contain PCRE2_UNSET for unset
substrings. substrings.
<a name="extractbyname"></a></P> <a name="extractbyname"></a></P>
<br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br> <br><a name="SEC29" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
<P> <P>
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b> <b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>);</b> <b> PCRE2_SPTR <i>name</i>);</b>
@ -2324,11 +2327,10 @@ that name.
</P> </P>
<P> <P>
Given the number, you can extract the substring directly, or use one of the Given the number, you can extract the substring directly, or use one of the
functions described in the previous section. For convenience, there are also functions described above. For convenience, there are also "byname" functions
"byname" functions that correspond to the "bynumber" functions, the only that correspond to the "bynumber" functions, the only difference being that the
difference being that the second argument is a name instead of a number. second argument is a name instead of a number. However, if PCRE2_DUPNAMES is
However, if PCRE2_DUPNAMES is set and there are duplicate names, set and there are duplicate names, the behaviour may not be what you want.
the behaviour may not be what you want (see the next section).
</P> </P>
<P> <P>
<b>Warning:</b> If the pattern uses the (?| feature to set up multiple <b>Warning:</b> If the pattern uses the (?| feature to set up multiple
@ -2341,7 +2343,7 @@ names are not included in the compiled code. The matching process uses only
numbers. For this reason, the use of different names for subpatterns of the numbers. For this reason, the use of different names for subpatterns of the
same number causes an error at compile time. same number causes an error at compile time.
</P> </P>
<br><a name="SEC28" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br> <br><a name="SEC30" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br>
<P> <P>
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> <b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
@ -2368,8 +2370,8 @@ recognized:
Either a group number or a group name can be given for &#60;n&#62;. Curly brackets are Either a group number or a group name can be given for &#60;n&#62;. Curly brackets are
required only if the following character would be interpreted as part of the required only if the following character would be interpreted as part of the
number or name. The number may be zero to include the entire matched string. number or name. The number may be zero to include the entire matched string.
For example, if the pattern a(b)c is matched with "[abc]" and the replacement For example, if the pattern a(b)c is matched with "=abc=" and the replacement
string "+$1$0$1+", the result is "[+babcb+]". Group insertion is done by string "+$1$0$1+", the result is "=+babcb+=". Group insertion is done by
calling <b>pcre2_copy_byname()</b> or <b>pcre2_copy_bynumber()</b> as calling <b>pcre2_copy_byname()</b> or <b>pcre2_copy_bynumber()</b> as
appropriate. appropriate.
</P> </P>
@ -2402,7 +2404,7 @@ straight back. PCRE2_ERROR_BADREPLACEMENT is returned for an invalid
replacement string (unrecognized sequence following a dollar sign), and replacement string (unrecognized sequence following a dollar sign), and
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough.
</P> </P>
<br><a name="SEC29" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br> <br><a name="SEC31" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
<P> <P>
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b> <b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b> <b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
@ -2423,19 +2425,21 @@ documentation.
When duplicates are present, <b>pcre2_substring_copy_byname()</b> and When duplicates are present, <b>pcre2_substring_copy_byname()</b> and
<b>pcre2_substring_get_byname()</b> return the first substring corresponding to <b>pcre2_substring_get_byname()</b> return the first substring corresponding to
the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is
returned. The <b>pcre2_substring_number_from_name()</b> function returns one of returned. The <b>pcre2_substring_number_from_name()</b> function returns
the numbers that are associated with the name, but it is not defined which it the error PCRE2_ERROR_NOUNIQUESUBSTRING.
is.
</P> </P>
<P> <P>
If you want to get full details of all captured substrings for a given name, If you want to get full details of all captured substrings for a given name,
you must use the <b>pcre2_substring_nametable_scan()</b> function. The first you must use the <b>pcre2_substring_nametable_scan()</b> function. The first
argument is the compiled pattern, and the second is the name. If the third and argument is the compiled pattern, and the second is the name. If the third and
fourth arguments are NULL, the function returns a group number (it is not fourth arguments are NULL, the function returns a group number for a unique
defined which). Otherwise, the third and fourth arguments must be pointers to name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
</P>
<P>
When the third and fourth arguments are not NULL, they must be pointers to
variables that are updated by the function. After it has run, they point to the variables that are updated by the function. After it has run, they point to the
first and last entries in the name-to-number table for the given name, and the first and last entries in the name-to-number table for the given name, and the
function returns the length of each entry. In both cases, function returns the length of each entry in code units. In both cases,
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name. PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
</P> </P>
<P> <P>
@ -2445,14 +2449,14 @@ The format of the name table is described above in the section entitled
Given all the relevant entries for the name, you can extract each of their Given all the relevant entries for the name, you can extract each of their
numbers, and hence the captured data. numbers, and hence the captured data.
</P> </P>
<br><a name="SEC30" href="#TOC1">FINDING ALL POSSIBLE MATCHES</a><br> <br><a name="SEC32" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
<P> <P>
The traditional matching function uses a similar algorithm to Perl, which stops The traditional matching function uses a similar algorithm to Perl, which stops
when it finds the first match, starting at a given point in the subject. If you when it finds the first match at a given point in the subject. If you want to
want to find all possible matches, or the longest possible match at a given find all possible matches, or the longest possible match at a given position,
position, consider using the alternative matching function (see below) instead. consider using the alternative matching function (see below) instead. If you
If you cannot use the alternative function, you can kludge it up by making use cannot use the alternative function, you can kludge it up by making use of the
of the callout facility, which is described in the callout facility, which is described in the
<a href="pcre2callout.html"><b>pcre2callout</b></a> <a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation. documentation.
</P> </P>
@ -2463,7 +2467,7 @@ substring. Then return 1, which forces <b>pcre2_match()</b> to backtrack and try
other alternatives. Ultimately, when it runs out of matches, other alternatives. Ultimately, when it runs out of matches,
<b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH. <b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH.
<a name="dfamatch"></a></P> <a name="dfamatch"></a></P>
<br><a name="SEC31" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br> <br><a name="SEC33" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br>
<P> <P>
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> <b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
@ -2591,11 +2595,10 @@ the longest matches.
<P> <P>
NOTE: PCRE2's "auto-possessification" optimization usually applies to character NOTE: PCRE2's "auto-possessification" optimization usually applies to character
repeats at the end of a pattern (as well as internally). For example, the repeats at the end of a pattern (as well as internally). For example, the
pattern "a\d+" is compiled as if it were "a\d++" because there is no point in pattern "a\d+" is compiled as if it were "a\d++". For DFA matching, this
backtracking into the repeated digits. For DFA matching, this means that only means that only one possible match is found. If you really do want multiple
one possible match is found. If you really do want multiple matches in such matches in such cases, either use an ungreedy repeat auch as "a\d+?" or set
cases, either use an ungreedy repeat ("a\d+?") or set the the PCRE2_NO_AUTO_POSSESS option when compiling.
PCRE2_NO_AUTO_POSSESS option when compiling.
</P> </P>
<br><b> <br><b>
Error returns from <b>pcre2_dfa_match()</b> Error returns from <b>pcre2_dfa_match()</b>
@ -2633,29 +2636,29 @@ extremely rare, as a vector of size 1000 is used.
<pre> <pre>
PCRE2_ERROR_DFA_BADRESTART PCRE2_ERROR_DFA_BADRESTART
</pre> </pre>
When <b>pcre2_dfa_match()</b> is called with the <b>pcre2_dfa_RESTART</b> option, When <b>pcre2_dfa_match()</b> is called with the <b>PCRE2_DFA_RESTART</b> option,
some plausibility checks are made on the contents of the workspace, which some plausibility checks are made on the contents of the workspace, which
should contain data about the previous partial match. If any of these checks should contain data about the previous partial match. If any of these checks
fail, this error is given. fail, this error is given.
</P> </P>
<br><a name="SEC32" href="#TOC1">SEE ALSO</a><br> <br><a name="SEC34" href="#TOC1">SEE ALSO</a><br>
<P> <P>
<b>pcre2build</b>(3), <b>pcre2libs</b>(3), <b>pcre2callout</b>(3), <b>pcre2build</b>(3), <b>pcre2callout</b>(3), <b>pcre2demo(3)</b>,
<b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3), <b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3),
<b>pcre2demo(3)</b>, <b>pcre2sample</b>(3), <b>pcre2stack</b>(3). <b>pcre2sample</b>(3), <b>pcre2stack</b>(3), <b>pcre2unicode</b>(3).
</P> </P>
<br><a name="SEC33" href="#TOC1">AUTHOR</a><br> <br><a name="SEC35" href="#TOC1">AUTHOR</a><br>
<P> <P>
Philip Hazel Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC34" href="#TOC1">REVISION</a><br> <br><a name="SEC36" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 11 November 2014 Last updated: 21 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -461,7 +461,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>

View File

@ -256,7 +256,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC7" href="#TOC1">REVISION</a><br> <br><a name="SEC7" href="#TOC1">REVISION</a><br>

View File

@ -207,7 +207,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>

View File

@ -745,7 +745,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br> <br><a name="SEC14" href="#TOC1">REVISION</a><br>

View File

@ -413,7 +413,7 @@ Philip Hazel (FAQ by Zoltan Herczeg)
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br> <br><a name="SEC13" href="#TOC1">REVISION</a><br>

View File

@ -73,7 +73,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>

View File

@ -227,7 +227,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br> <br><a name="SEC8" href="#TOC1">REVISION</a><br>

View File

@ -450,7 +450,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC10" href="#TOC1">REVISION</a><br> <br><a name="SEC10" href="#TOC1">REVISION</a><br>

View File

@ -3231,7 +3231,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>

View File

@ -180,7 +180,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>

View File

@ -278,7 +278,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br> <br><a name="SEC9" href="#TOC1">REVISION</a><br>

View File

@ -90,7 +90,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>

View File

@ -33,6 +33,13 @@ the recursive call would immediately be passed back as the result of the
current call (a "tail recursion"), the function is just restarted instead. current call (a "tail recursion"), the function is just restarted instead.
</P> </P>
<P> <P>
Each time the internal <b>match()</b> function is called recursively, it uses
memory from the process stack. For certain kinds of pattern and data, very
large amounts of stack may be needed, despite the recognition of "tail
recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
of the GCC compiler, the stack requirements are greatly increased.
</P>
<P>
The above comments apply when <b>pcre2_match()</b> is run in its normal The above comments apply when <b>pcre2_match()</b> is run in its normal
interpretive manner. If the compiled pattern was processed by interpretive manner. If the compiled pattern was processed by
<b>pcre2_jit_compile()</b>, and just-in-time compiling was successful, and the <b>pcre2_jit_compile()</b>, and just-in-time compiling was successful, and the
@ -61,10 +68,7 @@ relevant only for <b>pcre2_match()</b> without the JIT optimization.
Reducing <b>pcre2_match()</b>'s stack usage Reducing <b>pcre2_match()</b>'s stack usage
</b><br> </b><br>
<P> <P>
Each time that the internal <b>match()</b> function is called recursively, it You can often reduce the amount of recursion, and therefore the
uses memory from the process stack. For certain kinds of pattern and data, very
large amounts of stack may be needed, despite the recognition of "tail
recursion". You can often reduce the amount of recursion, and therefore the
amount of stack used, by modifying the pattern that is being matched. Consider, amount of stack used, by modifying the pattern that is being matched. Consider,
for example, this pattern: for example, this pattern:
<pre> <pre>
@ -187,14 +191,14 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 20 October 2014 Last updated: 21 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -548,7 +548,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br> <br><a name="SEC27" href="#TOC1">REVISION</a><br>

View File

@ -1301,7 +1301,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC20" href="#TOC1">REVISION</a><br> <br><a name="SEC20" href="#TOC1">REVISION</a><br>

View File

@ -254,7 +254,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>

View File

@ -22,9 +22,9 @@ INTRODUCTION
pattern matching using the same syntax and semantics as Perl, with just pattern matching using the same syntax and semantics as Perl, with just
a few differences. Some features that appeared in Python and the origi- a few differences. Some features that appeared in Python and the origi-
nal PCRE before they appeared in Perl are also available using the nal PCRE before they appeared in Perl are also available using the
Python syntax, there is some support for one or two .NET and Oniguruma Python syntax. There is also some support for one or two .NET and Onig-
syntax items, and there are options for requesting some minor changes uruma syntax items, and there are options for requesting some minor
that give better ECMAScript (aka JavaScript) compatibility. changes that give better ECMAScript (aka JavaScript) compatibility.
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or
32-bit code units, which means that up to three separate libraries may 32-bit code units, which means that up to three separate libraries may
@ -33,7 +33,7 @@ INTRODUCTION
tively. In all three cases, strings can be interpreted either as one tively. In all three cases, strings can be interpreted either as one
character per code unit, or as UTF-encoded Unicode, with support for character per code unit, or as UTF-encoded Unicode, with support for
Unicode general category properties. Unicode support is optional at Unicode general category properties. Unicode support is optional at
build time (but is the default); however, processing strings as UTF build time (but is the default). However, processing strings as UTF
code units must be enabled explicitly at run time. The version of Uni- code units must be enabled explicitly at run time. The version of Uni-
code in use can be discovered by running code in use can be discovered by running
@ -124,19 +124,18 @@ USER DOCUMENTATION
pcre2compat discussion of Perl compatibility pcre2compat discussion of Perl compatibility
pcre2demo a demonstration C program that uses PCRE2 pcre2demo a demonstration C program that uses PCRE2
pcre2grep description of the pcre2grep command (8-bit only) pcre2grep description of the pcre2grep command (8-bit only)
pcre2jit discussion of the just-in-time optimization sup- pcre2jit discussion of just-in-time optimization support
port
pcre2limits details of size and other limits pcre2limits details of size and other limits
pcre2matching discussion of the two matching algorithms pcre2matching discussion of the two matching algorithms
pcre2partial details of the partial matching facility pcre2partial details of the partial matching facility
pcre2pattern syntax and semantics of supported pcre2pattern syntax and semantics of supported regular
regular expressions expression patterns
pcre2perform discussion of performance issues pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program pcre2sample discussion of the pcre2demo program
pcre2stack discussion of stack usage pcre2stack discussion of stack usage
pcre2syntax quick syntax reference pcre2syntax quick syntax reference
pcre2test description of the pcre2test testing command pcre2test description of the pcre2test command
pcre2unicode discussion of Unicode and UTF support pcre2unicode discussion of Unicode and UTF support
In the "man" and HTML formats, there is also a short page for each C In the "man" and HTML formats, there is also a short page for each C
@ -147,7 +146,7 @@ AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge, England.
Putting an actual email address here is a spam magnet. If you want to Putting an actual email address here is a spam magnet. If you want to
email me, use my two initials, followed by the two digits 10, at the email me, use my two initials, followed by the two digits 10, at the
@ -156,7 +155,7 @@ AUTHOR
REVISION REVISION
Last updated: 03 November 2014 Last updated: 18 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -506,13 +505,10 @@ NEWLINES
Each of the first three conventions is used by at least one operating Each of the first three conventions is used by at least one operating
system as its standard newline sequence. When PCRE2 is built, a default system as its standard newline sequence. When PCRE2 is built, a default
can be specified. The default default is LF, which is the Unix stan- can be specified. The default default is LF, which is the Unix stan-
dard. When PCRE2 is run, the default can be overridden, either when a dard. However, the newline convention can be changed by an application
pattern is compiled, or when it is matched. when calling pcre2_compile(), or it can be specified by special text at
the start of the pattern itself; this overrides any other settings. See
The newline convention can be changed when calling pcre2_compile(), or the pcre2pattern page for details of the special character sequences.
it can be specified by special text at the start of the pattern itself;
this overrides any other settings. See the pcre2pattern page for
details of the special character sequences.
In the PCRE2 documentation the word "newline" is used to mean "the In the PCRE2 documentation the word "newline" is used to mean "the
character or pair of characters that indicate a line break". The choice character or pair of characters that indicate a line break". The choice
@ -523,8 +519,8 @@ NEWLINES
section on pcre2_match() options below. section on pcre2_match() options below.
The choice of newline convention does not affect the interpretation of The choice of newline convention does not affect the interpretation of
the \n or \r escape sequences, nor does it affect what \R matches, the \n or \r escape sequences, nor does it affect what \R matches; this
which has its own separate control. has its own separate convention.
MULTITHREADING MULTITHREADING
@ -537,7 +533,7 @@ MULTITHREADING
cations can use it. cations can use it.
There are several different blocks of data that are used to pass infor- There are several different blocks of data that are used to pass infor-
mation between the application and the PCRE libraries. mation between the application and the PCRE2 libraries.
(1) A pointer to the compiled form of a pattern is returned to the user (1) A pointer to the compiled form of a pattern is returned to the user
when pcre2_compile() is successful. The data in the compiled pattern is when pcre2_compile() is successful. The data in the compiled pattern is
@ -634,11 +630,11 @@ PCRE2 CONTEXTS
A compile context is required if you want to change the default values A compile context is required if you want to change the default values
of any of the following compile-time parameters: of any of the following compile-time parameters:
What \R matches (Unicode newlines or CR, LF, CRLF only); What \R matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables; PCRE2's character tables
The newline character sequence; The newline character sequence
The compile time nested parentheses limit; The compile time nested parentheses limit
An external function for stack checking. An external function for stack checking
A compile context is also required if you are using custom memory man- A compile context is also required if you are using custom memory man-
agement. If none of these apply, just pass NULL as the context argu- agement. If none of these apply, just pass NULL as the context argu-
@ -664,10 +660,9 @@ PCRE2 CONTEXTS
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only
CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any
Unicode line ending sequence. The value of this parameter does not Unicode line ending sequence. The value is used by the JIT compiler and
affect what is compiled; it is just saved with the compiled pattern. by the two interpreted matching functions, pcre2_match() and
The value is used by the JIT compiler and by the two interpreted match- pcre2_dfa_match().
ing functions, pcre2_match() and pcre2_dfa_match().
int pcre2_set_character_tables(pcre2_compile_context *ccontext, int pcre2_set_character_tables(pcre2_compile_context *ccontext,
const unsigned char *tables); const unsigned char *tables);
@ -763,8 +758,8 @@ PCRE2 CONTEXTS
from zero for each position in the subject string. This limit is not from zero for each position in the subject string. This limit is not
relevant to pcre2_dfa_match(), which ignores it. relevant to pcre2_dfa_match(), which ignores it.
When pcre2_match() is called with a pattern that was successfully stud- When pcre2_match() is called with a pattern that was successfully pro-
ied with pcre2_jit_compile(), the way that the matching is executed is cessed by pcre2_jit_compile(), the way in which matching is executed is
entirely different. However, there is still the possibility of runaway entirely different. However, there is still the possibility of runaway
matching that goes on for a very long time, and so the match_limit matching that goes on for a very long time, and so the match_limit
value is also used in this case (but in a different way) to limit how value is also used in this case (but in a different way) to limit how
@ -819,17 +814,18 @@ PCRE2 CONTEXTS
remembering backtracking data, instead of recursive function calls that remembering backtracking data, instead of recursive function calls that
use the system stack. There is a discussion about PCRE2's stack usage use the system stack. There is a discussion about PCRE2's stack usage
in the pcre2stack documentation. See the pcre2build documentation for in the pcre2stack documentation. See the pcre2build documentation for
details of how to build PCRE2. Using the heap for recursion is a non- details of how to build PCRE2.
standard way of building PCRE2, for use in environments that have lim-
ited stacks. Because of the greater use of memory management, Using the heap for recursion is a non-standard way of building PCRE2,
pcre2_match() runs more slowly. Functions that are different to the for use in environments that have limited stacks. Because of the
general custom memory functions are provided so that special-purpose greater use of memory management, pcre2_match() runs more slowly. Func-
external code can be used for this case, because the memory blocks are tions that are different to the general custom memory functions are
all the same size. The blocks are retained by pcre2_match() until it is provided so that special-purpose external code can be used for this
about to exit so that they can be re-used when possible during the case, because the memory blocks are all the same size. The blocks are
match. In the absence of these functions, the normal custom memory man- retained by pcre2_match() until it is about to exit so that they can be
agement functions are used, if supplied, otherwise the system func- re-used when possible during the match. In the absence of these func-
tions. tions, the normal custom memory management functions are used, if sup-
plied, otherwise the system functions.
CHECKING BUILD-TIME OPTIONS CHECKING BUILD-TIME OPTIONS
@ -858,10 +854,10 @@ CHECKING BUILD-TIME OPTIONS
PCRE2_CONFIG_BSR PCRE2_CONFIG_BSR
The output is an integer whose value indicates what character sequences The output is an integer whose value indicates what character sequences
the \R escape sequence matches by default. A value of 0 means that \R the \R escape sequence matches by default. A value of PCRE2_BSR_UNICODE
matches any Unicode line ending sequence; a value of 1 means that \R means that \R matches any Unicode line ending sequence; a value of
matches only CR, LF, or CRLF. The default can be overridden when a pat- PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. The
tern is compiled or matched. default can be overridden when a pattern is compiled.
PCRE2_CONFIG_JIT PCRE2_CONFIG_JIT
@ -871,13 +867,13 @@ CHECKING BUILD-TIME OPTIONS
PCRE2_CONFIG_JITTARGET PCRE2_CONFIG_JITTARGET
The where argument should point to a buffer that is at least 48 code The where argument should point to a buffer that is at least 48 code
units long. (The exact length needed can be found by calling pcre2_con- units long. (The exact length required can be found by calling
fig() with where set to NULL.) The buffer is filled with a string that pcre2_config() with where set to NULL.) The buffer is filled with a
contains the name of the architecture for which the JIT compiler is string that contains the name of the architecture for which the JIT
configured, for example "x86 32bit (little endian + unaligned)". If JIT compiler is configured, for example "x86 32bit (little endian +
support is not available, PCRE2_ERROR_BADOPTION is returned, otherwise unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is
the number of code units used is returned. This is the length of the returned, otherwise the number of code units used is returned. This is
string, plus one unit for the terminating zero. the length of the string, plus one unit for the terminating zero.
PCRE2_CONFIG_LINKSIZE PCRE2_CONFIG_LINKSIZE
@ -906,11 +902,11 @@ CHECKING BUILD-TIME OPTIONS
The output is an integer whose value specifies the default character The output is an integer whose value specifies the default character
sequence that is recognized as meaning "newline". The values are: sequence that is recognized as meaning "newline". The values are:
1 Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
2 Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
3 Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
4 Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
5 Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
The default should normally correspond to the standard sequence for The default should normally correspond to the standard sequence for
your operating system. your operating system.
@ -943,12 +939,12 @@ CHECKING BUILD-TIME OPTIONS
PCRE2_CONFIG_UNICODE_VERSION PCRE2_CONFIG_UNICODE_VERSION
The where argument should point to a buffer that is at least 24 code The where argument should point to a buffer that is at least 24 code
units long. (The exact length needed can be found by calling pcre2_con- units long. (The exact length required can be found by calling
fig() with where set to NULL.) If PCRE2 has been compiled without Uni- pcre2_config() with where set to NULL.) If PCRE2 has been compiled
code support, the buffer is filled with the text "Unicode not sup- without Unicode support, the buffer is filled with the text "Unicode
ported". Otherwise, the Unicode version string (for example, "7.0.0") not supported". Otherwise, the Unicode version string (for example,
is inserted. The number of code units used is returned. This is the "7.0.0") is inserted. The number of code units used is returned. This
length of the string plus one unit for the terminating zero. is the length of the string plus one unit for the terminating zero.
PCRE2_CONFIG_UNICODE PCRE2_CONFIG_UNICODE
@ -959,9 +955,9 @@ CHECKING BUILD-TIME OPTIONS
PCRE2_CONFIG_VERSION PCRE2_CONFIG_VERSION
The where argument should point to a buffer that is at least 12 code The where argument should point to a buffer that is at least 12 code
units long. (The exact length needed can be found by calling pcre2_con- units long. (The exact length required can be found by calling
fig() with where set to NULL.) The buffer is filled with the PCRE2 ver- pcre2_config() with where set to NULL.) The buffer is filled with the
sion string, zero-terminated. The number of code units used is PCRE2 version string, zero-terminated. The number of code units used is
returned. This is the length of the string plus one unit for the termi- returned. This is the length of the string plus one unit for the termi-
nating zero. nating zero.
@ -974,16 +970,18 @@ COMPILING A PATTERN
pcre2_code_free(pcre2_code *code); pcre2_code_free(pcre2_code *code);
This function compiles a pattern, defined by a pointer to a string of The pcre2_compile() function compiles a pattern into an internal form.
code units and a length, into an internal form. If the pattern is zero- The pattern is defined by a pointer to a string of code units and a
terminated, the length should be specified as PCRE2_ZERO_TERMINATED. length, If the pattern is zero-terminated, the length can be specified
The function returns a pointer to a block of memory that contains the as PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of
compiled pattern and related data. The caller must free the memory by memory that contains the compiled pattern and related data. The caller
calling pcre2_code_free() when it is no longer needed. must free the memory by calling pcre2_code_free() when it is no longer
needed.
If the compile context argument ccontext is NULL, the memory is If the compile context argument ccontext is NULL, memory for the com-
obtained by calling malloc(). Otherwise, it is obtained from the same piled pattern is obtained by calling malloc(). Otherwise, it is
memory function that was used for the compile context. obtained from the same memory function that was used for the compile
context.
The options argument contains various bit settings that affect the com- The options argument contains various bit settings that affect the com-
pilation. It should be zero if no options are required. The available pilation. It should be zero if no options are required. The available
@ -1280,7 +1278,8 @@ COMPILING A PATTERN
are used instead to classify characters. More details are given in the are used instead to classify characters. More details are given in the
section on generic character types in the pcre2pattern page. If you set section on generic character types in the pcre2pattern page. If you set
PCRE2_UCP, matching one of the items it affects takes much longer. The PCRE2_UCP, matching one of the items it affects takes much longer. The
option is available only if PCRE2 has been compiled with UTF support. option is available only if PCRE2 has been compiled with Unicode sup-
port.
PCRE2_UNGREEDY PCRE2_UNGREEDY
@ -1293,10 +1292,11 @@ COMPILING A PATTERN
This option causes PCRE2 to regard both the pattern and the subject This option causes PCRE2 to regard both the pattern and the subject
strings that are subsequently processed as strings of UTF characters strings that are subsequently processed as strings of UTF characters
instead of single-code-unit strings. However, it is available only when instead of single-code-unit strings. It is available when PCRE2 is
PCRE2 is built to include UTF support. If not, the use of this option built to include Unicode support (which is the default). If Unicode
provokes an error. Details of how this option changes the behaviour of support is not available, the use of this option provokes an error.
PCRE2 are given in the pcre2unicode page. Details of how this option changes the behaviour of PCRE2 are given in
the pcre2unicode page.
COMPILATION ERROR CODES COMPILATION ERROR CODES
@ -1345,14 +1345,13 @@ LOCALE SUPPORT
PCRE2 handles caseless matching, and determines whether characters are PCRE2 handles caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables, indexed letters, digits, or whatever, by reference to a set of tables, indexed
by character code point. When running in UTF-8 mode, or using the by character code point. This applies only to characters whose code
16-bit or 32-bit libraries, this applies only to characters with code points are less than 256. By default, higher-valued code points never
points less than 256. By default, higher-valued code points never match match escapes such as \w or \d. However, if PCRE2 is built with UTF
escapes such as \w or \d. However, if PCRE2 is built with UTF support, support, all characters can be tested with \p and \P, or, alterna-
all characters can be tested with \p and \P, or, alternatively, the tively, the PCRE2_UCP option can be set when a pattern is compiled;
PCRE2_UCP option can be set when a pattern is compiled; this causes \w this causes \w and friends to use Unicode property support instead of
and friends to use Unicode property support instead of the built-in the built-in tables.
tables.
The use of locales with Unicode is discouraged. If you are handling The use of locales with Unicode is discouraged. If you are handling
characters with code points greater than 128, you should either use characters with code points greater than 128, you should either use
@ -1463,10 +1462,9 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_BSR PCRE2_INFO_BSR
The output is a uint32_t whose value indicates what character sequences The output is a uint32_t whose value indicates what character sequences
the \R escape sequence matches by default. A value of 0 means that \R the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that
matches any Unicode line ending sequence; a value of 1 means that \R \R matches any Unicode line ending sequence; a value of PCRE2_BSR_ANY-
matches only CR, LF, or CRLF. The default can be overridden when a pat- CRLF means that \R matches only CR, LF, or CRLF.
tern is matched.
PCRE2_INFO_CAPTURECOUNT PCRE2_INFO_CAPTURECOUNT
@ -1607,15 +1605,16 @@ INFORMATION ABOUT A COMPILED PATTERN
The map consists of a number of fixed-size entries. PCRE2_INFO_NAME- The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
the size of each entry; both of these return a uint32_t value. The the size of each entry in code units; both of these return a uint32_t
entry size depends on the length of the longest name. value. The entry size depends on the length of the longest name.
PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit
library, the first two bytes of each entry are the number of the cap- library, the first two bytes of each entry are the number of the cap-
turing parenthesis, most significant byte first. In the 16-bit library, turing parenthesis, most significant byte first. In the 16-bit library,
the pointer points to 16-bit data units, the first of which contains the pointer points to 16-bit code units, the first of which contains
the parenthesis number. In the 32-bit library, the pointer points to the parenthesis number. In the 32-bit library, the pointer points to
32-bit data units, the first of which contains the parenthesis number. 32-bit code units, the first of which contains the parenthesis number.
The rest of the entry is the corresponding name, zero terminated. The rest of the entry is the corresponding name, zero terminated.
The names are in alphabetical order. If (?| is used to create multiple The names are in alphabetical order. If (?| is used to create multiple
@ -1653,17 +1652,16 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_NEWLINE PCRE2_INFO_NEWLINE
The output is a uint32_t whose value specifies the default character The output is a uint32_t with one of the following values:
sequence that will be recognized as meaning "newline" while matching.
The values are:
1 Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
2 Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
3 Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
4 Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
5 Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
The default can be overridden when a pattern is matched. This specifies the default character sequence that will be recognized
as meaning "newline" while matching.
PCRE2_INFO_RECURSIONLIMIT PCRE2_INFO_RECURSIONLIMIT
@ -1699,17 +1697,17 @@ THE MATCH DATA BLOCK
match data block, which is an opaque structure that is accessed by match data block, which is an opaque structure that is accessed by
function calls. In particular, the match data block contains a vector function calls. In particular, the match data block contains a vector
of offsets into the subject string that define the matched part of the of offsets into the subject string that define the matched part of the
subject and any substrings that were capured. This is know as the ovec- subject and any substrings that were captured. This is know as the
tor. ovector.
Before calling pcre2_match() or pcre2_dfa_match() you must create a Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
match data block by calling one of the creation functions above. For you must create a match data block by calling one of the creation func-
pcre2_match_data_create(), the first argument is the number of pairs of tions above. For pcre2_match_data_create(), the first argument is the
offsets in the ovector. One pair of offsets is required to identify the number of pairs of offsets in the ovector. One pair of offsets is
string that matched the whole pattern, with another pair for each cap- required to identify the string that matched the whole pattern, with
tured substring. For example, a value of 4 creates enough space to another pair for each captured substring. For example, a value of 4
record the matched portion of the subject plus three captured sub- creates enough space to record the matched portion of the subject plus
strings. A minimum of at least 1 pair is imposed by three captured substrings. A minimum of at least 1 pair is imposed by
pcre2_match_data_create(), so it is always possible to return the over- pcre2_match_data_create(), so it is always possible to return the over-
all matched string. all matched string.
@ -1718,16 +1716,17 @@ THE MATCH DATA BLOCK
be exactly the right size to hold all the substrings a pattern might be exactly the right size to hold all the substrings a pattern might
capture. capture.
The second argument of both these functions ia a pointer to a general The second argument of both these functions is a pointer to a general
context, which can specify custom memory management for obtaining the context, which can specify custom memory management for obtaining the
memory for the match data block. If you are not using custom memory memory for the match data block. If you are not using custom memory
management, pass NULL. management, pass NULL.
A match data block can be used many times, with the same or different A match data block can be used many times, with the same or different
compiled patterns. When it is no longer needed, it should be freed by compiled patterns. When it is no longer needed, it should be freed by
calling pcre2_match_data_free(). How to extract information from a calling pcre2_match_data_free(). You can extract information from a
match data block after a match operation is described in the sections match data block after a match operation has finished, using functions
on matched strings and other match data below. that are described in the sections on matched strings and other match
data below.
MATCHING A PATTERN: THE TRADITIONAL FUNCTION MATCHING A PATTERN: THE TRADITIONAL FUNCTION
@ -1826,12 +1825,10 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their
action is described below. action is described below.
If the pattern was successfully processed by the just-in-time (JIT) Setting PCRE2_ANCHORED at match time is not supported by the just-in-
compiler, the only supported options for matching using the JIT code time (JIT) compiler. If it is set, JIT matching is disabled and the
are PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, normal interpretive code in pcre2_match() is run. The remaining options
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. If an are supported for JIT matching.
unsupported option is used, JIT matching is disabled and the normal
interpretive code in pcre2_match() is run.
PCRE2_ANCHORED PCRE2_ANCHORED
@ -1845,18 +1842,18 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
This option specifies that first character of the subject string is not This option specifies that first character of the subject string is not
the beginning of a line, so the circumflex metacharacter should not the beginning of a line, so the circumflex metacharacter should not
match before it. Setting this without PCRE2_MULTILINE (at compile time) match before it. Setting this without having set PCRE2_MULTILINE at
causes circumflex never to match. This option affects only the behav- compile time causes circumflex never to match. This option affects only
iour of the circumflex metacharacter. It does not affect \A. the behaviour of the circumflex metacharacter. It does not affect \A.
PCRE2_NOTEOL PCRE2_NOTEOL
This option specifies that the end of the subject string is not the end This option specifies that the end of the subject string is not the end
of a line, so the dollar metacharacter should not match it nor (except of a line, so the dollar metacharacter should not match it nor (except
in multiline mode) a newline immediately before it. Setting this with- in multiline mode) a newline immediately before it. Setting this with-
out PCRE2_MULTILINE (at compile time) causes dollar never to match. out having set PCRE2_MULTILINE at compile time causes dollar never to
This option affects only the behaviour of the dollar metacharacter. It match. This option affects only the behaviour of the dollar metacharac-
does not affect \Z or \z. ter. It does not affect \Z or \z.
PCRE2_NOTEMPTY PCRE2_NOTEMPTY
@ -1869,14 +1866,16 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
is applied to a string not beginning with "a" or "b", it matches an is applied to a string not beginning with "a" or "b", it matches an
empty string at the start of the subject. With PCRE2_NOTEMPTY set, this empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
match is not valid, so PCRE2 searches further into the string for match is not valid, so pcre2_match() searches further into the string
occurrences of "a" or "b". for occurrences of "a" or "b".
PCRE2_NOTEMPTY_ATSTART PCRE2_NOTEMPTY_ATSTART
This is like PCRE2_NOTEMPTY, except that an empty string match that is This is like PCRE2_NOTEMPTY, except that it locks out an empty string
not at the start of the subject is permitted. If the pattern is match only at the first matching position, that is, at the start of the
anchored, such a match can occur only if the pattern contains \K. subject plus the starting offset. An empty string match later in the
subject is permitted. If the pattern is anchored, such a match can
occur only if the pattern contains \K.
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
@ -1910,9 +1909,9 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set,
matching continues by testing any remaining alternatives. Only if no matching continues by testing any remaining alternatives. Only if no
complete match can be found is PCRE2_ERROR_PARTIAL returned instead of complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT says that the PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that
caller is prepared to handle a partial match, but only if no complete the caller is prepared to handle a partial match, but only if no com-
match can be found. plete match can be found.
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
case, if a partial match is found, pcre2_match() immediately returns case, if a partial match is found, pcre2_match() immediately returns
@ -1930,15 +1929,15 @@ NEWLINE HANDLING WHEN MATCHING
ally the standard convention for the operating system. The default can ally the standard convention for the operating system. The default can
be overridden in a compile context. During matching, the newline be overridden in a compile context. During matching, the newline
choice affects the behaviour of the dot, circumflex, and dollar choice affects the behaviour of the dot, circumflex, and dollar
metacharacters. It may also alter the way the match position is metacharacters. It may also alter the way the match starting position
advanced after a match failure for an unanchored pattern. is advanced after a match failure for an unanchored pattern.
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
set, and a match attempt for an unanchored pattern fails when the cur- set as the newline convention, and a match attempt for an unanchored
rent position is at a CRLF sequence, and the pattern contains no pattern fails when the current starting position is at a CRLF sequence,
explicit matches for CR or LF characters, the match position is and the pattern contains no explicit matches for CR or LF characters,
advanced by two characters instead of one, in other words, to after the the match position is advanced by two characters instead of one, in
CRLF. other words, to after the CRLF.
The above rule is a compromise that makes the most common cases work as The above rule is a compromise that makes the most common cases work as
expected. For example, if the pattern is .+A (and the PCRE2_DOTALL expected. For example, if the pattern is .+A (and the PCRE2_DOTALL
@ -1950,8 +1949,8 @@ NEWLINE HANDLING WHEN MATCHING
An explicit match for CR of LF is either a literal appearance of one of An explicit match for CR of LF is either a literal appearance of one of
those characters in the pattern, or one of the \r or \n escape those characters in the pattern, or one of the \r or \n escape
sequences. Implicit matches such as [^X] do not count, nor does \s sequences. Implicit matches such as [^X] do not count, nor does \s,
(which includes CR and LF in the characters that it matches). even though it includes CR and LF in the characters that it matches.
Notwithstanding the above, anomalous effects may still occur when CRLF Notwithstanding the above, anomalous effects may still occur when CRLF
is a valid newline sequence and explicit \r or \n escapes appear in the is a valid newline sequence and explicit \r or \n escapes appear in the
@ -1968,19 +1967,20 @@ HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
addition, further substrings from the subject may be picked out by addition, further substrings from the subject may be picked out by
parenthesized parts of the pattern. Following the usage in Jeffrey parenthesized parts of the pattern. Following the usage in Jeffrey
Friedl's book, this is called "capturing" in what follows, and the Friedl's book, this is called "capturing" in what follows, and the
phrase "capturing subpattern" is used for a fragment of a pattern that phrase "capturing subpattern" or "capturing group" is used for a frag-
picks out a substring. PCRE2 supports several other kinds of parenthe- ment of a pattern that picks out a substring. PCRE2 supports several
sized subpattern that do not cause substrings to be captured. The other kinds of parenthesized subpattern that do not cause substrings to
pcre2_pattern_info() function can be used to find out how many captur- be captured. The pcre2_pattern_info() function can be used to find out
ing subpatterns there are in a compiled pattern. how many capturing subpatterns there are in a compiled pattern.
The overall matched string and any captured substrings are returned to The overall matched string and any captured substrings are returned to
the caller via a vector of PCRE2_SIZE values, called the ovector. This the caller via a vector of PCRE2_SIZE values. This is called the ovec-
is contained within the match data block. You can obtain direct access tor, and is contained within the match data block. You can obtain
to the ovector by calling pcre2_get_ovector_pointer() to find its direct access to the ovector by calling pcre2_get_ovector_pointer() to
address, and pcre2_get_ovector_count() to find the number of pairs of find its address, and pcre2_get_ovector_count() to find the number of
values it contains. Alternatively, you can use the auxiliary functions pairs of values it contains. Alternatively, you can use the auxiliary
for accessing captured substrings by number or by name (see below). functions for accessing captured substrings by number or by name (see
below).
Within the ovector, the first in each pair of values is set to the off- Within the ovector, the first in each pair of values is set to the off-
set of the first code unit of a substring, and the second is set to the set of the first code unit of a substring, and the second is set to the
@ -2033,15 +2033,16 @@ HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
pcre2_match(). The other elements retain whatever values they previ- pcre2_match(). The other elements retain whatever values they previ-
ously had. ously had.
Other information about the match
OTHER INFORMATION ABOUT A MATCH
PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
In addition to the offsets in the ovector, other information about a As well as the offsets in the ovector, other information about a match
match is retained in the match data block and can be retrieved by the is retained in the match data block and can be retrieved by the above
above functions. functions.
When a (*MARK) name is to be passed back, pcre2_get_mark() returns a When a (*MARK) name is to be passed back, pcre2_get_mark() returns a
pointer to the zero-terminated name, which is within the compiled pat- pointer to the zero-terminated name, which is within the compiled pat-
@ -2056,7 +2057,8 @@ HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
value is always the same as ovector[0] because \K does not affect the value is always the same as ovector[0] because \K does not affect the
result of a partial match. result of a partial match.
Error return values from pcre2_match()
ERROR RETURNS FROM pcre2_match()
If pcre2_match() fails, it returns a negative number. This can be con- If pcre2_match() fails, it returns a negative number. This can be con-
verted to a text string by calling pcre2_get_error_message(). Negative verted to a text string by calling pcre2_get_error_message(). Negative
@ -2090,7 +2092,7 @@ HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
PCRE2_ERROR_BADOFFSET PCRE2_ERROR_BADOFFSET
The value of startoffset greater than the length of the subject. The value of startoffset was greater than the length of the subject.
PCRE2_ERROR_BADOPTION PCRE2_ERROR_BADOPTION
@ -2154,7 +2156,7 @@ HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
the same position in the subject string. Some simple patterns that the same position in the subject string. Some simple patterns that
might do this are detected and faulted at compile time, but more com- might do this are detected and faulted at compile time, but more com-
plicated cases, in particular mutual recursions between two different plicated cases, in particular mutual recursions between two different
subpatterns, cannot be detected until run time. subpatterns, cannot be detected until matching is attempted.
PCRE2_ERROR_RECURSIONLIMIT PCRE2_ERROR_RECURSIONLIMIT
@ -2201,8 +2203,8 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
The final arguments of pcre2_substring_copy_bynumber() are a pointer to The final arguments of pcre2_substring_copy_bynumber() are a pointer to
the buffer and a pointer to a variable that contains its length in code the buffer and a pointer to a variable that contains its length in code
units. This is updated to contain the actual number of code units units. This is updated to contain the actual number of code units used
used, excluding the terminating zero. for the extracted substring, excluding the terminating zero.
For pcre2_substring_get_bynumber() the third and fourth arguments point For pcre2_substring_get_bynumber() the third and fourth arguments point
to variables that are updated with a pointer to the new memory and the to variables that are updated with a pointer to the new memory and the
@ -2234,11 +2236,11 @@ EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
void pcre2_substring_list_free(PCRE2_SPTR *list); void pcre2_substring_list_free(PCRE2_SPTR *list);
The pcre2_substring_list_get() function extracts all available sub- The pcre2_substring_list_get() function extracts all available sub-
strings and builds a list of pointers to them, and a second list that strings and builds a list of pointers to them. It also (optionally)
contains their lengths (in code units), excluding a terminating zero builds a second list that contains their lengths (in code units),
that is added to each of them. All this is done in a single block of excluding a terminating zero that is added to each of them. All this is
memory that is obtained using the same memory allocation function that done in a single block of memory that is obtained using the same memory
was used to get the match data block. allocation function that was used to get the match data block.
The address of the memory block is returned via listptr, which is also The address of the memory block is returned via listptr, which is also
the start of the list of string pointers. The end of the list is marked the start of the list of string pointers. The end of the list is marked
@ -2254,7 +2256,7 @@ EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
when capturing subpattern number n+1 matches some part of the subject, when capturing subpattern number n+1 matches some part of the subject,
but subpattern n has not been used at all, it returns an empty string. but subpattern n has not been used at all, it returns an empty string.
This can be distinguished from a genuine zero-length substring by This can be distinguished from a genuine zero-length substring by
inspecting the appropriate offset in the ovector, which contains inspecting the appropriate offset in the ovector, which contain
PCRE2_UNSET for unset substrings. PCRE2_UNSET for unset substrings.
@ -2288,12 +2290,11 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
there is more than one subpattern of that name. there is more than one subpattern of that name.
Given the number, you can extract the substring directly, or use one of Given the number, you can extract the substring directly, or use one of
the functions described in the previous section. For convenience, there the functions described above. For convenience, there are also "byname"
are also "byname" functions that correspond to the "bynumber" func- functions that correspond to the "bynumber" functions, the only differ-
tions, the only difference being that the second argument is a name ence being that the second argument is a name instead of a number. How-
instead of a number. However, if PCRE2_DUPNAMES is set and there are ever, if PCRE2_DUPNAMES is set and there are duplicate names, the be-
duplicate names, the behaviour may not be what you want (see the next haviour may not be what you want.
section).
Warning: If the pattern uses the (?| feature to set up multiple subpat- Warning: If the pattern uses the (?| feature to set up multiple subpat-
terns with the same number, as described in the section on duplicate terns with the same number, as described in the section on duplicate
@ -2331,8 +2332,8 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
brackets are required only if the following character would be inter- brackets are required only if the following character would be inter-
preted as part of the number or name. The number may be zero to include preted as part of the number or name. The number may be zero to include
the entire matched string. For example, if the pattern a(b)c is the entire matched string. For example, if the pattern a(b)c is
matched with "[abc]" and the replacement string "+$1$0$1+", the result matched with "=abc=" and the replacement string "+$1$0$1+", the result
is "[+babcb+]". Group insertion is done by calling pcre2_copy_byname() is "=+babcb+=". Group insertion is done by calling pcre2_copy_byname()
or pcre2_copy_bynumber() as appropriate. or pcre2_copy_bynumber() as appropriate.
The first seven arguments of pcre2_substitute() are the same as for The first seven arguments of pcre2_substitute() are the same as for
@ -2382,19 +2383,20 @@ DUPLICATE SUBPATTERN NAMES
pcre2_substring_get_byname() return the first substring corresponding pcre2_substring_get_byname() return the first substring corresponding
to the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING to the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING
is returned. The pcre2_substring_number_from_name() function returns is returned. The pcre2_substring_number_from_name() function returns
one of the numbers that are associated with the name, but it is not the error PCRE2_ERROR_NOUNIQUESUBSTRING.
defined which it is.
If you want to get full details of all captured substrings for a given If you want to get full details of all captured substrings for a given
name, you must use the pcre2_substring_nametable_scan() function. The name, you must use the pcre2_substring_nametable_scan() function. The
first argument is the compiled pattern, and the second is the name. If first argument is the compiled pattern, and the second is the name. If
the third and fourth arguments are NULL, the function returns a group the third and fourth arguments are NULL, the function returns a group
number (it is not defined which). Otherwise, the third and fourth argu- number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
ments must be pointers to variables that are updated by the function.
After it has run, they point to the first and last entries in the name- When the third and fourth arguments are not NULL, they must be pointers
to-number table for the given name, and the function returns the length to variables that are updated by the function. After it has run, they
of each entry. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if point to the first and last entries in the name-to-number table for the
there are no entries for the given name. given name, and the function returns the length of each entry in code
units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
no entries for the given name.
The format of the name table is described above in the section entitled The format of the name table is described above in the section entitled
Information about a pattern above. Given all the relevant entries for Information about a pattern above. Given all the relevant entries for
@ -2402,15 +2404,15 @@ DUPLICATE SUBPATTERN NAMES
data. data.
FINDING ALL POSSIBLE MATCHES FINDING ALL POSSIBLE MATCHES AT ONE POSITION
The traditional matching function uses a similar algorithm to Perl, The traditional matching function uses a similar algorithm to Perl,
which stops when it finds the first match, starting at a given point in which stops when it finds the first match at a given point in the sub-
the subject. If you want to find all possible matches, or the longest ject. If you want to find all possible matches, or the longest possible
possible match at a given position, consider using the alternative match at a given position, consider using the alternative matching
matching function (see below) instead. If you cannot use the alterna- function (see below) instead. If you cannot use the alternative func-
tive function, you can kludge it up by making use of the callout facil- tion, you can kludge it up by making use of the callout facility, which
ity, which is described in the pcre2callout documentation. is described in the pcre2callout documentation.
What you have to do is to insert a callout right at the end of the pat- What you have to do is to insert a callout right at the end of the pat-
tern. When your callout function is called, extract and save the cur- tern. When your callout function is called, extract and save the cur-
@ -2538,12 +2540,11 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
NOTE: PCRE2's "auto-possessification" optimization usually applies to NOTE: PCRE2's "auto-possessification" optimization usually applies to
character repeats at the end of a pattern (as well as internally). For character repeats at the end of a pattern (as well as internally). For
example, the pattern "a\d+" is compiled as if it were "a\d++" because example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
there is no point in backtracking into the repeated digits. For DFA
matching, this means that only one possible match is found. If you matching, this means that only one possible match is found. If you
really do want multiple matches in such cases, either use an ungreedy really do want multiple matches in such cases, either use an ungreedy
repeat ("a\d+?") or set the PCRE2_NO_AUTO_POSSESS option when compil- repeat auch as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when
ing. compiling.
Error returns from pcre2_dfa_match() Error returns from pcre2_dfa_match()
@ -2578,7 +2579,7 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
PCRE2_ERROR_DFA_BADRESTART PCRE2_ERROR_DFA_BADRESTART
When pcre2_dfa_match() is called with the pcre2_dfa_RESTART option, When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
some plausibility checks are made on the contents of the workspace, some plausibility checks are made on the contents of the workspace,
which should contain data about the previous partial match. If any of which should contain data about the previous partial match. If any of
these checks fail, this error is given. these checks fail, this error is given.
@ -2586,21 +2587,21 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
SEE ALSO SEE ALSO
pcre2build(3), pcre2libs(3), pcre2callout(3), pcre2matching(3), pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
pcre2partial(3), pcre2posix(3), pcre2demo(3), pcre2sample(3), pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2stack(3),
pcre2stack(3). pcre2unicode(3).
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge, England.
REVISION REVISION
Last updated: 11 November 2014 Last updated: 21 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -3043,7 +3044,7 @@ AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge, England.
REVISION REVISION
@ -3279,7 +3280,7 @@ AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge, England.
REVISION REVISION
@ -3465,7 +3466,7 @@ AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge, England.
REVISION REVISION
@ -3849,7 +3850,7 @@ AUTHOR
Philip Hazel (FAQ by Zoltan Herczeg) Philip Hazel (FAQ by Zoltan Herczeg)
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge, England.
REVISION REVISION
@ -3917,7 +3918,7 @@ AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge, England.
REVISION REVISION
@ -4136,7 +4137,7 @@ AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge, England.
REVISION REVISION
@ -4576,7 +4577,7 @@ AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge, England.
REVISION REVISION
@ -4801,7 +4802,7 @@ AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge, England.
REVISION REVISION

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "18 November 2014" "PCRE2 10.00" .TH PCRE2API 3 "21 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -1569,15 +1569,17 @@ values.
.P .P
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
entry; both of these return a \fBuint32_t\fP value. The entry size depends on entry in code units; both of these return a \fBuint32_t\fP value. The entry
the length of the longest name. PCRE2_INFO_NAMETABLE returns a pointer to the size depends on the length of the longest name.
first entry of the table. This is a PCRE2_SPTR pointer to a block of code .P
units. In the 8-bit library, the first two bytes of each entry are the number PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
of the capturing parenthesis, most significant byte first. In the 16-bit a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first
library, the pointer points to 16-bit data units, the first of which contains two bytes of each entry are the number of the capturing parenthesis, most
the parenthesis number. In the 32-bit library, the pointer points to 32-bit significant byte first. In the 16-bit library, the pointer points to 16-bit
data units, the first of which contains the parenthesis number. The rest of the code units, the first of which contains the parenthesis number. In the 32-bit
entry is the corresponding name, zero terminated. library, the pointer points to 32-bit code units, the first of which contains
the parenthesis number. The rest of the entry is the corresponding name, zero
terminated.
.P .P
The names are in alphabetical order. If (?| is used to create multiple groups The names are in alphabetical order. If (?| is used to create multiple groups
with the same number, as described in the with the same number, as described in the
@ -1835,17 +1837,18 @@ matching.
.sp .sp
This option specifies that first character of the subject string is not the This option specifies that first character of the subject string is not the
beginning of a line, so the circumflex metacharacter should not match before beginning of a line, so the circumflex metacharacter should not match before
it. Setting this without PCRE2_MULTILINE (at compile time) causes circumflex it. Setting this without having set PCRE2_MULTILINE at compile time causes
never to match. This option affects only the behaviour of the circumflex circumflex never to match. This option affects only the behaviour of the
metacharacter. It does not affect \eA. circumflex metacharacter. It does not affect \eA.
.sp .sp
PCRE2_NOTEOL PCRE2_NOTEOL
.sp .sp
This option specifies that the end of the subject string is not the end of a This option specifies that the end of the subject string is not the end of a
line, so the dollar metacharacter should not match it nor (except in multiline line, so the dollar metacharacter should not match it nor (except in multiline
mode) a newline immediately before it. Setting this without PCRE2_MULTILINE (at mode) a newline immediately before it. Setting this without having set
compile time) causes dollar never to match. This option affects only the PCRE2_MULTILINE at compile time causes dollar never to match. This option
behaviour of the dollar metacharacter. It does not affect \eZ or \ez. affects only the behaviour of the dollar metacharacter. It does not affect \eZ
or \ez.
.sp .sp
PCRE2_NOTEMPTY PCRE2_NOTEMPTY
.sp .sp
@ -1857,13 +1860,16 @@ match the empty string, the entire match fails. For example, if the pattern
.sp .sp
is applied to a string not beginning with "a" or "b", it matches an empty is applied to a string not beginning with "a" or "b", it matches an empty
string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
valid, so PCRE2 searches further into the string for occurrences of "a" or "b". valid, so \fBpcre2_match()\fP searches further into the string for occurrences
of "a" or "b".
.sp .sp
PCRE2_NOTEMPTY_ATSTART PCRE2_NOTEMPTY_ATSTART
.sp .sp
This is like PCRE2_NOTEMPTY, except that an empty string match that is not at This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
the start of the subject is permitted. If the pattern is anchored, such a match only at the first matching position, that is, at the start of the subject plus
can occur only if the pattern contains \eK. the starting offset. An empty string match later in the subject is permitted.
If the pattern is anchored, such a match can occur only if the pattern contains
\eK.
.sp .sp
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
.sp .sp
@ -1913,8 +1919,8 @@ subject characters to complete the match. If this happens when
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
testing any remaining alternatives. Only if no complete match can be found is testing any remaining alternatives. Only if no complete match can be found is
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
PCRE2_PARTIAL_SOFT says that the caller is prepared to handle a partial match, PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
but only if no complete match can be found. match, but only if no complete match can be found.
.P .P
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
a partial match is found, \fBpcre2_match()\fP immediately returns a partial match is found, \fBpcre2_match()\fP immediately returns
@ -1943,13 +1949,13 @@ compile context.
.\" .\"
During matching, the newline choice affects the behaviour of the dot, During matching, the newline choice affects the behaviour of the dot,
circumflex, and dollar metacharacters. It may also alter the way the match circumflex, and dollar metacharacters. It may also alter the way the match
position is advanced after a match failure for an unanchored pattern. starting position is advanced after a match failure for an unanchored pattern.
.P .P
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set, When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
and a match attempt for an unanchored pattern fails when the current position the newline convention, and a match attempt for an unanchored pattern fails
is at a CRLF sequence, and the pattern contains no explicit matches for CR or when the current starting position is at a CRLF sequence, and the pattern
LF characters, the match position is advanced by two characters instead of one, contains no explicit matches for CR or LF characters, the match position is
in other words, to after the CRLF. advanced by two characters instead of one, in other words, to after the CRLF.
.P .P
The above rule is a compromise that makes the most common cases work as The above rule is a compromise that makes the most common cases work as
expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is
@ -1960,8 +1966,8 @@ reference, and so advances only by one character after the first failure.
.P .P
An explicit match for CR of LF is either a literal appearance of one of those An explicit match for CR of LF is either a literal appearance of one of those
characters in the pattern, or one of the \er or \en escape sequences. Implicit characters in the pattern, or one of the \er or \en escape sequences. Implicit
matches such as [^X] do not count, nor does \es (which includes CR and LF in matches such as [^X] do not count, nor does \es, even though it includes CR and
the characters that it matches). LF in the characters that it matches.
.P .P
Notwithstanding the above, anomalous effects may still occur when CRLF is a Notwithstanding the above, anomalous effects may still occur when CRLF is a
valid newline sequence and explicit \er or \en escapes appear in the pattern. valid newline sequence and explicit \er or \en escapes appear in the pattern.
@ -1981,15 +1987,15 @@ In general, a pattern matches a certain portion of the subject, and in
addition, further substrings from the subject may be picked out by addition, further substrings from the subject may be picked out by
parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
book, this is called "capturing" in what follows, and the phrase "capturing book, this is called "capturing" in what follows, and the phrase "capturing
subpattern" is used for a fragment of a pattern that picks out a substring. subpattern" or "capturing group" is used for a fragment of a pattern that picks
PCRE2 supports several other kinds of parenthesized subpattern that do not out a substring. PCRE2 supports several other kinds of parenthesized subpattern
cause substrings to be captured. The \fBpcre2_pattern_info()\fP function can be that do not cause substrings to be captured. The \fBpcre2_pattern_info()\fP
used to find out how many capturing subpatterns there are in a compiled function can be used to find out how many capturing subpatterns there are in a
pattern. compiled pattern.
.P .P
The overall matched string and any captured substrings are returned to the The overall matched string and any captured substrings are returned to the
caller via a vector of PCRE2_SIZE values, called the \fBovector\fP. This is caller via a vector of PCRE2_SIZE values. This is called the \fBovector\fP, and
contained within the is contained within the
.\" HTML <a href="#matchdatablock"> .\" HTML <a href="#matchdatablock">
.\" </a> .\" </a>
match data block. match data block.
@ -2062,7 +2068,7 @@ had.
. .
. .
.\" HTML <a name="matchotherdata"></a> .\" HTML <a name="matchotherdata"></a>
.SS "Other information about the match" .SH "OTHER INFORMATION ABOUT A MATCH"
.rs .rs
.sp .sp
.nf .nf
@ -2071,7 +2077,7 @@ had.
.B PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *\fImatch_data\fP); .B PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *\fImatch_data\fP);
.fi .fi
.P .P
In addition to the offsets in the ovector, other information about a match is As well as the offsets in the ovector, other information about a match is
retained in the match data block and can be retrieved by the above functions. retained in the match data block and can be retrieved by the above functions.
.P .P
When a (*MARK) name is to be passed back, \fBpcre2_get_mark()\fP returns a When a (*MARK) name is to be passed back, \fBpcre2_get_mark()\fP returns a
@ -2087,7 +2093,7 @@ as \fIovector[0]\fP because \eK does not affect the result of a partial match.
. .
. .
.\" HTML <a name="errorlist"></a> .\" HTML <a name="errorlist"></a>
.SS "Error return values from \fBpcre2_match()\fP" .SH "ERROR RETURNS FROM \fBpcre2_match()\fP"
.rs .rs
.sp .sp
If \fBpcre2_match()\fP fails, it returns a negative number. This can be If \fBpcre2_match()\fP fails, it returns a negative number. This can be
@ -2127,7 +2133,7 @@ passed to a 16-bit or 32-bit library function, or vice versa.
.sp .sp
PCRE2_ERROR_BADOFFSET PCRE2_ERROR_BADOFFSET
.sp .sp
The value of \fIstartoffset\fP greater than the length of the subject. The value of \fIstartoffset\fP was greater than the length of the subject.
.sp .sp
PCRE2_ERROR_BADOPTION PCRE2_ERROR_BADOPTION
.sp .sp
@ -2200,8 +2206,8 @@ the pattern. Specifically, it means that either the whole pattern or a
subpattern has been called recursively for the second time at the same position subpattern has been called recursively for the second time at the same position
in the subject string. Some simple patterns that might do this are detected and in the subject string. Some simple patterns that might do this are detected and
faulted at compile time, but more complicated cases, in particular mutual faulted at compile time, but more complicated cases, in particular mutual
recursions between two different subpatterns, cannot be detected until run recursions between two different subpatterns, cannot be detected until matching
time. is attempted.
.sp .sp
PCRE2_ERROR_RECURSIONLIMIT PCRE2_ERROR_RECURSIONLIMIT
.sp .sp
@ -2254,8 +2260,8 @@ extract the captured substrings.
.P .P
The final arguments of \fBpcre2_substring_copy_bynumber()\fP are a pointer to The final arguments of \fBpcre2_substring_copy_bynumber()\fP are a pointer to
the buffer and a pointer to a variable that contains its length in code units. the buffer and a pointer to a variable that contains its length in code units.
This is updated to contain the actual number of code units used, excluding the This is updated to contain the actual number of code units used for the
terminating zero. extracted substring, excluding the terminating zero.
.P .P
For \fBpcre2_substring_get_bynumber()\fP the third and fourth arguments point For \fBpcre2_substring_get_bynumber()\fP the third and fourth arguments point
to variables that are updated with a pointer to the new memory and the number to variables that are updated with a pointer to the new memory and the number
@ -2290,10 +2296,11 @@ small to capture that group.
.fi .fi
.P .P
The \fBpcre2_substring_list_get()\fP function extracts all available substrings The \fBpcre2_substring_list_get()\fP function extracts all available substrings
and builds a list of pointers to them, and a second list that contains their and builds a list of pointers to them. It also (optionally) builds a second
lengths (in code units), excluding a terminating zero that is added to each of list that contains their lengths (in code units), excluding a terminating zero
them. All this is done in a single block of memory that is obtained using the that is added to each of them. All this is done in a single block of memory
same memory allocation function that was used to get the match data block. that is obtained using the same memory allocation function that was used to get
the match data block.
.P .P
The address of the memory block is returned via \fIlistptr\fP, which is also The address of the memory block is returned via \fIlistptr\fP, which is also
the start of the list of string pointers. The end of the list is marked by a the start of the list of string pointers. The end of the list is marked by a
@ -2309,7 +2316,7 @@ If this function encounters a substring that is unset, which can happen when
capturing subpattern number \fIn+1\fP matches some part of the subject, but capturing subpattern number \fIn+1\fP matches some part of the subject, but
subpattern \fIn\fP has not been used at all, it returns an empty string. This subpattern \fIn\fP has not been used at all, it returns an empty string. This
can be distinguished from a genuine zero-length substring by inspecting the can be distinguished from a genuine zero-length substring by inspecting the
appropriate offset in the ovector, which contains PCRE2_UNSET for unset appropriate offset in the ovector, which contain PCRE2_UNSET for unset
substrings. substrings.
. .
. .
@ -2347,11 +2354,10 @@ name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one subpattern of
that name. that name.
.P .P
Given the number, you can extract the substring directly, or use one of the Given the number, you can extract the substring directly, or use one of the
functions described in the previous section. For convenience, there are also functions described above. For convenience, there are also "byname" functions
"byname" functions that correspond to the "bynumber" functions, the only that correspond to the "bynumber" functions, the only difference being that the
difference being that the second argument is a name instead of a number. second argument is a name instead of a number. However, if PCRE2_DUPNAMES is
However, if PCRE2_DUPNAMES is set and there are duplicate names, set and there are duplicate names, the behaviour may not be what you want.
the behaviour may not be what you want (see the next section).
.P .P
\fBWarning:\fP If the pattern uses the (?| feature to set up multiple \fBWarning:\fP If the pattern uses the (?| feature to set up multiple
subpatterns with the same number, as described in the subpatterns with the same number, as described in the
@ -2398,8 +2404,8 @@ recognized:
Either a group number or a group name can be given for <n>. Curly brackets are Either a group number or a group name can be given for <n>. Curly brackets are
required only if the following character would be interpreted as part of the required only if the following character would be interpreted as part of the
number or name. The number may be zero to include the entire matched string. number or name. The number may be zero to include the entire matched string.
For example, if the pattern a(b)c is matched with "[abc]" and the replacement For example, if the pattern a(b)c is matched with "=abc=" and the replacement
string "+$1$0$1+", the result is "[+babcb+]". Group insertion is done by string "+$1$0$1+", the result is "=+babcb+=". Group insertion is done by
calling \fBpcre2_copy_byname()\fP or \fBpcre2_copy_bynumber()\fP as calling \fBpcre2_copy_byname()\fP or \fBpcre2_copy_bynumber()\fP as
appropriate. appropriate.
.P .P
@ -2452,18 +2458,19 @@ documentation.
When duplicates are present, \fBpcre2_substring_copy_byname()\fP and When duplicates are present, \fBpcre2_substring_copy_byname()\fP and
\fBpcre2_substring_get_byname()\fP return the first substring corresponding to \fBpcre2_substring_get_byname()\fP return the first substring corresponding to
the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is
returned. The \fBpcre2_substring_number_from_name()\fP function returns one of returned. The \fBpcre2_substring_number_from_name()\fP function returns
the numbers that are associated with the name, but it is not defined which it the error PCRE2_ERROR_NOUNIQUESUBSTRING.
is.
.P .P
If you want to get full details of all captured substrings for a given name, If you want to get full details of all captured substrings for a given name,
you must use the \fBpcre2_substring_nametable_scan()\fP function. The first you must use the \fBpcre2_substring_nametable_scan()\fP function. The first
argument is the compiled pattern, and the second is the name. If the third and argument is the compiled pattern, and the second is the name. If the third and
fourth arguments are NULL, the function returns a group number (it is not fourth arguments are NULL, the function returns a group number for a unique
defined which). Otherwise, the third and fourth arguments must be pointers to name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
.P
When the third and fourth arguments are not NULL, they must be pointers to
variables that are updated by the function. After it has run, they point to the variables that are updated by the function. After it has run, they point to the
first and last entries in the name-to-number table for the given name, and the first and last entries in the name-to-number table for the given name, and the
function returns the length of each entry. In both cases, function returns the length of each entry in code units. In both cases,
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name. PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
.P .P
The format of the name table is described above in the section entitled The format of the name table is described above in the section entitled
@ -2476,15 +2483,15 @@ Given all the relevant entries for the name, you can extract each of their
numbers, and hence the captured data. numbers, and hence the captured data.
. .
. .
.SH "FINDING ALL POSSIBLE MATCHES" .SH "FINDING ALL POSSIBLE MATCHES AT ONE POSITION"
.rs .rs
.sp .sp
The traditional matching function uses a similar algorithm to Perl, which stops The traditional matching function uses a similar algorithm to Perl, which stops
when it finds the first match, starting at a given point in the subject. If you when it finds the first match at a given point in the subject. If you want to
want to find all possible matches, or the longest possible match at a given find all possible matches, or the longest possible match at a given position,
position, consider using the alternative matching function (see below) instead. consider using the alternative matching function (see below) instead. If you
If you cannot use the alternative function, you can kludge it up by making use cannot use the alternative function, you can kludge it up by making use of the
of the callout facility, which is described in the callout facility, which is described in the
.\" HREF .\" HREF
\fBpcre2callout\fP \fBpcre2callout\fP
.\" .\"
@ -2628,11 +2635,10 @@ the longest matches.
.P .P
NOTE: PCRE2's "auto-possessification" optimization usually applies to character NOTE: PCRE2's "auto-possessification" optimization usually applies to character
repeats at the end of a pattern (as well as internally). For example, the repeats at the end of a pattern (as well as internally). For example, the
pattern "a\ed+" is compiled as if it were "a\ed++" because there is no point in pattern "a\ed+" is compiled as if it were "a\ed++". For DFA matching, this
backtracking into the repeated digits. For DFA matching, this means that only means that only one possible match is found. If you really do want multiple
one possible match is found. If you really do want multiple matches in such matches in such cases, either use an ungreedy repeat auch as "a\ed+?" or set
cases, either use an ungreedy repeat ("a\ed+?") or set the the PCRE2_NO_AUTO_POSSESS option when compiling.
PCRE2_NO_AUTO_POSSESS option when compiling.
. .
. .
.SS "Error returns from \fBpcre2_dfa_match()\fP" .SS "Error returns from \fBpcre2_dfa_match()\fP"
@ -2673,7 +2679,7 @@ extremely rare, as a vector of size 1000 is used.
.sp .sp
PCRE2_ERROR_DFA_BADRESTART PCRE2_ERROR_DFA_BADRESTART
.sp .sp
When \fBpcre2_dfa_match()\fP is called with the \fBpcre2_dfa_RESTART\fP option, When \fBpcre2_dfa_match()\fP is called with the \fBPCRE2_DFA_RESTART\fP option,
some plausibility checks are made on the contents of the workspace, which some plausibility checks are made on the contents of the workspace, which
should contain data about the previous partial match. If any of these checks should contain data about the previous partial match. If any of these checks
fail, this error is given. fail, this error is given.
@ -2682,9 +2688,9 @@ fail, this error is given.
.SH "SEE ALSO" .SH "SEE ALSO"
.rs .rs
.sp .sp
\fBpcre2build\fP(3), \fBpcre2libs\fP(3), \fBpcre2callout\fP(3), \fBpcre2build\fP(3), \fBpcre2callout\fP(3), \fBpcre2demo(3)\fP,
\fBpcre2matching\fP(3), \fBpcre2partial\fP(3), \fBpcre2posix\fP(3), \fBpcre2matching\fP(3), \fBpcre2partial\fP(3), \fBpcre2posix\fP(3),
\fBpcre2demo(3)\fP, \fBpcre2sample\fP(3), \fBpcre2stack\fP(3). \fBpcre2sample\fP(3), \fBpcre2stack\fP(3), \fBpcre2unicode\fP(3).
. .
. .
.SH AUTHOR .SH AUTHOR
@ -2701,6 +2707,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 18 November 2014 Last updated: 21 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi