More documentation and file tidies.

This commit is contained in:
Philip.Hazel 2014-11-21 16:45:06 +00:00
parent ba1e2e0cbb
commit eb4fffbbf4
25 changed files with 1002 additions and 987 deletions

View File

@ -394,7 +394,7 @@ SET(PCRE2_SOURCES
src/pcre2_pattern_info.c src/pcre2_pattern_info.c
src/pcre2_string_utils.c src/pcre2_string_utils.c
src/pcre2_study.c src/pcre2_study.c
src/pcre2_substitute.c src/pcre2_substitute.c
src/pcre2_substring.c src/pcre2_substring.c
src/pcre2_tables.c src/pcre2_tables.c
src/pcre2_ucd.c src/pcre2_ucd.c

View File

@ -286,7 +286,7 @@ if [ $? -eq 0 ] ; then
test2stack="-S 16" test2stack="-S 16"
else else
test2stack="" test2stack=""
fi fi
# All of 8-bit, 16-bit, and 32-bit character strings may be supported, but only # All of 8-bit, 16-bit, and 32-bit character strings may be supported, but only
# one need be. # one need be.

View File

@ -25,9 +25,10 @@ PCRE2 is the name used for a revised API for the PCRE library, which is a set
of functions, written in C, that implement regular expression pattern matching of functions, written in C, that implement regular expression pattern matching
using the same syntax and semantics as Perl, with just a few differences. Some using the same syntax and semantics as Perl, with just a few differences. Some
features that appeared in Python and the original PCRE before they appeared in features that appeared in Python and the original PCRE before they appeared in
Perl are also available using the Python syntax, there is some support for one Perl are also available using the Python syntax. There is also some support for
or two .NET and Oniguruma syntax items, and there are options for requesting one or two .NET and Oniguruma syntax items, and there are options for
some minor changes that give better ECMAScript (aka JavaScript) compatibility. requesting some minor changes that give better ECMAScript (aka JavaScript)
compatibility.
</P> </P>
<P> <P>
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit
@ -36,7 +37,7 @@ The original work to extend PCRE to 16-bit and 32-bit code units was done by
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
can be interpreted either as one character per code unit, or as UTF-encoded can be interpreted either as one character per code unit, or as UTF-encoded
Unicode, with support for Unicode general category properties. Unicode support Unicode, with support for Unicode general category properties. Unicode support
is optional at build time (but is the default); however, processing strings as is optional at build time (but is the default). However, processing strings as
UTF code units must be enabled explicitly at run time. The version of Unicode UTF code units must be enabled explicitly at run time. The version of Unicode
in use can be discovered by running in use can be discovered by running
<pre> <pre>
@ -143,17 +144,17 @@ listing), and the short pages for individual functions, are concatenated in
pcre2compat discussion of Perl compatibility pcre2compat discussion of Perl compatibility
pcre2demo a demonstration C program that uses PCRE2 pcre2demo a demonstration C program that uses PCRE2
pcre2grep description of the <b>pcre2grep</b> command (8-bit only) pcre2grep description of the <b>pcre2grep</b> command (8-bit only)
pcre2jit discussion of the just-in-time optimization support pcre2jit discussion of just-in-time optimization support
pcre2limits details of size and other limits pcre2limits details of size and other limits
pcre2matching discussion of the two matching algorithms pcre2matching discussion of the two matching algorithms
pcre2partial details of the partial matching facility pcre2partial details of the partial matching facility
pcre2pattern syntax and semantics of supported regular expressions pcre2pattern syntax and semantics of supported regular expression patterns
pcre2perform discussion of performance issues pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program pcre2sample discussion of the pcre2demo program
pcre2stack discussion of stack usage pcre2stack discussion of stack usage
pcre2syntax quick syntax reference pcre2syntax quick syntax reference
pcre2test description of the <b>pcre2test</b> testing command pcre2test description of the <b>pcre2test</b> command
pcre2unicode discussion of Unicode and UTF support pcre2unicode discussion of Unicode and UTF support
</pre> </pre>
In the "man" and HTML formats, there is also a short page for each C library In the "man" and HTML formats, there is also a short page for each C library
@ -165,7 +166,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<P> <P>
@ -174,7 +175,7 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
</P> </P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br> <br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 03 November 2014 Last updated: 18 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -37,16 +37,18 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC22" href="#SEC22">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a> <li><a name="TOC22" href="#SEC22">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a>
<li><a name="TOC23" href="#SEC23">NEWLINE HANDLING WHEN MATCHING</a> <li><a name="TOC23" href="#SEC23">NEWLINE HANDLING WHEN MATCHING</a>
<li><a name="TOC24" href="#SEC24">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a> <li><a name="TOC24" href="#SEC24">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a>
<li><a name="TOC25" href="#SEC25">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a> <li><a name="TOC25" href="#SEC25">OTHER INFORMATION ABOUT A MATCH</a>
<li><a name="TOC26" href="#SEC26">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a> <li><a name="TOC26" href="#SEC26">ERROR RETURNS FROM <b>pcre2_match()</b></a>
<li><a name="TOC27" href="#SEC27">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a> <li><a name="TOC27" href="#SEC27">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a>
<li><a name="TOC28" href="#SEC28">CREATING A NEW STRING WITH SUBSTITUTIONS</a> <li><a name="TOC28" href="#SEC28">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a>
<li><a name="TOC29" href="#SEC29">DUPLICATE SUBPATTERN NAMES</a> <li><a name="TOC29" href="#SEC29">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a>
<li><a name="TOC30" href="#SEC30">FINDING ALL POSSIBLE MATCHES</a> <li><a name="TOC30" href="#SEC30">CREATING A NEW STRING WITH SUBSTITUTIONS</a>
<li><a name="TOC31" href="#SEC31">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a> <li><a name="TOC31" href="#SEC31">DUPLICATE SUBPATTERN NAMES</a>
<li><a name="TOC32" href="#SEC32">SEE ALSO</a> <li><a name="TOC32" href="#SEC32">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a>
<li><a name="TOC33" href="#SEC33">AUTHOR</a> <li><a name="TOC33" href="#SEC33">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a>
<li><a name="TOC34" href="#SEC34">REVISION</a> <li><a name="TOC34" href="#SEC34">SEE ALSO</a>
<li><a name="TOC35" href="#SEC35">AUTHOR</a>
<li><a name="TOC36" href="#SEC36">REVISION</a>
</ul> </ul>
<P> <P>
<b>#include &#60;pcre2.h&#62;</b> <b>#include &#60;pcre2.h&#62;</b>
@ -436,13 +438,9 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
<P> <P>
Each of the first three conventions is used by at least one operating system as Each of the first three conventions is used by at least one operating system as
its standard newline sequence. When PCRE2 is built, a default can be specified. its standard newline sequence. When PCRE2 is built, a default can be specified.
The default default is LF, which is the Unix standard. When PCRE2 is run, the The default default is LF, which is the Unix standard. However, the newline
default can be overridden, either when a pattern is compiled, or when it is convention can be changed by an application when calling <b>pcre2_compile()</b>,
matched. or it can be specified by special text at the start of the pattern itself; this
</P>
<P>
The newline convention can be changed when calling <b>pcre2_compile()</b>, or it
can be specified by special text at the start of the pattern itself; this
overrides any other settings. See the overrides any other settings. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page for details of the special character sequences. page for details of the special character sequences.
@ -459,8 +457,8 @@ below.
</P> </P>
<P> <P>
The choice of newline convention does not affect the interpretation of The choice of newline convention does not affect the interpretation of
the \n or \r escape sequences, nor does it affect what \R matches, which has the \n or \r escape sequences, nor does it affect what \R matches; this has
its own separate control. its own separate convention.
</P> </P>
<br><a name="SEC13" href="#TOC1">MULTITHREADING</a><br> <br><a name="SEC13" href="#TOC1">MULTITHREADING</a><br>
<P> <P>
@ -472,7 +470,7 @@ time ensuring that multithreaded applications can use it.
</P> </P>
<P> <P>
There are several different blocks of data that are used to pass information There are several different blocks of data that are used to pass information
between the application and the PCRE libraries. between the application and the PCRE2 libraries.
</P> </P>
<P> <P>
(1) A pointer to the compiled form of a pattern is returned to the user when (1) A pointer to the compiled form of a pattern is returned to the user when
@ -572,11 +570,11 @@ The compile context
A compile context is required if you want to change the default values of any A compile context is required if you want to change the default values of any
of the following compile-time parameters: of the following compile-time parameters:
<pre> <pre>
What \R matches (Unicode newlines or CR, LF, CRLF only); What \R matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables; PCRE2's character tables
The newline character sequence; The newline character sequence
The compile time nested parentheses limit; The compile time nested parentheses limit
An external function for stack checking. An external function for stack checking
</pre> </pre>
A compile context is also required if you are using custom memory management. A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of If none of these apply, just pass NULL as the context argument of
@ -604,9 +602,8 @@ PCRE2_ERROR_BADDATA if invalid data is detected.
<br> <br>
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF, The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF,
or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line
ending sequence. The value of this parameter does not affect what is compiled; ending sequence. The value is used by the JIT compiler and by the two
it is just saved with the compiled pattern. The value is used by the JIT interpreted matching functions, <i>pcre2_match()</i> and
compiler and by the two interpreted matching functions, <i>pcre2_match()</i> and
<i>pcre2_dfa_match()</i>. <i>pcre2_dfa_match()</i>.
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b> <b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
<b> const unsigned char *<i>tables</i>);</b> <b> const unsigned char *<i>tables</i>);</b>
@ -709,12 +706,12 @@ in the subject string. This limit is not relevant to <b>pcre2_dfa_match()</b>,
which ignores it. which ignores it.
</P> </P>
<P> <P>
When <b>pcre2_match()</b> is called with a pattern that was successfully studied When <b>pcre2_match()</b> is called with a pattern that was successfully
with <b>pcre2_jit_compile()</b>, the way that the matching is executed is processed by <b>pcre2_jit_compile()</b>, the way in which matching is executed
entirely different. However, there is still the possibility of runaway matching is entirely different. However, there is still the possibility of runaway
that goes on for a very long time, and so the <i>match_limit</i> value is also matching that goes on for a very long time, and so the <i>match_limit</i> value
used in this case (but in a different way) to limit how long the matching can is also used in this case (but in a different way) to limit how long the
continue. matching can continue.
</P> </P>
<P> <P>
The default value for the limit can be set when PCRE2 is built; the default The default value for the limit can be set when PCRE2 is built; the default
@ -770,15 +767,17 @@ stack. There is a discussion about PCRE2's stack usage in the
<a href="pcre2stack.html"><b>pcre2stack</b></a> <a href="pcre2stack.html"><b>pcre2stack</b></a>
documentation. See the documentation. See the
<a href="pcre2build.html"><b>pcre2build</b></a> <a href="pcre2build.html"><b>pcre2build</b></a>
documentation for details of how to build PCRE2. Using the heap for recursion documentation for details of how to build PCRE2.
is a non-standard way of building PCRE2, for use in environments that have </P>
limited stacks. Because of the greater use of memory management, <P>
<b>pcre2_match()</b> runs more slowly. Functions that are different to the Using the heap for recursion is a non-standard way of building PCRE2, for use
general custom memory functions are provided so that special-purpose external in environments that have limited stacks. Because of the greater use of memory
code can be used for this case, because the memory blocks are all the same management, <b>pcre2_match()</b> runs more slowly. Functions that are different
size. The blocks are retained by <b>pcre2_match()</b> until it is about to exit to the general custom memory functions are provided so that special-purpose
so that they can be re-used when possible during the match. In the absence of external code can be used for this case, because the memory blocks are all the
these functions, the normal custom memory management functions are used, if same size. The blocks are retained by <b>pcre2_match()</b> until it is about to
exit so that they can be re-used when possible during the match. In the absence
of these functions, the normal custom memory management functions are used, if
supplied, otherwise the system functions. supplied, otherwise the system functions.
</P> </P>
<br><a name="SEC15" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br> <br><a name="SEC15" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br>
@ -809,9 +808,10 @@ available:
PCRE2_CONFIG_BSR PCRE2_CONFIG_BSR
</pre> </pre>
The output is an integer whose value indicates what character sequences the \R The output is an integer whose value indicates what character sequences the \R
escape sequence matches by default. A value of 0 means that \R matches any escape sequence matches by default. A value of PCRE2_BSR_UNICODE means that \R
Unicode line ending sequence; a value of 1 means that \R matches only CR, LF, matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
or CRLF. The default can be overridden when a pattern is compiled or matched. that \R matches only CR, LF, or CRLF. The default can be overridden when a
pattern is compiled.
<pre> <pre>
PCRE2_CONFIG_JIT PCRE2_CONFIG_JIT
</pre> </pre>
@ -821,7 +821,7 @@ compiling is available; otherwise it is set to zero.
PCRE2_CONFIG_JITTARGET PCRE2_CONFIG_JITTARGET
</pre> </pre>
The <i>where</i> argument should point to a buffer that is at least 48 code The <i>where</i> argument should point to a buffer that is at least 48 code
units long. (The exact length needed can be found by calling units long. (The exact length required can be found by calling
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a <b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a
string that contains the name of the architecture for which the JIT compiler is string that contains the name of the architecture for which the JIT compiler is
configured, for example "x86 32bit (little endian + unaligned)". If JIT support configured, for example "x86 32bit (little endian + unaligned)". If JIT support
@ -855,11 +855,11 @@ Further details are given with <b>pcre2_match()</b> below.
The output is an integer whose value specifies the default character sequence The output is an integer whose value specifies the default character sequence
that is recognized as meaning "newline". The values are: that is recognized as meaning "newline". The values are:
<pre> <pre>
1 Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
2 Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
3 Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
4 Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
5 Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
</pre> </pre>
The default should normally correspond to the standard sequence for your The default should normally correspond to the standard sequence for your
operating system. operating system.
@ -891,7 +891,7 @@ heap instead of recursive function calls.
PCRE2_CONFIG_UNICODE_VERSION PCRE2_CONFIG_UNICODE_VERSION
</pre> </pre>
The <i>where</i> argument should point to a buffer that is at least 24 code The <i>where</i> argument should point to a buffer that is at least 24 code
units long. (The exact length needed can be found by calling units long. (The exact length required can be found by calling
<b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled <b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled
without Unicode support, the buffer is filled with the text "Unicode not without Unicode support, the buffer is filled with the text "Unicode not
supported". Otherwise, the Unicode version string (for example, "7.0.0") is supported". Otherwise, the Unicode version string (for example, "7.0.0") is
@ -906,7 +906,7 @@ otherwise it is set to zero. Unicode support implies UTF support.
PCRE2_CONFIG_VERSION PCRE2_CONFIG_VERSION
</pre> </pre>
The <i>where</i> argument should point to a buffer that is at least 12 code The <i>where</i> argument should point to a buffer that is at least 12 code
units long. (The exact length needed can be found by calling units long. (The exact length required can be found by calling
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with <b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with
the PCRE2 version string, zero-terminated. The number of code units used is the PCRE2 version string, zero-terminated. The number of code units used is
returned. This is the length of the string plus one unit for the terminating returned. This is the length of the string plus one unit for the terminating
@ -922,17 +922,17 @@ zero.
<b>pcre2_code_free(pcre2_code *<i>code</i>);</b> <b>pcre2_code_free(pcre2_code *<i>code</i>);</b>
</P> </P>
<P> <P>
This function compiles a pattern, defined by a pointer to a string of code The <b>pcre2_compile()</b> function compiles a pattern into an internal form.
units and a length, into an internal form. If the pattern is zero-terminated, The pattern is defined by a pointer to a string of code units and a length, If
the length should be specified as PCRE2_ZERO_TERMINATED. The function returns a the pattern is zero-terminated, the length can be specified as
pointer to a block of memory that contains the compiled pattern and related PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
data. The caller must free the memory by calling <b>pcre2_code_free()</b> when contains the compiled pattern and related data. The caller must free the memory
it is no longer needed. by calling <b>pcre2_code_free()</b> when it is no longer needed.
</P> </P>
<P> <P>
If the compile context argument <i>ccontext</i> is NULL, the memory is obtained If the compile context argument <i>ccontext</i> is NULL, memory for the compiled
by calling <b>malloc()</b>. Otherwise, it is obtained from the same memory pattern is obtained by calling <b>malloc()</b>. Otherwise, it is obtained from
function that was used for the compile context. the same memory function that was used for the compile context.
</P> </P>
<P> <P>
The <i>options</i> argument contains various bit settings that affect the The <i>options</i> argument contains various bit settings that affect the
@ -1247,7 +1247,7 @@ classify characters. More details are given in the section on
in the in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page. If you set PCRE2_UCP, matching one of the items it affects takes much page. If you set PCRE2_UCP, matching one of the items it affects takes much
longer. The option is available only if PCRE2 has been compiled with UTF longer. The option is available only if PCRE2 has been compiled with Unicode
support. support.
<pre> <pre>
PCRE2_UNGREEDY PCRE2_UNGREEDY
@ -1260,9 +1260,10 @@ with Perl. It can also be set by a (?U) option setting within the pattern.
</pre> </pre>
This option causes PCRE2 to regard both the pattern and the subject strings This option causes PCRE2 to regard both the pattern and the subject strings
that are subsequently processed as strings of UTF characters instead of that are subsequently processed as strings of UTF characters instead of
single-code-unit strings. However, it is available only when PCRE2 is built to single-code-unit strings. It is available when PCRE2 is built to include
include UTF support. If not, the use of this option provokes an error. Details Unicode support (which is the default). If Unicode support is not available,
of how this option changes the behaviour of PCRE2 are given in the the use of this option provokes an error. Details of how this option changes
the behaviour of PCRE2 are given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a> <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page. page.
</P> </P>
@ -1318,13 +1319,12 @@ Most, but not all patterns can be optimized by the JIT compiler.
<P> <P>
PCRE2 handles caseless matching, and determines whether characters are letters, PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code digits, or whatever, by reference to a set of tables, indexed by character code
point. When running in UTF-8 mode, or using the 16-bit or 32-bit libraries, point. This applies only to characters whose code points are less than 256. By
this applies only to characters with code points less than 256. By default, default, higher-valued code points never match escapes such as \w or \d.
higher-valued code points never match escapes such as \w or \d. However, if However, if PCRE2 is built with UTF support, all characters can be tested with
PCRE2 is built with UTF support, all characters can be tested with \p and \P, \p and \P, or, alternatively, the PCRE2_UCP option can be set when a pattern
or, alternatively, the PCRE2_UCP option can be set when a pattern is compiled; is compiled; this causes \w and friends to use Unicode property support
this causes \w and friends to use Unicode property support instead of the instead of the built-in tables.
built-in tables.
</P> </P>
<P> <P>
The use of locales with Unicode is discouraged. If you are handling characters The use of locales with Unicode is discouraged. If you are handling characters
@ -1437,9 +1437,9 @@ are no back references.
PCRE2_INFO_BSR PCRE2_INFO_BSR
</pre> </pre>
The output is a uint32_t whose value indicates what character sequences the \R The output is a uint32_t whose value indicates what character sequences the \R
escape sequence matches by default. A value of 0 means that \R matches any escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R matches
Unicode line ending sequence; a value of 1 means that \R matches only CR, LF, any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \R
or CRLF. The default can be overridden when a pattern is matched. matches only CR, LF, or CRLF.
<pre> <pre>
PCRE2_INFO_CAPTURECOUNT PCRE2_INFO_CAPTURECOUNT
</pre> </pre>
@ -1581,15 +1581,18 @@ values.
<P> <P>
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
entry; both of these return a <b>uint32_t</b> value. The entry size depends on entry in code units; both of these return a <b>uint32_t</b> value. The entry
the length of the longest name. PCRE2_INFO_NAMETABLE returns a pointer to the size depends on the length of the longest name.
first entry of the table. This is a PCRE2_SPTR pointer to a block of code </P>
units. In the 8-bit library, the first two bytes of each entry are the number <P>
of the capturing parenthesis, most significant byte first. In the 16-bit PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
library, the pointer points to 16-bit data units, the first of which contains a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first
the parenthesis number. In the 32-bit library, the pointer points to 32-bit two bytes of each entry are the number of the capturing parenthesis, most
data units, the first of which contains the parenthesis number. The rest of the significant byte first. In the 16-bit library, the pointer points to 16-bit
entry is the corresponding name, zero terminated. code units, the first of which contains the parenthesis number. In the 32-bit
library, the pointer points to 32-bit code units, the first of which contains
the parenthesis number. The rest of the entry is the corresponding name, zero
terminated.
</P> </P>
<P> <P>
The names are in alphabetical order. If (?| is used to create multiple groups The names are in alphabetical order. If (?| is used to create multiple groups
@ -1629,17 +1632,16 @@ different for each compiled pattern.
<pre> <pre>
PCRE2_INFO_NEWLINE PCRE2_INFO_NEWLINE
</pre> </pre>
The output is a <b>uint32_t</b> whose value specifies the default character The output is a <b>uint32_t</b> with one of the following values:
sequence that will be recognized as meaning "newline" while matching. The
values are:
<pre> <pre>
1 Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
2 Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
3 Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
4 Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
5 Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
</pre> </pre>
The default can be overridden when a pattern is matched. This specifies the default character sequence that will be recognized as
meaning "newline" while matching.
<pre> <pre>
PCRE2_INFO_RECURSIONLIMIT PCRE2_INFO_RECURSIONLIMIT
</pre> </pre>
@ -1675,18 +1677,19 @@ Information about successful and unsuccessful matches is placed in a match
data block, which is an opaque structure that is accessed by function calls. In data block, which is an opaque structure that is accessed by function calls. In
particular, the match data block contains a vector of offsets into the subject particular, the match data block contains a vector of offsets into the subject
string that define the matched part of the subject and any substrings that were string that define the matched part of the subject and any substrings that were
capured. This is know as the <i>ovector</i>. captured. This is know as the <i>ovector</i>.
</P> </P>
<P> <P>
Before calling <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> you must create a Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
match data block by calling one of the creation functions above. For <b>pcre2_jit_match()</b> you must create a match data block by calling one of
<b>pcre2_match_data_create()</b>, the first argument is the number of pairs of the creation functions above. For <b>pcre2_match_data_create()</b>, the first
offsets in the <i>ovector</i>. One pair of offsets is required to identify the argument is the number of pairs of offsets in the <i>ovector</i>. One pair of
string that matched the whole pattern, with another pair for each captured offsets is required to identify the string that matched the whole pattern, with
substring. For example, a value of 4 creates enough space to record the matched another pair for each captured substring. For example, a value of 4 creates
portion of the subject plus three captured substrings. A minimum of at least 1 enough space to record the matched portion of the subject plus three captured
pair is imposed by <b>pcre2_match_data_create()</b>, so it is always possible to substrings. A minimum of at least 1 pair is imposed by
return the overall matched string. <b>pcre2_match_data_create()</b>, so it is always possible to return the overall
matched string.
</P> </P>
<P> <P>
For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
@ -1694,15 +1697,16 @@ pointer to a compiled pattern. In this case the ovector is created to be
exactly the right size to hold all the substrings a pattern might capture. exactly the right size to hold all the substrings a pattern might capture.
</P> </P>
<P> <P>
The second argument of both these functions ia a pointer to a general context, The second argument of both these functions is a pointer to a general context,
which can specify custom memory management for obtaining the memory for the which can specify custom memory management for obtaining the memory for the
match data block. If you are not using custom memory management, pass NULL. match data block. If you are not using custom memory management, pass NULL.
</P> </P>
<P> <P>
A match data block can be used many times, with the same or different compiled A match data block can be used many times, with the same or different compiled
patterns. When it is no longer needed, it should be freed by calling patterns. When it is no longer needed, it should be freed by calling
<b>pcre2_match_data_free()</b>. How to extract information from a match data <b>pcre2_match_data_free()</b>. You can extract information from a match data
block after a match operation is described in the sections on block after a match operation has finished, using functions that are described
in the sections on
<a href="#matchedstrings">matched strings</a> <a href="#matchedstrings">matched strings</a>
and and
<a href="#matchotherdata">other match data</a> <a href="#matchotherdata">other match data</a>
@ -1816,12 +1820,10 @@ PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below. PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
</P> </P>
<P> <P>
If the pattern was successfully processed by the just-in-time (JIT) compiler, Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
the only supported options for matching using the JIT code are PCRE2_NOTBOL, compiler. If it is set, JIT matching is disabled and the normal interpretive
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, code in <b>pcre2_match()</b> is run. The remaining options are supported for JIT
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. If an unsupported option is used, matching.
JIT matching is disabled and the normal interpretive code in
<b>pcre2_match()</b> is run.
<pre> <pre>
PCRE2_ANCHORED PCRE2_ANCHORED
</pre> </pre>
@ -1835,17 +1837,18 @@ matching.
</pre> </pre>
This option specifies that first character of the subject string is not the This option specifies that first character of the subject string is not the
beginning of a line, so the circumflex metacharacter should not match before beginning of a line, so the circumflex metacharacter should not match before
it. Setting this without PCRE2_MULTILINE (at compile time) causes circumflex it. Setting this without having set PCRE2_MULTILINE at compile time causes
never to match. This option affects only the behaviour of the circumflex circumflex never to match. This option affects only the behaviour of the
metacharacter. It does not affect \A. circumflex metacharacter. It does not affect \A.
<pre> <pre>
PCRE2_NOTEOL PCRE2_NOTEOL
</pre> </pre>
This option specifies that the end of the subject string is not the end of a This option specifies that the end of the subject string is not the end of a
line, so the dollar metacharacter should not match it nor (except in multiline line, so the dollar metacharacter should not match it nor (except in multiline
mode) a newline immediately before it. Setting this without PCRE2_MULTILINE (at mode) a newline immediately before it. Setting this without having set
compile time) causes dollar never to match. This option affects only the PCRE2_MULTILINE at compile time causes dollar never to match. This option
behaviour of the dollar metacharacter. It does not affect \Z or \z. affects only the behaviour of the dollar metacharacter. It does not affect \Z
or \z.
<pre> <pre>
PCRE2_NOTEMPTY PCRE2_NOTEMPTY
</pre> </pre>
@ -1857,13 +1860,16 @@ match the empty string, the entire match fails. For example, if the pattern
</pre> </pre>
is applied to a string not beginning with "a" or "b", it matches an empty is applied to a string not beginning with "a" or "b", it matches an empty
string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
valid, so PCRE2 searches further into the string for occurrences of "a" or "b". valid, so <b>pcre2_match()</b> searches further into the string for occurrences
of "a" or "b".
<pre> <pre>
PCRE2_NOTEMPTY_ATSTART PCRE2_NOTEMPTY_ATSTART
</pre> </pre>
This is like PCRE2_NOTEMPTY, except that an empty string match that is not at This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
the start of the subject is permitted. If the pattern is anchored, such a match only at the first matching position, that is, at the start of the subject plus
can occur only if the pattern contains \K. the starting offset. An empty string match later in the subject is permitted.
If the pattern is anchored, such a match can occur only if the pattern contains
\K.
<pre> <pre>
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
</pre> </pre>
@ -1904,8 +1910,8 @@ subject characters to complete the match. If this happens when
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
testing any remaining alternatives. Only if no complete match can be found is testing any remaining alternatives. Only if no complete match can be found is
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
PCRE2_PARTIAL_SOFT says that the caller is prepared to handle a partial match, PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
but only if no complete match can be found. match, but only if no complete match can be found.
</P> </P>
<P> <P>
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
@ -1928,14 +1934,14 @@ a
<a href="#compilecontext">compile context.</a> <a href="#compilecontext">compile context.</a>
During matching, the newline choice affects the behaviour of the dot, During matching, the newline choice affects the behaviour of the dot,
circumflex, and dollar metacharacters. It may also alter the way the match circumflex, and dollar metacharacters. It may also alter the way the match
position is advanced after a match failure for an unanchored pattern. starting position is advanced after a match failure for an unanchored pattern.
</P> </P>
<P> <P>
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set, When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
and a match attempt for an unanchored pattern fails when the current position the newline convention, and a match attempt for an unanchored pattern fails
is at a CRLF sequence, and the pattern contains no explicit matches for CR or when the current starting position is at a CRLF sequence, and the pattern
LF characters, the match position is advanced by two characters instead of one, contains no explicit matches for CR or LF characters, the match position is
in other words, to after the CRLF. advanced by two characters instead of one, in other words, to after the CRLF.
</P> </P>
<P> <P>
The above rule is a compromise that makes the most common cases work as The above rule is a compromise that makes the most common cases work as
@ -1948,8 +1954,8 @@ reference, and so advances only by one character after the first failure.
<P> <P>
An explicit match for CR of LF is either a literal appearance of one of those An explicit match for CR of LF is either a literal appearance of one of those
characters in the pattern, or one of the \r or \n escape sequences. Implicit characters in the pattern, or one of the \r or \n escape sequences. Implicit
matches such as [^X] do not count, nor does \s (which includes CR and LF in matches such as [^X] do not count, nor does \s, even though it includes CR and
the characters that it matches). LF in the characters that it matches.
</P> </P>
<P> <P>
Notwithstanding the above, anomalous effects may still occur when CRLF is a Notwithstanding the above, anomalous effects may still occur when CRLF is a
@ -1967,16 +1973,16 @@ In general, a pattern matches a certain portion of the subject, and in
addition, further substrings from the subject may be picked out by addition, further substrings from the subject may be picked out by
parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
book, this is called "capturing" in what follows, and the phrase "capturing book, this is called "capturing" in what follows, and the phrase "capturing
subpattern" is used for a fragment of a pattern that picks out a substring. subpattern" or "capturing group" is used for a fragment of a pattern that picks
PCRE2 supports several other kinds of parenthesized subpattern that do not out a substring. PCRE2 supports several other kinds of parenthesized subpattern
cause substrings to be captured. The <b>pcre2_pattern_info()</b> function can be that do not cause substrings to be captured. The <b>pcre2_pattern_info()</b>
used to find out how many capturing subpatterns there are in a compiled function can be used to find out how many capturing subpatterns there are in a
pattern. compiled pattern.
</P> </P>
<P> <P>
The overall matched string and any captured substrings are returned to the The overall matched string and any captured substrings are returned to the
caller via a vector of PCRE2_SIZE values, called the <b>ovector</b>. This is caller via a vector of PCRE2_SIZE values. This is called the <b>ovector</b>, and
contained within the is contained within the
<a href="#matchdatablock">match data block.</a> <a href="#matchdatablock">match data block.</a>
You can obtain direct access to the ovector by calling You can obtain direct access to the ovector by calling
<b>pcre2_get_ovector_pointer()</b> to find its address, and <b>pcre2_get_ovector_pointer()</b> to find its address, and
@ -2045,9 +2051,7 @@ parentheses, no more than <i>ovector[0]</i> to <i>ovector[2n+1]</i> are set by
<b>pcre2_match()</b>. The other elements retain whatever values they previously <b>pcre2_match()</b>. The other elements retain whatever values they previously
had. had.
<a name="matchotherdata"></a></P> <a name="matchotherdata"></a></P>
<br><b> <br><a name="SEC25" href="#TOC1">OTHER INFORMATION ABOUT A MATCH</a><br>
Other information about the match
</b><br>
<P> <P>
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b> <b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
<br> <br>
@ -2055,7 +2059,7 @@ Other information about the match
<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b> <b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
</P> </P>
<P> <P>
In addition to the offsets in the ovector, other information about a match is As well as the offsets in the ovector, other information about a match is
retained in the match data block and can be retrieved by the above functions. retained in the match data block and can be retrieved by the above functions.
</P> </P>
<P> <P>
@ -2071,9 +2075,7 @@ different to the value of <i>ovector[0]</i> if the pattern contains the \K
escape sequence. After a partial match, however, this value is always the same escape sequence. After a partial match, however, this value is always the same
as <i>ovector[0]</i> because \K does not affect the result of a partial match. as <i>ovector[0]</i> because \K does not affect the result of a partial match.
<a name="errorlist"></a></P> <a name="errorlist"></a></P>
<br><b> <br><a name="SEC26" href="#TOC1">ERROR RETURNS FROM <b>pcre2_match()</b></a><br>
Error return values from <b>pcre2_match()</b>
</b><br>
<P> <P>
If <b>pcre2_match()</b> fails, it returns a negative number. This can be If <b>pcre2_match()</b> fails, it returns a negative number. This can be
converted to a text string by calling <b>pcre2_get_error_message()</b>. Negative converted to a text string by calling <b>pcre2_get_error_message()</b>. Negative
@ -2108,7 +2110,7 @@ passed to a 16-bit or 32-bit library function, or vice versa.
<pre> <pre>
PCRE2_ERROR_BADOFFSET PCRE2_ERROR_BADOFFSET
</pre> </pre>
The value of <i>startoffset</i> greater than the length of the subject. The value of <i>startoffset</i> was greater than the length of the subject.
<pre> <pre>
PCRE2_ERROR_BADOPTION PCRE2_ERROR_BADOPTION
</pre> </pre>
@ -2175,14 +2177,14 @@ the pattern. Specifically, it means that either the whole pattern or a
subpattern has been called recursively for the second time at the same position subpattern has been called recursively for the second time at the same position
in the subject string. Some simple patterns that might do this are detected and in the subject string. Some simple patterns that might do this are detected and
faulted at compile time, but more complicated cases, in particular mutual faulted at compile time, but more complicated cases, in particular mutual
recursions between two different subpatterns, cannot be detected until run recursions between two different subpatterns, cannot be detected until matching
time. is attempted.
<pre> <pre>
PCRE2_ERROR_RECURSIONLIMIT PCRE2_ERROR_RECURSIONLIMIT
</pre> </pre>
The internal recursion limit was reached. The internal recursion limit was reached.
<a name="extractbynumber"></a></P> <a name="extractbynumber"></a></P>
<br><a name="SEC25" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br> <br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
<P> <P>
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b> <b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> unsigned int <i>number</i>, PCRE2_SIZE *<i>length</i>);</b> <b> unsigned int <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
@ -2228,8 +2230,8 @@ extract the captured substrings.
<P> <P>
The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to
the buffer and a pointer to a variable that contains its length in code units. the buffer and a pointer to a variable that contains its length in code units.
This is updated to contain the actual number of code units used, excluding the This is updated to contain the actual number of code units used for the
terminating zero. extracted substring, excluding the terminating zero.
</P> </P>
<P> <P>
For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point
@ -2254,7 +2256,7 @@ no capturing group of that number in the pattern, or because the group with
that number did not participate in the match, or because the ovector was too that number did not participate in the match, or because the ovector was too
small to capture that group. small to capture that group.
</P> </P>
<br><a name="SEC26" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br> <br><a name="SEC28" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br>
<P> <P>
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b> <b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b> <b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
@ -2264,10 +2266,11 @@ small to capture that group.
</P> </P>
<P> <P>
The <b>pcre2_substring_list_get()</b> function extracts all available substrings The <b>pcre2_substring_list_get()</b> function extracts all available substrings
and builds a list of pointers to them, and a second list that contains their and builds a list of pointers to them. It also (optionally) builds a second
lengths (in code units), excluding a terminating zero that is added to each of list that contains their lengths (in code units), excluding a terminating zero
them. All this is done in a single block of memory that is obtained using the that is added to each of them. All this is done in a single block of memory
same memory allocation function that was used to get the match data block. that is obtained using the same memory allocation function that was used to get
the match data block.
</P> </P>
<P> <P>
The address of the memory block is returned via <i>listptr</i>, which is also The address of the memory block is returned via <i>listptr</i>, which is also
@ -2285,10 +2288,10 @@ If this function encounters a substring that is unset, which can happen when
capturing subpattern number <i>n+1</i> matches some part of the subject, but capturing subpattern number <i>n+1</i> matches some part of the subject, but
subpattern <i>n</i> has not been used at all, it returns an empty string. This subpattern <i>n</i> has not been used at all, it returns an empty string. This
can be distinguished from a genuine zero-length substring by inspecting the can be distinguished from a genuine zero-length substring by inspecting the
appropriate offset in the ovector, which contains PCRE2_UNSET for unset appropriate offset in the ovector, which contain PCRE2_UNSET for unset
substrings. substrings.
<a name="extractbyname"></a></P> <a name="extractbyname"></a></P>
<br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br> <br><a name="SEC29" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
<P> <P>
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b> <b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>);</b> <b> PCRE2_SPTR <i>name</i>);</b>
@ -2324,11 +2327,10 @@ that name.
</P> </P>
<P> <P>
Given the number, you can extract the substring directly, or use one of the Given the number, you can extract the substring directly, or use one of the
functions described in the previous section. For convenience, there are also functions described above. For convenience, there are also "byname" functions
"byname" functions that correspond to the "bynumber" functions, the only that correspond to the "bynumber" functions, the only difference being that the
difference being that the second argument is a name instead of a number. second argument is a name instead of a number. However, if PCRE2_DUPNAMES is
However, if PCRE2_DUPNAMES is set and there are duplicate names, set and there are duplicate names, the behaviour may not be what you want.
the behaviour may not be what you want (see the next section).
</P> </P>
<P> <P>
<b>Warning:</b> If the pattern uses the (?| feature to set up multiple <b>Warning:</b> If the pattern uses the (?| feature to set up multiple
@ -2341,7 +2343,7 @@ names are not included in the compiled code. The matching process uses only
numbers. For this reason, the use of different names for subpatterns of the numbers. For this reason, the use of different names for subpatterns of the
same number causes an error at compile time. same number causes an error at compile time.
</P> </P>
<br><a name="SEC28" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br> <br><a name="SEC30" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br>
<P> <P>
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> <b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
@ -2368,8 +2370,8 @@ recognized:
Either a group number or a group name can be given for &#60;n&#62;. Curly brackets are Either a group number or a group name can be given for &#60;n&#62;. Curly brackets are
required only if the following character would be interpreted as part of the required only if the following character would be interpreted as part of the
number or name. The number may be zero to include the entire matched string. number or name. The number may be zero to include the entire matched string.
For example, if the pattern a(b)c is matched with "[abc]" and the replacement For example, if the pattern a(b)c is matched with "=abc=" and the replacement
string "+$1$0$1+", the result is "[+babcb+]". Group insertion is done by string "+$1$0$1+", the result is "=+babcb+=". Group insertion is done by
calling <b>pcre2_copy_byname()</b> or <b>pcre2_copy_bynumber()</b> as calling <b>pcre2_copy_byname()</b> or <b>pcre2_copy_bynumber()</b> as
appropriate. appropriate.
</P> </P>
@ -2402,7 +2404,7 @@ straight back. PCRE2_ERROR_BADREPLACEMENT is returned for an invalid
replacement string (unrecognized sequence following a dollar sign), and replacement string (unrecognized sequence following a dollar sign), and
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough.
</P> </P>
<br><a name="SEC29" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br> <br><a name="SEC31" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
<P> <P>
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b> <b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b> <b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
@ -2423,19 +2425,21 @@ documentation.
When duplicates are present, <b>pcre2_substring_copy_byname()</b> and When duplicates are present, <b>pcre2_substring_copy_byname()</b> and
<b>pcre2_substring_get_byname()</b> return the first substring corresponding to <b>pcre2_substring_get_byname()</b> return the first substring corresponding to
the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is
returned. The <b>pcre2_substring_number_from_name()</b> function returns one of returned. The <b>pcre2_substring_number_from_name()</b> function returns
the numbers that are associated with the name, but it is not defined which it the error PCRE2_ERROR_NOUNIQUESUBSTRING.
is.
</P> </P>
<P> <P>
If you want to get full details of all captured substrings for a given name, If you want to get full details of all captured substrings for a given name,
you must use the <b>pcre2_substring_nametable_scan()</b> function. The first you must use the <b>pcre2_substring_nametable_scan()</b> function. The first
argument is the compiled pattern, and the second is the name. If the third and argument is the compiled pattern, and the second is the name. If the third and
fourth arguments are NULL, the function returns a group number (it is not fourth arguments are NULL, the function returns a group number for a unique
defined which). Otherwise, the third and fourth arguments must be pointers to name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
</P>
<P>
When the third and fourth arguments are not NULL, they must be pointers to
variables that are updated by the function. After it has run, they point to the variables that are updated by the function. After it has run, they point to the
first and last entries in the name-to-number table for the given name, and the first and last entries in the name-to-number table for the given name, and the
function returns the length of each entry. In both cases, function returns the length of each entry in code units. In both cases,
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name. PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
</P> </P>
<P> <P>
@ -2445,14 +2449,14 @@ The format of the name table is described above in the section entitled
Given all the relevant entries for the name, you can extract each of their Given all the relevant entries for the name, you can extract each of their
numbers, and hence the captured data. numbers, and hence the captured data.
</P> </P>
<br><a name="SEC30" href="#TOC1">FINDING ALL POSSIBLE MATCHES</a><br> <br><a name="SEC32" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
<P> <P>
The traditional matching function uses a similar algorithm to Perl, which stops The traditional matching function uses a similar algorithm to Perl, which stops
when it finds the first match, starting at a given point in the subject. If you when it finds the first match at a given point in the subject. If you want to
want to find all possible matches, or the longest possible match at a given find all possible matches, or the longest possible match at a given position,
position, consider using the alternative matching function (see below) instead. consider using the alternative matching function (see below) instead. If you
If you cannot use the alternative function, you can kludge it up by making use cannot use the alternative function, you can kludge it up by making use of the
of the callout facility, which is described in the callout facility, which is described in the
<a href="pcre2callout.html"><b>pcre2callout</b></a> <a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation. documentation.
</P> </P>
@ -2463,7 +2467,7 @@ substring. Then return 1, which forces <b>pcre2_match()</b> to backtrack and try
other alternatives. Ultimately, when it runs out of matches, other alternatives. Ultimately, when it runs out of matches,
<b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH. <b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH.
<a name="dfamatch"></a></P> <a name="dfamatch"></a></P>
<br><a name="SEC31" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br> <br><a name="SEC33" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br>
<P> <P>
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> <b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> <b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
@ -2591,11 +2595,10 @@ the longest matches.
<P> <P>
NOTE: PCRE2's "auto-possessification" optimization usually applies to character NOTE: PCRE2's "auto-possessification" optimization usually applies to character
repeats at the end of a pattern (as well as internally). For example, the repeats at the end of a pattern (as well as internally). For example, the
pattern "a\d+" is compiled as if it were "a\d++" because there is no point in pattern "a\d+" is compiled as if it were "a\d++". For DFA matching, this
backtracking into the repeated digits. For DFA matching, this means that only means that only one possible match is found. If you really do want multiple
one possible match is found. If you really do want multiple matches in such matches in such cases, either use an ungreedy repeat auch as "a\d+?" or set
cases, either use an ungreedy repeat ("a\d+?") or set the the PCRE2_NO_AUTO_POSSESS option when compiling.
PCRE2_NO_AUTO_POSSESS option when compiling.
</P> </P>
<br><b> <br><b>
Error returns from <b>pcre2_dfa_match()</b> Error returns from <b>pcre2_dfa_match()</b>
@ -2633,29 +2636,29 @@ extremely rare, as a vector of size 1000 is used.
<pre> <pre>
PCRE2_ERROR_DFA_BADRESTART PCRE2_ERROR_DFA_BADRESTART
</pre> </pre>
When <b>pcre2_dfa_match()</b> is called with the <b>pcre2_dfa_RESTART</b> option, When <b>pcre2_dfa_match()</b> is called with the <b>PCRE2_DFA_RESTART</b> option,
some plausibility checks are made on the contents of the workspace, which some plausibility checks are made on the contents of the workspace, which
should contain data about the previous partial match. If any of these checks should contain data about the previous partial match. If any of these checks
fail, this error is given. fail, this error is given.
</P> </P>
<br><a name="SEC32" href="#TOC1">SEE ALSO</a><br> <br><a name="SEC34" href="#TOC1">SEE ALSO</a><br>
<P> <P>
<b>pcre2build</b>(3), <b>pcre2libs</b>(3), <b>pcre2callout</b>(3), <b>pcre2build</b>(3), <b>pcre2callout</b>(3), <b>pcre2demo(3)</b>,
<b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3), <b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3),
<b>pcre2demo(3)</b>, <b>pcre2sample</b>(3), <b>pcre2stack</b>(3). <b>pcre2sample</b>(3), <b>pcre2stack</b>(3), <b>pcre2unicode</b>(3).
</P> </P>
<br><a name="SEC33" href="#TOC1">AUTHOR</a><br> <br><a name="SEC35" href="#TOC1">AUTHOR</a><br>
<P> <P>
Philip Hazel Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC34" href="#TOC1">REVISION</a><br> <br><a name="SEC36" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 11 November 2014 Last updated: 21 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -461,7 +461,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>

View File

@ -256,7 +256,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC7" href="#TOC1">REVISION</a><br> <br><a name="SEC7" href="#TOC1">REVISION</a><br>

View File

@ -207,7 +207,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>

View File

@ -745,7 +745,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br> <br><a name="SEC14" href="#TOC1">REVISION</a><br>

View File

@ -413,7 +413,7 @@ Philip Hazel (FAQ by Zoltan Herczeg)
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br> <br><a name="SEC13" href="#TOC1">REVISION</a><br>

View File

@ -73,7 +73,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>

View File

@ -227,7 +227,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br> <br><a name="SEC8" href="#TOC1">REVISION</a><br>

View File

@ -450,7 +450,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC10" href="#TOC1">REVISION</a><br> <br><a name="SEC10" href="#TOC1">REVISION</a><br>

View File

@ -3231,7 +3231,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>

View File

@ -180,7 +180,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>

View File

@ -278,7 +278,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br> <br><a name="SEC9" href="#TOC1">REVISION</a><br>

View File

@ -90,7 +90,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>

View File

@ -33,6 +33,13 @@ the recursive call would immediately be passed back as the result of the
current call (a "tail recursion"), the function is just restarted instead. current call (a "tail recursion"), the function is just restarted instead.
</P> </P>
<P> <P>
Each time the internal <b>match()</b> function is called recursively, it uses
memory from the process stack. For certain kinds of pattern and data, very
large amounts of stack may be needed, despite the recognition of "tail
recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
of the GCC compiler, the stack requirements are greatly increased.
</P>
<P>
The above comments apply when <b>pcre2_match()</b> is run in its normal The above comments apply when <b>pcre2_match()</b> is run in its normal
interpretive manner. If the compiled pattern was processed by interpretive manner. If the compiled pattern was processed by
<b>pcre2_jit_compile()</b>, and just-in-time compiling was successful, and the <b>pcre2_jit_compile()</b>, and just-in-time compiling was successful, and the
@ -61,10 +68,7 @@ relevant only for <b>pcre2_match()</b> without the JIT optimization.
Reducing <b>pcre2_match()</b>'s stack usage Reducing <b>pcre2_match()</b>'s stack usage
</b><br> </b><br>
<P> <P>
Each time that the internal <b>match()</b> function is called recursively, it You can often reduce the amount of recursion, and therefore the
uses memory from the process stack. For certain kinds of pattern and data, very
large amounts of stack may be needed, despite the recognition of "tail
recursion". You can often reduce the amount of recursion, and therefore the
amount of stack used, by modifying the pattern that is being matched. Consider, amount of stack used, by modifying the pattern that is being matched. Consider,
for example, this pattern: for example, this pattern:
<pre> <pre>
@ -187,14 +191,14 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 20 October 2014 Last updated: 21 November 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -548,7 +548,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br> <br><a name="SEC27" href="#TOC1">REVISION</a><br>

View File

@ -1301,7 +1301,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC20" href="#TOC1">REVISION</a><br> <br><a name="SEC20" href="#TOC1">REVISION</a><br>

View File

@ -254,7 +254,7 @@ Philip Hazel
<br> <br>
University Computing Service University Computing Service
<br> <br>
Cambridge CB2 3QH, England. Cambridge, England.
<br> <br>
</P> </P>
<br><b> <br><b>

View File

@ -146,7 +146,7 @@ listing), and the short pages for individual functions, are concatenated in
pcre2matching discussion of the two matching algorithms pcre2matching discussion of the two matching algorithms
pcre2partial details of the partial matching facility pcre2partial details of the partial matching facility
.\" JOIN .\" JOIN
pcre2pattern syntax and semantics of supported regular pcre2pattern syntax and semantics of supported regular
expression patterns expression patterns
pcre2perform discussion of performance issues pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library pcre2posix the POSIX-compatible C API for the 8-bit library

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "18 November 2014" "PCRE2 10.00" .TH PCRE2API 3 "21 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -674,7 +674,7 @@ patterns that are not anchored, the count restarts from zero for each position
in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP, in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP,
which ignores it. which ignores it.
.P .P
When \fBpcre2_match()\fP is called with a pattern that was successfully When \fBpcre2_match()\fP is called with a pattern that was successfully
processed by \fBpcre2_jit_compile()\fP, the way in which matching is executed processed by \fBpcre2_jit_compile()\fP, the way in which matching is executed
is entirely different. However, there is still the possibility of runaway is entirely different. However, there is still the possibility of runaway
matching that goes on for a very long time, and so the \fImatch_limit\fP value matching that goes on for a very long time, and so the \fImatch_limit\fP value
@ -740,7 +740,7 @@ documentation. See the
.\" HREF .\" HREF
\fBpcre2build\fP \fBpcre2build\fP
.\" .\"
documentation for details of how to build PCRE2. documentation for details of how to build PCRE2.
.P .P
Using the heap for recursion is a non-standard way of building PCRE2, for use Using the heap for recursion is a non-standard way of building PCRE2, for use
in environments that have limited stacks. Because of the greater use of memory in environments that have limited stacks. Because of the greater use of memory
@ -904,7 +904,7 @@ PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
contains the compiled pattern and related data. The caller must free the memory contains the compiled pattern and related data. The caller must free the memory
by calling \fBpcre2_code_free()\fP when it is no longer needed. by calling \fBpcre2_code_free()\fP when it is no longer needed.
.P .P
If the compile context argument \fIccontext\fP is NULL, memory for the compiled If the compile context argument \fIccontext\fP is NULL, memory for the compiled
pattern is obtained by calling \fBmalloc()\fP. Otherwise, it is obtained from pattern is obtained by calling \fBmalloc()\fP. Otherwise, it is obtained from
the same memory function that was used for the compile context. the same memory function that was used for the compile context.
.P .P
@ -1569,15 +1569,17 @@ values.
.P .P
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
entry; both of these return a \fBuint32_t\fP value. The entry size depends on entry in code units; both of these return a \fBuint32_t\fP value. The entry
the length of the longest name. PCRE2_INFO_NAMETABLE returns a pointer to the size depends on the length of the longest name.
first entry of the table. This is a PCRE2_SPTR pointer to a block of code .P
units. In the 8-bit library, the first two bytes of each entry are the number PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
of the capturing parenthesis, most significant byte first. In the 16-bit a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first
library, the pointer points to 16-bit data units, the first of which contains two bytes of each entry are the number of the capturing parenthesis, most
the parenthesis number. In the 32-bit library, the pointer points to 32-bit significant byte first. In the 16-bit library, the pointer points to 16-bit
data units, the first of which contains the parenthesis number. The rest of the code units, the first of which contains the parenthesis number. In the 32-bit
entry is the corresponding name, zero terminated. library, the pointer points to 32-bit code units, the first of which contains
the parenthesis number. The rest of the entry is the corresponding name, zero
terminated.
.P .P
The names are in alphabetical order. If (?| is used to create multiple groups The names are in alphabetical order. If (?| is used to create multiple groups
with the same number, as described in the with the same number, as described in the
@ -1621,14 +1623,14 @@ different for each compiled pattern.
.sp .sp
PCRE2_INFO_NEWLINE PCRE2_INFO_NEWLINE
.sp .sp
The output is a \fBuint32_t\fP with one of the following values: The output is a \fBuint32_t\fP with one of the following values:
.sp .sp
PCRE2_NEWLINE_CR Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
PCRE2_NEWLINE_LF Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
.sp .sp
This specifies the default character sequence that will be recognized as This specifies the default character sequence that will be recognized as
meaning "newline" while matching. meaning "newline" while matching.
.sp .sp
@ -1670,7 +1672,7 @@ particular, the match data block contains a vector of offsets into the subject
string that define the matched part of the subject and any substrings that were string that define the matched part of the subject and any substrings that were
captured. This is know as the \fIovector\fP. captured. This is know as the \fIovector\fP.
.P .P
Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or
\fBpcre2_jit_match()\fP you must create a match data block by calling one of \fBpcre2_jit_match()\fP you must create a match data block by calling one of
the creation functions above. For \fBpcre2_match_data_create()\fP, the first the creation functions above. For \fBpcre2_match_data_create()\fP, the first
argument is the number of pairs of offsets in the \fIovector\fP. One pair of argument is the number of pairs of offsets in the \fIovector\fP. One pair of
@ -1820,7 +1822,7 @@ PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
.P .P
Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT) Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
compiler. If it is set, JIT matching is disabled and the normal interpretive compiler. If it is set, JIT matching is disabled and the normal interpretive
code in \fBpcre2_match()\fP is run. The remaining options are supported for JIT code in \fBpcre2_match()\fP is run. The remaining options are supported for JIT
matching. matching.
.sp .sp
PCRE2_ANCHORED PCRE2_ANCHORED
@ -1835,17 +1837,18 @@ matching.
.sp .sp
This option specifies that first character of the subject string is not the This option specifies that first character of the subject string is not the
beginning of a line, so the circumflex metacharacter should not match before beginning of a line, so the circumflex metacharacter should not match before
it. Setting this without PCRE2_MULTILINE (at compile time) causes circumflex it. Setting this without having set PCRE2_MULTILINE at compile time causes
never to match. This option affects only the behaviour of the circumflex circumflex never to match. This option affects only the behaviour of the
metacharacter. It does not affect \eA. circumflex metacharacter. It does not affect \eA.
.sp .sp
PCRE2_NOTEOL PCRE2_NOTEOL
.sp .sp
This option specifies that the end of the subject string is not the end of a This option specifies that the end of the subject string is not the end of a
line, so the dollar metacharacter should not match it nor (except in multiline line, so the dollar metacharacter should not match it nor (except in multiline
mode) a newline immediately before it. Setting this without PCRE2_MULTILINE (at mode) a newline immediately before it. Setting this without having set
compile time) causes dollar never to match. This option affects only the PCRE2_MULTILINE at compile time causes dollar never to match. This option
behaviour of the dollar metacharacter. It does not affect \eZ or \ez. affects only the behaviour of the dollar metacharacter. It does not affect \eZ
or \ez.
.sp .sp
PCRE2_NOTEMPTY PCRE2_NOTEMPTY
.sp .sp
@ -1857,13 +1860,16 @@ match the empty string, the entire match fails. For example, if the pattern
.sp .sp
is applied to a string not beginning with "a" or "b", it matches an empty is applied to a string not beginning with "a" or "b", it matches an empty
string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
valid, so PCRE2 searches further into the string for occurrences of "a" or "b". valid, so \fBpcre2_match()\fP searches further into the string for occurrences
of "a" or "b".
.sp .sp
PCRE2_NOTEMPTY_ATSTART PCRE2_NOTEMPTY_ATSTART
.sp .sp
This is like PCRE2_NOTEMPTY, except that an empty string match that is not at This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
the start of the subject is permitted. If the pattern is anchored, such a match only at the first matching position, that is, at the start of the subject plus
can occur only if the pattern contains \eK. the starting offset. An empty string match later in the subject is permitted.
If the pattern is anchored, such a match can occur only if the pattern contains
\eK.
.sp .sp
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
.sp .sp
@ -1913,8 +1919,8 @@ subject characters to complete the match. If this happens when
PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
testing any remaining alternatives. Only if no complete match can be found is testing any remaining alternatives. Only if no complete match can be found is
PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
PCRE2_PARTIAL_SOFT says that the caller is prepared to handle a partial match, PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
but only if no complete match can be found. match, but only if no complete match can be found.
.P .P
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
a partial match is found, \fBpcre2_match()\fP immediately returns a partial match is found, \fBpcre2_match()\fP immediately returns
@ -1943,13 +1949,13 @@ compile context.
.\" .\"
During matching, the newline choice affects the behaviour of the dot, During matching, the newline choice affects the behaviour of the dot,
circumflex, and dollar metacharacters. It may also alter the way the match circumflex, and dollar metacharacters. It may also alter the way the match
position is advanced after a match failure for an unanchored pattern. starting position is advanced after a match failure for an unanchored pattern.
.P .P
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set, When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
and a match attempt for an unanchored pattern fails when the current position the newline convention, and a match attempt for an unanchored pattern fails
is at a CRLF sequence, and the pattern contains no explicit matches for CR or when the current starting position is at a CRLF sequence, and the pattern
LF characters, the match position is advanced by two characters instead of one, contains no explicit matches for CR or LF characters, the match position is
in other words, to after the CRLF. advanced by two characters instead of one, in other words, to after the CRLF.
.P .P
The above rule is a compromise that makes the most common cases work as The above rule is a compromise that makes the most common cases work as
expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is
@ -1960,8 +1966,8 @@ reference, and so advances only by one character after the first failure.
.P .P
An explicit match for CR of LF is either a literal appearance of one of those An explicit match for CR of LF is either a literal appearance of one of those
characters in the pattern, or one of the \er or \en escape sequences. Implicit characters in the pattern, or one of the \er or \en escape sequences. Implicit
matches such as [^X] do not count, nor does \es (which includes CR and LF in matches such as [^X] do not count, nor does \es, even though it includes CR and
the characters that it matches). LF in the characters that it matches.
.P .P
Notwithstanding the above, anomalous effects may still occur when CRLF is a Notwithstanding the above, anomalous effects may still occur when CRLF is a
valid newline sequence and explicit \er or \en escapes appear in the pattern. valid newline sequence and explicit \er or \en escapes appear in the pattern.
@ -1981,15 +1987,15 @@ In general, a pattern matches a certain portion of the subject, and in
addition, further substrings from the subject may be picked out by addition, further substrings from the subject may be picked out by
parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
book, this is called "capturing" in what follows, and the phrase "capturing book, this is called "capturing" in what follows, and the phrase "capturing
subpattern" is used for a fragment of a pattern that picks out a substring. subpattern" or "capturing group" is used for a fragment of a pattern that picks
PCRE2 supports several other kinds of parenthesized subpattern that do not out a substring. PCRE2 supports several other kinds of parenthesized subpattern
cause substrings to be captured. The \fBpcre2_pattern_info()\fP function can be that do not cause substrings to be captured. The \fBpcre2_pattern_info()\fP
used to find out how many capturing subpatterns there are in a compiled function can be used to find out how many capturing subpatterns there are in a
pattern. compiled pattern.
.P .P
The overall matched string and any captured substrings are returned to the The overall matched string and any captured substrings are returned to the
caller via a vector of PCRE2_SIZE values, called the \fBovector\fP. This is caller via a vector of PCRE2_SIZE values. This is called the \fBovector\fP, and
contained within the is contained within the
.\" HTML <a href="#matchdatablock"> .\" HTML <a href="#matchdatablock">
.\" </a> .\" </a>
match data block. match data block.
@ -2062,7 +2068,7 @@ had.
. .
. .
.\" HTML <a name="matchotherdata"></a> .\" HTML <a name="matchotherdata"></a>
.SS "Other information about the match" .SH "OTHER INFORMATION ABOUT A MATCH"
.rs .rs
.sp .sp
.nf .nf
@ -2071,7 +2077,7 @@ had.
.B PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *\fImatch_data\fP); .B PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *\fImatch_data\fP);
.fi .fi
.P .P
In addition to the offsets in the ovector, other information about a match is As well as the offsets in the ovector, other information about a match is
retained in the match data block and can be retrieved by the above functions. retained in the match data block and can be retrieved by the above functions.
.P .P
When a (*MARK) name is to be passed back, \fBpcre2_get_mark()\fP returns a When a (*MARK) name is to be passed back, \fBpcre2_get_mark()\fP returns a
@ -2087,7 +2093,7 @@ as \fIovector[0]\fP because \eK does not affect the result of a partial match.
. .
. .
.\" HTML <a name="errorlist"></a> .\" HTML <a name="errorlist"></a>
.SS "Error return values from \fBpcre2_match()\fP" .SH "ERROR RETURNS FROM \fBpcre2_match()\fP"
.rs .rs
.sp .sp
If \fBpcre2_match()\fP fails, it returns a negative number. This can be If \fBpcre2_match()\fP fails, it returns a negative number. This can be
@ -2127,7 +2133,7 @@ passed to a 16-bit or 32-bit library function, or vice versa.
.sp .sp
PCRE2_ERROR_BADOFFSET PCRE2_ERROR_BADOFFSET
.sp .sp
The value of \fIstartoffset\fP greater than the length of the subject. The value of \fIstartoffset\fP was greater than the length of the subject.
.sp .sp
PCRE2_ERROR_BADOPTION PCRE2_ERROR_BADOPTION
.sp .sp
@ -2200,8 +2206,8 @@ the pattern. Specifically, it means that either the whole pattern or a
subpattern has been called recursively for the second time at the same position subpattern has been called recursively for the second time at the same position
in the subject string. Some simple patterns that might do this are detected and in the subject string. Some simple patterns that might do this are detected and
faulted at compile time, but more complicated cases, in particular mutual faulted at compile time, but more complicated cases, in particular mutual
recursions between two different subpatterns, cannot be detected until run recursions between two different subpatterns, cannot be detected until matching
time. is attempted.
.sp .sp
PCRE2_ERROR_RECURSIONLIMIT PCRE2_ERROR_RECURSIONLIMIT
.sp .sp
@ -2254,8 +2260,8 @@ extract the captured substrings.
.P .P
The final arguments of \fBpcre2_substring_copy_bynumber()\fP are a pointer to The final arguments of \fBpcre2_substring_copy_bynumber()\fP are a pointer to
the buffer and a pointer to a variable that contains its length in code units. the buffer and a pointer to a variable that contains its length in code units.
This is updated to contain the actual number of code units used, excluding the This is updated to contain the actual number of code units used for the
terminating zero. extracted substring, excluding the terminating zero.
.P .P
For \fBpcre2_substring_get_bynumber()\fP the third and fourth arguments point For \fBpcre2_substring_get_bynumber()\fP the third and fourth arguments point
to variables that are updated with a pointer to the new memory and the number to variables that are updated with a pointer to the new memory and the number
@ -2290,10 +2296,11 @@ small to capture that group.
.fi .fi
.P .P
The \fBpcre2_substring_list_get()\fP function extracts all available substrings The \fBpcre2_substring_list_get()\fP function extracts all available substrings
and builds a list of pointers to them, and a second list that contains their and builds a list of pointers to them. It also (optionally) builds a second
lengths (in code units), excluding a terminating zero that is added to each of list that contains their lengths (in code units), excluding a terminating zero
them. All this is done in a single block of memory that is obtained using the that is added to each of them. All this is done in a single block of memory
same memory allocation function that was used to get the match data block. that is obtained using the same memory allocation function that was used to get
the match data block.
.P .P
The address of the memory block is returned via \fIlistptr\fP, which is also The address of the memory block is returned via \fIlistptr\fP, which is also
the start of the list of string pointers. The end of the list is marked by a the start of the list of string pointers. The end of the list is marked by a
@ -2309,7 +2316,7 @@ If this function encounters a substring that is unset, which can happen when
capturing subpattern number \fIn+1\fP matches some part of the subject, but capturing subpattern number \fIn+1\fP matches some part of the subject, but
subpattern \fIn\fP has not been used at all, it returns an empty string. This subpattern \fIn\fP has not been used at all, it returns an empty string. This
can be distinguished from a genuine zero-length substring by inspecting the can be distinguished from a genuine zero-length substring by inspecting the
appropriate offset in the ovector, which contains PCRE2_UNSET for unset appropriate offset in the ovector, which contain PCRE2_UNSET for unset
substrings. substrings.
. .
. .
@ -2347,11 +2354,10 @@ name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one subpattern of
that name. that name.
.P .P
Given the number, you can extract the substring directly, or use one of the Given the number, you can extract the substring directly, or use one of the
functions described in the previous section. For convenience, there are also functions described above. For convenience, there are also "byname" functions
"byname" functions that correspond to the "bynumber" functions, the only that correspond to the "bynumber" functions, the only difference being that the
difference being that the second argument is a name instead of a number. second argument is a name instead of a number. However, if PCRE2_DUPNAMES is
However, if PCRE2_DUPNAMES is set and there are duplicate names, set and there are duplicate names, the behaviour may not be what you want.
the behaviour may not be what you want (see the next section).
.P .P
\fBWarning:\fP If the pattern uses the (?| feature to set up multiple \fBWarning:\fP If the pattern uses the (?| feature to set up multiple
subpatterns with the same number, as described in the subpatterns with the same number, as described in the
@ -2398,8 +2404,8 @@ recognized:
Either a group number or a group name can be given for <n>. Curly brackets are Either a group number or a group name can be given for <n>. Curly brackets are
required only if the following character would be interpreted as part of the required only if the following character would be interpreted as part of the
number or name. The number may be zero to include the entire matched string. number or name. The number may be zero to include the entire matched string.
For example, if the pattern a(b)c is matched with "[abc]" and the replacement For example, if the pattern a(b)c is matched with "=abc=" and the replacement
string "+$1$0$1+", the result is "[+babcb+]". Group insertion is done by string "+$1$0$1+", the result is "=+babcb+=". Group insertion is done by
calling \fBpcre2_copy_byname()\fP or \fBpcre2_copy_bynumber()\fP as calling \fBpcre2_copy_byname()\fP or \fBpcre2_copy_bynumber()\fP as
appropriate. appropriate.
.P .P
@ -2452,18 +2458,19 @@ documentation.
When duplicates are present, \fBpcre2_substring_copy_byname()\fP and When duplicates are present, \fBpcre2_substring_copy_byname()\fP and
\fBpcre2_substring_get_byname()\fP return the first substring corresponding to \fBpcre2_substring_get_byname()\fP return the first substring corresponding to
the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is the given name that is set. If none are set, PCRE2_ERROR_NOSUBSTRING is
returned. The \fBpcre2_substring_number_from_name()\fP function returns one of returned. The \fBpcre2_substring_number_from_name()\fP function returns
the numbers that are associated with the name, but it is not defined which it the error PCRE2_ERROR_NOUNIQUESUBSTRING.
is.
.P .P
If you want to get full details of all captured substrings for a given name, If you want to get full details of all captured substrings for a given name,
you must use the \fBpcre2_substring_nametable_scan()\fP function. The first you must use the \fBpcre2_substring_nametable_scan()\fP function. The first
argument is the compiled pattern, and the second is the name. If the third and argument is the compiled pattern, and the second is the name. If the third and
fourth arguments are NULL, the function returns a group number (it is not fourth arguments are NULL, the function returns a group number for a unique
defined which). Otherwise, the third and fourth arguments must be pointers to name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
.P
When the third and fourth arguments are not NULL, they must be pointers to
variables that are updated by the function. After it has run, they point to the variables that are updated by the function. After it has run, they point to the
first and last entries in the name-to-number table for the given name, and the first and last entries in the name-to-number table for the given name, and the
function returns the length of each entry. In both cases, function returns the length of each entry in code units. In both cases,
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name. PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
.P .P
The format of the name table is described above in the section entitled The format of the name table is described above in the section entitled
@ -2476,15 +2483,15 @@ Given all the relevant entries for the name, you can extract each of their
numbers, and hence the captured data. numbers, and hence the captured data.
. .
. .
.SH "FINDING ALL POSSIBLE MATCHES" .SH "FINDING ALL POSSIBLE MATCHES AT ONE POSITION"
.rs .rs
.sp .sp
The traditional matching function uses a similar algorithm to Perl, which stops The traditional matching function uses a similar algorithm to Perl, which stops
when it finds the first match, starting at a given point in the subject. If you when it finds the first match at a given point in the subject. If you want to
want to find all possible matches, or the longest possible match at a given find all possible matches, or the longest possible match at a given position,
position, consider using the alternative matching function (see below) instead. consider using the alternative matching function (see below) instead. If you
If you cannot use the alternative function, you can kludge it up by making use cannot use the alternative function, you can kludge it up by making use of the
of the callout facility, which is described in the callout facility, which is described in the
.\" HREF .\" HREF
\fBpcre2callout\fP \fBpcre2callout\fP
.\" .\"
@ -2628,11 +2635,10 @@ the longest matches.
.P .P
NOTE: PCRE2's "auto-possessification" optimization usually applies to character NOTE: PCRE2's "auto-possessification" optimization usually applies to character
repeats at the end of a pattern (as well as internally). For example, the repeats at the end of a pattern (as well as internally). For example, the
pattern "a\ed+" is compiled as if it were "a\ed++" because there is no point in pattern "a\ed+" is compiled as if it were "a\ed++". For DFA matching, this
backtracking into the repeated digits. For DFA matching, this means that only means that only one possible match is found. If you really do want multiple
one possible match is found. If you really do want multiple matches in such matches in such cases, either use an ungreedy repeat auch as "a\ed+?" or set
cases, either use an ungreedy repeat ("a\ed+?") or set the the PCRE2_NO_AUTO_POSSESS option when compiling.
PCRE2_NO_AUTO_POSSESS option when compiling.
. .
. .
.SS "Error returns from \fBpcre2_dfa_match()\fP" .SS "Error returns from \fBpcre2_dfa_match()\fP"
@ -2673,7 +2679,7 @@ extremely rare, as a vector of size 1000 is used.
.sp .sp
PCRE2_ERROR_DFA_BADRESTART PCRE2_ERROR_DFA_BADRESTART
.sp .sp
When \fBpcre2_dfa_match()\fP is called with the \fBpcre2_dfa_RESTART\fP option, When \fBpcre2_dfa_match()\fP is called with the \fBPCRE2_DFA_RESTART\fP option,
some plausibility checks are made on the contents of the workspace, which some plausibility checks are made on the contents of the workspace, which
should contain data about the previous partial match. If any of these checks should contain data about the previous partial match. If any of these checks
fail, this error is given. fail, this error is given.
@ -2682,9 +2688,9 @@ fail, this error is given.
.SH "SEE ALSO" .SH "SEE ALSO"
.rs .rs
.sp .sp
\fBpcre2build\fP(3), \fBpcre2libs\fP(3), \fBpcre2callout\fP(3), \fBpcre2build\fP(3), \fBpcre2callout\fP(3), \fBpcre2demo(3)\fP,
\fBpcre2matching\fP(3), \fBpcre2partial\fP(3), \fBpcre2posix\fP(3), \fBpcre2matching\fP(3), \fBpcre2partial\fP(3), \fBpcre2posix\fP(3),
\fBpcre2demo(3)\fP, \fBpcre2sample\fP(3), \fBpcre2stack\fP(3). \fBpcre2sample\fP(3), \fBpcre2stack\fP(3), \fBpcre2unicode\fP(3).
. .
. .
.SH AUTHOR .SH AUTHOR
@ -2701,6 +2707,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 18 November 2014 Last updated: 21 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -3438,10 +3438,10 @@ while (TRUE)
} }
} }
else else
{ {
caseless = FALSE; caseless = FALSE;
othercase[0] = 0; /* Stops compiler warning - PH */ othercase[0] = 0; /* Stops compiler warning - PH */
} }
len_save = len; len_save = len;
cc_save = cc; cc_save = cc;

View File

@ -1401,11 +1401,11 @@ for (;;)
condition = TRUE; condition = TRUE;
/* Advance ecode past the assertion to the start of the first branch, /* Advance ecode past the assertion to the start of the first branch,
but adjust it so that the general choosing code below works. If the but adjust it so that the general choosing code below works. If the
assertion has a quantifier that allows zero repeats we must skip over assertion has a quantifier that allows zero repeats we must skip over
the BRAZERO. This is a lunatic thing to do, but somebody did! */ the BRAZERO. This is a lunatic thing to do, but somebody did! */
if (*ecode == OP_BRAZERO) ecode++; if (*ecode == OP_BRAZERO) ecode++;
ecode += GET(ecode, 1); ecode += GET(ecode, 1);
while (*ecode == OP_ALT) ecode += GET(ecode, 1); while (*ecode == OP_ALT) ecode += GET(ecode, 1);
ecode += 1 + LINK_SIZE - PRIV(OP_lengths)[condcode]; ecode += 1 + LINK_SIZE - PRIV(OP_lengths)[condcode];