Documentation update.

This commit is contained in:
Philip.Hazel 2017-03-24 16:53:38 +00:00
parent 32bab50c01
commit 3aeb812180
24 changed files with 1293 additions and 1357 deletions

View File

@ -1,10 +1,6 @@
Building PCRE2 without using autotools Building PCRE2 without using autotools
-------------------------------------- --------------------------------------
This document has been converted from the PCRE1 document. I have removed a
number of sections about building in various environments, as they applied only
to PCRE1 and are probably out of date.
This document contains the following sections: This document contains the following sections:
General General
@ -183,21 +179,9 @@ can skip ahead to the CMake section.
STACK SIZE IN WINDOWS ENVIRONMENTS STACK SIZE IN WINDOWS ENVIRONMENTS
The default processor stack size of 1Mb in some Windows environments is too Prior to release 10.30 the default system stack size of 1Mb in some Windows
small for matching patterns that need much recursion. In particular, test 2 may environments caused issues with some tests. This should no longer be the case
fail because of this. Normally, running out of stack causes a crash, but there for 10.30 and later releases.
have been cases where the test program has just died silently. See your linker
documentation for how to increase stack size if you experience problems. If you
are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
compiler, you can increase the stack size for pcre2test and pcre2grep by
setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
example). The Linux default of 8Mb is a reasonable choice for the stack, though
even that can be too small for some pattern/subject combinations.
PCRE2 has a compile configuration option to disable the use of stack for
recursion so that heap is used instead. However, pattern matching is
significantly slower when this is done. There is more about stack usage in the
"pcre2stack" documentation.
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
@ -393,4 +377,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
recommended download site. recommended download site.
============================= =============================
Last Updated: 13 October 2016 Last Updated: 17 March 2017

View File

@ -15,8 +15,8 @@ subscribe or manage your subscription here:
https://lists.exim.org/mailman/listinfo/pcre-dev https://lists.exim.org/mailman/listinfo/pcre-dev
Please read the NEWS file if you are upgrading from a previous release. Please read the NEWS file if you are upgrading from a previous release. The
The contents of this README file are: contents of this README file are:
The PCRE2 APIs The PCRE2 APIs
Documentation for PCRE2 Documentation for PCRE2
@ -44,8 +44,8 @@ wrappers.
The distribution does contain a set of C wrapper functions for the 8-bit The distribution does contain a set of C wrapper functions for the 8-bit
library that are based on the POSIX regular expression API (see the pcre2posix library that are based on the POSIX regular expression API (see the pcre2posix
man page). These can be found in a library called libpcre2-posix. Note that this man page). These can be found in a library called libpcre2-posix. Note that
just provides a POSIX calling interface to PCRE2; the regular expressions this just provides a POSIX calling interface to PCRE2; the regular expressions
themselves still follow Perl syntax and semantics. The POSIX API is restricted, themselves still follow Perl syntax and semantics. The POSIX API is restricted,
and does not give full access to all of PCRE2's facilities. and does not give full access to all of PCRE2's facilities.
@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms:
Building PCRE2 on non-Unix-like systems Building PCRE2 on non-Unix-like systems
--------------------------------------- ---------------------------------------
For a non-Unix-like system, please read the comments in the file For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and your system supports the use of "configure" and "make" you may be able to build
"make" you may be able to build PCRE2 using autotools in the same way as for PCRE2 using autotools in the same way as for many Unix-like systems.
many Unix-like systems.
PCRE2 can also be configured using CMake, which can be run in various ways PCRE2 can also be configured using CMake, which can be run in various ways
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file (command line, GUI, etc). This creates Makefiles, solution files, etc. The file
@ -174,19 +173,19 @@ library. They are also documented in the pcre2build man page.
architectures. If you try to enable it on an unsupported architecture, there architectures. If you try to enable it on an unsupported architecture, there
will be a compile time error. will be a compile time error.
. If you do not want to make use of the support for UTF-8 Unicode character . If you do not want to make use of the default support for UTF-8 Unicode
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit character strings in the 8-bit library, UTF-16 Unicode character strings in
library, or UTF-32 Unicode character strings in the 32-bit library, you can the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
add --disable-unicode to the "configure" command. This reduces the size of library, you can add --disable-unicode to the "configure" command. This
the libraries. It is not possible to configure one library with Unicode reduces the size of the libraries. It is not possible to configure one
support, and another without, in the same configuration. library with Unicode support, and another without, in the same configuration.
It is also not possible to use --enable-ebcdic (see below) with Unicode
support, so if this option is set, you must also use --disable-unicode.
When Unicode support is available, the use of a UTF encoding still has to be When Unicode support is available, the use of a UTF encoding still has to be
enabled by setting the PCRE2_UTF option at run time or starting a pattern enabled by setting the PCRE2_UTF option at run time or starting a pattern
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
not possible to use both --enable-unicode and --enable-ebcdic at the same
time.
As well as supporting UTF strings, Unicode support includes support for the As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties. \P, \p, and \X sequences that recognize Unicode character properties.
@ -232,18 +231,18 @@ library. They are also documented in the pcre2build man page.
--with-match-limit=500000 --with-match-limit=500000
on the "configure" command. This is just the default; individual calls to on the "configure" command. This is just the default; individual calls to
pcre2_match() can supply their own value. There is more discussion on the pcre2_match() can supply their own value. There is more discussion in the
pcre2api man page. pcre2api man page (search for pcre2_set_match_limit).
. There is a separate counter that limits the depth of recursive function calls . There is a separate counter that limits the depth of nested backtracking
during a matching process. This also has a default of ten million, which is during a matching process, which in turn limits the amount of memory that is
essentially "unlimited". You can change the default by setting, for example, used. This also has a default of ten million, which is essentially
"unlimited". You can change the default by setting, for example,
--with-match-limit-recursion=500000 --with-match-limit-depth=5000
Recursive function calls use up the runtime stack; running out of stack can There is more discussion in the pcre2api man page (search for
cause programs to crash in strange ways. There is a discussion about stack pcre2_set_depth_limit).
sizes in the pcre2stack man page.
. In the 8-bit library, the default maximum compiled pattern size is around . In the 8-bit library, the default maximum compiled pattern size is around
64K bytes. You can increase this by adding --with-link-size=3 to the 64K bytes. You can increase this by adding --with-link-size=3 to the
@ -254,20 +253,6 @@ library. They are also documented in the pcre2build man page.
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
link size setting is ignored, as 4-byte offsets are always used. link size setting is ignored, as 4-byte offsets are always used.
. You can build PCRE2 so that its internal match() function that is called from
pcre2_match() does not call itself recursively. Instead, it uses memory
blocks obtained from the heap to save data that would otherwise be saved on
the stack. To build PCRE2 like this, use
--disable-stack-for-recursion
on the "configure" command. PCRE2 runs more slowly in this mode, but it may
be necessary in environments with limited stack sizes. This applies only to
the normal execution of the pcre2_match() function; if JIT support is being
successfully used, it is not relevant. Equally, it does not apply to
pcre2_dfa_match(), which does not use deeply nested recursion. There is a
discussion about stack sizes in the pcre2stack man page.
. For speed, PCRE2 uses four tables for manipulating and identifying characters . For speed, PCRE2 uses four tables for manipulating and identifying characters
whose code point values are less than 256. By default, it uses a set of whose code point values are less than 256. By default, it uses a set of
tables for ASCII encoding that is part of the distribution. If you specify tables for ASCII encoding that is part of the distribution. If you specify
@ -389,6 +374,13 @@ library. They are also documented in the pcre2build man page.
string. Otherwise, it is assumed to be a file name, and the contents of the string. Otherwise, it is assumed to be a file name, and the contents of the
file are the test string. file are the test string.
. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
which caused pcre2_match() to use individual blocks on the heap for
backtracking instead of recursive function calls (which use the stack). This
is now obsolete since pcre2_match() was refactored always to use the heap (in
a much more efficient way than before). This option is retained for backwards
compatibility, but has no effect other than to output a warning.
The "configure" script builds the following files for the basic C library: The "configure" script builds the following files for the basic C library:
. Makefile the makefile that builds the library . Makefile the makefile that builds the library
@ -662,25 +654,32 @@ Unicode support is enabled.
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
16-bit and 32-bit modes. These are tests that generate different output in 16-bit and 32-bit modes. These are tests that generate different output in
8-bit mode. Each pair are for general cases and Unicode support, respectively. 8-bit mode. Each pair are for general cases and Unicode support, respectively.
Test 13 checks the handling of non-UTF characters greater than 255 by Test 13 checks the handling of non-UTF characters greater than 255 by
pcre2_dfa_match() in 16-bit and 32-bit modes. pcre2_dfa_match() in 16-bit and 32-bit modes.
Test 14 contains a number of tests that must not be run with JIT. They check, Test 14 contains some special UTF and UCP tests that give different output for
the different widths.
Test 15 contains a number of tests that must not be run with JIT. They check,
among other non-JIT things, the match-limiting features of the intepretive among other non-JIT things, the match-limiting features of the intepretive
matcher. matcher.
Test 15 is run only when JIT support is not available. It checks that an Test 16 is run only when JIT support is not available. It checks that an
attempt to use JIT has the expected behaviour. attempt to use JIT has the expected behaviour.
Test 16 is run only when JIT support is available. It checks JIT complete and Test 17 is run only when JIT support is available. It checks JIT complete and
partial modes, match-limiting under JIT, and other JIT-specific features. partial modes, match-limiting under JIT, and other JIT-specific features.
Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
the 8-bit library, without and with Unicode support, respectively. the 8-bit library, without and with Unicode support, respectively.
Test 19 checks the serialization functions by writing a set of compiled Test 20 checks the serialization functions by writing a set of compiled
patterns to a file, and then reloading and checking them. patterns to a file, and then reloading and checking them.
Tests 21 and 22 test \C support when the use of \C is not locked out, without
and with UTF support, respectively. Test 23 tests \C when it is locked out.
Character tables Character tables
---------------- ----------------
@ -866,4 +865,4 @@ The distribution should contain the files listed below.
Philip Hazel Philip Hazel
Email local part: ph10 Email local part: ph10
Email domain: cam.ac.uk Email domain: cam.ac.uk
Last updated: 01 November 2016 Last updated: 17 March 2017

View File

@ -109,7 +109,7 @@ lose performance.
One way of guarding against this possibility is to use the One way of guarding against this possibility is to use the
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for <b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
<b>pcre2_compile()</b>. This causes an compile time error if a pattern contains <b>pcre2_compile()</b>. This causes a compile time error if the pattern contains
a UTF-setting sequence. a UTF-setting sequence.
</P> </P>
<P> <P>
@ -137,7 +137,8 @@ large search tree against a string that will never match. Nested unlimited
repeats in a pattern are a common example. PCRE2 provides some protection repeats in a pattern are a common example. PCRE2 provides some protection
against this: see the <b>pcre2_set_match_limit()</b> function in the against this: see the <b>pcre2_set_match_limit()</b> function in the
<a href="pcre2api.html"><b>pcre2api</b></a> <a href="pcre2api.html"><b>pcre2api</b></a>
page. page. There is a similar function called <b>pcre2_set_depth_limit()</b> that can
be used to restrict the amount of memory that is used.
</P> </P>
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br> <br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
<P> <P>
@ -166,7 +167,7 @@ listing), and the short pages for individual functions, are concatenated in
pcre2perform discussion of performance issues pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program pcre2sample discussion of the pcre2demo program
pcre2stack discussion of stack usage pcre2stack discussion of stack and memory usage
pcre2syntax quick syntax reference pcre2syntax quick syntax reference
pcre2test description of the <b>pcre2test</b> command pcre2test description of the <b>pcre2test</b> command
pcre2unicode discussion of Unicode and UTF support pcre2unicode discussion of Unicode and UTF support
@ -189,9 +190,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
</P> </P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br> <br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 16 October 2015 Last updated: 27 March 2017
<br> <br>
Copyright &copy; 1997-2015 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -36,20 +36,21 @@ for success and non-zero otherwise. The arguments are:
<i>callout_data</i> User data that is passed to the callback <i>callout_data</i> User data that is passed to the callback
</pre> </pre>
The <i>callback()</i> function is passed a pointer to a data block containing The <i>callback()</i> function is passed a pointer to a data block containing
the following fields: the following fields (not necessarily in this order):
<pre> <pre>
<i>version</i> Block version number uint32_t <i>version</i> Block version number
<i>pattern_position</i> Offset to next item in pattern uint32_t <i>callout_number</i> Number for numbered callouts
<i>next_item_length</i> Length of next item in pattern PCRE2_SIZE <i>pattern_position</i> Offset to next item in pattern
<i>callout_number</i> Number for numbered callouts PCRE2_SIZE <i>next_item_length</i> Length of next item in pattern
<i>callout_string_offset</i> Offset to string within pattern PCRE2_SIZE <i>callout_string_offset</i> Offset to string within pattern
<i>callout_string_length</i> Length of callout string PCRE2_SIZE <i>callout_string_length</i> Length of callout string
<i>callout_string</i> Points to callout string or is NULL PCRE2_SPTR <i>callout_string</i> Points to callout string or is NULL
</pre> </pre>
The second argument is the callout data that was passed to The second argument passed to the <b>callback()</b> function is the callout data
<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero that was passed to <b>pcre2_callout_enumerate()</b>. The <b>callback()</b>
for success. Any other value causes the pattern scan to stop, with the value function must return zero for success. Any other value causes the pattern scan
being passed back as the result of <b>pcre2_callout_enumerate()</b>. to stop, with the value being passed back as the result of
<b>pcre2_callout_enumerate()</b>.
</P> </P>
<P> <P>
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API in the

View File

@ -26,7 +26,9 @@ DESCRIPTION
</b><br> </b><br>
<P> <P>
This function frees the memory used for a compiled pattern, including any This function frees the memory used for a compiled pattern, including any
memory used by the JIT compiler. memory used by the JIT compiler. If the compiled pattern was created by a call
to <b>pcre2_code_copy_with_tables()</b>, the memory for the character tables is
also freed.
</P> </P>
<P> <P>
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API in the

View File

@ -37,19 +37,24 @@ arguments are:
<i>erroffset</i> Where to put an error offset <i>erroffset</i> Where to put an error offset
<i>ccontext</i> Pointer to a compile context or NULL <i>ccontext</i> Pointer to a compile context or NULL
</pre> </pre>
The length of the string and any error offset that is returned are in code The length of the pattern and any error offset that is returned are in code
units, not characters. A compile context is needed only if you want to change units, not characters. A compile context is needed only if you want to provide
custom memory allocation functions, or to provide an external function for
system stack size checking, or to change one or more of these parameters:
<pre> <pre>
What \R matches (Unicode newlines or CR, LF, CRLF only) What \R matches (Unicode newlines, or CR, LF, CRLF only);
PCRE2's character tables PCRE2's character tables;
The newline character sequence The newline character sequence;
The compile time nested parentheses limit The compile time nested parentheses limit;
The maximum pattern length (in code units) that is allowed.
</pre> </pre>
or provide an external function for stack size checking. The option bits are: The option bits are:
<pre> <pre>
PCRE2_ANCHORED Force pattern anchoring PCRE2_ANCHORED Force pattern anchoring
PCRE2_ALLOW_EMPTY_CLASS Allow empty classes
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
PCRE2_ALT_VERBNAMES Process backslashes in verb names
PCRE2_AUTO_CALLOUT Compile automatic callouts PCRE2_AUTO_CALLOUT Compile automatic callouts
PCRE2_CASELESS Do caseless matching PCRE2_CASELESS Do caseless matching
PCRE2_DOLLAR_ENDONLY $ not to match newline at end PCRE2_DOLLAR_ENDONLY $ not to match newline at end
@ -71,19 +76,21 @@ or provide an external function for stack size checking. The option bits are:
(only relevant if PCRE2_UTF is set) (only relevant if PCRE2_UTF is set)
PCRE2_UCP Use Unicode properties for \d, \w, etc. PCRE2_UCP Use Unicode properties for \d, \w, etc.
PCRE2_UNGREEDY Invert greediness of quantifiers PCRE2_UNGREEDY Invert greediness of quantifiers
PCRE2_USE_OFFSET_LIMIT Enable offset limit for unanchored matching
PCRE2_UTF Treat pattern and subjects as UTF strings PCRE2_UTF Treat pattern and subjects as UTF strings
</pre> </pre>
PCRE2 must be built with Unicode support in order to use PCRE2_UTF, PCRE2_UCP PCRE2 must be built with Unicode support (the default) in order to use
and related options. PCRE2_UTF, PCRE2_UCP and related options.
</P> </P>
<P> <P>
The yield of the function is a pointer to a private data structure that The yield of the function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected. contains the compiled pattern, or NULL if an error was detected.
</P> </P>
<P> <P>
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API, with more detail on
each option, in the
<a href="pcre2api.html"><b>pcre2api</b></a> <a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the page, and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a> <a href="pcre2posix.html"><b>pcre2posix</b></a>
page. page.
<p> <p>

View File

@ -45,10 +45,9 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_CONFIG_BSR Indicates what \R matches by default: PCRE2_CONFIG_BSR Indicates what \R matches by default:
PCRE2_BSR_UNICODE PCRE2_BSR_UNICODE
PCRE2_BSR_ANYCRLF PCRE2_BSR_ANYCRLF
PCRE2_CONFIG_JIT Availability of just-in-time compiler PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
support (1=yes 0=no) PCRE2_CONFIG_JIT Availability of just-in-time compiler support (1=yes 0=no)
PCRE2_CONFIG_JITTARGET Information about the target archi- PCRE2_CONFIG_JITTARGET Information (a string) about the target architecture for the JIT compiler
tecture for the JIT compiler
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4) PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
PCRE2_CONFIG_NEWLINE Code for the default newline sequence: PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
@ -58,11 +57,9 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_NEWLINE_ANY PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF PCRE2_NEWLINE_ANYCRLF
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
0=heap) PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes 0=no)
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
0=no)
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string) PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
PCRE2_CONFIG_VERSION The PCRE2 version (a string) PCRE2_CONFIG_VERSION The PCRE2 version (a string)
</pre> </pre>

View File

@ -31,8 +31,9 @@ DESCRIPTION
<P> <P>
This function matches a compiled regular expression against a given subject This function matches a compiled regular expression against a given subject
string, using an alternative matching algorithm that scans the subject string string, using an alternative matching algorithm that scans the subject string
just once (<i>not</i> Perl-compatible). (The Perl-compatible matching function just once (except when processing lookaround assertions). This function is
is <b>pcre2_match()</b>.) The arguments for this function are: <i>not</i> Perl-compatible (the Perl-compatible matching function is
<b>pcre2_match()</b>). The arguments for this function are:
<pre> <pre>
<i>code</i> Points to the compiled pattern <i>code</i> Points to the compiled pattern
<i>subject</i> Points to the subject string <i>subject</i> Points to the subject string
@ -45,22 +46,18 @@ is <b>pcre2_match()</b>.) The arguments for this function are:
<i>wscount</i> Number of elements in the vector <i>wscount</i> Number of elements in the vector
</pre> </pre>
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
up a callout function or specify the recursion limit. The <i>length</i> and up a callout function or specify the recursion depth limit. The <i>length</i>
<i>startoffset</i> values are code units, not characters. The options are: and <i>startoffset</i> values are code units, not characters. The options are:
<pre> <pre>
PCRE2_ANCHORED Match only at the first position PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject is not the beginning of a line PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match PCRE2_NOTEMPTY An empty string is not a valid match
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
is not a valid match PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
validity (only relevant if PCRE2_UTF
was set at compile time) was set at compile time)
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
match if no full matches are found PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
even if there is a full match as well
PCRE2_DFA_RESTART Restart after a partial match PCRE2_DFA_RESTART Restart after a partial match
PCRE2_DFA_SHORTEST Return only the shortest match PCRE2_DFA_SHORTEST Return only the shortest match
</pre> </pre>

View File

@ -34,11 +34,11 @@ errors are negative numbers. The arguments are:
<i>buffer</i> where to put the message <i>buffer</i> where to put the message
<i>bufflen</i> the length of the buffer (code units) <i>bufflen</i> the length of the buffer (code units)
</pre> </pre>
The function returns the length of the message, excluding the trailing zero, or The function returns the length of the message in code units, excluding the
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
this case, the returned message is truncated (but still with a trailing zero). too small. In this case, the returned message is truncated (but still with a
If <i>errorcode</i> does not contain a recognized error code number, the trailing zero). If <i>errorcode</i> does not contain a recognized error code
negative value PCRE2_ERROR_BADDATA is returned. number, the negative value PCRE2_ERROR_BADDATA is returned.
</P> </P>
<P> <P>
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API in the

View File

@ -32,10 +32,9 @@ maximum size to which it is allowed to grow. The final argument is a general
context, for memory allocation functions, or NULL for standard memory context, for memory allocation functions, or NULL for standard memory
allocation. The result can be passed to the JIT run-time code by calling allocation. The result can be passed to the JIT run-time code by calling
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern, <b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
which can then be processed by <b>pcre2_match()</b>. If the "fast path" JIT which can then be processed by <b>pcre2_match()</b> or <b>pcre2_jit_match()</b>.
matcher, <b>pcre2_jit_match()</b> is used, the stack can be passed directly as A maximum stack size of 512K to 1M should be more than enough for any pattern.
an argument. A maximum stack size of 512K to 1M should be more than enough for For more details, see the
any pattern. For more details, see the
<a href="pcre2jit.html"><b>pcre2jit</b></a> <a href="pcre2jit.html"><b>pcre2jit</b></a>
page. page.
</P> </P>

View File

@ -25,10 +25,10 @@ SYNOPSIS
DESCRIPTION DESCRIPTION
</b><br> </b><br>
<P> <P>
This function builds a set of character tables for character values less than This function builds a set of character tables for character code points that
256. These can be passed to <b>pcre2_compile()</b> in a compile context in order are less than 256. These can be passed to <b>pcre2_compile()</b> in a compile
to override the internal, built-in tables (which were either defaulted or made context in order to override the internal, built-in tables (which were either
by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the defaulted or made by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a> <a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
page. You might want to do this if you are using a non-standard locale. page. You might want to do this if you are using a non-standard locale.
</P> </P>

View File

@ -2575,8 +2575,8 @@ The internal recursion limit was reached.
A text message for an error code from any PCRE2 function (compile, match, or A text message for an error code from any PCRE2 function (compile, match, or
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
is passed as the first argument, with the remaining two arguments specifying a is passed as the first argument, with the remaining two arguments specifying a
code unit buffer and its length, into which the text message is placed. Note code unit buffer and its length in code units, into which the text message is
that the message is returned in code units of the appropriate width for the placed. The message is returned in code units of the appropriate width for the
library that is being used. library that is being used.
</P> </P>
<P> <P>
@ -3265,9 +3265,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC41" href="#TOC1">REVISION</a><br> <br><a name="SEC41" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 23 December 2016 Last updated: 21 March 2017
<br> <br>
Copyright &copy; 1997-2016 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -280,6 +280,10 @@ operating systems the effect of reading a directory like this is an immediate
end-of-file; in others it may provoke an error. end-of-file; in others it may provoke an error.
</P> </P>
<P> <P>
<b>--depth-limit</b>=<i>number</i>
See <b>--match-limit</b> below.
</P>
<P>
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i> <b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
Specify a pattern to be matched. This option can be used multiple times in Specify a pattern to be matched. This option can be used multiple times in
order to specify several patterns. It can also be used as a way of specifying a order to specify several patterns. It can also be used as a way of specifying a
@ -498,29 +502,22 @@ used. There is no short form for this option.
</P> </P>
<P> <P>
<b>--match-limit</b>=<i>number</i> <b>--match-limit</b>=<i>number</i>
Processing some regular expression patterns can require a very large amount of Processing some regular expression patterns may take a very long time to search
memory, leading in some cases to a program crash if not enough is available. for all possible matching strings. Others may require a very large amount of
Other patterns may take a very long time to search for all possible matching memory. There are two options that set resource limits for matching.
strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
do the matching has two parameters that can limit the resources that it uses.
<br> <br>
<br> <br>
The <b>--match-limit</b> option provides a means of limiting resource usage The <b>--match-limit</b> option provides a means of limiting computing resource
when processing patterns that are not going to match, but which have a very usage when processing patterns that are not going to match, but which have a
large number of possibilities in their search trees. The classic example is a very large number of possibilities in their search trees. The classic example
pattern that uses nested unlimited repeats. Internally, PCRE2 uses a function is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
called <b>match()</b> which it calls repeatedly (sometimes recursively). The counter that is incremented each time around its main processing loop. If the
limit set by <b>--match-limit</b> is imposed on the number of times this value set by <b>--match-limit</b> is reached, an error occurs.
function is called during a match, which has the effect of limiting the amount
of backtracking that can take place.
<br> <br>
<br> <br>
The <b>--recursion-limit</b> option is similar to <b>--match-limit</b>, but The <b>--depth-limit</b> option limits the depth of nested backtracking points,
instead of limiting the total number of times that <b>match()</b> is called, it which in turn limits the amount of memory that is used. This limit is of use
limits the depth of recursive calls, which in turn limits the amount of memory only if it is set smaller than <b>--match-limit</b>.
that can be used. The recursion depth is a smaller number than the total number
of calls, because not all calls to <b>match()</b> are recursive. This limit is
of use only if it is set smaller than <b>--match-limit</b>.
<br> <br>
<br> <br>
There are no short forms for these options. The default settings are specified There are no short forms for these options. The default settings are specified
@ -843,9 +840,9 @@ there are more than 20 such errors, <b>pcre2grep</b> gives up.
</P> </P>
<P> <P>
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
overall resource limit; there is a second option called <b>--recursion-limit</b> overall resource limit; there is a second option called <b>--depth-limit</b>
that sets a limit on the amount of memory (usually stack) that is used (see the that sets a limit on the amount of memory that is used (see the discussion of
discussion of these options above). these options above).
</P> </P>
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br> <br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
<P> <P>
@ -870,9 +867,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br> <br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 31 December 2016 Last updated: 21 March 2017
<br> <br>
Copyright &copy; 1997-2016 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -170,20 +170,24 @@ the application to apply the JIT optimization by calling
<b>pcre2_jit_compile()</b> is ignored. <b>pcre2_jit_compile()</b> is ignored.
</P> </P>
<br><b> <br><b>
Setting match and recursion limits Setting match and backtracking depth limits
</b><br> </b><br>
<P> <P>
The caller of <b>pcre2_match()</b> can set a limit on the number of times the The pcre2_match() function contains a counter that is incremented every time it
internal <b>match()</b> function is called and on the maximum depth of goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
recursive calls. These facilities are provided to catch runaway matches that this counter, which therefore limits the amount of computing resource used for
are provoked by patterns with huge matching trees (a typical example is a a match. The maximum depth of nested backtracking can also be limited, and this
pattern with nested unlimited repeats) and to avoid running out of system stack restricts the amount of heap memory that is used.
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b> </P>
gives an error return. The limits can also be set by items at the start of the <P>
pattern of the form These facilities are provided to catch runaway matches that are provoked by
patterns with huge matching trees (a typical example is a pattern with nested
unlimited repeats applied to a long string that does not match). When one of
these limits is reached, <b>pcre2_match()</b> gives an error return. The limits
can also be set by items at the start of the pattern of the form
<pre> <pre>
(*LIMIT_MATCH=d) (*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d) (*LIMIT_DEPTH=d)
</pre> </pre>
where d is any number of decimal digits. However, the value of the setting must where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b> be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
@ -192,10 +196,15 @@ limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used. setting of one of these limits, the lower value is used.
</P> </P>
<P> <P>
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
still recognized for backwards compatibility.
</P>
<P>
The match limit is used (but in a different way) when JIT is being used, but it The match limit is used (but in a different way) when JIT is being used, but it
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>. is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
However, the recursion limit is relevant for DFA matching, which does use some However, the depth limit is relevant for DFA matching, which uses function
function recursion, in particular, for recursions within the pattern. recursion for recursions within the pattern. In this case, the depth limit
controls the amount of system stack that is used.
<a name="newlines"></a></P> <a name="newlines"></a></P>
<br><b> <br><b>
Newline conventions Newline conventions
@ -235,8 +244,8 @@ The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
what the \R escape sequence matches. By default, this is any Unicode newline what the \R escape sequence matches. By default, this is any Unicode newline
sequence, for Perl compatibility. However, this can be changed; see the sequence, for Perl compatibility. However, this can be changed; see the next
description of \R in the section entitled section and the description of \R in the section entitled
<a href="#newlineseq">"Newline sequences"</a> <a href="#newlineseq">"Newline sequences"</a>
below. A change of \R setting can be combined with a change of newline below. A change of \R setting can be combined with a change of newline
convention. convention.
@ -254,7 +263,7 @@ corresponding to PCRE2_BSR_UNICODE.
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br> <br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
<P> <P>
PCRE2 can be compiled to run in an environment that uses EBCDIC as its PCRE2 can be compiled to run in an environment that uses EBCDIC as its
character code rather than ASCII or Unicode (typically a mainframe system). In character code instead of ASCII or Unicode (typically a mainframe system). In
the sections below, character code values are ASCII or Unicode; in an EBCDIC the sections below, character code values are ASCII or Unicode; in an EBCDIC
environment these characters may have different code values, and there are no environment these characters may have different code values, and there are no
code points greater than 255. code points greater than 255.
@ -318,11 +327,11 @@ that character may have. This use of backslash as an escape character applies
both inside and outside character classes. both inside and outside character classes.
</P> </P>
<P> <P>
For example, if you want to match a * character, you write \* in the pattern. For example, if you want to match a * character, you must write \* in the
This escaping action applies whether or not the following character would pattern. This escaping action applies whether or not the following character
otherwise be interpreted as a metacharacter, so it is always safe to precede a would otherwise be interpreted as a metacharacter, so it is always safe to
non-alphanumeric with backslash to specify that it stands for itself. In precede a non-alphanumeric with backslash to specify that it stands for itself.
particular, if you want to match a backslash, you write \\. In particular, if you want to match a backslash, you write \\.
</P> </P>
<P> <P>
In a UTF mode, only ASCII numbers and letters have any special meaning after a In a UTF mode, only ASCII numbers and letters have any special meaning after a
@ -353,7 +362,7 @@ An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
by \E later in the pattern, the literal interpretation continues to the end of by \E later in the pattern, the literal interpretation continues to the end of
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
a character class, this causes an error, because the character class is not a character class, this causes an error, because the character class is not
terminated. terminated by a closing square bracket.
<a name="digitsafterbackslash"></a></P> <a name="digitsafterbackslash"></a></P>
<br><b> <br><b>
Non-printing characters Non-printing characters
@ -476,9 +485,9 @@ a hexadecimal digit appears between \x{ and }, or if there is no terminating
<P> <P>
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
described only when it is followed by two hexadecimal digits. Otherwise, it described only when it is followed by two hexadecimal digits. Otherwise, it
matches a literal "x" character. In this mode mode, support for code points matches a literal "x" character. In this mode, support for code points greater
greater than 256 is provided by \u, which must be followed by four hexadecimal than 256 is provided by \u, which must be followed by four hexadecimal digits;
digits; otherwise it matches a literal "u" character. otherwise it matches a literal "u" character.
</P> </P>
<P> <P>
Characters whose value is less than 256 can be defined by either of the two Characters whose value is less than 256 can be defined by either of the two
@ -493,12 +502,10 @@ Constraints on character values
Characters that are specified using octal or hexadecimal numbers are Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows: limited to certain values, as follows:
<pre> <pre>
8-bit non-UTF mode less than 0x100 8-bit non-UTF mode no greater than 0xff
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint 16-bit non-UTF mode no greater than 0xffff
16-bit non-UTF mode less than 0x10000 32-bit non-UTF mode no greater than 0xffffffff
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint All UTF modes no greater than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
</pre> </pre>
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
"surrogate" codepoints), and 0xffef. "surrogate" codepoints), and 0xffef.
@ -525,7 +532,7 @@ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2 handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
is set, \U matches a "U" character, and \u can be used to define a character is set, \U matches a "U" character, and \u can be used to define a character
by code point, as described in the previous section. by code point, as described above.
</P> </P>
<br><b> <br><b>
Absolute and relative back references Absolute and relative back references
@ -714,7 +721,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
sequences that match characters with specific properties are available. In sequences that match characters with specific properties are available. In
8-bit non-UTF-8 mode, these sequences are of course limited to testing 8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose codepoints are less than 256, but they do work in this mode. characters whose codepoints are less than 256, but they do work in this mode.
The extra escape sequences are: In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
may be encountered. These are all treated as being in the Common script and
with an unassigned type. The extra escape sequences are:
<pre> <pre>
\p{<i>xx</i>} a character with the <i>xx</i> property \p{<i>xx</i>} a character with the <i>xx</i> property
\P{<i>xx</i>} a character without the <i>xx</i> property \P{<i>xx</i>} a character without the <i>xx</i> property
@ -2214,16 +2223,8 @@ except that it does not cause the current matching position to be changed.
Assertion subpatterns are not capturing subpatterns. If such an assertion Assertion subpatterns are not capturing subpatterns. If such an assertion
contains capturing subpatterns within it, these are counted for the purposes of contains capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. However, substring numbering the capturing subpatterns in the whole pattern. However, substring
capturing is carried out only for positive assertions. (Perl sometimes, but not capturing is normally carried out only for positive assertions (but see the
always, does do capturing in negative assertions.) discussion of conditional subpatterns below).
</P>
<P>
WARNING: If a positive assertion containing one or more capturing subpatterns
succeeds, but failure to match later in the pattern causes backtracking over
this assertion, the captures within the assertion are reset only if no higher
numbered captures are already set. This is, unfortunately, a fundamental
limitation of the current implementation; it may get removed in a future
reworking.
</P> </P>
<P> <P>
For compatibility with Perl, most assertion subpatterns may be repeated; though For compatibility with Perl, most assertion subpatterns may be repeated; though
@ -2601,6 +2602,12 @@ presence of at least one letter in the subject. If a letter is found, the
subject is matched against the first alternative; otherwise it is matched subject is matched against the first alternative; otherwise it is matched
against the second. This pattern matches strings in one of the two forms against the second. This pattern matches strings in one of the two forms
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
</P>
<P>
For Perl compatibility, if an assertion that is a condition contains capturing
subpatterns, any capturing that occurs is retained afterwards, for both
positive and negative assertions. (Compare non-conditional assertions, when
captures are retained only for positive assertions.)
<a name="comments"></a></P> <a name="comments"></a></P>
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br> <br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
<P> <P>
@ -2773,93 +2780,57 @@ is the actual recursive call.
Differences in recursion processing between PCRE2 and Perl Differences in recursion processing between PCRE2 and Perl
</b><br> </b><br>
<P> <P>
Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2 Some former differences between PCRE2 and Perl no longer exist.
(like Python, but unlike Perl), a recursive subpattern call is always treated
as an atomic group. That is, once it has matched some of the subject string, it
is never re-entered, even if it contains untried alternatives and there is a
subsequent matching failure. This can be illustrated by the following pattern,
which purports to match a palindromic string that contains an odd number of
characters (for example, "a", "aba", "abcba", "abcdcba"):
<pre>
^(.|(.)(?1)\2)$
</pre>
The idea is that it either matches a single character, or two identical
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
it does not if the pattern is longer than three characters. Consider the
subject string "abcba":
</P> </P>
<P> <P>
At the top level, the first character is matched, but as it is not at the end Before release 10.30, recursion processing in PCRE2 differed from Perl in that
of the string, the first alternative fails; the second alternative is taken a recursive subpattern call was always treated as an atomic group. That is,
and the recursion kicks in. The recursive call to subpattern 1 successfully once it had matched some of the subject string, it was never re-entered, even
matches the next character ("b"). (Note that the beginning and end of line if it contained untried alternatives and there was a subsequent matching
tests are not part of the recursion). failure. (Historical note: PCRE implemented recursion before Perl did.)
</P> </P>
<P> <P>
Back at the top level, the next character ("c") is compared with what Starting with release 10.30, recursive subroutine calls are no longer treated
subpattern 2 matched, which was "a". This fails. Because the recursion is as atomic. That is, they can be re-entered to try unused alternatives if there
treated as an atomic group, there are now no backtracking points, and so the is a matching failure later in the pattern. This is now compatible with the way
entire match fails. (Perl is able, at this point, to re-enter the recursion and Perl works. If you want a subroutine call to be atomic, you must explicitly
try the second alternative.) However, if the pattern is written with the enclose it in an atomic group.
alternatives in the other order, things are different:
<pre>
^((.)(?1)\2|.)$
</pre>
This time, the recursing alternative is tried first, and continues to recurse
until it runs out of characters, at which point the recursion fails. But this
time we do have another alternative to try at the higher level. That is the big
difference: in the previous case the remaining alternative is at a deeper
recursion level, which PCRE2 cannot use.
</P> </P>
<P> <P>
To change the pattern so that it matches all palindromic strings, not just Supporting backtracking into recursions simplifies certain types of recursive
those with an odd number of characters, it is tempting to change the pattern to pattern. For example, this pattern matches palindromic strings:
this:
<pre> <pre>
^((.)(?1)\2|.?)$ ^((.)(?1)\2|.?)$
</pre> </pre>
Again, this works in Perl, but not in PCRE2, and for the same reason. When a The second branch in the group matches a single central character in the
deeper recursion has matched a single character, it cannot be entered again in palindrome when there are an odd number of characters, or nothing when there
order to match an empty string. The solution is to separate the two cases, and are an even number of characters, but in order to work it has to be able to try
write out the odd and even cases as alternatives at the higher level: the second case when the rest of the pattern match fails. If you want to match
typical palindromic phrases, the pattern has to ignore all non-word characters,
which can be done like this:
<pre> <pre>
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
</pre>
If you want to match typical palindromic phrases, the pattern has to ignore all
non-word characters, which can be done like this:
<pre>
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
</pre> </pre>
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
use of the possessive quantifier *+ to avoid backtracking into sequences of avoid backtracking into sequences of non-word characters. Without this, PCRE2
non-word characters. Without this, PCRE2 takes a great deal longer (ten times takes a great deal longer (ten times or more) to match typical phrases, and
or more) to match typical phrases, and Perl takes so long that you think it has Perl takes so long that you think it has gone into a loop.
gone into a loop.
</P> </P>
<P> <P>
<b>WARNING</b>: The palindrome-matching patterns above work only if the subject Another way in which PCRE2 and Perl used to differ in their recursion
string does not start with a palindrome that is shorter than the entire string. processing is in the handling of captured values. Formerly in Perl, when a
For example, although "abcba" is correctly matched, if the subject is "ababa", subpattern was called recursively or as a subpattern (see the next section), it
PCRE2 finds the palindrome "aba" at the start, then fails at top level because had no access to any values that were captured outside the recursion, whereas
the end of the string does not follow. Once again, it cannot jump back into the in PCRE2 these values can be referenced. Consider this pattern:
recursion to try other alternatives, so the entire match fails.
</P>
<P>
The second way in which PCRE2 and Perl differ in their recursion processing is
in the handling of captured values. In Perl, when a subpattern is called
recursively or as a subpattern (see the next section), it has no access to any
values that were captured outside the recursion, whereas in PCRE2 these values
can be referenced. Consider this pattern:
<pre> <pre>
^(.)(\1|a(?2)) ^(.)(\1|a(?2))
</pre> </pre>
In PCRE2, this pattern matches "bab". The first capturing parentheses match "b", This pattern matches "bab". The first capturing parentheses match "b", then in
then in the second group, when the back reference \1 fails to match "b", the the second group, when the back reference \1 fails to match "b", the second
second alternative matches "a" and then recurses. In the recursion, \1 does alternative matches "a" and then recurses. In the recursion, \1 does now match
now match "b" and so the whole match succeeds. In Perl, the pattern fails to "b" and so the whole match succeeds. This match used to fail in Perl, but in
match because inside the recursive call \1 cannot access the externally set later versions (I tried 5.024) it now works.
value.
<a name="subpatternsassubroutines"></a></P> <a name="subpatternsassubroutines"></a></P>
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> <br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<P> <P>
@ -2886,11 +2857,10 @@ is used, it does match "sense and responsibility" as well as the other two
strings. Another example is given in the discussion of DEFINE above. strings. Another example is given in the discussion of DEFINE above.
</P> </P>
<P> <P>
All subroutine calls, whether recursive or not, are always treated as atomic Like recursions, subroutine calls used to be treated as atomic, but this
groups. That is, once a subroutine has matched some of the subject string, it changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
is never re-entered, even if it contains untried alternatives and there is a occur. However, any capturing parentheses that are set during the subroutine
subsequent matching failure. Any capturing parentheses that are set during the call revert to their previous values afterwards.
subroutine call revert to their previous values afterwards.
</P> </P>
<P> <P>
Processing options such as case-independence are fixed when a subpattern is Processing options such as case-independence are fixed when a subpattern is
@ -2998,17 +2968,10 @@ The doubling is removed before the string is passed to the callout function.
<a name="backtrackcontrol"></a></P> <a name="backtrackcontrol"></a></P>
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br> <br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P> <P>
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which There are a number of special "Backtracking Control Verbs" (to use Perl's
are still described in the Perl documentation as "experimental and subject to terminology) that modify the behaviour of backtracking during matching. They
change or removal in a future version of Perl". It goes on to say: "Their usage are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
in production code should be noted to avoid problems during upgrades." The same possibly behaving differently depending on whether or not a name is present.
remarks apply to the PCRE2 features described in this section.
</P>
<P>
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
(*VERB:NAME). Some verbs take either form, possibly behaving differently
depending on whether or not a name is present.
</P> </P>
<P> <P>
By default, for compatibility with Perl, a name is any sequence of characters By default, for compatibility with Perl, a name is any sequence of characters
@ -3040,7 +3003,7 @@ not there. Any number of these verbs may occur in a pattern.
<P> <P>
Since these verbs are specifically related to backtracking, most of them can be Since these verbs are specifically related to backtracking, most of them can be
used only when the pattern is to be matched using the traditional matching used only when the pattern is to be matched using the traditional matching
function, because these use a backtracking algorithm. With the exception of function, because that uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, the backtracking (*FAIL), which behaves like a failing negative assertion, the backtracking
control verbs cause an error if encountered by the DFA matching function. control verbs cause an error if encountered by the DFA matching function.
</P> </P>
@ -3178,11 +3141,11 @@ Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching continues The following verbs do nothing when they are encountered. Matching continues
with what follows, but if there is no subsequent match, causing a backtrack to with what follows, but if there is no subsequent match, causing a backtrack to
the verb, a failure is forced. That is, backtracking cannot pass to the left of the verb, a failure is forced. That is, backtracking cannot pass to the left of
the verb. However, when one of these verbs appears inside an atomic group the verb. However, when one of these verbs appears inside an atomic group or in
(which includes any group that is called as a subroutine) or in an assertion an assertion that is true, its effect is confined to that group, because once
that is true, its effect is confined to that group, because once the group has the group has been matched, there is never any backtracking into it. In this
been matched, there is never any backtracking into it. In this situation, situation, backtracking has to jump to the left of the entire atomic group or
backtracking has to jump to the left of the entire atomic group or assertion. assertion.
</P> </P>
<P> <P>
These verbs differ in exactly what kind of failure occurs when backtracking These verbs differ in exactly what kind of failure occurs when backtracking
@ -3246,8 +3209,8 @@ expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
as (*COMMIT). as (*COMMIT).
</P> </P>
<P> <P>
The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE). The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
It is like (*MARK:NAME) in that the name is remembered for passing back to the like (*MARK:NAME) in that the name is remembered for passing back to the
caller. However, (*SKIP:NAME) searches only for names set with (*MARK), caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
ignoring those set by (*PRUNE) or (*THEN). ignoring those set by (*PRUNE) or (*THEN).
<pre> <pre>
@ -3452,9 +3415,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 27 December 2016 Last updated: 18 March 2017
<br> <br>
Copyright &copy; 1997-2016 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -55,7 +55,10 @@ The facility for saving and restoring compiled patterns is intended for use
within individual applications. As such, the data supplied to within individual applications. As such, the data supplied to
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from <b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency checking, not arbitrary external sources. There is only some simple consistency checking, not
complete validation of what is being re-loaded. complete validation of what is being re-loaded. Corrupted data may cause
undefined results. For example, if the length field of a pattern in the
serialized data is corrupted, the deserializing code may read beyond the end of
the byte stream that is passed to it.
</P> </P>
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br> <br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
<P> <P>
@ -190,9 +193,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br> <br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 24 May 2016 Last updated: 21 March 2017
<br> <br>
Copyright &copy; 1997-2016 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -126,12 +126,13 @@ character values up to 0x7fffffff. Each character is placed in one 16-bit or
to occur). to occur).
</P> </P>
<P> <P>
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such UTF-8 (in its original definition) is not capable of encoding values greater
values can be handled by the 32-bit library. When testing this library in than 0x7fffffff, but such values can be handled by the 32-bit library. When
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the testing this library in non-UTF mode with <b>utf8_input</b> set, if any
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
character's value. This is the only way of passing such code points in a 0x80000000 is added to the character's value. This is the only way of passing
pattern string. For subject strings, using an escape sequence is preferable. such code points in a pattern string. For subject strings, using an escape
sequence is preferable.
</P> </P>
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br> <br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
<P> <P>
@ -602,6 +603,7 @@ about the pattern:
/B bincode show binary code without lengths /B bincode show binary code without lengths
callout_info show callout information callout_info show callout information
debug same as info,fullbincode debug same as info,fullbincode
framesize show matching frame size
fullbincode show binary code with lengths fullbincode show binary code with lengths
/I info show info about compiled pattern /I info show info about compiled pattern
hex unquoted characters are hexadecimal hex unquoted characters are hexadecimal
@ -689,6 +691,11 @@ not necessarily the last character. These lines are omitted if no starting or
ending code units are recorded. ending code units are recorded.
</P> </P>
<P> <P>
The <b>framesize</b> modifier shows the size, in bytes, of the storage frames
used by <b>pcre2_match()</b> for handling backtracking. The size depends on the
number of capturing parentheses in the pattern.
</P>
<P>
The <b>callout_info</b> modifier requests information about all the callouts in The <b>callout_info</b> modifier requests information about all the callouts in
the pattern. A list of them is output at the end of any other information that the pattern. A list of them is output at the end of any other information that
is requested. For each callout, either its number or string is given, followed is requested. For each callout, either its number or string is given, followed
@ -1073,6 +1080,7 @@ pattern.
callout_fail=&#60;n&#62;[:&#60;m&#62;] control callout failure callout_fail=&#60;n&#62;[:&#60;m&#62;] control callout failure
callout_none do not supply a callout function callout_none do not supply a callout function
copy=&#60;number or name&#62; copy captured substring copy=&#60;number or name&#62; copy captured substring
depth_limit=&#60;n&#62; set a depth limit
dfa use <b>pcre2_dfa_match()</b> dfa use <b>pcre2_dfa_match()</b>
find_limits find match and recursion limits find_limits find match and recursion limits
get=&#60;number or name&#62; extract captured substring get=&#60;number or name&#62; extract captured substring
@ -1086,7 +1094,7 @@ pattern.
offset=&#60;n&#62; set starting offset offset=&#60;n&#62; set starting offset
offset_limit=&#60;n&#62; set offset limit offset_limit=&#60;n&#62; set offset limit
ovector=&#60;n&#62; set size of output vector ovector=&#60;n&#62; set size of output vector
recursion_limit=&#60;n&#62; set a recursion limit recursion_limit=&#60;n&#62; obsolete synonym for depth_limit
replace=&#60;string&#62; specify a replacement string replace=&#60;string&#62; specify a replacement string
startchar show startchar when relevant startchar show startchar when relevant
startoffset=&#60;n&#62; same as offset=&#60;n&#62; startoffset=&#60;n&#62; same as offset=&#60;n&#62;
@ -1320,10 +1328,10 @@ stack that is larger than the default 32K is necessary only for very
complicated patterns. complicated patterns.
</P> </P>
<br><b> <br><b>
Setting match and recursion limits Setting match and depth limits
</b><br> </b><br>
<P> <P>
The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate The <b>match_limit</b> and <b>depth_limit</b> modifiers set the appropriate
limits in the match context. These values are ignored when the limits in the match context. These values are ignored when the
<b>find_limits</b> modifier is specified. <b>find_limits</b> modifier is specified.
</P> </P>
@ -1333,23 +1341,23 @@ Finding minimum limits
<P> <P>
If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
<b>pcre2_match()</b> several times, setting different values in the match <b>pcre2_match()</b> several times, setting different values in the match
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b> context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_depth_limit()</b>
until it finds the minimum values for each parameter that allow until it finds the minimum values for each parameter that allow
<b>pcre2_match()</b> to complete without error. <b>pcre2_match()</b> to complete without error.
</P> </P>
<P> <P>
If JIT is being used, only the match limit is relevant. If DFA matching is If JIT is being used, only the match limit is relevant. If DFA matching is
being used, neither limit is relevant, and this modifier is ignored (with a being used, only the depth limit is relevant, but at present this modifier is
warning message). ignored (with a warning message).
</P> </P>
<P> <P>
The <i>match_limit</i> number is a measure of the amount of backtracking The <i>match_limit</i> number is a measure of the amount of backtracking
that takes place, and learning the minimum value can be instructive. For most that takes place, and learning the minimum value can be instructive. For most
simple matches, the number is quite small, but for patterns with very large simple matches, the number is quite small, but for patterns with very large
numbers of matching possibilities, it can become large very quickly with numbers of matching possibilities, it can become large very quickly with
increasing length of subject string. The <i>match_limit_recursion</i> number is increasing length of subject string. The <i>depth_limit</i> number is
a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much a measure of how much memory for recording backtracking points is needed to
heap) memory is needed to complete the match attempt. complete the match attempt.
</P> </P>
<br><b> <br><b>
Showing MARK names Showing MARK names
@ -1466,7 +1474,7 @@ code unit offset of the start of the failing character is also output. Here is
an example of an interactive <b>pcre2test</b> run. an example of an interactive <b>pcre2test</b> run.
<pre> <pre>
$ pcre2test $ pcre2test
PCRE2 version 9.00 2014-05-10 PCRE2 version 10.22 2016-07-29
re&#62; /^abc(\d+)/ re&#62; /^abc(\d+)/
data&#62; abc123 data&#62; abc123
@ -1779,9 +1787,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 28 December 2016 Last updated: 21 March 2017
<br> <br>
Copyright &copy; 1997-2016 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -89,8 +89,8 @@ SECURITY CONSIDERATIONS
One way of guarding against this possibility is to use the pcre2_pat- One way of guarding against this possibility is to use the pcre2_pat-
tern_info() function to check the compiled pattern's options for tern_info() function to check the compiled pattern's options for
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
calling pcre2_compile(). This causes an compile time error if a pattern calling pcre2_compile(). This causes a compile time error if the pat-
contains a UTF-setting sequence. tern contains a UTF-setting sequence.
The use of Unicode properties for character types such as \d can also The use of Unicode properties for character types such as \d can also
be enabled from within the pattern, by specifying "(*UCP)". This fea- be enabled from within the pattern, by specifying "(*UCP)". This fea-
@ -112,7 +112,9 @@ SECURITY CONSIDERATIONS
has a very large search tree against a string that will never match. has a very large search tree against a string that will never match.
Nested unlimited repeats in a pattern are a common example. PCRE2 pro- Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
vides some protection against this: see the pcre2_set_match_limit() vides some protection against this: see the pcre2_set_match_limit()
function in the pcre2api page. function in the pcre2api page. There is a similar function called
pcre2_set_depth_limit() that can be used to restrict the amount of mem-
ory that is used.
USER DOCUMENTATION USER DOCUMENTATION
@ -144,7 +146,7 @@ USER DOCUMENTATION
pcre2perform discussion of performance issues pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program pcre2sample discussion of the pcre2demo program
pcre2stack discussion of stack usage pcre2stack discussion of stack and memory usage
pcre2syntax quick syntax reference pcre2syntax quick syntax reference
pcre2test description of the pcre2test command pcre2test description of the pcre2test command
pcre2unicode discussion of Unicode and UTF support pcre2unicode discussion of Unicode and UTF support
@ -166,8 +168,8 @@ AUTHOR
REVISION REVISION
Last updated: 16 October 2015 Last updated: 27 March 2017
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -2533,9 +2535,10 @@ OBTAINING A TEXTUAL ERROR MESSAGE
A text message for an error code from any PCRE2 function (compile, A text message for an error code from any PCRE2 function (compile,
match, or auxiliary) can be obtained by calling pcre2_get_error_mes- match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
sage(). The code is passed as the first argument, with the remaining sage(). The code is passed as the first argument, with the remaining
two arguments specifying a code unit buffer and its length, into which two arguments specifying a code unit buffer and its length in code
the text message is placed. Note that the message is returned in code units, into which the text message is placed. The message is returned
units of the appropriate width for the library that is being used. in code units of the appropriate width for the library that is being
used.
The returned message is terminated with a trailing zero, and the func- The returned message is terminated with a trailing zero, and the func-
tion returns the number of code units used, excluding the trailing tion returns the number of code units used, excluding the trailing
@ -3178,8 +3181,8 @@ AUTHOR
REVISION REVISION
Last updated: 23 December 2016 Last updated: 21 March 2017
Copyright (c) 1997-2016 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -5519,19 +5522,24 @@ SPECIAL START-OF-PATTERN ITEMS
attempt by the application to apply the JIT optimization by calling attempt by the application to apply the JIT optimization by calling
pcre2_jit_compile() is ignored. pcre2_jit_compile() is ignored.
Setting match and recursion limits Setting match and backtracking depth limits
The caller of pcre2_match() can set a limit on the number of times the The pcre2_match() function contains a counter that is incremented every
internal match() function is called and on the maximum depth of recur- time it goes round its main loop. The caller of pcre2_match() can set a
sive calls. These facilities are provided to catch runaway matches that limit on this counter, which therefore limits the amount of computing
are provoked by patterns with huge matching trees (a typical example is resource used for a match. The maximum depth of nested backtracking can
a pattern with nested unlimited repeats) and to avoid running out of also be limited, and this restricts the amount of heap memory that is
system stack by too much recursion. When one of these limits is used.
reached, pcre2_match() gives an error return. The limits can also be
set by items at the start of the pattern of the form These facilities are provided to catch runaway matches that are pro-
voked by patterns with huge matching trees (a typical example is a pat-
tern with nested unlimited repeats applied to a long string that does
not match). When one of these limits is reached, pcre2_match() gives an
error return. The limits can also be set by items at the start of the
pattern of the form
(*LIMIT_MATCH=d) (*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d) (*LIMIT_DEPTH=d)
where d is any number of decimal digits. However, the value of the set- where d is any number of decimal digits. However, the value of the set-
ting must be less than the value set (or defaulted) by the caller of ting must be less than the value set (or defaulted) by the caller of
@ -5540,11 +5548,15 @@ SPECIAL START-OF-PATTERN ITEMS
If there is more than one setting of one of these limits, the lower If there is more than one setting of one of these limits, the lower
value is used. value is used.
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
name is still recognized for backwards compatibility.
The match limit is used (but in a different way) when JIT is being The match limit is used (but in a different way) when JIT is being
used, but it is not relevant, and is ignored, when matching with used, but it is not relevant, and is ignored, when matching with
pcre2_dfa_match(). However, the recursion limit is relevant for DFA pcre2_dfa_match(). However, the depth limit is relevant for DFA match-
matching, which does use some function recursion, in particular, for ing, which uses function recursion for recursions within the pattern.
recursions within the pattern. In this case, the depth limit controls the amount of system stack that
is used.
Newline conventions Newline conventions
@ -5579,9 +5591,9 @@ SPECIAL START-OF-PATTERN ITEMS
acter when PCRE2_DOTALL is not set, and the behaviour of \N. However, acter when PCRE2_DOTALL is not set, and the behaviour of \N. However,
it does not affect what the \R escape sequence matches. By default, it does not affect what the \R escape sequence matches. By default,
this is any Unicode newline sequence, for Perl compatibility. However, this is any Unicode newline sequence, for Perl compatibility. However,
this can be changed; see the description of \R in the section entitled this can be changed; see the next section and the description of \R in
"Newline sequences" below. A change of \R setting can be combined with the section entitled "Newline sequences" below. A change of \R setting
a change of newline convention. can be combined with a change of newline convention.
Specifying what \R matches Specifying what \R matches
@ -5595,7 +5607,7 @@ SPECIAL START-OF-PATTERN ITEMS
EBCDIC CHARACTER CODES EBCDIC CHARACTER CODES
PCRE2 can be compiled to run in an environment that uses EBCDIC as its PCRE2 can be compiled to run in an environment that uses EBCDIC as its
character code rather than ASCII or Unicode (typically a mainframe sys- character code instead of ASCII or Unicode (typically a mainframe sys-
tem). In the sections below, character code values are ASCII or Uni- tem). In the sections below, character code values are ASCII or Uni-
code; in an EBCDIC environment these characters may have different code code; in an EBCDIC environment these characters may have different code
values, and there are no code points greater than 255. values, and there are no code points greater than 255.
@ -5660,8 +5672,8 @@ BACKSLASH
meaning that character may have. This use of backslash as an escape meaning that character may have. This use of backslash as an escape
character applies both inside and outside character classes. character applies both inside and outside character classes.
For example, if you want to match a * character, you write \* in the For example, if you want to match a * character, you must write \* in
pattern. This escaping action applies whether or not the following the pattern. This escaping action applies whether or not the following
character would otherwise be interpreted as a metacharacter, so it is character would otherwise be interpreted as a metacharacter, so it is
always safe to precede a non-alphanumeric with backslash to specify always safe to precede a non-alphanumeric with backslash to specify
that it stands for itself. In particular, if you want to match a back- that it stands for itself. In particular, if you want to match a back-
@ -5695,7 +5707,8 @@ BACKSLASH
is not followed by \E later in the pattern, the literal interpretation is not followed by \E later in the pattern, the literal interpretation
continues to the end of the pattern (that is, \E is assumed at the continues to the end of the pattern (that is, \E is assumed at the
end). If the isolated \Q is inside a character class, this causes an end). If the isolated \Q is inside a character class, this causes an
error, because the character class is not terminated. error, because the character class is not terminated by a closing
square bracket.
Non-printing characters Non-printing characters
@ -5810,10 +5823,10 @@ BACKSLASH
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
just described only when it is followed by two hexadecimal digits. Oth- just described only when it is followed by two hexadecimal digits. Oth-
erwise, it matches a literal "x" character. In this mode mode, support erwise, it matches a literal "x" character. In this mode, support for
for code points greater than 256 is provided by \u, which must be fol- code points greater than 256 is provided by \u, which must be followed
lowed by four hexadecimal digits; otherwise it matches a literal "u" by four hexadecimal digits; otherwise it matches a literal "u" charac-
character. ter.
Characters whose value is less than 256 can be defined by either of the Characters whose value is less than 256 can be defined by either of the
two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif- two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
@ -5825,12 +5838,10 @@ BACKSLASH
Characters that are specified using octal or hexadecimal numbers are Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows: limited to certain values, as follows:
8-bit non-UTF mode less than 0x100 8-bit non-UTF mode no greater than 0xff
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint 16-bit non-UTF mode no greater than 0xffff
16-bit non-UTF mode less than 0x10000 32-bit non-UTF mode no greater than 0xffffffff
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint All UTF modes no greater than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
called "surrogate" codepoints), and 0xffef. called "surrogate" codepoints), and 0xffef.
@ -5852,8 +5863,7 @@ BACKSLASH
handler and used to modify the case of following characters. By handler and used to modify the case of following characters. By
default, PCRE2 does not support these escape sequences. However, if the default, PCRE2 does not support these escape sequences. However, if the
PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
used to define a character by code point, as described in the previous used to define a character by code point, as described above.
section.
Absolute and relative back references Absolute and relative back references
@ -6022,7 +6032,10 @@ BACKSLASH
tional escape sequences that match characters with specific properties tional escape sequences that match characters with specific properties
are available. In 8-bit non-UTF-8 mode, these sequences are of course are available. In 8-bit non-UTF-8 mode, these sequences are of course
limited to testing characters whose codepoints are less than 256, but limited to testing characters whose codepoints are less than 256, but
they do work in this mode. The extra escape sequences are: they do work in this mode. In 32-bit non-UTF mode, codepoints greater
than 0x10ffff (the Unicode limit) may be encountered. These are all
treated as being in the Common script and with an unassigned type. The
extra escape sequences are:
\p{xx} a character with the xx property \p{xx} a character with the xx property
\P{xx} a character without the xx property \P{xx} a character without the xx property
@ -7328,16 +7341,9 @@ ASSERTIONS
Assertion subpatterns are not capturing subpatterns. If such an asser- Assertion subpatterns are not capturing subpatterns. If such an asser-
tion contains capturing subpatterns within it, these are counted for tion contains capturing subpatterns within it, these are counted for
the purposes of numbering the capturing subpatterns in the whole pat- the purposes of numbering the capturing subpatterns in the whole pat-
tern. However, substring capturing is carried out only for positive tern. However, substring capturing is normally carried out only for
assertions. (Perl sometimes, but not always, does do capturing in nega- positive assertions (but see the discussion of conditional subpatterns
tive assertions.) below).
WARNING: If a positive assertion containing one or more capturing sub-
patterns succeeds, but failure to match later in the pattern causes
backtracking over this assertion, the captures within the assertion are
reset only if no higher numbered captures are already set. This is,
unfortunately, a fundamental limitation of the current implementation;
it may get removed in a future reworking.
For compatibility with Perl, most assertion subpatterns may be For compatibility with Perl, most assertion subpatterns may be
repeated; though it makes no sense to assert the same thing several repeated; though it makes no sense to assert the same thing several
@ -7686,6 +7692,12 @@ CONDITIONAL SUBPATTERNS
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
letters and dd are digits. letters and dd are digits.
For Perl compatibility, if an assertion that is a condition contains
capturing subpatterns, any capturing that occurs is retained after-
wards, for both positive and negative assertions. (Compare non-condi-
tional assertions, when captures are retained only for positive asser-
tions.)
COMMENTS COMMENTS
@ -7849,94 +7861,59 @@ RECURSIVE PATTERNS
Differences in recursion processing between PCRE2 and Perl Differences in recursion processing between PCRE2 and Perl
Recursion processing in PCRE2 differs from Perl in two important ways. Some former differences between PCRE2 and Perl no longer exist.
In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
alternatives and there is a subsequent matching failure. This can be
illustrated by the following pattern, which purports to match a palin-
dromic string that contains an odd number of characters (for example,
"a", "aba", "abcba", "abcdcba"):
^(.|(.)(?1)\2)$ Before release 10.30, recursion processing in PCRE2 differed from Perl
in that a recursive subpattern call was always treated as an atomic
group. That is, once it had matched some of the subject string, it was
never re-entered, even if it contained untried alternatives and there
was a subsequent matching failure. (Historical note: PCRE implemented
recursion before Perl did.)
The idea is that it either matches a single character, or two identical Starting with release 10.30, recursive subroutine calls are no longer
characters surrounding a sub-palindrome. In Perl, this pattern works; treated as atomic. That is, they can be re-entered to try unused alter-
in PCRE2 it does not if the pattern is longer than three characters. natives if there is a matching failure later in the pattern. This is
Consider the subject string "abcba": now compatible with the way Perl works. If you want a subroutine call
to be atomic, you must explicitly enclose it in an atomic group.
At the top level, the first character is matched, but as it is not at Supporting backtracking into recursions simplifies certain types of
the end of the string, the first alternative fails; the second alterna- recursive pattern. For example, this pattern matches palindromic
tive is taken and the recursion kicks in. The recursive call to subpat- strings:
tern 1 successfully matches the next character ("b"). (Note that the
beginning and end of line tests are not part of the recursion).
Back at the top level, the next character ("c") is compared with what
subpattern 2 matched, which was "a". This fails. Because the recursion
is treated as an atomic group, there are now no backtracking points,
and so the entire match fails. (Perl is able, at this point, to re-
enter the recursion and try the second alternative.) However, if the
pattern is written with the alternatives in the other order, things are
different:
^((.)(?1)\2|.)$
This time, the recursing alternative is tried first, and continues to
recurse until it runs out of characters, at which point the recursion
fails. But this time we do have another alternative to try at the
higher level. That is the big difference: in the previous case the
remaining alternative is at a deeper recursion level, which PCRE2 can-
not use.
To change the pattern so that it matches all palindromic strings, not
just those with an odd number of characters, it is tempting to change
the pattern to this:
^((.)(?1)\2|.?)$ ^((.)(?1)\2|.?)$
Again, this works in Perl, but not in PCRE2, and for the same reason. The second branch in the group matches a single central character in
When a deeper recursion has matched a single character, it cannot be the palindrome when there are an odd number of characters, or nothing
entered again in order to match an empty string. The solution is to when there are an even number of characters, but in order to work it
separate the two cases, and write out the odd and even cases as alter- has to be able to try the second case when the rest of the pattern
natives at the higher level: match fails. If you want to match typical palindromic phrases, the pat-
tern has to ignore all non-word characters, which can be done like
this:
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
If you want to match typical palindromic phrases, the pattern has to
ignore all non-word characters, which can be done like this:
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
If run with the PCRE2_CASELESS option, this pattern matches phrases If run with the PCRE2_CASELESS option, this pattern matches phrases
such as "A man, a plan, a canal: Panama!" and it works in both PCRE2 such as "A man, a plan, a canal: Panama!". Note the use of the posses-
and Perl. Note the use of the possessive quantifier *+ to avoid back- sive quantifier *+ to avoid backtracking into sequences of non-word
tracking into sequences of non-word characters. Without this, PCRE2 characters. Without this, PCRE2 takes a great deal longer (ten times or
takes a great deal longer (ten times or more) to match typical phrases, more) to match typical phrases, and Perl takes so long that you think
and Perl takes so long that you think it has gone into a loop. it has gone into a loop.
WARNING: The palindrome-matching patterns above work only if the sub- Another way in which PCRE2 and Perl used to differ in their recursion
ject string does not start with a palindrome that is shorter than the processing is in the handling of captured values. Formerly in Perl,
entire string. For example, although "abcba" is correctly matched, if when a subpattern was called recursively or as a subpattern (see the
the subject is "ababa", PCRE2 finds the palindrome "aba" at the start, next section), it had no access to any values that were captured out-
then fails at top level because the end of the string does not follow. side the recursion, whereas in PCRE2 these values can be referenced.
Once again, it cannot jump back into the recursion to try other alter- Consider this pattern:
natives, so the entire match fails.
The second way in which PCRE2 and Perl differ in their recursion pro-
cessing is in the handling of captured values. In Perl, when a subpat-
tern is called recursively or as a subpattern (see the next section),
it has no access to any values that were captured outside the recur-
sion, whereas in PCRE2 these values can be referenced. Consider this
pattern:
^(.)(\1|a(?2)) ^(.)(\1|a(?2))
In PCRE2, this pattern matches "bab". The first capturing parentheses This pattern matches "bab". The first capturing parentheses match "b",
match "b", then in the second group, when the back reference \1 fails then in the second group, when the back reference \1 fails to match
to match "b", the second alternative matches "a" and then recurses. In "b", the second alternative matches "a" and then recurses. In the
the recursion, \1 does now match "b" and so the whole match succeeds. recursion, \1 does now match "b" and so the whole match succeeds. This
In Perl, the pattern fails to match because inside the recursive call match used to fail in Perl, but in later versions (I tried 5.024) it
\1 cannot access the externally set value. now works.
SUBPATTERNS AS SUBROUTINES SUBPATTERNS AS SUBROUTINES
@ -7964,12 +7941,10 @@ SUBPATTERNS AS SUBROUTINES
two strings. Another example is given in the discussion of DEFINE two strings. Another example is given in the discussion of DEFINE
above. above.
All subroutine calls, whether recursive or not, are always treated as Like recursions, subroutine calls used to be treated as atomic, but
atomic groups. That is, once a subroutine has matched some of the sub- this changed at PCRE2 release 10.30, so backtracking into subroutine
ject string, it is never re-entered, even if it contains untried alter- calls can now occur. However, any capturing parentheses that are set
natives and there is a subsequent matching failure. Any capturing during the subroutine call revert to their previous values afterwards.
parentheses that are set during the subroutine call revert to their
previous values afterwards.
Processing options such as case-independence are fixed when a subpat- Processing options such as case-independence are fixed when a subpat-
tern is defined, so if it is used as a subroutine, such options cannot tern is defined, so if it is used as a subroutine, such options cannot
@ -8076,17 +8051,11 @@ CALLOUTS
BACKTRACKING CONTROL BACKTRACKING CONTROL
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", There are a number of special "Backtracking Control Verbs" (to use
which are still described in the Perl documentation as "experimental Perl's terminology) that modify the behaviour of backtracking during
and subject to change or removal in a future version of Perl". It goes matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
on to say: "Their usage in production code should be noted to avoid verbs take either form, possibly behaving differently depending on
problems during upgrades." The same remarks apply to the PCRE2 features whether or not a name is present.
described in this section.
The new verbs make use of what was previously invalid syntax: an open-
ing parenthesis followed by an asterisk. They are generally of the form
(*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving
differently depending on whether or not a name is present.
By default, for compatibility with Perl, a name is any sequence of By default, for compatibility with Perl, a name is any sequence of
characters that does not include a closing parenthesis. The name is not characters that does not include a closing parenthesis. The name is not
@ -8116,7 +8085,7 @@ BACKTRACKING CONTROL
Since these verbs are specifically related to backtracking, most of Since these verbs are specifically related to backtracking, most of
them can be used only when the pattern is to be matched using the tra- them can be used only when the pattern is to be matched using the tra-
ditional matching function, because these use a backtracking algorithm. ditional matching function, because that uses a backtracking algorithm.
With the exception of (*FAIL), which behaves like a failing negative With the exception of (*FAIL), which behaves like a failing negative
assertion, the backtracking control verbs cause an error if encountered assertion, the backtracking control verbs cause an error if encountered
by the DFA matching function. by the DFA matching function.
@ -8236,11 +8205,11 @@ BACKTRACKING CONTROL
tinues with what follows, but if there is no subsequent match, causing tinues with what follows, but if there is no subsequent match, causing
a backtrack to the verb, a failure is forced. That is, backtracking a backtrack to the verb, a failure is forced. That is, backtracking
cannot pass to the left of the verb. However, when one of these verbs cannot pass to the left of the verb. However, when one of these verbs
appears inside an atomic group (which includes any group that is called appears inside an atomic group or in an assertion that is true, its
as a subroutine) or in an assertion that is true, its effect is con- effect is confined to that group, because once the group has been
fined to that group, because once the group has been matched, there is matched, there is never any backtracking into it. In this situation,
never any backtracking into it. In this situation, backtracking has to backtracking has to jump to the left of the entire atomic group or
jump to the left of the entire atomic group or assertion. assertion.
These verbs differ in exactly what kind of failure occurs when back- These verbs differ in exactly what kind of failure occurs when back-
tracking reaches them. The behaviour described below is what happens tracking reaches them. The behaviour described below is what happens
@ -8303,11 +8272,10 @@ BACKTRACKING CONTROL
any other way. In an anchored pattern (*PRUNE) has the same effect as any other way. In an anchored pattern (*PRUNE) has the same effect as
(*COMMIT). (*COMMIT).
The behaviour of (*PRUNE:NAME) is the not the same as The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
(*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is It is like (*MARK:NAME) in that the name is remembered for passing back
remembered for passing back to the caller. However, (*SKIP:NAME) to the caller. However, (*SKIP:NAME) searches only for names set with
searches only for names set with (*MARK), ignoring those set by (*MARK), ignoring those set by (*PRUNE) or (*THEN).
(*PRUNE) or (*THEN).
(*SKIP) (*SKIP)
@ -8496,8 +8464,8 @@ AUTHOR
REVISION REVISION
Last updated: 27 December 2016 Last updated: 18 March 2017
Copyright (c) 1997-2016 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -9078,7 +9046,10 @@ SECURITY CONCERNS
use within individual applications. As such, the data supplied to use within individual applications. As such, the data supplied to
pcre2_serialize_decode() is expected to be trusted data, not data from pcre2_serialize_decode() is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency arbitrary external sources. There is only some simple consistency
checking, not complete validation of what is being re-loaded. checking, not complete validation of what is being re-loaded. Corrupted
data may cause undefined results. For example, if the length field of a
pattern in the serialized data is corrupted, the deserializing code may
read beyond the end of the byte stream that is passed to it.
SAVING COMPILED PATTERNS SAVING COMPILED PATTERNS
@ -9211,8 +9182,8 @@ AUTHOR
REVISION REVISION
Last updated: 24 May 2016 Last updated: 21 March 2017
Copyright (c) 1997-2016 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_CONFIG 3 "20 April 2014" "PCRE2 10.0" .TH PCRE2_CONFIG 3 "24 March 2017" "PCRE2 10.30"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -31,10 +31,13 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_CONFIG_BSR Indicates what \eR matches by default: PCRE2_CONFIG_BSR Indicates what \eR matches by default:
PCRE2_BSR_UNICODE PCRE2_BSR_UNICODE
PCRE2_BSR_ANYCRLF PCRE2_BSR_ANYCRLF
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
.\" JOIN
PCRE2_CONFIG_JIT Availability of just-in-time compiler PCRE2_CONFIG_JIT Availability of just-in-time compiler
support (1=yes 0=no) support (1=yes 0=no)
PCRE2_CONFIG_JITTARGET Information about the target archi- .\" JOIN
tecture for the JIT compiler PCRE2_CONFIG_JITTARGET Information (a string) about the target
architecture for the JIT compiler
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4) PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
PCRE2_CONFIG_NEWLINE Code for the default newline sequence: PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
@ -44,9 +47,9 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_NEWLINE_ANY PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF PCRE2_NEWLINE_ANYCRLF
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
0=heap) .\" JOIN
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
0=no) 0=no)
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string) PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)

View File

@ -1,4 +1,4 @@
.TH PCRE2_DFA_MATCH 3 "23 December 2016" "PCRE2 10.23" .TH PCRE2_DFA_MATCH 3 "24 March 2017" "PCRE2 10.30"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -19,8 +19,9 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
This function matches a compiled regular expression against a given subject This function matches a compiled regular expression against a given subject
string, using an alternative matching algorithm that scans the subject string string, using an alternative matching algorithm that scans the subject string
just once (\fInot\fP Perl-compatible). (The Perl-compatible matching function just once (except when processing lookaround assertions). This function is
is \fBpcre2_match()\fP.) The arguments for this function are: \fInot\fP Perl-compatible (the Perl-compatible matching function is
\fBpcre2_match()\fP). The arguments for this function are:
.sp .sp
\fIcode\fP Points to the compiled pattern \fIcode\fP Points to the compiled pattern
\fIsubject\fP Points to the subject string \fIsubject\fP Points to the subject string
@ -33,22 +34,26 @@ is \fBpcre2_match()\fP.) The arguments for this function are:
\fIwscount\fP Number of elements in the vector \fIwscount\fP Number of elements in the vector
.sp .sp
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
up a callout function or specify the recursion limit. The \fIlength\fP and up a callout function or specify the recursion depth limit. The \fIlength\fP
\fIstartoffset\fP values are code units, not characters. The options are: and \fIstartoffset\fP values are code units, not characters. The options are:
.sp .sp
PCRE2_ANCHORED Match only at the first position PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject is not the beginning of a line PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match PCRE2_NOTEMPTY An empty string is not a valid match
.\" JOIN
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match is not a valid match
.\" JOIN
PCRE2_NO_UTF_CHECK Do not check the subject for UTF PCRE2_NO_UTF_CHECK Do not check the subject for UTF
validity (only relevant if PCRE2_UTF validity (only relevant if PCRE2_UTF
was set at compile time) was set at compile time)
.\" JOIN
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial
match even if there is a full match
.\" JOIN
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
match if no full matches are found match if no full matches are found
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
even if there is a full match as well
PCRE2_DFA_RESTART Restart after a partial match PCRE2_DFA_RESTART Restart after a partial match
PCRE2_DFA_SHORTEST Return only the shortest match PCRE2_DFA_SHORTEST Return only the shortest match
.sp .sp

View File

@ -1,4 +1,4 @@
.TH PCRE2_GET_ERROR_MESSAGE 3 "17 June 2016" "PCRE2 10.22" .TH PCRE2_GET_ERROR_MESSAGE 3 "24 March 2017" "PCRE2 10.30"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -22,11 +22,11 @@ errors are negative numbers. The arguments are:
\fIbuffer\fP where to put the message \fIbuffer\fP where to put the message
\fIbufflen\fP the length of the buffer (code units) \fIbufflen\fP the length of the buffer (code units)
.sp .sp
The function returns the length of the message, excluding the trailing zero, or The function returns the length of the message in code units, excluding the
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
this case, the returned message is truncated (but still with a trailing zero). too small. In this case, the returned message is truncated (but still with a
If \fIerrorcode\fP does not contain a recognized error code number, the trailing zero). If \fIerrorcode\fP does not contain a recognized error code
negative value PCRE2_ERROR_BADDATA is returned. number, the negative value PCRE2_ERROR_BADDATA is returned.
.P .P
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API in the
.\" HREF .\" HREF

View File

@ -1,4 +1,4 @@
.TH PCRE2_JIT_STACK_CREATE 3 "03 November 2014" "PCRE2 10.00" .TH PCRE2_JIT_STACK_CREATE 3 "24 March 2017" "PCRE2 10.30"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -20,10 +20,9 @@ maximum size to which it is allowed to grow. The final argument is a general
context, for memory allocation functions, or NULL for standard memory context, for memory allocation functions, or NULL for standard memory
allocation. The result can be passed to the JIT run-time code by calling allocation. The result can be passed to the JIT run-time code by calling
\fBpcre2_jit_stack_assign()\fP to associate the stack with a compiled pattern, \fBpcre2_jit_stack_assign()\fP to associate the stack with a compiled pattern,
which can then be processed by \fBpcre2_match()\fP. If the "fast path" JIT which can then be processed by \fBpcre2_match()\fP or \fBpcre2_jit_match()\fP.
matcher, \fBpcre2_jit_match()\fP is used, the stack can be passed directly as A maximum stack size of 512K to 1M should be more than enough for any pattern.
an argument. A maximum stack size of 512K to 1M should be more than enough for For more details, see the
any pattern. For more details, see the
.\" HREF .\" HREF
\fBpcre2jit\fP \fBpcre2jit\fP
.\" .\"

View File

@ -1,4 +1,4 @@
.TH PCRE2_MAKETABLES 3 "21 October 2014" "PCRE2 10.00" .TH PCRE2_MAKETABLES 3 "24 March 2017" "PCRE2 10.30"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -12,10 +12,10 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.SH DESCRIPTION .SH DESCRIPTION
.rs .rs
.sp .sp
This function builds a set of character tables for character values less than This function builds a set of character tables for character code points that
256. These can be passed to \fBpcre2_compile()\fP in a compile context in order are less than 256. These can be passed to \fBpcre2_compile()\fP in a compile
to override the internal, built-in tables (which were either defaulted or made context in order to override the internal, built-in tables (which were either
by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the defaulted or made by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
.\" HREF .\" HREF
\fBpcre2_set_character_tables()\fP \fBpcre2_set_character_tables()\fP
.\" .\"

View File

@ -255,6 +255,9 @@ OPTIONS
directory like this is an immediate end-of-file; in others it directory like this is an immediate end-of-file; in others it
may provoke an error. may provoke an error.
--depth-limit=number
See --match-limit below.
-e pattern, --regex=pattern, --regexp=pattern -e pattern, --regex=pattern, --regexp=pattern
Specify a pattern to be matched. This option can be used mul- Specify a pattern to be matched. This option can be used mul-
tiple times in order to specify several patterns. It can also tiple times in order to specify several patterns. It can also
@ -477,32 +480,24 @@ OPTIONS
no short form for this option. no short form for this option.
--match-limit=number --match-limit=number
Processing some regular expression patterns can require a Processing some regular expression patterns may take a very
very large amount of memory, leading in some cases to a pro- long time to search for all possible matching strings. Others
gram crash if not enough is available. Other patterns may may require a very large amount of memory. There are two
take a very long time to search for all possible matching options that set resource limits for matching.
strings. The pcre2_match() function that is called by
pcre2grep to do the matching has two parameters that can
limit the resources that it uses.
The --match-limit option provides a means of limiting The --match-limit option provides a means of limiting comput-
resource usage when processing patterns that are not going to ing resource usage when processing patterns that are not
match, but which have a very large number of possibilities in going to match, but which have a very large number of possi-
their search trees. The classic example is a pattern that bilities in their search trees. The classic example is a pat-
uses nested unlimited repeats. Internally, PCRE2 uses a func- tern that uses nested unlimited repeats. Internally, PCRE2
tion called match() which it calls repeatedly (sometimes has a counter that is incremented each time around its main
recursively). The limit set by --match-limit is imposed on processing loop. If the value set by --match-limit is
the number of times this function is called during a match, reached, an error occurs.
which has the effect of limiting the amount of backtracking
that can take place.
The --recursion-limit option is similar to --match-limit, but The --depth-limit option limits the depth of nested back-
instead of limiting the total number of times that match() is tracking points, which in turn limits the amount of memory
called, it limits the depth of recursive calls, which in turn that is used. This limit is of use only if it is set smaller
limits the amount of memory that can be used. The recursion than --match-limit.
depth is a smaller number than the total number of calls,
because not all calls to match() are recursive. This limit is
of use only if it is set smaller than --match-limit.
There are no short forms for these options. The default set- There are no short forms for these options. The default set-
tings are specified when the PCRE2 library is compiled, with tings are specified when the PCRE2 library is compiled, with
@ -834,9 +829,9 @@ MATCHING ERRORS
such errors, pcre2grep gives up. such errors, pcre2grep gives up.
The --match-limit option of pcre2grep can be used to set the overall The --match-limit option of pcre2grep can be used to set the overall
resource limit; there is a second option called --recursion-limit that resource limit; there is a second option called --depth-limit that sets
sets a limit on the amount of memory (usually stack) that is used (see a limit on the amount of memory that is used (see the discussion of
the discussion of these options above). these options above).
DIAGNOSTICS DIAGNOSTICS
@ -862,5 +857,5 @@ AUTHOR
REVISION REVISION
Last updated: 31 December 2016 Last updated: 21 March 2017
Copyright (c) 1997-2016 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.

View File

@ -91,13 +91,13 @@ INPUT ENCODING
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case, ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
values greater than 0xffff cause an error to occur). values greater than 0xffff cause an error to occur).
UTF-8 is not capable of encoding values greater than 0x7fffffff, but UTF-8 (in its original definition) is not capable of encoding values
such values can be handled by the 32-bit library. When testing this greater than 0x7fffffff, but such values can be handled by the 32-bit
library in non-UTF mode with utf8_input set, if any character is pre- library. When testing this library in non-UTF mode with utf8_input set,
ceded by the byte 0xff (which is an illegal byte in UTF-8) 0x80000000 if any character is preceded by the byte 0xff (which is an illegal byte
is added to the character's value. This is the only way of passing such in UTF-8) 0x80000000 is added to the character's value. This is the
code points in a pattern string. For subject strings, using an escape only way of passing such code points in a pattern string. For subject
sequence is preferable. strings, using an escape sequence is preferable.
COMMAND LINE OPTIONS COMMAND LINE OPTIONS
@ -544,6 +544,7 @@ PATTERN MODIFIERS
/B bincode show binary code without lengths /B bincode show binary code without lengths
callout_info show callout information callout_info show callout information
debug same as info,fullbincode debug same as info,fullbincode
framesize show matching frame size
fullbincode show binary code with lengths fullbincode show binary code with lengths
/I info show info about compiled pattern /I info show info about compiled pattern
hex unquoted characters are hexadecimal hex unquoted characters are hexadecimal
@ -624,6 +625,10 @@ PATTERN MODIFIERS
last character. These lines are omitted if no starting or ending code last character. These lines are omitted if no starting or ending code
units are recorded. units are recorded.
The framesize modifier shows the size, in bytes, of the storage frames
used by pcre2_match() for handling backtracking. The size depends on
the number of capturing parentheses in the pattern.
The callout_info modifier requests information about all the callouts The callout_info modifier requests information about all the callouts
in the pattern. A list of them is output at the end of any other infor- in the pattern. A list of them is output at the end of any other infor-
mation that is requested. For each callout, either its number or string mation that is requested. For each callout, either its number or string
@ -959,6 +964,7 @@ SUBJECT MODIFIERS
callout_fail=<n>[:<m>] control callout failure callout_fail=<n>[:<m>] control callout failure
callout_none do not supply a callout function callout_none do not supply a callout function
copy=<number or name> copy captured substring copy=<number or name> copy captured substring
depth_limit=<n> set a depth limit
dfa use pcre2_dfa_match() dfa use pcre2_dfa_match()
find_limits find match and recursion limits find_limits find match and recursion limits
get=<number or name> extract captured substring get=<number or name> extract captured substring
@ -972,7 +978,7 @@ SUBJECT MODIFIERS
offset=<n> set starting offset offset=<n> set starting offset
offset_limit=<n> set offset limit offset_limit=<n> set offset limit
ovector=<n> set size of output vector ovector=<n> set size of output vector
recursion_limit=<n> set a recursion limit recursion_limit=<n> obsolete synonym for depth_limit
replace=<string> specify a replacement string replace=<string> specify a replacement string
startchar show startchar when relevant startchar show startchar when relevant
startoffset=<n> same as offset=<n> startoffset=<n> same as offset=<n>
@ -1188,32 +1194,31 @@ SUBJECT MODIFIERS
Providing a stack that is larger than the default 32K is necessary only Providing a stack that is larger than the default 32K is necessary only
for very complicated patterns. for very complicated patterns.
Setting match and recursion limits Setting match and depth limits
The match_limit and recursion_limit modifiers set the appropriate lim- The match_limit and depth_limit modifiers set the appropriate limits in
its in the match context. These values are ignored when the find_limits the match context. These values are ignored when the find_limits modi-
modifier is specified. fier is specified.
Finding minimum limits Finding minimum limits
If the find_limits modifier is present, pcre2test calls pcre2_match() If the find_limits modifier is present, pcre2test calls pcre2_match()
several times, setting different values in the match context via several times, setting different values in the match context via
pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds pcre2_set_match_limit() and pcre2_set_depth_limit() until it finds the
the minimum values for each parameter that allow pcre2_match() to com- minimum values for each parameter that allow pcre2_match() to complete
plete without error. without error.
If JIT is being used, only the match limit is relevant. If DFA matching If JIT is being used, only the match limit is relevant. If DFA matching
is being used, neither limit is relevant, and this modifier is ignored is being used, only the depth limit is relevant, but at present this
(with a warning message). modifier is ignored (with a warning message).
The match_limit number is a measure of the amount of backtracking that The match_limit number is a measure of the amount of backtracking that
takes place, and learning the minimum value can be instructive. For takes place, and learning the minimum value can be instructive. For
most simple matches, the number is quite small, but for patterns with most simple matches, the number is quite small, but for patterns with
very large numbers of matching possibilities, it can become large very very large numbers of matching possibilities, it can become large very
quickly with increasing length of subject string. The quickly with increasing length of subject string. The depth_limit num-
match_limit_recursion number is a measure of how much stack (or, if ber is a measure of how much memory for recording backtracking points
PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to is needed to complete the match attempt.
complete the match attempt.
Showing MARK names Showing MARK names
@ -1314,7 +1319,7 @@ DEFAULT OUTPUT FROM pcre2test
also output. Here is an example of an interactive pcre2test run. also output. Here is an example of an interactive pcre2test run.
$ pcre2test $ pcre2test
PCRE2 version 9.00 2014-05-10 PCRE2 version 10.22 2016-07-29
re> /^abc(\d+)/ re> /^abc(\d+)/
data> abc123 data> abc123
@ -1614,5 +1619,5 @@ AUTHOR
REVISION REVISION
Last updated: 28 December 2016 Last updated: 21 March 2017
Copyright (c) 1997-2016 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.