Documentation update.

This commit is contained in:
Philip.Hazel 2017-03-24 16:53:38 +00:00
parent 32bab50c01
commit 3aeb812180
24 changed files with 1293 additions and 1357 deletions

View File

@ -1,10 +1,6 @@
Building PCRE2 without using autotools
--------------------------------------
This document has been converted from the PCRE1 document. I have removed a
number of sections about building in various environments, as they applied only
to PCRE1 and are probably out of date.
This document contains the following sections:
General
@ -183,21 +179,9 @@ can skip ahead to the CMake section.
STACK SIZE IN WINDOWS ENVIRONMENTS
The default processor stack size of 1Mb in some Windows environments is too
small for matching patterns that need much recursion. In particular, test 2 may
fail because of this. Normally, running out of stack causes a crash, but there
have been cases where the test program has just died silently. See your linker
documentation for how to increase stack size if you experience problems. If you
are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
compiler, you can increase the stack size for pcre2test and pcre2grep by
setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
example). The Linux default of 8Mb is a reasonable choice for the stack, though
even that can be too small for some pattern/subject combinations.
PCRE2 has a compile configuration option to disable the use of stack for
recursion so that heap is used instead. However, pattern matching is
significantly slower when this is done. There is more about stack usage in the
"pcre2stack" documentation.
Prior to release 10.30 the default system stack size of 1Mb in some Windows
environments caused issues with some tests. This should no longer be the case
for 10.30 and later releases.
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
@ -393,4 +377,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
recommended download site.
=============================
Last Updated: 13 October 2016
Last Updated: 17 March 2017

View File

@ -15,8 +15,8 @@ subscribe or manage your subscription here:
https://lists.exim.org/mailman/listinfo/pcre-dev
Please read the NEWS file if you are upgrading from a previous release.
The contents of this README file are:
Please read the NEWS file if you are upgrading from a previous release. The
contents of this README file are:
The PCRE2 APIs
Documentation for PCRE2
@ -44,8 +44,8 @@ wrappers.
The distribution does contain a set of C wrapper functions for the 8-bit
library that are based on the POSIX regular expression API (see the pcre2posix
man page). These can be found in a library called libpcre2-posix. Note that this
just provides a POSIX calling interface to PCRE2; the regular expressions
man page). These can be found in a library called libpcre2-posix. Note that
this just provides a POSIX calling interface to PCRE2; the regular expressions
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
and does not give full access to all of PCRE2's facilities.
@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms:
Building PCRE2 on non-Unix-like systems
---------------------------------------
For a non-Unix-like system, please read the comments in the file
NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and
"make" you may be able to build PCRE2 using autotools in the same way as for
many Unix-like systems.
For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
your system supports the use of "configure" and "make" you may be able to build
PCRE2 using autotools in the same way as for many Unix-like systems.
PCRE2 can also be configured using CMake, which can be run in various ways
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
@ -174,19 +173,19 @@ library. They are also documented in the pcre2build man page.
architectures. If you try to enable it on an unsupported architecture, there
will be a compile time error.
. If you do not want to make use of the support for UTF-8 Unicode character
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
library, or UTF-32 Unicode character strings in the 32-bit library, you can
add --disable-unicode to the "configure" command. This reduces the size of
the libraries. It is not possible to configure one library with Unicode
support, and another without, in the same configuration.
. If you do not want to make use of the default support for UTF-8 Unicode
character strings in the 8-bit library, UTF-16 Unicode character strings in
the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
library, you can add --disable-unicode to the "configure" command. This
reduces the size of the libraries. It is not possible to configure one
library with Unicode support, and another without, in the same configuration.
It is also not possible to use --enable-ebcdic (see below) with Unicode
support, so if this option is set, you must also use --disable-unicode.
When Unicode support is available, the use of a UTF encoding still has to be
enabled by setting the PCRE2_UTF option at run time or starting a pattern
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
not possible to use both --enable-unicode and --enable-ebcdic at the same
time.
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
@ -232,18 +231,18 @@ library. They are also documented in the pcre2build man page.
--with-match-limit=500000
on the "configure" command. This is just the default; individual calls to
pcre2_match() can supply their own value. There is more discussion on the
pcre2api man page.
pcre2_match() can supply their own value. There is more discussion in the
pcre2api man page (search for pcre2_set_match_limit).
. There is a separate counter that limits the depth of recursive function calls
during a matching process. This also has a default of ten million, which is
essentially "unlimited". You can change the default by setting, for example,
. There is a separate counter that limits the depth of nested backtracking
during a matching process, which in turn limits the amount of memory that is
used. This also has a default of ten million, which is essentially
"unlimited". You can change the default by setting, for example,
--with-match-limit-recursion=500000
--with-match-limit-depth=5000
Recursive function calls use up the runtime stack; running out of stack can
cause programs to crash in strange ways. There is a discussion about stack
sizes in the pcre2stack man page.
There is more discussion in the pcre2api man page (search for
pcre2_set_depth_limit).
. In the 8-bit library, the default maximum compiled pattern size is around
64K bytes. You can increase this by adding --with-link-size=3 to the
@ -254,20 +253,6 @@ library. They are also documented in the pcre2build man page.
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
link size setting is ignored, as 4-byte offsets are always used.
. You can build PCRE2 so that its internal match() function that is called from
pcre2_match() does not call itself recursively. Instead, it uses memory
blocks obtained from the heap to save data that would otherwise be saved on
the stack. To build PCRE2 like this, use
--disable-stack-for-recursion
on the "configure" command. PCRE2 runs more slowly in this mode, but it may
be necessary in environments with limited stack sizes. This applies only to
the normal execution of the pcre2_match() function; if JIT support is being
successfully used, it is not relevant. Equally, it does not apply to
pcre2_dfa_match(), which does not use deeply nested recursion. There is a
discussion about stack sizes in the pcre2stack man page.
. For speed, PCRE2 uses four tables for manipulating and identifying characters
whose code point values are less than 256. By default, it uses a set of
tables for ASCII encoding that is part of the distribution. If you specify
@ -389,6 +374,13 @@ library. They are also documented in the pcre2build man page.
string. Otherwise, it is assumed to be a file name, and the contents of the
file are the test string.
. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
which caused pcre2_match() to use individual blocks on the heap for
backtracking instead of recursive function calls (which use the stack). This
is now obsolete since pcre2_match() was refactored always to use the heap (in
a much more efficient way than before). This option is retained for backwards
compatibility, but has no effect other than to output a warning.
The "configure" script builds the following files for the basic C library:
. Makefile the makefile that builds the library
@ -662,25 +654,32 @@ Unicode support is enabled.
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
16-bit and 32-bit modes. These are tests that generate different output in
8-bit mode. Each pair are for general cases and Unicode support, respectively.
Test 13 checks the handling of non-UTF characters greater than 255 by
pcre2_dfa_match() in 16-bit and 32-bit modes.
Test 14 contains a number of tests that must not be run with JIT. They check,
Test 14 contains some special UTF and UCP tests that give different output for
the different widths.
Test 15 contains a number of tests that must not be run with JIT. They check,
among other non-JIT things, the match-limiting features of the intepretive
matcher.
Test 15 is run only when JIT support is not available. It checks that an
Test 16 is run only when JIT support is not available. It checks that an
attempt to use JIT has the expected behaviour.
Test 16 is run only when JIT support is available. It checks JIT complete and
Test 17 is run only when JIT support is available. It checks JIT complete and
partial modes, match-limiting under JIT, and other JIT-specific features.
Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to
Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
the 8-bit library, without and with Unicode support, respectively.
Test 19 checks the serialization functions by writing a set of compiled
Test 20 checks the serialization functions by writing a set of compiled
patterns to a file, and then reloading and checking them.
Tests 21 and 22 test \C support when the use of \C is not locked out, without
and with UTF support, respectively. Test 23 tests \C when it is locked out.
Character tables
----------------
@ -866,4 +865,4 @@ The distribution should contain the files listed below.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
Last updated: 01 November 2016
Last updated: 17 March 2017

View File

@ -109,7 +109,7 @@ lose performance.
One way of guarding against this possibility is to use the
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
<b>pcre2_compile()</b>. This causes an compile time error if a pattern contains
<b>pcre2_compile()</b>. This causes a compile time error if the pattern contains
a UTF-setting sequence.
</P>
<P>
@ -137,7 +137,8 @@ large search tree against a string that will never match. Nested unlimited
repeats in a pattern are a common example. PCRE2 provides some protection
against this: see the <b>pcre2_set_match_limit()</b> function in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page.
page. There is a similar function called <b>pcre2_set_depth_limit()</b> that can
be used to restrict the amount of memory that is used.
</P>
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
<P>
@ -166,7 +167,7 @@ listing), and the short pages for individual functions, are concatenated in
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program
pcre2stack discussion of stack usage
pcre2stack discussion of stack and memory usage
pcre2syntax quick syntax reference
pcre2test description of the <b>pcre2test</b> command
pcre2unicode discussion of Unicode and UTF support
@ -189,9 +190,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P>
Last updated: 16 October 2015
Last updated: 27 March 2017
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -36,20 +36,21 @@ for success and non-zero otherwise. The arguments are:
<i>callout_data</i> User data that is passed to the callback
</pre>
The <i>callback()</i> function is passed a pointer to a data block containing
the following fields:
the following fields (not necessarily in this order):
<pre>
<i>version</i> Block version number
<i>pattern_position</i> Offset to next item in pattern
<i>next_item_length</i> Length of next item in pattern
<i>callout_number</i> Number for numbered callouts
<i>callout_string_offset</i> Offset to string within pattern
<i>callout_string_length</i> Length of callout string
<i>callout_string</i> Points to callout string or is NULL
uint32_t <i>version</i> Block version number
uint32_t <i>callout_number</i> Number for numbered callouts
PCRE2_SIZE <i>pattern_position</i> Offset to next item in pattern
PCRE2_SIZE <i>next_item_length</i> Length of next item in pattern
PCRE2_SIZE <i>callout_string_offset</i> Offset to string within pattern
PCRE2_SIZE <i>callout_string_length</i> Length of callout string
PCRE2_SPTR <i>callout_string</i> Points to callout string or is NULL
</pre>
The second argument is the callout data that was passed to
<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero
for success. Any other value causes the pattern scan to stop, with the value
being passed back as the result of <b>pcre2_callout_enumerate()</b>.
The second argument passed to the <b>callback()</b> function is the callout data
that was passed to <b>pcre2_callout_enumerate()</b>. The <b>callback()</b>
function must return zero for success. Any other value causes the pattern scan
to stop, with the value being passed back as the result of
<b>pcre2_callout_enumerate()</b>.
</P>
<P>
There is a complete description of the PCRE2 native API in the

View File

@ -26,7 +26,9 @@ DESCRIPTION
</b><br>
<P>
This function frees the memory used for a compiled pattern, including any
memory used by the JIT compiler.
memory used by the JIT compiler. If the compiled pattern was created by a call
to <b>pcre2_code_copy_with_tables()</b>, the memory for the character tables is
also freed.
</P>
<P>
There is a complete description of the PCRE2 native API in the

View File

@ -37,19 +37,24 @@ arguments are:
<i>erroffset</i> Where to put an error offset
<i>ccontext</i> Pointer to a compile context or NULL
</pre>
The length of the string and any error offset that is returned are in code
units, not characters. A compile context is needed only if you want to change
The length of the pattern and any error offset that is returned are in code
units, not characters. A compile context is needed only if you want to provide
custom memory allocation functions, or to provide an external function for
system stack size checking, or to change one or more of these parameters:
<pre>
What \R matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables
The newline character sequence
The compile time nested parentheses limit
What \R matches (Unicode newlines, or CR, LF, CRLF only);
PCRE2's character tables;
The newline character sequence;
The compile time nested parentheses limit;
The maximum pattern length (in code units) that is allowed.
</pre>
or provide an external function for stack size checking. The option bits are:
The option bits are:
<pre>
PCRE2_ANCHORED Force pattern anchoring
PCRE2_ALLOW_EMPTY_CLASS Allow empty classes
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
PCRE2_ALT_VERBNAMES Process backslashes in verb names
PCRE2_AUTO_CALLOUT Compile automatic callouts
PCRE2_CASELESS Do caseless matching
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
@ -71,19 +76,21 @@ or provide an external function for stack size checking. The option bits are:
(only relevant if PCRE2_UTF is set)
PCRE2_UCP Use Unicode properties for \d, \w, etc.
PCRE2_UNGREEDY Invert greediness of quantifiers
PCRE2_USE_OFFSET_LIMIT Enable offset limit for unanchored matching
PCRE2_UTF Treat pattern and subjects as UTF strings
</pre>
PCRE2 must be built with Unicode support in order to use PCRE2_UTF, PCRE2_UCP
and related options.
PCRE2 must be built with Unicode support (the default) in order to use
PCRE2_UTF, PCRE2_UCP and related options.
</P>
<P>
The yield of the function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected.
</P>
<P>
There is a complete description of the PCRE2 native API in the
There is a complete description of the PCRE2 native API, with more detail on
each option, in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
page, and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>

View File

@ -45,10 +45,9 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_CONFIG_BSR Indicates what \R matches by default:
PCRE2_BSR_UNICODE
PCRE2_BSR_ANYCRLF
PCRE2_CONFIG_JIT Availability of just-in-time compiler
support (1=yes 0=no)
PCRE2_CONFIG_JITTARGET Information about the target archi-
tecture for the JIT compiler
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
PCRE2_CONFIG_JIT Availability of just-in-time compiler support (1=yes 0=no)
PCRE2_CONFIG_JITTARGET Information (a string) about the target architecture for the JIT compiler
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
@ -58,11 +57,9 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
0=heap)
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
0=no)
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes 0=no)
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
PCRE2_CONFIG_VERSION The PCRE2 version (a string)
</pre>

View File

@ -31,8 +31,9 @@ DESCRIPTION
<P>
This function matches a compiled regular expression against a given subject
string, using an alternative matching algorithm that scans the subject string
just once (<i>not</i> Perl-compatible). (The Perl-compatible matching function
is <b>pcre2_match()</b>.) The arguments for this function are:
just once (except when processing lookaround assertions). This function is
<i>not</i> Perl-compatible (the Perl-compatible matching function is
<b>pcre2_match()</b>). The arguments for this function are:
<pre>
<i>code</i> Points to the compiled pattern
<i>subject</i> Points to the subject string
@ -45,22 +46,18 @@ is <b>pcre2_match()</b>.) The arguments for this function are:
<i>wscount</i> Number of elements in the vector
</pre>
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
up a callout function or specify the recursion limit. The <i>length</i> and
<i>startoffset</i> values are code units, not characters. The options are:
up a callout function or specify the recursion depth limit. The <i>length</i>
and <i>startoffset</i> values are code units, not characters. The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
validity (only relevant if PCRE2_UTF
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
was set at compile time)
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
match if no full matches are found
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
even if there is a full match as well
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
PCRE2_DFA_RESTART Restart after a partial match
PCRE2_DFA_SHORTEST Return only the shortest match
</pre>

View File

@ -34,11 +34,11 @@ errors are negative numbers. The arguments are:
<i>buffer</i> where to put the message
<i>bufflen</i> the length of the buffer (code units)
</pre>
The function returns the length of the message, excluding the trailing zero, or
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
this case, the returned message is truncated (but still with a trailing zero).
If <i>errorcode</i> does not contain a recognized error code number, the
negative value PCRE2_ERROR_BADDATA is returned.
The function returns the length of the message in code units, excluding the
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
too small. In this case, the returned message is truncated (but still with a
trailing zero). If <i>errorcode</i> does not contain a recognized error code
number, the negative value PCRE2_ERROR_BADDATA is returned.
</P>
<P>
There is a complete description of the PCRE2 native API in the

View File

@ -32,10 +32,9 @@ maximum size to which it is allowed to grow. The final argument is a general
context, for memory allocation functions, or NULL for standard memory
allocation. The result can be passed to the JIT run-time code by calling
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
which can then be processed by <b>pcre2_match()</b>. If the "fast path" JIT
matcher, <b>pcre2_jit_match()</b> is used, the stack can be passed directly as
an argument. A maximum stack size of 512K to 1M should be more than enough for
any pattern. For more details, see the
which can then be processed by <b>pcre2_match()</b> or <b>pcre2_jit_match()</b>.
A maximum stack size of 512K to 1M should be more than enough for any pattern.
For more details, see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
page.
</P>

View File

@ -25,10 +25,10 @@ SYNOPSIS
DESCRIPTION
</b><br>
<P>
This function builds a set of character tables for character values less than
256. These can be passed to <b>pcre2_compile()</b> in a compile context in order
to override the internal, built-in tables (which were either defaulted or made
by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
This function builds a set of character tables for character code points that
are less than 256. These can be passed to <b>pcre2_compile()</b> in a compile
context in order to override the internal, built-in tables (which were either
defaulted or made by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
page. You might want to do this if you are using a non-standard locale.
</P>

View File

@ -2575,8 +2575,8 @@ The internal recursion limit was reached.
A text message for an error code from any PCRE2 function (compile, match, or
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
is passed as the first argument, with the remaining two arguments specifying a
code unit buffer and its length, into which the text message is placed. Note
that the message is returned in code units of the appropriate width for the
code unit buffer and its length in code units, into which the text message is
placed. The message is returned in code units of the appropriate width for the
library that is being used.
</P>
<P>
@ -3265,9 +3265,9 @@ Cambridge, England.
</P>
<br><a name="SEC41" href="#TOC1">REVISION</a><br>
<P>
Last updated: 23 December 2016
Last updated: 21 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -280,6 +280,10 @@ operating systems the effect of reading a directory like this is an immediate
end-of-file; in others it may provoke an error.
</P>
<P>
<b>--depth-limit</b>=<i>number</i>
See <b>--match-limit</b> below.
</P>
<P>
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
Specify a pattern to be matched. This option can be used multiple times in
order to specify several patterns. It can also be used as a way of specifying a
@ -498,29 +502,22 @@ used. There is no short form for this option.
</P>
<P>
<b>--match-limit</b>=<i>number</i>
Processing some regular expression patterns can require a very large amount of
memory, leading in some cases to a program crash if not enough is available.
Other patterns may take a very long time to search for all possible matching
strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
do the matching has two parameters that can limit the resources that it uses.
Processing some regular expression patterns may take a very long time to search
for all possible matching strings. Others may require a very large amount of
memory. There are two options that set resource limits for matching.
<br>
<br>
The <b>--match-limit</b> option provides a means of limiting resource usage
when processing patterns that are not going to match, but which have a very
large number of possibilities in their search trees. The classic example is a
pattern that uses nested unlimited repeats. Internally, PCRE2 uses a function
called <b>match()</b> which it calls repeatedly (sometimes recursively). The
limit set by <b>--match-limit</b> is imposed on the number of times this
function is called during a match, which has the effect of limiting the amount
of backtracking that can take place.
The <b>--match-limit</b> option provides a means of limiting computing resource
usage when processing patterns that are not going to match, but which have a
very large number of possibilities in their search trees. The classic example
is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
counter that is incremented each time around its main processing loop. If the
value set by <b>--match-limit</b> is reached, an error occurs.
<br>
<br>
The <b>--recursion-limit</b> option is similar to <b>--match-limit</b>, but
instead of limiting the total number of times that <b>match()</b> is called, it
limits the depth of recursive calls, which in turn limits the amount of memory
that can be used. The recursion depth is a smaller number than the total number
of calls, because not all calls to <b>match()</b> are recursive. This limit is
of use only if it is set smaller than <b>--match-limit</b>.
The <b>--depth-limit</b> option limits the depth of nested backtracking points,
which in turn limits the amount of memory that is used. This limit is of use
only if it is set smaller than <b>--match-limit</b>.
<br>
<br>
There are no short forms for these options. The default settings are specified
@ -843,9 +840,9 @@ there are more than 20 such errors, <b>pcre2grep</b> gives up.
</P>
<P>
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
overall resource limit; there is a second option called <b>--recursion-limit</b>
that sets a limit on the amount of memory (usually stack) that is used (see the
discussion of these options above).
overall resource limit; there is a second option called <b>--depth-limit</b>
that sets a limit on the amount of memory that is used (see the discussion of
these options above).
</P>
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
<P>
@ -870,9 +867,9 @@ Cambridge, England.
</P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
Last updated: 31 December 2016
Last updated: 21 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -170,20 +170,24 @@ the application to apply the JIT optimization by calling
<b>pcre2_jit_compile()</b> is ignored.
</P>
<br><b>
Setting match and recursion limits
Setting match and backtracking depth limits
</b><br>
<P>
The caller of <b>pcre2_match()</b> can set a limit on the number of times the
internal <b>match()</b> function is called and on the maximum depth of
recursive calls. These facilities are provided to catch runaway matches that
are provoked by patterns with huge matching trees (a typical example is a
pattern with nested unlimited repeats) and to avoid running out of system stack
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
gives an error return. The limits can also be set by items at the start of the
pattern of the form
The pcre2_match() function contains a counter that is incremented every time it
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
this counter, which therefore limits the amount of computing resource used for
a match. The maximum depth of nested backtracking can also be limited, and this
restricts the amount of heap memory that is used.
</P>
<P>
These facilities are provided to catch runaway matches that are provoked by
patterns with huge matching trees (a typical example is a pattern with nested
unlimited repeats applied to a long string that does not match). When one of
these limits is reached, <b>pcre2_match()</b> gives an error return. The limits
can also be set by items at the start of the pattern of the form
<pre>
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
(*LIMIT_DEPTH=d)
</pre>
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
@ -192,10 +196,15 @@ limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
</P>
<P>
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
still recognized for backwards compatibility.
</P>
<P>
The match limit is used (but in a different way) when JIT is being used, but it
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
However, the recursion limit is relevant for DFA matching, which does use some
function recursion, in particular, for recursions within the pattern.
However, the depth limit is relevant for DFA matching, which uses function
recursion for recursions within the pattern. In this case, the depth limit
controls the amount of system stack that is used.
<a name="newlines"></a></P>
<br><b>
Newline conventions
@ -235,8 +244,8 @@ The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
what the \R escape sequence matches. By default, this is any Unicode newline
sequence, for Perl compatibility. However, this can be changed; see the
description of \R in the section entitled
sequence, for Perl compatibility. However, this can be changed; see the next
section and the description of \R in the section entitled
<a href="#newlineseq">"Newline sequences"</a>
below. A change of \R setting can be combined with a change of newline
convention.
@ -254,7 +263,7 @@ corresponding to PCRE2_BSR_UNICODE.
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
<P>
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
character code rather than ASCII or Unicode (typically a mainframe system). In
character code instead of ASCII or Unicode (typically a mainframe system). In
the sections below, character code values are ASCII or Unicode; in an EBCDIC
environment these characters may have different code values, and there are no
code points greater than 255.
@ -318,11 +327,11 @@ that character may have. This use of backslash as an escape character applies
both inside and outside character classes.
</P>
<P>
For example, if you want to match a * character, you write \* in the pattern.
This escaping action applies whether or not the following character would
otherwise be interpreted as a metacharacter, so it is always safe to precede a
non-alphanumeric with backslash to specify that it stands for itself. In
particular, if you want to match a backslash, you write \\.
For example, if you want to match a * character, you must write \* in the
pattern. This escaping action applies whether or not the following character
would otherwise be interpreted as a metacharacter, so it is always safe to
precede a non-alphanumeric with backslash to specify that it stands for itself.
In particular, if you want to match a backslash, you write \\.
</P>
<P>
In a UTF mode, only ASCII numbers and letters have any special meaning after a
@ -353,7 +362,7 @@ An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
by \E later in the pattern, the literal interpretation continues to the end of
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
a character class, this causes an error, because the character class is not
terminated.
terminated by a closing square bracket.
<a name="digitsafterbackslash"></a></P>
<br><b>
Non-printing characters
@ -476,9 +485,9 @@ a hexadecimal digit appears between \x{ and }, or if there is no terminating
<P>
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
described only when it is followed by two hexadecimal digits. Otherwise, it
matches a literal "x" character. In this mode mode, support for code points
greater than 256 is provided by \u, which must be followed by four hexadecimal
digits; otherwise it matches a literal "u" character.
matches a literal "x" character. In this mode, support for code points greater
than 256 is provided by \u, which must be followed by four hexadecimal digits;
otherwise it matches a literal "u" character.
</P>
<P>
Characters whose value is less than 256 can be defined by either of the two
@ -493,12 +502,10 @@ Constraints on character values
Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
<pre>
8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
8-bit non-UTF mode no greater than 0xff
16-bit non-UTF mode no greater than 0xffff
32-bit non-UTF mode no greater than 0xffffffff
All UTF modes no greater than 0x10ffff and a valid codepoint
</pre>
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
"surrogate" codepoints), and 0xffef.
@ -525,7 +532,7 @@ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
is set, \U matches a "U" character, and \u can be used to define a character
by code point, as described in the previous section.
by code point, as described above.
</P>
<br><b>
Absolute and relative back references
@ -714,7 +721,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
sequences that match characters with specific properties are available. In
8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose codepoints are less than 256, but they do work in this mode.
The extra escape sequences are:
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
may be encountered. These are all treated as being in the Common script and
with an unassigned type. The extra escape sequences are:
<pre>
\p{<i>xx</i>} a character with the <i>xx</i> property
\P{<i>xx</i>} a character without the <i>xx</i> property
@ -2214,16 +2223,8 @@ except that it does not cause the current matching position to be changed.
Assertion subpatterns are not capturing subpatterns. If such an assertion
contains capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. However, substring
capturing is carried out only for positive assertions. (Perl sometimes, but not
always, does do capturing in negative assertions.)
</P>
<P>
WARNING: If a positive assertion containing one or more capturing subpatterns
succeeds, but failure to match later in the pattern causes backtracking over
this assertion, the captures within the assertion are reset only if no higher
numbered captures are already set. This is, unfortunately, a fundamental
limitation of the current implementation; it may get removed in a future
reworking.
capturing is normally carried out only for positive assertions (but see the
discussion of conditional subpatterns below).
</P>
<P>
For compatibility with Perl, most assertion subpatterns may be repeated; though
@ -2601,6 +2602,12 @@ presence of at least one letter in the subject. If a letter is found, the
subject is matched against the first alternative; otherwise it is matched
against the second. This pattern matches strings in one of the two forms
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
</P>
<P>
For Perl compatibility, if an assertion that is a condition contains capturing
subpatterns, any capturing that occurs is retained afterwards, for both
positive and negative assertions. (Compare non-conditional assertions, when
captures are retained only for positive assertions.)
<a name="comments"></a></P>
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
<P>
@ -2773,93 +2780,57 @@ is the actual recursive call.
Differences in recursion processing between PCRE2 and Perl
</b><br>
<P>
Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2
(like Python, but unlike Perl), a recursive subpattern call is always treated
as an atomic group. That is, once it has matched some of the subject string, it
is never re-entered, even if it contains untried alternatives and there is a
subsequent matching failure. This can be illustrated by the following pattern,
which purports to match a palindromic string that contains an odd number of
characters (for example, "a", "aba", "abcba", "abcdcba"):
<pre>
^(.|(.)(?1)\2)$
</pre>
The idea is that it either matches a single character, or two identical
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
it does not if the pattern is longer than three characters. Consider the
subject string "abcba":
Some former differences between PCRE2 and Perl no longer exist.
</P>
<P>
At the top level, the first character is matched, but as it is not at the end
of the string, the first alternative fails; the second alternative is taken
and the recursion kicks in. The recursive call to subpattern 1 successfully
matches the next character ("b"). (Note that the beginning and end of line
tests are not part of the recursion).
Before release 10.30, recursion processing in PCRE2 differed from Perl in that
a recursive subpattern call was always treated as an atomic group. That is,
once it had matched some of the subject string, it was never re-entered, even
if it contained untried alternatives and there was a subsequent matching
failure. (Historical note: PCRE implemented recursion before Perl did.)
</P>
<P>
Back at the top level, the next character ("c") is compared with what
subpattern 2 matched, which was "a". This fails. Because the recursion is
treated as an atomic group, there are now no backtracking points, and so the
entire match fails. (Perl is able, at this point, to re-enter the recursion and
try the second alternative.) However, if the pattern is written with the
alternatives in the other order, things are different:
<pre>
^((.)(?1)\2|.)$
</pre>
This time, the recursing alternative is tried first, and continues to recurse
until it runs out of characters, at which point the recursion fails. But this
time we do have another alternative to try at the higher level. That is the big
difference: in the previous case the remaining alternative is at a deeper
recursion level, which PCRE2 cannot use.
Starting with release 10.30, recursive subroutine calls are no longer treated
as atomic. That is, they can be re-entered to try unused alternatives if there
is a matching failure later in the pattern. This is now compatible with the way
Perl works. If you want a subroutine call to be atomic, you must explicitly
enclose it in an atomic group.
</P>
<P>
To change the pattern so that it matches all palindromic strings, not just
those with an odd number of characters, it is tempting to change the pattern to
this:
Supporting backtracking into recursions simplifies certain types of recursive
pattern. For example, this pattern matches palindromic strings:
<pre>
^((.)(?1)\2|.?)$
</pre>
Again, this works in Perl, but not in PCRE2, and for the same reason. When a
deeper recursion has matched a single character, it cannot be entered again in
order to match an empty string. The solution is to separate the two cases, and
write out the odd and even cases as alternatives at the higher level:
The second branch in the group matches a single central character in the
palindrome when there are an odd number of characters, or nothing when there
are an even number of characters, but in order to work it has to be able to try
the second case when the rest of the pattern match fails. If you want to match
typical palindromic phrases, the pattern has to ignore all non-word characters,
which can be done like this:
<pre>
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
</pre>
If you want to match typical palindromic phrases, the pattern has to ignore all
non-word characters, which can be done like this:
<pre>
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
</pre>
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the
use of the possessive quantifier *+ to avoid backtracking into sequences of
non-word characters. Without this, PCRE2 takes a great deal longer (ten times
or more) to match typical phrases, and Perl takes so long that you think it has
gone into a loop.
man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
avoid backtracking into sequences of non-word characters. Without this, PCRE2
takes a great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop.
</P>
<P>
<b>WARNING</b>: The palindrome-matching patterns above work only if the subject
string does not start with a palindrome that is shorter than the entire string.
For example, although "abcba" is correctly matched, if the subject is "ababa",
PCRE2 finds the palindrome "aba" at the start, then fails at top level because
the end of the string does not follow. Once again, it cannot jump back into the
recursion to try other alternatives, so the entire match fails.
</P>
<P>
The second way in which PCRE2 and Perl differ in their recursion processing is
in the handling of captured values. In Perl, when a subpattern is called
recursively or as a subpattern (see the next section), it has no access to any
values that were captured outside the recursion, whereas in PCRE2 these values
can be referenced. Consider this pattern:
Another way in which PCRE2 and Perl used to differ in their recursion
processing is in the handling of captured values. Formerly in Perl, when a
subpattern was called recursively or as a subpattern (see the next section), it
had no access to any values that were captured outside the recursion, whereas
in PCRE2 these values can be referenced. Consider this pattern:
<pre>
^(.)(\1|a(?2))
</pre>
In PCRE2, this pattern matches "bab". The first capturing parentheses match "b",
then in the second group, when the back reference \1 fails to match "b", the
second alternative matches "a" and then recurses. In the recursion, \1 does
now match "b" and so the whole match succeeds. In Perl, the pattern fails to
match because inside the recursive call \1 cannot access the externally set
value.
This pattern matches "bab". The first capturing parentheses match "b", then in
the second group, when the back reference \1 fails to match "b", the second
alternative matches "a" and then recurses. In the recursion, \1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
<a name="subpatternsassubroutines"></a></P>
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<P>
@ -2886,11 +2857,10 @@ is used, it does match "sense and responsibility" as well as the other two
strings. Another example is given in the discussion of DEFINE above.
</P>
<P>
All subroutine calls, whether recursive or not, are always treated as atomic
groups. That is, once a subroutine has matched some of the subject string, it
is never re-entered, even if it contains untried alternatives and there is a
subsequent matching failure. Any capturing parentheses that are set during the
subroutine call revert to their previous values afterwards.
Like recursions, subroutine calls used to be treated as atomic, but this
changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
occur. However, any capturing parentheses that are set during the subroutine
call revert to their previous values afterwards.
</P>
<P>
Processing options such as case-independence are fixed when a subpattern is
@ -2998,17 +2968,10 @@ The doubling is removed before the string is passed to the callout function.
<a name="backtrackcontrol"></a></P>
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
are still described in the Perl documentation as "experimental and subject to
change or removal in a future version of Perl". It goes on to say: "Their usage
in production code should be noted to avoid problems during upgrades." The same
remarks apply to the PCRE2 features described in this section.
</P>
<P>
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
(*VERB:NAME). Some verbs take either form, possibly behaving differently
depending on whether or not a name is present.
There are a number of special "Backtracking Control Verbs" (to use Perl's
terminology) that modify the behaviour of backtracking during matching. They
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
possibly behaving differently depending on whether or not a name is present.
</P>
<P>
By default, for compatibility with Perl, a name is any sequence of characters
@ -3040,7 +3003,7 @@ not there. Any number of these verbs may occur in a pattern.
<P>
Since these verbs are specifically related to backtracking, most of them can be
used only when the pattern is to be matched using the traditional matching
function, because these use a backtracking algorithm. With the exception of
function, because that uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, the backtracking
control verbs cause an error if encountered by the DFA matching function.
</P>
@ -3178,11 +3141,11 @@ Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching continues
with what follows, but if there is no subsequent match, causing a backtrack to
the verb, a failure is forced. That is, backtracking cannot pass to the left of
the verb. However, when one of these verbs appears inside an atomic group
(which includes any group that is called as a subroutine) or in an assertion
that is true, its effect is confined to that group, because once the group has
been matched, there is never any backtracking into it. In this situation,
backtracking has to jump to the left of the entire atomic group or assertion.
the verb. However, when one of these verbs appears inside an atomic group or in
an assertion that is true, its effect is confined to that group, because once
the group has been matched, there is never any backtracking into it. In this
situation, backtracking has to jump to the left of the entire atomic group or
assertion.
</P>
<P>
These verbs differ in exactly what kind of failure occurs when backtracking
@ -3246,8 +3209,8 @@ expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
as (*COMMIT).
</P>
<P>
The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
It is like (*MARK:NAME) in that the name is remembered for passing back to the
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
like (*MARK:NAME) in that the name is remembered for passing back to the
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
ignoring those set by (*PRUNE) or (*THEN).
<pre>
@ -3452,9 +3415,9 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 27 December 2016
Last updated: 18 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -55,7 +55,10 @@ The facility for saving and restoring compiled patterns is intended for use
within individual applications. As such, the data supplied to
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency checking, not
complete validation of what is being re-loaded.
complete validation of what is being re-loaded. Corrupted data may cause
undefined results. For example, if the length field of a pattern in the
serialized data is corrupted, the deserializing code may read beyond the end of
the byte stream that is passed to it.
</P>
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
<P>
@ -190,9 +193,9 @@ Cambridge, England.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 May 2016
Last updated: 21 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -126,12 +126,13 @@ character values up to 0x7fffffff. Each character is placed in one 16-bit or
to occur).
</P>
<P>
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
values can be handled by the 32-bit library. When testing this library in
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
character's value. This is the only way of passing such code points in a
pattern string. For subject strings, using an escape sequence is preferable.
UTF-8 (in its original definition) is not capable of encoding values greater
than 0x7fffffff, but such values can be handled by the 32-bit library. When
testing this library in non-UTF mode with <b>utf8_input</b> set, if any
character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
0x80000000 is added to the character's value. This is the only way of passing
such code points in a pattern string. For subject strings, using an escape
sequence is preferable.
</P>
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
<P>
@ -602,6 +603,7 @@ about the pattern:
/B bincode show binary code without lengths
callout_info show callout information
debug same as info,fullbincode
framesize show matching frame size
fullbincode show binary code with lengths
/I info show info about compiled pattern
hex unquoted characters are hexadecimal
@ -689,6 +691,11 @@ not necessarily the last character. These lines are omitted if no starting or
ending code units are recorded.
</P>
<P>
The <b>framesize</b> modifier shows the size, in bytes, of the storage frames
used by <b>pcre2_match()</b> for handling backtracking. The size depends on the
number of capturing parentheses in the pattern.
</P>
<P>
The <b>callout_info</b> modifier requests information about all the callouts in
the pattern. A list of them is output at the end of any other information that
is requested. For each callout, either its number or string is given, followed
@ -1073,6 +1080,7 @@ pattern.
callout_fail=&#60;n&#62;[:&#60;m&#62;] control callout failure
callout_none do not supply a callout function
copy=&#60;number or name&#62; copy captured substring
depth_limit=&#60;n&#62; set a depth limit
dfa use <b>pcre2_dfa_match()</b>
find_limits find match and recursion limits
get=&#60;number or name&#62; extract captured substring
@ -1086,7 +1094,7 @@ pattern.
offset=&#60;n&#62; set starting offset
offset_limit=&#60;n&#62; set offset limit
ovector=&#60;n&#62; set size of output vector
recursion_limit=&#60;n&#62; set a recursion limit
recursion_limit=&#60;n&#62; obsolete synonym for depth_limit
replace=&#60;string&#62; specify a replacement string
startchar show startchar when relevant
startoffset=&#60;n&#62; same as offset=&#60;n&#62;
@ -1320,10 +1328,10 @@ stack that is larger than the default 32K is necessary only for very
complicated patterns.
</P>
<br><b>
Setting match and recursion limits
Setting match and depth limits
</b><br>
<P>
The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate
The <b>match_limit</b> and <b>depth_limit</b> modifiers set the appropriate
limits in the match context. These values are ignored when the
<b>find_limits</b> modifier is specified.
</P>
@ -1333,23 +1341,23 @@ Finding minimum limits
<P>
If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
<b>pcre2_match()</b> several times, setting different values in the match
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b>
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_depth_limit()</b>
until it finds the minimum values for each parameter that allow
<b>pcre2_match()</b> to complete without error.
</P>
<P>
If JIT is being used, only the match limit is relevant. If DFA matching is
being used, neither limit is relevant, and this modifier is ignored (with a
warning message).
being used, only the depth limit is relevant, but at present this modifier is
ignored (with a warning message).
</P>
<P>
The <i>match_limit</i> number is a measure of the amount of backtracking
that takes place, and learning the minimum value can be instructive. For most
simple matches, the number is quite small, but for patterns with very large
numbers of matching possibilities, it can become large very quickly with
increasing length of subject string. The <i>match_limit_recursion</i> number is
a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much
heap) memory is needed to complete the match attempt.
increasing length of subject string. The <i>depth_limit</i> number is
a measure of how much memory for recording backtracking points is needed to
complete the match attempt.
</P>
<br><b>
Showing MARK names
@ -1466,7 +1474,7 @@ code unit offset of the start of the failing character is also output. Here is
an example of an interactive <b>pcre2test</b> run.
<pre>
$ pcre2test
PCRE2 version 9.00 2014-05-10
PCRE2 version 10.22 2016-07-29
re&#62; /^abc(\d+)/
data&#62; abc123
@ -1779,9 +1787,9 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 28 December 2016
Last updated: 21 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -89,8 +89,8 @@ SECURITY CONSIDERATIONS
One way of guarding against this possibility is to use the pcre2_pat-
tern_info() function to check the compiled pattern's options for
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
calling pcre2_compile(). This causes an compile time error if a pattern
contains a UTF-setting sequence.
calling pcre2_compile(). This causes a compile time error if the pat-
tern contains a UTF-setting sequence.
The use of Unicode properties for character types such as \d can also
be enabled from within the pattern, by specifying "(*UCP)". This fea-
@ -112,7 +112,9 @@ SECURITY CONSIDERATIONS
has a very large search tree against a string that will never match.
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
vides some protection against this: see the pcre2_set_match_limit()
function in the pcre2api page.
function in the pcre2api page. There is a similar function called
pcre2_set_depth_limit() that can be used to restrict the amount of mem-
ory that is used.
USER DOCUMENTATION
@ -144,7 +146,7 @@ USER DOCUMENTATION
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program
pcre2stack discussion of stack usage
pcre2stack discussion of stack and memory usage
pcre2syntax quick syntax reference
pcre2test description of the pcre2test command
pcre2unicode discussion of Unicode and UTF support
@ -166,8 +168,8 @@ AUTHOR
REVISION
Last updated: 16 October 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 27 March 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@ -2533,9 +2535,10 @@ OBTAINING A TEXTUAL ERROR MESSAGE
A text message for an error code from any PCRE2 function (compile,
match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
sage(). The code is passed as the first argument, with the remaining
two arguments specifying a code unit buffer and its length, into which
the text message is placed. Note that the message is returned in code
units of the appropriate width for the library that is being used.
two arguments specifying a code unit buffer and its length in code
units, into which the text message is placed. The message is returned
in code units of the appropriate width for the library that is being
used.
The returned message is terminated with a trailing zero, and the func-
tion returns the number of code units used, excluding the trailing
@ -3178,8 +3181,8 @@ AUTHOR
REVISION
Last updated: 23 December 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 21 March 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@ -5519,19 +5522,24 @@ SPECIAL START-OF-PATTERN ITEMS
attempt by the application to apply the JIT optimization by calling
pcre2_jit_compile() is ignored.
Setting match and recursion limits
Setting match and backtracking depth limits
The caller of pcre2_match() can set a limit on the number of times the
internal match() function is called and on the maximum depth of recur-
sive calls. These facilities are provided to catch runaway matches that
are provoked by patterns with huge matching trees (a typical example is
a pattern with nested unlimited repeats) and to avoid running out of
system stack by too much recursion. When one of these limits is
reached, pcre2_match() gives an error return. The limits can also be
set by items at the start of the pattern of the form
The pcre2_match() function contains a counter that is incremented every
time it goes round its main loop. The caller of pcre2_match() can set a
limit on this counter, which therefore limits the amount of computing
resource used for a match. The maximum depth of nested backtracking can
also be limited, and this restricts the amount of heap memory that is
used.
These facilities are provided to catch runaway matches that are pro-
voked by patterns with huge matching trees (a typical example is a pat-
tern with nested unlimited repeats applied to a long string that does
not match). When one of these limits is reached, pcre2_match() gives an
error return. The limits can also be set by items at the start of the
pattern of the form
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
(*LIMIT_DEPTH=d)
where d is any number of decimal digits. However, the value of the set-
ting must be less than the value set (or defaulted) by the caller of
@ -5540,11 +5548,15 @@ SPECIAL START-OF-PATTERN ITEMS
If there is more than one setting of one of these limits, the lower
value is used.
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
name is still recognized for backwards compatibility.
The match limit is used (but in a different way) when JIT is being
used, but it is not relevant, and is ignored, when matching with
pcre2_dfa_match(). However, the recursion limit is relevant for DFA
matching, which does use some function recursion, in particular, for
recursions within the pattern.
pcre2_dfa_match(). However, the depth limit is relevant for DFA match-
ing, which uses function recursion for recursions within the pattern.
In this case, the depth limit controls the amount of system stack that
is used.
Newline conventions
@ -5579,9 +5591,9 @@ SPECIAL START-OF-PATTERN ITEMS
acter when PCRE2_DOTALL is not set, and the behaviour of \N. However,
it does not affect what the \R escape sequence matches. By default,
this is any Unicode newline sequence, for Perl compatibility. However,
this can be changed; see the description of \R in the section entitled
"Newline sequences" below. A change of \R setting can be combined with
a change of newline convention.
this can be changed; see the next section and the description of \R in
the section entitled "Newline sequences" below. A change of \R setting
can be combined with a change of newline convention.
Specifying what \R matches
@ -5595,7 +5607,7 @@ SPECIAL START-OF-PATTERN ITEMS
EBCDIC CHARACTER CODES
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
character code rather than ASCII or Unicode (typically a mainframe sys-
character code instead of ASCII or Unicode (typically a mainframe sys-
tem). In the sections below, character code values are ASCII or Uni-
code; in an EBCDIC environment these characters may have different code
values, and there are no code points greater than 255.
@ -5660,8 +5672,8 @@ BACKSLASH
meaning that character may have. This use of backslash as an escape
character applies both inside and outside character classes.
For example, if you want to match a * character, you write \* in the
pattern. This escaping action applies whether or not the following
For example, if you want to match a * character, you must write \* in
the pattern. This escaping action applies whether or not the following
character would otherwise be interpreted as a metacharacter, so it is
always safe to precede a non-alphanumeric with backslash to specify
that it stands for itself. In particular, if you want to match a back-
@ -5695,7 +5707,8 @@ BACKSLASH
is not followed by \E later in the pattern, the literal interpretation
continues to the end of the pattern (that is, \E is assumed at the
end). If the isolated \Q is inside a character class, this causes an
error, because the character class is not terminated.
error, because the character class is not terminated by a closing
square bracket.
Non-printing characters
@ -5810,10 +5823,10 @@ BACKSLASH
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
just described only when it is followed by two hexadecimal digits. Oth-
erwise, it matches a literal "x" character. In this mode mode, support
for code points greater than 256 is provided by \u, which must be fol-
lowed by four hexadecimal digits; otherwise it matches a literal "u"
character.
erwise, it matches a literal "x" character. In this mode, support for
code points greater than 256 is provided by \u, which must be followed
by four hexadecimal digits; otherwise it matches a literal "u" charac-
ter.
Characters whose value is less than 256 can be defined by either of the
two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
@ -5825,12 +5838,10 @@ BACKSLASH
Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
8-bit non-UTF mode no greater than 0xff
16-bit non-UTF mode no greater than 0xffff
32-bit non-UTF mode no greater than 0xffffffff
All UTF modes no greater than 0x10ffff and a valid codepoint
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
called "surrogate" codepoints), and 0xffef.
@ -5852,8 +5863,7 @@ BACKSLASH
handler and used to modify the case of following characters. By
default, PCRE2 does not support these escape sequences. However, if the
PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
used to define a character by code point, as described in the previous
section.
used to define a character by code point, as described above.
Absolute and relative back references
@ -6022,7 +6032,10 @@ BACKSLASH
tional escape sequences that match characters with specific properties
are available. In 8-bit non-UTF-8 mode, these sequences are of course
limited to testing characters whose codepoints are less than 256, but
they do work in this mode. The extra escape sequences are:
they do work in this mode. In 32-bit non-UTF mode, codepoints greater
than 0x10ffff (the Unicode limit) may be encountered. These are all
treated as being in the Common script and with an unassigned type. The
extra escape sequences are:
\p{xx} a character with the xx property
\P{xx} a character without the xx property
@ -7328,16 +7341,9 @@ ASSERTIONS
Assertion subpatterns are not capturing subpatterns. If such an asser-
tion contains capturing subpatterns within it, these are counted for
the purposes of numbering the capturing subpatterns in the whole pat-
tern. However, substring capturing is carried out only for positive
assertions. (Perl sometimes, but not always, does do capturing in nega-
tive assertions.)
WARNING: If a positive assertion containing one or more capturing sub-
patterns succeeds, but failure to match later in the pattern causes
backtracking over this assertion, the captures within the assertion are
reset only if no higher numbered captures are already set. This is,
unfortunately, a fundamental limitation of the current implementation;
it may get removed in a future reworking.
tern. However, substring capturing is normally carried out only for
positive assertions (but see the discussion of conditional subpatterns
below).
For compatibility with Perl, most assertion subpatterns may be
repeated; though it makes no sense to assert the same thing several
@ -7686,6 +7692,12 @@ CONDITIONAL SUBPATTERNS
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
letters and dd are digits.
For Perl compatibility, if an assertion that is a condition contains
capturing subpatterns, any capturing that occurs is retained after-
wards, for both positive and negative assertions. (Compare non-condi-
tional assertions, when captures are retained only for positive asser-
tions.)
COMMENTS
@ -7849,94 +7861,59 @@ RECURSIVE PATTERNS
Differences in recursion processing between PCRE2 and Perl
Recursion processing in PCRE2 differs from Perl in two important ways.
In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
alternatives and there is a subsequent matching failure. This can be
illustrated by the following pattern, which purports to match a palin-
dromic string that contains an odd number of characters (for example,
"a", "aba", "abcba", "abcdcba"):
Some former differences between PCRE2 and Perl no longer exist.
^(.|(.)(?1)\2)$
Before release 10.30, recursion processing in PCRE2 differed from Perl
in that a recursive subpattern call was always treated as an atomic
group. That is, once it had matched some of the subject string, it was
never re-entered, even if it contained untried alternatives and there
was a subsequent matching failure. (Historical note: PCRE implemented
recursion before Perl did.)
The idea is that it either matches a single character, or two identical
characters surrounding a sub-palindrome. In Perl, this pattern works;
in PCRE2 it does not if the pattern is longer than three characters.
Consider the subject string "abcba":
Starting with release 10.30, recursive subroutine calls are no longer
treated as atomic. That is, they can be re-entered to try unused alter-
natives if there is a matching failure later in the pattern. This is
now compatible with the way Perl works. If you want a subroutine call
to be atomic, you must explicitly enclose it in an atomic group.
At the top level, the first character is matched, but as it is not at
the end of the string, the first alternative fails; the second alterna-
tive is taken and the recursion kicks in. The recursive call to subpat-
tern 1 successfully matches the next character ("b"). (Note that the
beginning and end of line tests are not part of the recursion).
Back at the top level, the next character ("c") is compared with what
subpattern 2 matched, which was "a". This fails. Because the recursion
is treated as an atomic group, there are now no backtracking points,
and so the entire match fails. (Perl is able, at this point, to re-
enter the recursion and try the second alternative.) However, if the
pattern is written with the alternatives in the other order, things are
different:
^((.)(?1)\2|.)$
This time, the recursing alternative is tried first, and continues to
recurse until it runs out of characters, at which point the recursion
fails. But this time we do have another alternative to try at the
higher level. That is the big difference: in the previous case the
remaining alternative is at a deeper recursion level, which PCRE2 can-
not use.
To change the pattern so that it matches all palindromic strings, not
just those with an odd number of characters, it is tempting to change
the pattern to this:
Supporting backtracking into recursions simplifies certain types of
recursive pattern. For example, this pattern matches palindromic
strings:
^((.)(?1)\2|.?)$
Again, this works in Perl, but not in PCRE2, and for the same reason.
When a deeper recursion has matched a single character, it cannot be
entered again in order to match an empty string. The solution is to
separate the two cases, and write out the odd and even cases as alter-
natives at the higher level:
The second branch in the group matches a single central character in
the palindrome when there are an odd number of characters, or nothing
when there are an even number of characters, but in order to work it
has to be able to try the second case when the rest of the pattern
match fails. If you want to match typical palindromic phrases, the pat-
tern has to ignore all non-word characters, which can be done like
this:
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
If you want to match typical palindromic phrases, the pattern has to
ignore all non-word characters, which can be done like this:
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
If run with the PCRE2_CASELESS option, this pattern matches phrases
such as "A man, a plan, a canal: Panama!" and it works in both PCRE2
and Perl. Note the use of the possessive quantifier *+ to avoid back-
tracking into sequences of non-word characters. Without this, PCRE2
takes a great deal longer (ten times or more) to match typical phrases,
and Perl takes so long that you think it has gone into a loop.
such as "A man, a plan, a canal: Panama!". Note the use of the posses-
sive quantifier *+ to avoid backtracking into sequences of non-word
characters. Without this, PCRE2 takes a great deal longer (ten times or
more) to match typical phrases, and Perl takes so long that you think
it has gone into a loop.
WARNING: The palindrome-matching patterns above work only if the sub-
ject string does not start with a palindrome that is shorter than the
entire string. For example, although "abcba" is correctly matched, if
the subject is "ababa", PCRE2 finds the palindrome "aba" at the start,
then fails at top level because the end of the string does not follow.
Once again, it cannot jump back into the recursion to try other alter-
natives, so the entire match fails.
The second way in which PCRE2 and Perl differ in their recursion pro-
cessing is in the handling of captured values. In Perl, when a subpat-
tern is called recursively or as a subpattern (see the next section),
it has no access to any values that were captured outside the recur-
sion, whereas in PCRE2 these values can be referenced. Consider this
pattern:
Another way in which PCRE2 and Perl used to differ in their recursion
processing is in the handling of captured values. Formerly in Perl,
when a subpattern was called recursively or as a subpattern (see the
next section), it had no access to any values that were captured out-
side the recursion, whereas in PCRE2 these values can be referenced.
Consider this pattern:
^(.)(\1|a(?2))
In PCRE2, this pattern matches "bab". The first capturing parentheses
match "b", then in the second group, when the back reference \1 fails
to match "b", the second alternative matches "a" and then recurses. In
the recursion, \1 does now match "b" and so the whole match succeeds.
In Perl, the pattern fails to match because inside the recursive call
\1 cannot access the externally set value.
This pattern matches "bab". The first capturing parentheses match "b",
then in the second group, when the back reference \1 fails to match
"b", the second alternative matches "a" and then recurses. In the
recursion, \1 does now match "b" and so the whole match succeeds. This
match used to fail in Perl, but in later versions (I tried 5.024) it
now works.
SUBPATTERNS AS SUBROUTINES
@ -7964,12 +7941,10 @@ SUBPATTERNS AS SUBROUTINES
two strings. Another example is given in the discussion of DEFINE
above.
All subroutine calls, whether recursive or not, are always treated as
atomic groups. That is, once a subroutine has matched some of the sub-
ject string, it is never re-entered, even if it contains untried alter-
natives and there is a subsequent matching failure. Any capturing
parentheses that are set during the subroutine call revert to their
previous values afterwards.
Like recursions, subroutine calls used to be treated as atomic, but
this changed at PCRE2 release 10.30, so backtracking into subroutine
calls can now occur. However, any capturing parentheses that are set
during the subroutine call revert to their previous values afterwards.
Processing options such as case-independence are fixed when a subpat-
tern is defined, so if it is used as a subroutine, such options cannot
@ -8076,17 +8051,11 @@ CALLOUTS
BACKTRACKING CONTROL
Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
which are still described in the Perl documentation as "experimental
and subject to change or removal in a future version of Perl". It goes
on to say: "Their usage in production code should be noted to avoid
problems during upgrades." The same remarks apply to the PCRE2 features
described in this section.
The new verbs make use of what was previously invalid syntax: an open-
ing parenthesis followed by an asterisk. They are generally of the form
(*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving
differently depending on whether or not a name is present.
There are a number of special "Backtracking Control Verbs" (to use
Perl's terminology) that modify the behaviour of backtracking during
matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
verbs take either form, possibly behaving differently depending on
whether or not a name is present.
By default, for compatibility with Perl, a name is any sequence of
characters that does not include a closing parenthesis. The name is not
@ -8116,7 +8085,7 @@ BACKTRACKING CONTROL
Since these verbs are specifically related to backtracking, most of
them can be used only when the pattern is to be matched using the tra-
ditional matching function, because these use a backtracking algorithm.
ditional matching function, because that uses a backtracking algorithm.
With the exception of (*FAIL), which behaves like a failing negative
assertion, the backtracking control verbs cause an error if encountered
by the DFA matching function.
@ -8236,11 +8205,11 @@ BACKTRACKING CONTROL
tinues with what follows, but if there is no subsequent match, causing
a backtrack to the verb, a failure is forced. That is, backtracking
cannot pass to the left of the verb. However, when one of these verbs
appears inside an atomic group (which includes any group that is called
as a subroutine) or in an assertion that is true, its effect is con-
fined to that group, because once the group has been matched, there is
never any backtracking into it. In this situation, backtracking has to
jump to the left of the entire atomic group or assertion.
appears inside an atomic group or in an assertion that is true, its
effect is confined to that group, because once the group has been
matched, there is never any backtracking into it. In this situation,
backtracking has to jump to the left of the entire atomic group or
assertion.
These verbs differ in exactly what kind of failure occurs when back-
tracking reaches them. The behaviour described below is what happens
@ -8303,11 +8272,10 @@ BACKTRACKING CONTROL
any other way. In an anchored pattern (*PRUNE) has the same effect as
(*COMMIT).
The behaviour of (*PRUNE:NAME) is the not the same as
(*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is
remembered for passing back to the caller. However, (*SKIP:NAME)
searches only for names set with (*MARK), ignoring those set by
(*PRUNE) or (*THEN).
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
It is like (*MARK:NAME) in that the name is remembered for passing back
to the caller. However, (*SKIP:NAME) searches only for names set with
(*MARK), ignoring those set by (*PRUNE) or (*THEN).
(*SKIP)
@ -8496,8 +8464,8 @@ AUTHOR
REVISION
Last updated: 27 December 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 18 March 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@ -9078,7 +9046,10 @@ SECURITY CONCERNS
use within individual applications. As such, the data supplied to
pcre2_serialize_decode() is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency
checking, not complete validation of what is being re-loaded.
checking, not complete validation of what is being re-loaded. Corrupted
data may cause undefined results. For example, if the length field of a
pattern in the serialized data is corrupted, the deserializing code may
read beyond the end of the byte stream that is passed to it.
SAVING COMPILED PATTERNS
@ -9211,8 +9182,8 @@ AUTHOR
REVISION
Last updated: 24 May 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 21 March 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_CONFIG 3 "20 April 2014" "PCRE2 10.0"
.TH PCRE2_CONFIG 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -31,10 +31,13 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_CONFIG_BSR Indicates what \eR matches by default:
PCRE2_BSR_UNICODE
PCRE2_BSR_ANYCRLF
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
.\" JOIN
PCRE2_CONFIG_JIT Availability of just-in-time compiler
support (1=yes 0=no)
PCRE2_CONFIG_JITTARGET Information about the target archi-
tecture for the JIT compiler
.\" JOIN
PCRE2_CONFIG_JITTARGET Information (a string) about the target
architecture for the JIT compiler
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
@ -44,9 +47,9 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
0=heap)
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
.\" JOIN
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
0=no)
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)

View File

@ -1,4 +1,4 @@
.TH PCRE2_DFA_MATCH 3 "23 December 2016" "PCRE2 10.23"
.TH PCRE2_DFA_MATCH 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -19,8 +19,9 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp
This function matches a compiled regular expression against a given subject
string, using an alternative matching algorithm that scans the subject string
just once (\fInot\fP Perl-compatible). (The Perl-compatible matching function
is \fBpcre2_match()\fP.) The arguments for this function are:
just once (except when processing lookaround assertions). This function is
\fInot\fP Perl-compatible (the Perl-compatible matching function is
\fBpcre2_match()\fP). The arguments for this function are:
.sp
\fIcode\fP Points to the compiled pattern
\fIsubject\fP Points to the subject string
@ -33,22 +34,26 @@ is \fBpcre2_match()\fP.) The arguments for this function are:
\fIwscount\fP Number of elements in the vector
.sp
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
up a callout function or specify the recursion limit. The \fIlength\fP and
\fIstartoffset\fP values are code units, not characters. The options are:
up a callout function or specify the recursion depth limit. The \fIlength\fP
and \fIstartoffset\fP values are code units, not characters. The options are:
.sp
PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
.\" JOIN
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
.\" JOIN
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
validity (only relevant if PCRE2_UTF
was set at compile time)
.\" JOIN
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial
match even if there is a full match
.\" JOIN
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
match if no full matches are found
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
even if there is a full match as well
PCRE2_DFA_RESTART Restart after a partial match
PCRE2_DFA_SHORTEST Return only the shortest match
.sp

View File

@ -1,4 +1,4 @@
.TH PCRE2_GET_ERROR_MESSAGE 3 "17 June 2016" "PCRE2 10.22"
.TH PCRE2_GET_ERROR_MESSAGE 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -22,11 +22,11 @@ errors are negative numbers. The arguments are:
\fIbuffer\fP where to put the message
\fIbufflen\fP the length of the buffer (code units)
.sp
The function returns the length of the message, excluding the trailing zero, or
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
this case, the returned message is truncated (but still with a trailing zero).
If \fIerrorcode\fP does not contain a recognized error code number, the
negative value PCRE2_ERROR_BADDATA is returned.
The function returns the length of the message in code units, excluding the
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
too small. In this case, the returned message is truncated (but still with a
trailing zero). If \fIerrorcode\fP does not contain a recognized error code
number, the negative value PCRE2_ERROR_BADDATA is returned.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF

View File

@ -1,4 +1,4 @@
.TH PCRE2_JIT_STACK_CREATE 3 "03 November 2014" "PCRE2 10.00"
.TH PCRE2_JIT_STACK_CREATE 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -20,10 +20,9 @@ maximum size to which it is allowed to grow. The final argument is a general
context, for memory allocation functions, or NULL for standard memory
allocation. The result can be passed to the JIT run-time code by calling
\fBpcre2_jit_stack_assign()\fP to associate the stack with a compiled pattern,
which can then be processed by \fBpcre2_match()\fP. If the "fast path" JIT
matcher, \fBpcre2_jit_match()\fP is used, the stack can be passed directly as
an argument. A maximum stack size of 512K to 1M should be more than enough for
any pattern. For more details, see the
which can then be processed by \fBpcre2_match()\fP or \fBpcre2_jit_match()\fP.
A maximum stack size of 512K to 1M should be more than enough for any pattern.
For more details, see the
.\" HREF
\fBpcre2jit\fP
.\"

View File

@ -1,4 +1,4 @@
.TH PCRE2_MAKETABLES 3 "21 October 2014" "PCRE2 10.00"
.TH PCRE2_MAKETABLES 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -12,10 +12,10 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.SH DESCRIPTION
.rs
.sp
This function builds a set of character tables for character values less than
256. These can be passed to \fBpcre2_compile()\fP in a compile context in order
to override the internal, built-in tables (which were either defaulted or made
by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
This function builds a set of character tables for character code points that
are less than 256. These can be passed to \fBpcre2_compile()\fP in a compile
context in order to override the internal, built-in tables (which were either
defaulted or made by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
.\" HREF
\fBpcre2_set_character_tables()\fP
.\"

View File

@ -255,6 +255,9 @@ OPTIONS
directory like this is an immediate end-of-file; in others it
may provoke an error.
--depth-limit=number
See --match-limit below.
-e pattern, --regex=pattern, --regexp=pattern
Specify a pattern to be matched. This option can be used mul-
tiple times in order to specify several patterns. It can also
@ -477,32 +480,24 @@ OPTIONS
no short form for this option.
--match-limit=number
Processing some regular expression patterns can require a
very large amount of memory, leading in some cases to a pro-
gram crash if not enough is available. Other patterns may
take a very long time to search for all possible matching
strings. The pcre2_match() function that is called by
pcre2grep to do the matching has two parameters that can
limit the resources that it uses.
Processing some regular expression patterns may take a very
long time to search for all possible matching strings. Others
may require a very large amount of memory. There are two
options that set resource limits for matching.
The --match-limit option provides a means of limiting
resource usage when processing patterns that are not going to
match, but which have a very large number of possibilities in
their search trees. The classic example is a pattern that
uses nested unlimited repeats. Internally, PCRE2 uses a func-
tion called match() which it calls repeatedly (sometimes
recursively). The limit set by --match-limit is imposed on
the number of times this function is called during a match,
which has the effect of limiting the amount of backtracking
that can take place.
The --match-limit option provides a means of limiting comput-
ing resource usage when processing patterns that are not
going to match, but which have a very large number of possi-
bilities in their search trees. The classic example is a pat-
tern that uses nested unlimited repeats. Internally, PCRE2
has a counter that is incremented each time around its main
processing loop. If the value set by --match-limit is
reached, an error occurs.
The --recursion-limit option is similar to --match-limit, but
instead of limiting the total number of times that match() is
called, it limits the depth of recursive calls, which in turn
limits the amount of memory that can be used. The recursion
depth is a smaller number than the total number of calls,
because not all calls to match() are recursive. This limit is
of use only if it is set smaller than --match-limit.
The --depth-limit option limits the depth of nested back-
tracking points, which in turn limits the amount of memory
that is used. This limit is of use only if it is set smaller
than --match-limit.
There are no short forms for these options. The default set-
tings are specified when the PCRE2 library is compiled, with
@ -834,9 +829,9 @@ MATCHING ERRORS
such errors, pcre2grep gives up.
The --match-limit option of pcre2grep can be used to set the overall
resource limit; there is a second option called --recursion-limit that
sets a limit on the amount of memory (usually stack) that is used (see
the discussion of these options above).
resource limit; there is a second option called --depth-limit that sets
a limit on the amount of memory that is used (see the discussion of
these options above).
DIAGNOSTICS
@ -862,5 +857,5 @@ AUTHOR
REVISION
Last updated: 31 December 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 21 March 2017
Copyright (c) 1997-2017 University of Cambridge.

View File

@ -91,13 +91,13 @@ INPUT ENCODING
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
values greater than 0xffff cause an error to occur).
UTF-8 is not capable of encoding values greater than 0x7fffffff, but
such values can be handled by the 32-bit library. When testing this
library in non-UTF mode with utf8_input set, if any character is pre-
ceded by the byte 0xff (which is an illegal byte in UTF-8) 0x80000000
is added to the character's value. This is the only way of passing such
code points in a pattern string. For subject strings, using an escape
sequence is preferable.
UTF-8 (in its original definition) is not capable of encoding values
greater than 0x7fffffff, but such values can be handled by the 32-bit
library. When testing this library in non-UTF mode with utf8_input set,
if any character is preceded by the byte 0xff (which is an illegal byte
in UTF-8) 0x80000000 is added to the character's value. This is the
only way of passing such code points in a pattern string. For subject
strings, using an escape sequence is preferable.
COMMAND LINE OPTIONS
@ -544,6 +544,7 @@ PATTERN MODIFIERS
/B bincode show binary code without lengths
callout_info show callout information
debug same as info,fullbincode
framesize show matching frame size
fullbincode show binary code with lengths
/I info show info about compiled pattern
hex unquoted characters are hexadecimal
@ -624,6 +625,10 @@ PATTERN MODIFIERS
last character. These lines are omitted if no starting or ending code
units are recorded.
The framesize modifier shows the size, in bytes, of the storage frames
used by pcre2_match() for handling backtracking. The size depends on
the number of capturing parentheses in the pattern.
The callout_info modifier requests information about all the callouts
in the pattern. A list of them is output at the end of any other infor-
mation that is requested. For each callout, either its number or string
@ -959,6 +964,7 @@ SUBJECT MODIFIERS
callout_fail=<n>[:<m>] control callout failure
callout_none do not supply a callout function
copy=<number or name> copy captured substring
depth_limit=<n> set a depth limit
dfa use pcre2_dfa_match()
find_limits find match and recursion limits
get=<number or name> extract captured substring
@ -972,7 +978,7 @@ SUBJECT MODIFIERS
offset=<n> set starting offset
offset_limit=<n> set offset limit
ovector=<n> set size of output vector
recursion_limit=<n> set a recursion limit
recursion_limit=<n> obsolete synonym for depth_limit
replace=<string> specify a replacement string
startchar show startchar when relevant
startoffset=<n> same as offset=<n>
@ -1188,32 +1194,31 @@ SUBJECT MODIFIERS
Providing a stack that is larger than the default 32K is necessary only
for very complicated patterns.
Setting match and recursion limits
Setting match and depth limits
The match_limit and recursion_limit modifiers set the appropriate lim-
its in the match context. These values are ignored when the find_limits
modifier is specified.
The match_limit and depth_limit modifiers set the appropriate limits in
the match context. These values are ignored when the find_limits modi-
fier is specified.
Finding minimum limits
If the find_limits modifier is present, pcre2test calls pcre2_match()
several times, setting different values in the match context via
pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds
the minimum values for each parameter that allow pcre2_match() to com-
plete without error.
pcre2_set_match_limit() and pcre2_set_depth_limit() until it finds the
minimum values for each parameter that allow pcre2_match() to complete
without error.
If JIT is being used, only the match limit is relevant. If DFA matching
is being used, neither limit is relevant, and this modifier is ignored
(with a warning message).
is being used, only the depth limit is relevant, but at present this
modifier is ignored (with a warning message).
The match_limit number is a measure of the amount of backtracking that
takes place, and learning the minimum value can be instructive. For
most simple matches, the number is quite small, but for patterns with
very large numbers of matching possibilities, it can become large very
quickly with increasing length of subject string. The
match_limit_recursion number is a measure of how much stack (or, if
PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to
complete the match attempt.
quickly with increasing length of subject string. The depth_limit num-
ber is a measure of how much memory for recording backtracking points
is needed to complete the match attempt.
Showing MARK names
@ -1314,7 +1319,7 @@ DEFAULT OUTPUT FROM pcre2test
also output. Here is an example of an interactive pcre2test run.
$ pcre2test
PCRE2 version 9.00 2014-05-10
PCRE2 version 10.22 2016-07-29
re> /^abc(\d+)/
data> abc123
@ -1614,5 +1619,5 @@ AUTHOR
REVISION
Last updated: 28 December 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 21 March 2017
Copyright (c) 1997-2017 University of Cambridge.