Documentation update.

This commit is contained in:
Philip.Hazel 2017-03-24 16:53:38 +00:00
parent 32bab50c01
commit 3aeb812180
24 changed files with 1293 additions and 1357 deletions

View File

@ -1,10 +1,6 @@
Building PCRE2 without using autotools
--------------------------------------
This document has been converted from the PCRE1 document. I have removed a
number of sections about building in various environments, as they applied only
to PCRE1 and are probably out of date.
This document contains the following sections:
General
@ -183,21 +179,9 @@ can skip ahead to the CMake section.
STACK SIZE IN WINDOWS ENVIRONMENTS
The default processor stack size of 1Mb in some Windows environments is too
small for matching patterns that need much recursion. In particular, test 2 may
fail because of this. Normally, running out of stack causes a crash, but there
have been cases where the test program has just died silently. See your linker
documentation for how to increase stack size if you experience problems. If you
are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
compiler, you can increase the stack size for pcre2test and pcre2grep by
setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
example). The Linux default of 8Mb is a reasonable choice for the stack, though
even that can be too small for some pattern/subject combinations.
PCRE2 has a compile configuration option to disable the use of stack for
recursion so that heap is used instead. However, pattern matching is
significantly slower when this is done. There is more about stack usage in the
"pcre2stack" documentation.
Prior to release 10.30 the default system stack size of 1Mb in some Windows
environments caused issues with some tests. This should no longer be the case
for 10.30 and later releases.
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
@ -393,4 +377,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
recommended download site.
=============================
Last Updated: 13 October 2016
Last Updated: 17 March 2017

View File

@ -15,8 +15,8 @@ subscribe or manage your subscription here:
https://lists.exim.org/mailman/listinfo/pcre-dev
Please read the NEWS file if you are upgrading from a previous release.
The contents of this README file are:
Please read the NEWS file if you are upgrading from a previous release. The
contents of this README file are:
The PCRE2 APIs
Documentation for PCRE2
@ -44,8 +44,8 @@ wrappers.
The distribution does contain a set of C wrapper functions for the 8-bit
library that are based on the POSIX regular expression API (see the pcre2posix
man page). These can be found in a library called libpcre2-posix. Note that this
just provides a POSIX calling interface to PCRE2; the regular expressions
man page). These can be found in a library called libpcre2-posix. Note that
this just provides a POSIX calling interface to PCRE2; the regular expressions
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
and does not give full access to all of PCRE2's facilities.
@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms:
Building PCRE2 on non-Unix-like systems
---------------------------------------
For a non-Unix-like system, please read the comments in the file
NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and
"make" you may be able to build PCRE2 using autotools in the same way as for
many Unix-like systems.
For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
your system supports the use of "configure" and "make" you may be able to build
PCRE2 using autotools in the same way as for many Unix-like systems.
PCRE2 can also be configured using CMake, which can be run in various ways
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
@ -174,19 +173,19 @@ library. They are also documented in the pcre2build man page.
architectures. If you try to enable it on an unsupported architecture, there
will be a compile time error.
. If you do not want to make use of the support for UTF-8 Unicode character
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
library, or UTF-32 Unicode character strings in the 32-bit library, you can
add --disable-unicode to the "configure" command. This reduces the size of
the libraries. It is not possible to configure one library with Unicode
support, and another without, in the same configuration.
. If you do not want to make use of the default support for UTF-8 Unicode
character strings in the 8-bit library, UTF-16 Unicode character strings in
the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
library, you can add --disable-unicode to the "configure" command. This
reduces the size of the libraries. It is not possible to configure one
library with Unicode support, and another without, in the same configuration.
It is also not possible to use --enable-ebcdic (see below) with Unicode
support, so if this option is set, you must also use --disable-unicode.
When Unicode support is available, the use of a UTF encoding still has to be
enabled by setting the PCRE2_UTF option at run time or starting a pattern
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
not possible to use both --enable-unicode and --enable-ebcdic at the same
time.
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
@ -232,18 +231,18 @@ library. They are also documented in the pcre2build man page.
--with-match-limit=500000
on the "configure" command. This is just the default; individual calls to
pcre2_match() can supply their own value. There is more discussion on the
pcre2api man page.
pcre2_match() can supply their own value. There is more discussion in the
pcre2api man page (search for pcre2_set_match_limit).
. There is a separate counter that limits the depth of recursive function calls
during a matching process. This also has a default of ten million, which is
essentially "unlimited". You can change the default by setting, for example,
. There is a separate counter that limits the depth of nested backtracking
during a matching process, which in turn limits the amount of memory that is
used. This also has a default of ten million, which is essentially
"unlimited". You can change the default by setting, for example,
--with-match-limit-recursion=500000
--with-match-limit-depth=5000
Recursive function calls use up the runtime stack; running out of stack can
cause programs to crash in strange ways. There is a discussion about stack
sizes in the pcre2stack man page.
There is more discussion in the pcre2api man page (search for
pcre2_set_depth_limit).
. In the 8-bit library, the default maximum compiled pattern size is around
64K bytes. You can increase this by adding --with-link-size=3 to the
@ -254,20 +253,6 @@ library. They are also documented in the pcre2build man page.
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
link size setting is ignored, as 4-byte offsets are always used.
. You can build PCRE2 so that its internal match() function that is called from
pcre2_match() does not call itself recursively. Instead, it uses memory
blocks obtained from the heap to save data that would otherwise be saved on
the stack. To build PCRE2 like this, use
--disable-stack-for-recursion
on the "configure" command. PCRE2 runs more slowly in this mode, but it may
be necessary in environments with limited stack sizes. This applies only to
the normal execution of the pcre2_match() function; if JIT support is being
successfully used, it is not relevant. Equally, it does not apply to
pcre2_dfa_match(), which does not use deeply nested recursion. There is a
discussion about stack sizes in the pcre2stack man page.
. For speed, PCRE2 uses four tables for manipulating and identifying characters
whose code point values are less than 256. By default, it uses a set of
tables for ASCII encoding that is part of the distribution. If you specify
@ -389,6 +374,13 @@ library. They are also documented in the pcre2build man page.
string. Otherwise, it is assumed to be a file name, and the contents of the
file are the test string.
. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
which caused pcre2_match() to use individual blocks on the heap for
backtracking instead of recursive function calls (which use the stack). This
is now obsolete since pcre2_match() was refactored always to use the heap (in
a much more efficient way than before). This option is retained for backwards
compatibility, but has no effect other than to output a warning.
The "configure" script builds the following files for the basic C library:
. Makefile the makefile that builds the library
@ -662,25 +654,32 @@ Unicode support is enabled.
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
16-bit and 32-bit modes. These are tests that generate different output in
8-bit mode. Each pair are for general cases and Unicode support, respectively.
Test 13 checks the handling of non-UTF characters greater than 255 by
pcre2_dfa_match() in 16-bit and 32-bit modes.
Test 14 contains a number of tests that must not be run with JIT. They check,
Test 14 contains some special UTF and UCP tests that give different output for
the different widths.
Test 15 contains a number of tests that must not be run with JIT. They check,
among other non-JIT things, the match-limiting features of the intepretive
matcher.
Test 15 is run only when JIT support is not available. It checks that an
Test 16 is run only when JIT support is not available. It checks that an
attempt to use JIT has the expected behaviour.
Test 16 is run only when JIT support is available. It checks JIT complete and
Test 17 is run only when JIT support is available. It checks JIT complete and
partial modes, match-limiting under JIT, and other JIT-specific features.
Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to
Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
the 8-bit library, without and with Unicode support, respectively.
Test 19 checks the serialization functions by writing a set of compiled
Test 20 checks the serialization functions by writing a set of compiled
patterns to a file, and then reloading and checking them.
Tests 21 and 22 test \C support when the use of \C is not locked out, without
and with UTF support, respectively. Test 23 tests \C when it is locked out.
Character tables
----------------
@ -866,4 +865,4 @@ The distribution should contain the files listed below.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
Last updated: 01 November 2016
Last updated: 17 March 2017

View File

@ -109,7 +109,7 @@ lose performance.
One way of guarding against this possibility is to use the
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
<b>pcre2_compile()</b>. This causes an compile time error if a pattern contains
<b>pcre2_compile()</b>. This causes a compile time error if the pattern contains
a UTF-setting sequence.
</P>
<P>
@ -137,7 +137,8 @@ large search tree against a string that will never match. Nested unlimited
repeats in a pattern are a common example. PCRE2 provides some protection
against this: see the <b>pcre2_set_match_limit()</b> function in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page.
page. There is a similar function called <b>pcre2_set_depth_limit()</b> that can
be used to restrict the amount of memory that is used.
</P>
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
<P>
@ -166,7 +167,7 @@ listing), and the short pages for individual functions, are concatenated in
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program
pcre2stack discussion of stack usage
pcre2stack discussion of stack and memory usage
pcre2syntax quick syntax reference
pcre2test description of the <b>pcre2test</b> command
pcre2unicode discussion of Unicode and UTF support
@ -189,9 +190,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P>
Last updated: 16 October 2015
Last updated: 27 March 2017
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -36,20 +36,21 @@ for success and non-zero otherwise. The arguments are:
<i>callout_data</i> User data that is passed to the callback
</pre>
The <i>callback()</i> function is passed a pointer to a data block containing
the following fields:
the following fields (not necessarily in this order):
<pre>
<i>version</i> Block version number
<i>pattern_position</i> Offset to next item in pattern
<i>next_item_length</i> Length of next item in pattern
<i>callout_number</i> Number for numbered callouts
<i>callout_string_offset</i> Offset to string within pattern
<i>callout_string_length</i> Length of callout string
<i>callout_string</i> Points to callout string or is NULL
uint32_t <i>version</i> Block version number
uint32_t <i>callout_number</i> Number for numbered callouts
PCRE2_SIZE <i>pattern_position</i> Offset to next item in pattern
PCRE2_SIZE <i>next_item_length</i> Length of next item in pattern
PCRE2_SIZE <i>callout_string_offset</i> Offset to string within pattern
PCRE2_SIZE <i>callout_string_length</i> Length of callout string
PCRE2_SPTR <i>callout_string</i> Points to callout string or is NULL
</pre>
The second argument is the callout data that was passed to
<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero
for success. Any other value causes the pattern scan to stop, with the value
being passed back as the result of <b>pcre2_callout_enumerate()</b>.
The second argument passed to the <b>callback()</b> function is the callout data
that was passed to <b>pcre2_callout_enumerate()</b>. The <b>callback()</b>
function must return zero for success. Any other value causes the pattern scan
to stop, with the value being passed back as the result of
<b>pcre2_callout_enumerate()</b>.
</P>
<P>
There is a complete description of the PCRE2 native API in the

View File

@ -26,7 +26,9 @@ DESCRIPTION
</b><br>
<P>
This function frees the memory used for a compiled pattern, including any
memory used by the JIT compiler.
memory used by the JIT compiler. If the compiled pattern was created by a call
to <b>pcre2_code_copy_with_tables()</b>, the memory for the character tables is
also freed.
</P>
<P>
There is a complete description of the PCRE2 native API in the

View File

@ -37,19 +37,24 @@ arguments are:
<i>erroffset</i> Where to put an error offset
<i>ccontext</i> Pointer to a compile context or NULL
</pre>
The length of the string and any error offset that is returned are in code
units, not characters. A compile context is needed only if you want to change
The length of the pattern and any error offset that is returned are in code
units, not characters. A compile context is needed only if you want to provide
custom memory allocation functions, or to provide an external function for
system stack size checking, or to change one or more of these parameters:
<pre>
What \R matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables
The newline character sequence
The compile time nested parentheses limit
What \R matches (Unicode newlines, or CR, LF, CRLF only);
PCRE2's character tables;
The newline character sequence;
The compile time nested parentheses limit;
The maximum pattern length (in code units) that is allowed.
</pre>
or provide an external function for stack size checking. The option bits are:
The option bits are:
<pre>
PCRE2_ANCHORED Force pattern anchoring
PCRE2_ALLOW_EMPTY_CLASS Allow empty classes
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
PCRE2_ALT_VERBNAMES Process backslashes in verb names
PCRE2_AUTO_CALLOUT Compile automatic callouts
PCRE2_CASELESS Do caseless matching
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
@ -71,19 +76,21 @@ or provide an external function for stack size checking. The option bits are:
(only relevant if PCRE2_UTF is set)
PCRE2_UCP Use Unicode properties for \d, \w, etc.
PCRE2_UNGREEDY Invert greediness of quantifiers
PCRE2_USE_OFFSET_LIMIT Enable offset limit for unanchored matching
PCRE2_UTF Treat pattern and subjects as UTF strings
</pre>
PCRE2 must be built with Unicode support in order to use PCRE2_UTF, PCRE2_UCP
and related options.
PCRE2 must be built with Unicode support (the default) in order to use
PCRE2_UTF, PCRE2_UCP and related options.
</P>
<P>
The yield of the function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected.
</P>
<P>
There is a complete description of the PCRE2 native API in the
There is a complete description of the PCRE2 native API, with more detail on
each option, in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
page, and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>

View File

@ -45,10 +45,9 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_CONFIG_BSR Indicates what \R matches by default:
PCRE2_BSR_UNICODE
PCRE2_BSR_ANYCRLF
PCRE2_CONFIG_JIT Availability of just-in-time compiler
support (1=yes 0=no)
PCRE2_CONFIG_JITTARGET Information about the target archi-
tecture for the JIT compiler
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
PCRE2_CONFIG_JIT Availability of just-in-time compiler support (1=yes 0=no)
PCRE2_CONFIG_JITTARGET Information (a string) about the target architecture for the JIT compiler
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
@ -58,11 +57,9 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
0=heap)
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
0=no)
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes 0=no)
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
PCRE2_CONFIG_VERSION The PCRE2 version (a string)
</pre>

View File

@ -31,8 +31,9 @@ DESCRIPTION
<P>
This function matches a compiled regular expression against a given subject
string, using an alternative matching algorithm that scans the subject string
just once (<i>not</i> Perl-compatible). (The Perl-compatible matching function
is <b>pcre2_match()</b>.) The arguments for this function are:
just once (except when processing lookaround assertions). This function is
<i>not</i> Perl-compatible (the Perl-compatible matching function is
<b>pcre2_match()</b>). The arguments for this function are:
<pre>
<i>code</i> Points to the compiled pattern
<i>subject</i> Points to the subject string
@ -45,22 +46,18 @@ is <b>pcre2_match()</b>.) The arguments for this function are:
<i>wscount</i> Number of elements in the vector
</pre>
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
up a callout function or specify the recursion limit. The <i>length</i> and
<i>startoffset</i> values are code units, not characters. The options are:
up a callout function or specify the recursion depth limit. The <i>length</i>
and <i>startoffset</i> values are code units, not characters. The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
validity (only relevant if PCRE2_UTF
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
was set at compile time)
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
match if no full matches are found
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
even if there is a full match as well
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
PCRE2_DFA_RESTART Restart after a partial match
PCRE2_DFA_SHORTEST Return only the shortest match
</pre>

View File

@ -34,11 +34,11 @@ errors are negative numbers. The arguments are:
<i>buffer</i> where to put the message
<i>bufflen</i> the length of the buffer (code units)
</pre>
The function returns the length of the message, excluding the trailing zero, or
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
this case, the returned message is truncated (but still with a trailing zero).
If <i>errorcode</i> does not contain a recognized error code number, the
negative value PCRE2_ERROR_BADDATA is returned.
The function returns the length of the message in code units, excluding the
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
too small. In this case, the returned message is truncated (but still with a
trailing zero). If <i>errorcode</i> does not contain a recognized error code
number, the negative value PCRE2_ERROR_BADDATA is returned.
</P>
<P>
There is a complete description of the PCRE2 native API in the

View File

@ -32,10 +32,9 @@ maximum size to which it is allowed to grow. The final argument is a general
context, for memory allocation functions, or NULL for standard memory
allocation. The result can be passed to the JIT run-time code by calling
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
which can then be processed by <b>pcre2_match()</b>. If the "fast path" JIT
matcher, <b>pcre2_jit_match()</b> is used, the stack can be passed directly as
an argument. A maximum stack size of 512K to 1M should be more than enough for
any pattern. For more details, see the
which can then be processed by <b>pcre2_match()</b> or <b>pcre2_jit_match()</b>.
A maximum stack size of 512K to 1M should be more than enough for any pattern.
For more details, see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
page.
</P>

View File

@ -25,10 +25,10 @@ SYNOPSIS
DESCRIPTION
</b><br>
<P>
This function builds a set of character tables for character values less than
256. These can be passed to <b>pcre2_compile()</b> in a compile context in order
to override the internal, built-in tables (which were either defaulted or made
by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
This function builds a set of character tables for character code points that
are less than 256. These can be passed to <b>pcre2_compile()</b> in a compile
context in order to override the internal, built-in tables (which were either
defaulted or made by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
page. You might want to do this if you are using a non-standard locale.
</P>

View File

@ -2575,8 +2575,8 @@ The internal recursion limit was reached.
A text message for an error code from any PCRE2 function (compile, match, or
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
is passed as the first argument, with the remaining two arguments specifying a
code unit buffer and its length, into which the text message is placed. Note
that the message is returned in code units of the appropriate width for the
code unit buffer and its length in code units, into which the text message is
placed. The message is returned in code units of the appropriate width for the
library that is being used.
</P>
<P>
@ -3265,9 +3265,9 @@ Cambridge, England.
</P>
<br><a name="SEC41" href="#TOC1">REVISION</a><br>
<P>
Last updated: 23 December 2016
Last updated: 21 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -280,6 +280,10 @@ operating systems the effect of reading a directory like this is an immediate
end-of-file; in others it may provoke an error.
</P>
<P>
<b>--depth-limit</b>=<i>number</i>
See <b>--match-limit</b> below.
</P>
<P>
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
Specify a pattern to be matched. This option can be used multiple times in
order to specify several patterns. It can also be used as a way of specifying a
@ -498,29 +502,22 @@ used. There is no short form for this option.
</P>
<P>
<b>--match-limit</b>=<i>number</i>
Processing some regular expression patterns can require a very large amount of
memory, leading in some cases to a program crash if not enough is available.
Other patterns may take a very long time to search for all possible matching
strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
do the matching has two parameters that can limit the resources that it uses.
Processing some regular expression patterns may take a very long time to search
for all possible matching strings. Others may require a very large amount of
memory. There are two options that set resource limits for matching.
<br>
<br>
The <b>--match-limit</b> option provides a means of limiting resource usage
when processing patterns that are not going to match, but which have a very
large number of possibilities in their search trees. The classic example is a
pattern that uses nested unlimited repeats. Internally, PCRE2 uses a function
called <b>match()</b> which it calls repeatedly (sometimes recursively). The
limit set by <b>--match-limit</b> is imposed on the number of times this
function is called during a match, which has the effect of limiting the amount
of backtracking that can take place.
The <b>--match-limit</b> option provides a means of limiting computing resource
usage when processing patterns that are not going to match, but which have a
very large number of possibilities in their search trees. The classic example
is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
counter that is incremented each time around its main processing loop. If the
value set by <b>--match-limit</b> is reached, an error occurs.
<br>
<br>
The <b>--recursion-limit</b> option is similar to <b>--match-limit</b>, but
instead of limiting the total number of times that <b>match()</b> is called, it
limits the depth of recursive calls, which in turn limits the amount of memory
that can be used. The recursion depth is a smaller number than the total number
of calls, because not all calls to <b>match()</b> are recursive. This limit is
of use only if it is set smaller than <b>--match-limit</b>.
The <b>--depth-limit</b> option limits the depth of nested backtracking points,
which in turn limits the amount of memory that is used. This limit is of use
only if it is set smaller than <b>--match-limit</b>.
<br>
<br>
There are no short forms for these options. The default settings are specified
@ -843,9 +840,9 @@ there are more than 20 such errors, <b>pcre2grep</b> gives up.
</P>
<P>
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
overall resource limit; there is a second option called <b>--recursion-limit</b>
that sets a limit on the amount of memory (usually stack) that is used (see the
discussion of these options above).
overall resource limit; there is a second option called <b>--depth-limit</b>
that sets a limit on the amount of memory that is used (see the discussion of
these options above).
</P>
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
<P>
@ -870,9 +867,9 @@ Cambridge, England.
</P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
Last updated: 31 December 2016
Last updated: 21 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -170,20 +170,24 @@ the application to apply the JIT optimization by calling
<b>pcre2_jit_compile()</b> is ignored.
</P>
<br><b>
Setting match and recursion limits
Setting match and backtracking depth limits
</b><br>
<P>
The caller of <b>pcre2_match()</b> can set a limit on the number of times the
internal <b>match()</b> function is called and on the maximum depth of
recursive calls. These facilities are provided to catch runaway matches that
are provoked by patterns with huge matching trees (a typical example is a
pattern with nested unlimited repeats) and to avoid running out of system stack
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
gives an error return. The limits can also be set by items at the start of the
pattern of the form
The pcre2_match() function contains a counter that is incremented every time it
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
this counter, which therefore limits the amount of computing resource used for
a match. The maximum depth of nested backtracking can also be limited, and this
restricts the amount of heap memory that is used.
</P>
<P>
These facilities are provided to catch runaway matches that are provoked by
patterns with huge matching trees (a typical example is a pattern with nested
unlimited repeats applied to a long string that does not match). When one of
these limits is reached, <b>pcre2_match()</b> gives an error return. The limits
can also be set by items at the start of the pattern of the form
<pre>
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
(*LIMIT_DEPTH=d)
</pre>
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
@ -192,10 +196,15 @@ limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
</P>
<P>
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
still recognized for backwards compatibility.
</P>
<P>
The match limit is used (but in a different way) when JIT is being used, but it
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
However, the recursion limit is relevant for DFA matching, which does use some
function recursion, in particular, for recursions within the pattern.
However, the depth limit is relevant for DFA matching, which uses function
recursion for recursions within the pattern. In this case, the depth limit
controls the amount of system stack that is used.
<a name="newlines"></a></P>
<br><b>
Newline conventions
@ -235,8 +244,8 @@ The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
what the \R escape sequence matches. By default, this is any Unicode newline
sequence, for Perl compatibility. However, this can be changed; see the
description of \R in the section entitled
sequence, for Perl compatibility. However, this can be changed; see the next
section and the description of \R in the section entitled
<a href="#newlineseq">"Newline sequences"</a>
below. A change of \R setting can be combined with a change of newline
convention.
@ -254,7 +263,7 @@ corresponding to PCRE2_BSR_UNICODE.
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
<P>
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
character code rather than ASCII or Unicode (typically a mainframe system). In
character code instead of ASCII or Unicode (typically a mainframe system). In
the sections below, character code values are ASCII or Unicode; in an EBCDIC
environment these characters may have different code values, and there are no
code points greater than 255.
@ -318,11 +327,11 @@ that character may have. This use of backslash as an escape character applies
both inside and outside character classes.
</P>
<P>
For example, if you want to match a * character, you write \* in the pattern.
This escaping action applies whether or not the following character would
otherwise be interpreted as a metacharacter, so it is always safe to precede a
non-alphanumeric with backslash to specify that it stands for itself. In
particular, if you want to match a backslash, you write \\.
For example, if you want to match a * character, you must write \* in the
pattern. This escaping action applies whether or not the following character
would otherwise be interpreted as a metacharacter, so it is always safe to
precede a non-alphanumeric with backslash to specify that it stands for itself.
In particular, if you want to match a backslash, you write \\.
</P>
<P>
In a UTF mode, only ASCII numbers and letters have any special meaning after a
@ -353,7 +362,7 @@ An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
by \E later in the pattern, the literal interpretation continues to the end of
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
a character class, this causes an error, because the character class is not
terminated.
terminated by a closing square bracket.
<a name="digitsafterbackslash"></a></P>
<br><b>
Non-printing characters
@ -476,9 +485,9 @@ a hexadecimal digit appears between \x{ and }, or if there is no terminating
<P>
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
described only when it is followed by two hexadecimal digits. Otherwise, it
matches a literal "x" character. In this mode mode, support for code points
greater than 256 is provided by \u, which must be followed by four hexadecimal
digits; otherwise it matches a literal "u" character.
matches a literal "x" character. In this mode, support for code points greater
than 256 is provided by \u, which must be followed by four hexadecimal digits;
otherwise it matches a literal "u" character.
</P>
<P>
Characters whose value is less than 256 can be defined by either of the two
@ -493,12 +502,10 @@ Constraints on character values
Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
<pre>
8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
8-bit non-UTF mode no greater than 0xff
16-bit non-UTF mode no greater than 0xffff
32-bit non-UTF mode no greater than 0xffffffff
All UTF modes no greater than 0x10ffff and a valid codepoint
</pre>
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
"surrogate" codepoints), and 0xffef.
@ -525,7 +532,7 @@ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
is set, \U matches a "U" character, and \u can be used to define a character
by code point, as described in the previous section.
by code point, as described above.
</P>
<br><b>
Absolute and relative back references
@ -714,7 +721,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
sequences that match characters with specific properties are available. In
8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose codepoints are less than 256, but they do work in this mode.
The extra escape sequences are:
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
may be encountered. These are all treated as being in the Common script and
with an unassigned type. The extra escape sequences are:
<pre>
\p{<i>xx</i>} a character with the <i>xx</i> property
\P{<i>xx</i>} a character without the <i>xx</i> property
@ -2214,16 +2223,8 @@ except that it does not cause the current matching position to be changed.
Assertion subpatterns are not capturing subpatterns. If such an assertion
contains capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. However, substring
capturing is carried out only for positive assertions. (Perl sometimes, but not
always, does do capturing in negative assertions.)
</P>
<P>
WARNING: If a positive assertion containing one or more capturing subpatterns
succeeds, but failure to match later in the pattern causes backtracking over
this assertion, the captures within the assertion are reset only if no higher
numbered captures are already set. This is, unfortunately, a fundamental
limitation of the current implementation; it may get removed in a future
reworking.
capturing is normally carried out only for positive assertions (but see the
discussion of conditional subpatterns below).
</P>
<P>
For compatibility with Perl, most assertion subpatterns may be repeated; though
@ -2601,6 +2602,12 @@ presence of at least one letter in the subject. If a letter is found, the
subject is matched against the first alternative; otherwise it is matched
against the second. This pattern matches strings in one of the two forms
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
</P>
<P>
For Perl compatibility, if an assertion that is a condition contains capturing
subpatterns, any capturing that occurs is retained afterwards, for both
positive and negative assertions. (Compare non-conditional assertions, when
captures are retained only for positive assertions.)
<a name="comments"></a></P>
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
<P>
@ -2773,93 +2780,57 @@ is the actual recursive call.
Differences in recursion processing between PCRE2 and Perl
</b><br>
<P>
Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2
(like Python, but unlike Perl), a recursive subpattern call is always treated
as an atomic group. That is, once it has matched some of the subject string, it
is never re-entered, even if it contains untried alternatives and there is a
subsequent matching failure. This can be illustrated by the following pattern,
which purports to match a palindromic string that contains an odd number of
characters (for example, "a", "aba", "abcba", "abcdcba"):
<pre>
^(.|(.)(?1)\2)$
</pre>
The idea is that it either matches a single character, or two identical
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
it does not if the pattern is longer than three characters. Consider the
subject string "abcba":
Some former differences between PCRE2 and Perl no longer exist.
</P>
<P>
At the top level, the first character is matched, but as it is not at the end
of the string, the first alternative fails; the second alternative is taken
and the recursion kicks in. The recursive call to subpattern 1 successfully
matches the next character ("b"). (Note that the beginning and end of line
tests are not part of the recursion).
Before release 10.30, recursion processing in PCRE2 differed from Perl in that
a recursive subpattern call was always treated as an atomic group. That is,
once it had matched some of the subject string, it was never re-entered, even
if it contained untried alternatives and there was a subsequent matching
failure. (Historical note: PCRE implemented recursion before Perl did.)
</P>
<P>
Back at the top level, the next character ("c") is compared with what
subpattern 2 matched, which was "a". This fails. Because the recursion is
treated as an atomic group, there are now no backtracking points, and so the
entire match fails. (Perl is able, at this point, to re-enter the recursion and
try the second alternative.) However, if the pattern is written with the
alternatives in the other order, things are different:
<pre>
^((.)(?1)\2|.)$
</pre>
This time, the recursing alternative is tried first, and continues to recurse
until it runs out of characters, at which point the recursion fails. But this
time we do have another alternative to try at the higher level. That is the big
difference: in the previous case the remaining alternative is at a deeper
recursion level, which PCRE2 cannot use.
Starting with release 10.30, recursive subroutine calls are no longer treated
as atomic. That is, they can be re-entered to try unused alternatives if there
is a matching failure later in the pattern. This is now compatible with the way
Perl works. If you want a subroutine call to be atomic, you must explicitly
enclose it in an atomic group.
</P>
<P>
To change the pattern so that it matches all palindromic strings, not just
those with an odd number of characters, it is tempting to change the pattern to
this:
Supporting backtracking into recursions simplifies certain types of recursive
pattern. For example, this pattern matches palindromic strings:
<pre>
^((.)(?1)\2|.?)$
</pre>
Again, this works in Perl, but not in PCRE2, and for the same reason. When a
deeper recursion has matched a single character, it cannot be entered again in
order to match an empty string. The solution is to separate the two cases, and
write out the odd and even cases as alternatives at the higher level:
The second branch in the group matches a single central character in the
palindrome when there are an odd number of characters, or nothing when there
are an even number of characters, but in order to work it has to be able to try
the second case when the rest of the pattern match fails. If you want to match
typical palindromic phrases, the pattern has to ignore all non-word characters,
which can be done like this:
<pre>
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
</pre>
If you want to match typical palindromic phrases, the pattern has to ignore all
non-word characters, which can be done like this:
<pre>
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
</pre>
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the
use of the possessive quantifier *+ to avoid backtracking into sequences of
non-word characters. Without this, PCRE2 takes a great deal longer (ten times
or more) to match typical phrases, and Perl takes so long that you think it has
gone into a loop.
man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
avoid backtracking into sequences of non-word characters. Without this, PCRE2
takes a great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop.
</P>
<P>
<b>WARNING</b>: The palindrome-matching patterns above work only if the subject
string does not start with a palindrome that is shorter than the entire string.
For example, although "abcba" is correctly matched, if the subject is "ababa",
PCRE2 finds the palindrome "aba" at the start, then fails at top level because
the end of the string does not follow. Once again, it cannot jump back into the
recursion to try other alternatives, so the entire match fails.
</P>
<P>
The second way in which PCRE2 and Perl differ in their recursion processing is
in the handling of captured values. In Perl, when a subpattern is called
recursively or as a subpattern (see the next section), it has no access to any
values that were captured outside the recursion, whereas in PCRE2 these values
can be referenced. Consider this pattern:
Another way in which PCRE2 and Perl used to differ in their recursion
processing is in the handling of captured values. Formerly in Perl, when a
subpattern was called recursively or as a subpattern (see the next section), it
had no access to any values that were captured outside the recursion, whereas
in PCRE2 these values can be referenced. Consider this pattern:
<pre>
^(.)(\1|a(?2))
</pre>
In PCRE2, this pattern matches "bab". The first capturing parentheses match "b",
then in the second group, when the back reference \1 fails to match "b", the
second alternative matches "a" and then recurses. In the recursion, \1 does
now match "b" and so the whole match succeeds. In Perl, the pattern fails to
match because inside the recursive call \1 cannot access the externally set
value.
This pattern matches "bab". The first capturing parentheses match "b", then in
the second group, when the back reference \1 fails to match "b", the second
alternative matches "a" and then recurses. In the recursion, \1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
<a name="subpatternsassubroutines"></a></P>
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<P>
@ -2886,11 +2857,10 @@ is used, it does match "sense and responsibility" as well as the other two
strings. Another example is given in the discussion of DEFINE above.
</P>
<P>
All subroutine calls, whether recursive or not, are always treated as atomic
groups. That is, once a subroutine has matched some of the subject string, it
is never re-entered, even if it contains untried alternatives and there is a
subsequent matching failure. Any capturing parentheses that are set during the
subroutine call revert to their previous values afterwards.
Like recursions, subroutine calls used to be treated as atomic, but this
changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
occur. However, any capturing parentheses that are set during the subroutine
call revert to their previous values afterwards.
</P>
<P>
Processing options such as case-independence are fixed when a subpattern is
@ -2998,17 +2968,10 @@ The doubling is removed before the string is passed to the callout function.
<a name="backtrackcontrol"></a></P>
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
are still described in the Perl documentation as "experimental and subject to
change or removal in a future version of Perl". It goes on to say: "Their usage
in production code should be noted to avoid problems during upgrades." The same
remarks apply to the PCRE2 features described in this section.
</P>
<P>
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
(*VERB:NAME). Some verbs take either form, possibly behaving differently
depending on whether or not a name is present.
There are a number of special "Backtracking Control Verbs" (to use Perl's
terminology) that modify the behaviour of backtracking during matching. They
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
possibly behaving differently depending on whether or not a name is present.
</P>
<P>
By default, for compatibility with Perl, a name is any sequence of characters
@ -3040,7 +3003,7 @@ not there. Any number of these verbs may occur in a pattern.
<P>
Since these verbs are specifically related to backtracking, most of them can be
used only when the pattern is to be matched using the traditional matching
function, because these use a backtracking algorithm. With the exception of
function, because that uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, the backtracking
control verbs cause an error if encountered by the DFA matching function.
</P>
@ -3178,11 +3141,11 @@ Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching continues
with what follows, but if there is no subsequent match, causing a backtrack to
the verb, a failure is forced. That is, backtracking cannot pass to the left of
the verb. However, when one of these verbs appears inside an atomic group
(which includes any group that is called as a subroutine) or in an assertion
that is true, its effect is confined to that group, because once the group has
been matched, there is never any backtracking into it. In this situation,
backtracking has to jump to the left of the entire atomic group or assertion.
the verb. However, when one of these verbs appears inside an atomic group or in
an assertion that is true, its effect is confined to that group, because once
the group has been matched, there is never any backtracking into it. In this
situation, backtracking has to jump to the left of the entire atomic group or
assertion.
</P>
<P>
These verbs differ in exactly what kind of failure occurs when backtracking
@ -3246,8 +3209,8 @@ expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
as (*COMMIT).
</P>
<P>
The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
It is like (*MARK:NAME) in that the name is remembered for passing back to the
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
like (*MARK:NAME) in that the name is remembered for passing back to the
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
ignoring those set by (*PRUNE) or (*THEN).
<pre>
@ -3452,9 +3415,9 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 27 December 2016
Last updated: 18 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -55,7 +55,10 @@ The facility for saving and restoring compiled patterns is intended for use
within individual applications. As such, the data supplied to
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency checking, not
complete validation of what is being re-loaded.
complete validation of what is being re-loaded. Corrupted data may cause
undefined results. For example, if the length field of a pattern in the
serialized data is corrupted, the deserializing code may read beyond the end of
the byte stream that is passed to it.
</P>
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
<P>
@ -190,9 +193,9 @@ Cambridge, England.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 May 2016
Last updated: 21 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -126,12 +126,13 @@ character values up to 0x7fffffff. Each character is placed in one 16-bit or
to occur).
</P>
<P>
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
values can be handled by the 32-bit library. When testing this library in
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
character's value. This is the only way of passing such code points in a
pattern string. For subject strings, using an escape sequence is preferable.
UTF-8 (in its original definition) is not capable of encoding values greater
than 0x7fffffff, but such values can be handled by the 32-bit library. When
testing this library in non-UTF mode with <b>utf8_input</b> set, if any
character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
0x80000000 is added to the character's value. This is the only way of passing
such code points in a pattern string. For subject strings, using an escape
sequence is preferable.
</P>
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
<P>
@ -602,6 +603,7 @@ about the pattern:
/B bincode show binary code without lengths
callout_info show callout information
debug same as info,fullbincode
framesize show matching frame size
fullbincode show binary code with lengths
/I info show info about compiled pattern
hex unquoted characters are hexadecimal
@ -689,6 +691,11 @@ not necessarily the last character. These lines are omitted if no starting or
ending code units are recorded.
</P>
<P>
The <b>framesize</b> modifier shows the size, in bytes, of the storage frames
used by <b>pcre2_match()</b> for handling backtracking. The size depends on the
number of capturing parentheses in the pattern.
</P>
<P>
The <b>callout_info</b> modifier requests information about all the callouts in
the pattern. A list of them is output at the end of any other information that
is requested. For each callout, either its number or string is given, followed
@ -1073,6 +1080,7 @@ pattern.
callout_fail=&#60;n&#62;[:&#60;m&#62;] control callout failure
callout_none do not supply a callout function
copy=&#60;number or name&#62; copy captured substring
depth_limit=&#60;n&#62; set a depth limit
dfa use <b>pcre2_dfa_match()</b>
find_limits find match and recursion limits
get=&#60;number or name&#62; extract captured substring
@ -1086,7 +1094,7 @@ pattern.
offset=&#60;n&#62; set starting offset
offset_limit=&#60;n&#62; set offset limit
ovector=&#60;n&#62; set size of output vector
recursion_limit=&#60;n&#62; set a recursion limit
recursion_limit=&#60;n&#62; obsolete synonym for depth_limit
replace=&#60;string&#62; specify a replacement string
startchar show startchar when relevant
startoffset=&#60;n&#62; same as offset=&#60;n&#62;
@ -1320,10 +1328,10 @@ stack that is larger than the default 32K is necessary only for very
complicated patterns.
</P>
<br><b>
Setting match and recursion limits
Setting match and depth limits
</b><br>
<P>
The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate
The <b>match_limit</b> and <b>depth_limit</b> modifiers set the appropriate
limits in the match context. These values are ignored when the
<b>find_limits</b> modifier is specified.
</P>
@ -1333,23 +1341,23 @@ Finding minimum limits
<P>
If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
<b>pcre2_match()</b> several times, setting different values in the match
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b>
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_depth_limit()</b>
until it finds the minimum values for each parameter that allow
<b>pcre2_match()</b> to complete without error.
</P>
<P>
If JIT is being used, only the match limit is relevant. If DFA matching is
being used, neither limit is relevant, and this modifier is ignored (with a
warning message).
being used, only the depth limit is relevant, but at present this modifier is
ignored (with a warning message).
</P>
<P>
The <i>match_limit</i> number is a measure of the amount of backtracking
that takes place, and learning the minimum value can be instructive. For most
simple matches, the number is quite small, but for patterns with very large
numbers of matching possibilities, it can become large very quickly with
increasing length of subject string. The <i>match_limit_recursion</i> number is
a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much
heap) memory is needed to complete the match attempt.
increasing length of subject string. The <i>depth_limit</i> number is
a measure of how much memory for recording backtracking points is needed to
complete the match attempt.
</P>
<br><b>
Showing MARK names
@ -1466,7 +1474,7 @@ code unit offset of the start of the failing character is also output. Here is
an example of an interactive <b>pcre2test</b> run.
<pre>
$ pcre2test
PCRE2 version 9.00 2014-05-10
PCRE2 version 10.22 2016-07-29
re&#62; /^abc(\d+)/
data&#62; abc123
@ -1779,9 +1787,9 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 28 December 2016
Last updated: 21 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2_CONFIG 3 "20 April 2014" "PCRE2 10.0"
.TH PCRE2_CONFIG 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -31,10 +31,13 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_CONFIG_BSR Indicates what \eR matches by default:
PCRE2_BSR_UNICODE
PCRE2_BSR_ANYCRLF
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
.\" JOIN
PCRE2_CONFIG_JIT Availability of just-in-time compiler
support (1=yes 0=no)
PCRE2_CONFIG_JITTARGET Information about the target archi-
tecture for the JIT compiler
.\" JOIN
PCRE2_CONFIG_JITTARGET Information (a string) about the target
architecture for the JIT compiler
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
@ -44,9 +47,9 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
0=heap)
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
.\" JOIN
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
0=no)
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)

View File

@ -1,4 +1,4 @@
.TH PCRE2_DFA_MATCH 3 "23 December 2016" "PCRE2 10.23"
.TH PCRE2_DFA_MATCH 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -19,8 +19,9 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp
This function matches a compiled regular expression against a given subject
string, using an alternative matching algorithm that scans the subject string
just once (\fInot\fP Perl-compatible). (The Perl-compatible matching function
is \fBpcre2_match()\fP.) The arguments for this function are:
just once (except when processing lookaround assertions). This function is
\fInot\fP Perl-compatible (the Perl-compatible matching function is
\fBpcre2_match()\fP). The arguments for this function are:
.sp
\fIcode\fP Points to the compiled pattern
\fIsubject\fP Points to the subject string
@ -33,22 +34,26 @@ is \fBpcre2_match()\fP.) The arguments for this function are:
\fIwscount\fP Number of elements in the vector
.sp
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
up a callout function or specify the recursion limit. The \fIlength\fP and
\fIstartoffset\fP values are code units, not characters. The options are:
up a callout function or specify the recursion depth limit. The \fIlength\fP
and \fIstartoffset\fP values are code units, not characters. The options are:
.sp
PCRE2_ANCHORED Match only at the first position
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
.\" JOIN
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
.\" JOIN
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
validity (only relevant if PCRE2_UTF
was set at compile time)
.\" JOIN
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial
match even if there is a full match
.\" JOIN
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
match if no full matches are found
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
even if there is a full match as well
match if no full matches are found
PCRE2_DFA_RESTART Restart after a partial match
PCRE2_DFA_SHORTEST Return only the shortest match
.sp

View File

@ -1,4 +1,4 @@
.TH PCRE2_GET_ERROR_MESSAGE 3 "17 June 2016" "PCRE2 10.22"
.TH PCRE2_GET_ERROR_MESSAGE 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -22,11 +22,11 @@ errors are negative numbers. The arguments are:
\fIbuffer\fP where to put the message
\fIbufflen\fP the length of the buffer (code units)
.sp
The function returns the length of the message, excluding the trailing zero, or
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
this case, the returned message is truncated (but still with a trailing zero).
If \fIerrorcode\fP does not contain a recognized error code number, the
negative value PCRE2_ERROR_BADDATA is returned.
The function returns the length of the message in code units, excluding the
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
too small. In this case, the returned message is truncated (but still with a
trailing zero). If \fIerrorcode\fP does not contain a recognized error code
number, the negative value PCRE2_ERROR_BADDATA is returned.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF

View File

@ -1,4 +1,4 @@
.TH PCRE2_JIT_STACK_CREATE 3 "03 November 2014" "PCRE2 10.00"
.TH PCRE2_JIT_STACK_CREATE 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -20,10 +20,9 @@ maximum size to which it is allowed to grow. The final argument is a general
context, for memory allocation functions, or NULL for standard memory
allocation. The result can be passed to the JIT run-time code by calling
\fBpcre2_jit_stack_assign()\fP to associate the stack with a compiled pattern,
which can then be processed by \fBpcre2_match()\fP. If the "fast path" JIT
matcher, \fBpcre2_jit_match()\fP is used, the stack can be passed directly as
an argument. A maximum stack size of 512K to 1M should be more than enough for
any pattern. For more details, see the
which can then be processed by \fBpcre2_match()\fP or \fBpcre2_jit_match()\fP.
A maximum stack size of 512K to 1M should be more than enough for any pattern.
For more details, see the
.\" HREF
\fBpcre2jit\fP
.\"

View File

@ -1,4 +1,4 @@
.TH PCRE2_MAKETABLES 3 "21 October 2014" "PCRE2 10.00"
.TH PCRE2_MAKETABLES 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -12,10 +12,10 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.SH DESCRIPTION
.rs
.sp
This function builds a set of character tables for character values less than
256. These can be passed to \fBpcre2_compile()\fP in a compile context in order
to override the internal, built-in tables (which were either defaulted or made
by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
This function builds a set of character tables for character code points that
are less than 256. These can be passed to \fBpcre2_compile()\fP in a compile
context in order to override the internal, built-in tables (which were either
defaulted or made by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
.\" HREF
\fBpcre2_set_character_tables()\fP
.\"

View File

@ -255,6 +255,9 @@ OPTIONS
directory like this is an immediate end-of-file; in others it
may provoke an error.
--depth-limit=number
See --match-limit below.
-e pattern, --regex=pattern, --regexp=pattern
Specify a pattern to be matched. This option can be used mul-
tiple times in order to specify several patterns. It can also
@ -477,32 +480,24 @@ OPTIONS
no short form for this option.
--match-limit=number
Processing some regular expression patterns can require a
very large amount of memory, leading in some cases to a pro-
gram crash if not enough is available. Other patterns may
take a very long time to search for all possible matching
strings. The pcre2_match() function that is called by
pcre2grep to do the matching has two parameters that can
limit the resources that it uses.
Processing some regular expression patterns may take a very
long time to search for all possible matching strings. Others
may require a very large amount of memory. There are two
options that set resource limits for matching.
The --match-limit option provides a means of limiting
resource usage when processing patterns that are not going to
match, but which have a very large number of possibilities in
their search trees. The classic example is a pattern that
uses nested unlimited repeats. Internally, PCRE2 uses a func-
tion called match() which it calls repeatedly (sometimes
recursively). The limit set by --match-limit is imposed on
the number of times this function is called during a match,
which has the effect of limiting the amount of backtracking
that can take place.
The --match-limit option provides a means of limiting comput-
ing resource usage when processing patterns that are not
going to match, but which have a very large number of possi-
bilities in their search trees. The classic example is a pat-
tern that uses nested unlimited repeats. Internally, PCRE2
has a counter that is incremented each time around its main
processing loop. If the value set by --match-limit is
reached, an error occurs.
The --recursion-limit option is similar to --match-limit, but
instead of limiting the total number of times that match() is
called, it limits the depth of recursive calls, which in turn
limits the amount of memory that can be used. The recursion
depth is a smaller number than the total number of calls,
because not all calls to match() are recursive. This limit is
of use only if it is set smaller than --match-limit.
The --depth-limit option limits the depth of nested back-
tracking points, which in turn limits the amount of memory
that is used. This limit is of use only if it is set smaller
than --match-limit.
There are no short forms for these options. The default set-
tings are specified when the PCRE2 library is compiled, with
@ -834,9 +829,9 @@ MATCHING ERRORS
such errors, pcre2grep gives up.
The --match-limit option of pcre2grep can be used to set the overall
resource limit; there is a second option called --recursion-limit that
sets a limit on the amount of memory (usually stack) that is used (see
the discussion of these options above).
resource limit; there is a second option called --depth-limit that sets
a limit on the amount of memory that is used (see the discussion of
these options above).
DIAGNOSTICS
@ -862,5 +857,5 @@ AUTHOR
REVISION
Last updated: 31 December 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 21 March 2017
Copyright (c) 1997-2017 University of Cambridge.

View File

@ -91,13 +91,13 @@ INPUT ENCODING
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
values greater than 0xffff cause an error to occur).
UTF-8 is not capable of encoding values greater than 0x7fffffff, but
such values can be handled by the 32-bit library. When testing this
library in non-UTF mode with utf8_input set, if any character is pre-
ceded by the byte 0xff (which is an illegal byte in UTF-8) 0x80000000
is added to the character's value. This is the only way of passing such
code points in a pattern string. For subject strings, using an escape
sequence is preferable.
UTF-8 (in its original definition) is not capable of encoding values
greater than 0x7fffffff, but such values can be handled by the 32-bit
library. When testing this library in non-UTF mode with utf8_input set,
if any character is preceded by the byte 0xff (which is an illegal byte
in UTF-8) 0x80000000 is added to the character's value. This is the
only way of passing such code points in a pattern string. For subject
strings, using an escape sequence is preferable.
COMMAND LINE OPTIONS
@ -544,6 +544,7 @@ PATTERN MODIFIERS
/B bincode show binary code without lengths
callout_info show callout information
debug same as info,fullbincode
framesize show matching frame size
fullbincode show binary code with lengths
/I info show info about compiled pattern
hex unquoted characters are hexadecimal
@ -624,6 +625,10 @@ PATTERN MODIFIERS
last character. These lines are omitted if no starting or ending code
units are recorded.
The framesize modifier shows the size, in bytes, of the storage frames
used by pcre2_match() for handling backtracking. The size depends on
the number of capturing parentheses in the pattern.
The callout_info modifier requests information about all the callouts
in the pattern. A list of them is output at the end of any other infor-
mation that is requested. For each callout, either its number or string
@ -959,6 +964,7 @@ SUBJECT MODIFIERS
callout_fail=<n>[:<m>] control callout failure
callout_none do not supply a callout function
copy=<number or name> copy captured substring
depth_limit=<n> set a depth limit
dfa use pcre2_dfa_match()
find_limits find match and recursion limits
get=<number or name> extract captured substring
@ -972,7 +978,7 @@ SUBJECT MODIFIERS
offset=<n> set starting offset
offset_limit=<n> set offset limit
ovector=<n> set size of output vector
recursion_limit=<n> set a recursion limit
recursion_limit=<n> obsolete synonym for depth_limit
replace=<string> specify a replacement string
startchar show startchar when relevant
startoffset=<n> same as offset=<n>
@ -1188,133 +1194,132 @@ SUBJECT MODIFIERS
Providing a stack that is larger than the default 32K is necessary only
for very complicated patterns.
Setting match and recursion limits
Setting match and depth limits
The match_limit and recursion_limit modifiers set the appropriate lim-
its in the match context. These values are ignored when the find_limits
modifier is specified.
The match_limit and depth_limit modifiers set the appropriate limits in
the match context. These values are ignored when the find_limits modi-
fier is specified.
Finding minimum limits
If the find_limits modifier is present, pcre2test calls pcre2_match()
several times, setting different values in the match context via
pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds
the minimum values for each parameter that allow pcre2_match() to com-
plete without error.
pcre2_set_match_limit() and pcre2_set_depth_limit() until it finds the
minimum values for each parameter that allow pcre2_match() to complete
without error.
If JIT is being used, only the match limit is relevant. If DFA matching
is being used, neither limit is relevant, and this modifier is ignored
(with a warning message).
is being used, only the depth limit is relevant, but at present this
modifier is ignored (with a warning message).
The match_limit number is a measure of the amount of backtracking that
takes place, and learning the minimum value can be instructive. For
most simple matches, the number is quite small, but for patterns with
very large numbers of matching possibilities, it can become large very
quickly with increasing length of subject string. The
match_limit_recursion number is a measure of how much stack (or, if
PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to
complete the match attempt.
quickly with increasing length of subject string. The depth_limit num-
ber is a measure of how much memory for recording backtracking points
is needed to complete the match attempt.
Showing MARK names
The mark modifier causes the names from backtracking control verbs that
are returned from calls to pcre2_match() to be displayed. If a mark is
returned for a match, non-match, or partial match, pcre2test shows it.
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
are returned from calls to pcre2_match() to be displayed. If a mark is
returned for a match, non-match, or partial match, pcre2test shows it.
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
it is added to the non-match message.
Showing memory usage
The memory modifier causes pcre2test to log all memory allocation and
The memory modifier causes pcre2test to log all memory allocation and
freeing calls that occur during a match operation.
Setting a starting offset
The offset modifier sets an offset in the subject string at which
The offset modifier sets an offset in the subject string at which
matching starts. Its value is a number of code units, not characters.
Setting an offset limit
The offset_limit modifier sets a limit for unanchored matches. If a
The offset_limit modifier sets a limit for unanchored matches. If a
match cannot be found starting at or before this offset in the subject,
a "no match" return is given. The data value is a number of code units,
not characters. When this modifier is used, the use_offset_limit modi-
not characters. When this modifier is used, the use_offset_limit modi-
fier must have been set for the pattern; if not, an error is generated.
Setting the size of the output vector
The ovector modifier applies only to the subject line in which it
appears, though of course it can also be used to set a default in a
#subject command. It specifies the number of pairs of offsets that are
The ovector modifier applies only to the subject line in which it
appears, though of course it can also be used to set a default in a
#subject command. It specifies the number of pairs of offsets that are
available for storing matching information. The default is 15.
A value of zero is useful when testing the POSIX API because it causes
A value of zero is useful when testing the POSIX API because it causes
regexec() to be called with a NULL capture vector. When not testing the
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
ate_from_pattern() to be called, in order to create a match block of
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
ate_from_pattern() to be called, in order to create a match block of
exactly the right size for the pattern. (It is not possible to create a
match block with a zero-length ovector; there is always at least one
match block with a zero-length ovector; there is always at least one
pair of offsets.)
Passing the subject as zero-terminated
By default, the subject string is passed to a native API matching func-
tion with its correct length. In order to test the facility for passing
a zero-terminated string, the zero_terminate modifier is provided. It
a zero-terminated string, the zero_terminate modifier is provided. It
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
via the POSIX interface, this modifier has no effect, as there is no
via the POSIX interface, this modifier has no effect, as there is no
facility for passing a length.)
When testing pcre2_substitute(), this modifier also has the effect of
When testing pcre2_substitute(), this modifier also has the effect of
passing the replacement string as zero-terminated.
Passing a NULL context
Normally, pcre2test passes a context block to pcre2_match(),
Normally, pcre2test passes a context block to pcre2_match(),
pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is
set, however, NULL is passed. This is for testing that the matching
set, however, NULL is passed. This is for testing that the matching
functions behave correctly in this case (they use default values). This
modifier cannot be used with the find_limits modifier or when testing
modifier cannot be used with the find_limits modifier or when testing
the substitution function.
THE ALTERNATIVE MATCHING FUNCTION
By default, pcre2test uses the standard PCRE2 matching function,
By default, pcre2test uses the standard PCRE2 matching function,
pcre2_match() to match each subject line. PCRE2 also supports an alter-
native matching function, pcre2_dfa_match(), which operates in a dif-
ferent way, and has some restrictions. The differences between the two
native matching function, pcre2_dfa_match(), which operates in a dif-
ferent way, and has some restrictions. The differences between the two
functions are described in the pcre2matching documentation.
If the dfa modifier is set, the alternative matching function is used.
This function finds all possible matches at a given point in the sub-
ject. If, however, the dfa_shortest modifier is set, processing stops
after the first match is found. This is always the shortest possible
If the dfa modifier is set, the alternative matching function is used.
This function finds all possible matches at a given point in the sub-
ject. If, however, the dfa_shortest modifier is set, processing stops
after the first match is found. This is always the shortest possible
match.
DEFAULT OUTPUT FROM pcre2test
This section describes the output when the normal matching function,
This section describes the output when the normal matching function,
pcre2_match(), is being used.
When a match succeeds, pcre2test outputs the list of captured sub-
strings, starting with number 0 for the string that matched the whole
pattern. Otherwise, it outputs "No match" when the return is
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
this is the entire substring that was inspected during the partial
match; it may include characters before the actual match start if a
When a match succeeds, pcre2test outputs the list of captured sub-
strings, starting with number 0 for the string that matched the whole
pattern. Otherwise, it outputs "No match" when the return is
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
this is the entire substring that was inspected during the partial
match; it may include characters before the actual match start if a
lookbehind assertion, \K, \b, or \B was involved.)
For any other return, pcre2test outputs the PCRE2 negative error number
and a short descriptive phrase. If the error is a failed UTF string
check, the code unit offset of the start of the failing character is
and a short descriptive phrase. If the error is a failed UTF string
check, the code unit offset of the start of the failing character is
also output. Here is an example of an interactive pcre2test run.
$ pcre2test
PCRE2 version 9.00 2014-05-10
PCRE2 version 10.22 2016-07-29
re> /^abc(\d+)/
data> abc123
@ -1326,8 +1331,8 @@ DEFAULT OUTPUT FROM pcre2test
Unset capturing substrings that are not followed by one that is set are
not shown by pcre2test unless the allcaptures modifier is specified. In
the following example, there are two capturing substrings, but when the
first data line is matched, the second, unset substring is not shown.
An "internal" unset substring is shown as "<unset>", as for the second
first data line is matched, the second, unset substring is not shown.
An "internal" unset substring is shown as "<unset>", as for the second
data line.
re> /(a)|(b)/
@ -1339,11 +1344,11 @@ DEFAULT OUTPUT FROM pcre2test
1: <unset>
2: b
If the strings contain any non-printing characters, they are output as
\xhh escapes if the value is less than 256 and UTF mode is not set.
If the strings contain any non-printing characters, they are output as
\xhh escapes if the value is less than 256 and UTF mode is not set.
Otherwise they are output as \x{hh...} escapes. See below for the defi-
nition of non-printing characters. If the aftertext modifier is set,
the output for substring 0 is followed by the the rest of the subject
nition of non-printing characters. If the aftertext modifier is set,
the output for substring 0 is followed by the the rest of the subject
string, identified by "0+" like this:
re> /cat/aftertext
@ -1351,7 +1356,7 @@ DEFAULT OUTPUT FROM pcre2test
0: cat
0+ aract
If global matching is requested, the results of successive matching
If global matching is requested, the results of successive matching
attempts are output in sequence, like this:
re> /\Bi(\w\w)/g
@ -1363,8 +1368,8 @@ DEFAULT OUTPUT FROM pcre2test
0: ipp
1: pp
"No match" is output only if the first match attempt fails. Here is an
example of a failure message (the offset 4 that is specified by the
"No match" is output only if the first match attempt fails. Here is an
example of a failure message (the offset 4 that is specified by the
offset modifier is past the end of the subject string):
re> /xyz/
@ -1372,7 +1377,7 @@ DEFAULT OUTPUT FROM pcre2test
Error -24 (bad offset value)
Note that whereas patterns can be continued over several lines (a plain
">" prompt is used for continuations), subject lines may not. However
">" prompt is used for continuations), subject lines may not. However
newlines can be included in a subject by means of the \n escape (or \r,
\r\n, etc., depending on the newline sequence setting).
@ -1380,7 +1385,7 @@ DEFAULT OUTPUT FROM pcre2test
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
When the alternative matching function, pcre2_dfa_match(), is used, the
output consists of a list of all the matches that start at the first
output consists of a list of all the matches that start at the first
point in the subject where there is at least one match. For example:
re> /(tang|tangerine|tan)/
@ -1389,11 +1394,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tang
2: tan
Using the normal matching function on this data finds only "tang". The
longest matching string is always given first (and numbered zero).
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
followed by the partially matching substring. Note that this is the
entire substring that was inspected during the partial match; it may
Using the normal matching function on this data finds only "tang". The
longest matching string is always given first (and numbered zero).
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
followed by the partially matching substring. Note that this is the
entire substring that was inspected during the partial match; it may
include characters before the actual match start if a lookbehind asser-
tion, \b, or \B was involved. (\K is not supported for DFA matching.)
@ -1409,16 +1414,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tan
0: tan
The alternative matching function does not support substring capture,
so the modifiers that are concerned with captured substrings are not
The alternative matching function does not support substring capture,
so the modifiers that are concerned with captured substrings are not
relevant.
RESTARTING AFTER A PARTIAL MATCH
When the alternative matching function has given the PCRE2_ERROR_PAR-
When the alternative matching function has given the PCRE2_ERROR_PAR-
TIAL return, indicating that the subject partially matched the pattern,
you can restart the match with additional subject data by means of the
you can restart the match with additional subject data by means of the
dfa_restart modifier. For example:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@ -1427,45 +1432,45 @@ RESTARTING AFTER A PARTIAL MATCH
data> n05\=dfa,dfa_restart
0: n05
For further information about partial matching, see the pcre2partial
For further information about partial matching, see the pcre2partial
documentation.
CALLOUTS
If the pattern contains any callout requests, pcre2test's callout func-
tion is called during matching unless callout_none is specified. This
tion is called during matching unless callout_none is specified. This
works with both matching functions.
The callout function in pcre2test returns zero (carry on matching) by
default, but you can use a callout_fail modifier in a subject line (as
The callout function in pcre2test returns zero (carry on matching) by
default, but you can use a callout_fail modifier in a subject line (as
described above) to change this and other parameters of the callout.
Inserting callouts can be helpful when using pcre2test to check compli-
cated regular expressions. For further information about callouts, see
cated regular expressions. For further information about callouts, see
the pcre2callout documentation.
The output for callouts with numerical arguments and those with string
The output for callouts with numerical arguments and those with string
arguments is slightly different.
Callouts with numerical arguments
By default, the callout function displays the callout number, the start
and current positions in the subject text at the callout time, and the
and current positions in the subject text at the callout time, and the
next pattern item to be tested. For example:
--->pqrabcdef
0 ^ ^ \d
This output indicates that callout number 0 occurred for a match
attempt starting at the fourth character of the subject string, when
the pointer was at the seventh character, and when the next pattern
item was \d. Just one circumflex is output if the start and current
positions are the same, or if the current position precedes the start
This output indicates that callout number 0 occurred for a match
attempt starting at the fourth character of the subject string, when
the pointer was at the seventh character, and when the next pattern
item was \d. Just one circumflex is output if the start and current
positions are the same, or if the current position precedes the start
position, which can happen if the callout is in a lookbehind assertion.
Callouts numbered 255 are assumed to be automatic callouts, inserted as
a result of the /auto_callout pattern modifier. In this case, instead
a result of the /auto_callout pattern modifier. In this case, instead
of showing the callout number, the offset in the pattern, preceded by a
plus, is output. For example:
@ -1479,7 +1484,7 @@ CALLOUTS
0: E*
If a pattern contains (*MARK) items, an additional line is output when-
ever a change of latest mark is passed to the callout function. For
ever a change of latest mark is passed to the callout function. For
example:
re> /a(*MARK:X)bc/auto_callout
@ -1493,17 +1498,17 @@ CALLOUTS
+12 ^ ^
0: abc
The mark changes between matching "a" and "b", but stays the same for
the rest of the match, so nothing more is output. If, as a result of
backtracking, the mark reverts to being unset, the text "<unset>" is
The mark changes between matching "a" and "b", but stays the same for
the rest of the match, so nothing more is output. If, as a result of
backtracking, the mark reverts to being unset, the text "<unset>" is
output.
Callouts with string arguments
The output for a callout with a string argument is similar, except that
instead of outputting a callout number before the position indicators,
the callout string and its offset in the pattern string are output
before the reflection of the subject string, and the subject string is
instead of outputting a callout number before the position indicators,
the callout string and its offset in the pattern string are output
before the reflection of the subject string, and the subject string is
reflected for each callout. For example:
re> /^ab(?C'first')cd(?C"second")ef/
@ -1520,43 +1525,43 @@ CALLOUTS
NON-PRINTING CHARACTERS
When pcre2test is outputting text in the compiled version of a pattern,
bytes other than 32-126 are always treated as non-printing characters
bytes other than 32-126 are always treated as non-printing characters
and are therefore shown as hex escapes.
When pcre2test is outputting text that is a matched part of a subject
string, it behaves in the same way, unless a different locale has been
set for the pattern (using the locale modifier). In this case, the
isprint() function is used to distinguish printing and non-printing
When pcre2test is outputting text that is a matched part of a subject
string, it behaves in the same way, unless a different locale has been
set for the pattern (using the locale modifier). In this case, the
isprint() function is used to distinguish printing and non-printing
characters.
SAVING AND RESTORING COMPILED PATTERNS
It is possible to save compiled patterns on disc or elsewhere, and
It is possible to save compiled patterns on disc or elsewhere, and
reload them later, subject to a number of restrictions. JIT data cannot
be saved. The host on which the patterns are reloaded must be running
be saved. The host on which the patterns are reloaded must be running
the same version of PCRE2, with the same code unit width, and must also
have the same endianness, pointer width and PCRE2_SIZE type. Before
compiled patterns can be saved they must be serialized, that is, con-
verted to a stream of bytes. A single byte stream may contain any num-
ber of compiled patterns, but they must all use the same character
have the same endianness, pointer width and PCRE2_SIZE type. Before
compiled patterns can be saved they must be serialized, that is, con-
verted to a stream of bytes. A single byte stream may contain any num-
ber of compiled patterns, but they must all use the same character
tables. A single copy of the tables is included in the byte stream (its
size is 1088 bytes).
The functions whose names begin with pcre2_serialize_ are used for
serializing and de-serializing. They are described in the pcre2serial-
The functions whose names begin with pcre2_serialize_ are used for
serializing and de-serializing. They are described in the pcre2serial-
ize documentation. In this section we describe the features of
pcre2test that can be used to test these functions.
When a pattern with push modifier is successfully compiled, it is
pushed onto a stack of compiled patterns, and pcre2test expects the
next line to contain a new pattern (or command) instead of a subject
line. By contrast, the pushcopy modifier causes a copy of the compiled
pattern to be stacked, leaving the original available for immediate
matching. By using push and/or pushcopy, a number of patterns can be
When a pattern with push modifier is successfully compiled, it is
pushed onto a stack of compiled patterns, and pcre2test expects the
next line to contain a new pattern (or command) instead of a subject
line. By contrast, the pushcopy modifier causes a copy of the compiled
pattern to be stacked, leaving the original available for immediate
matching. By using push and/or pushcopy, a number of patterns can be
compiled and retained. These modifiers are incompatible with posix, and
control modifiers that act at match time are ignored (with a message)
for the stacked patterns. The jitverify modifier applies only at com-
control modifiers that act at match time are ignored (with a message)
for the stacked patterns. The jitverify modifier applies only at com-
pile time.
The command
@ -1564,21 +1569,21 @@ SAVING AND RESTORING COMPILED PATTERNS
#save <filename>
causes all the stacked patterns to be serialized and the result written
to the named file. Afterwards, all the stacked patterns are freed. The
to the named file. Afterwards, all the stacked patterns are freed. The
command
#load <filename>
reads the data in the file, and then arranges for it to be de-serial-
ized, with the resulting compiled patterns added to the pattern stack.
The pattern on the top of the stack can be retrieved by the #pop com-
mand, which must be followed by lines of subjects that are to be
matched with the pattern, terminated as usual by an empty line or end
of file. This command may be followed by a modifier list containing
only control modifiers that act after a pattern has been compiled. In
reads the data in the file, and then arranges for it to be de-serial-
ized, with the resulting compiled patterns added to the pattern stack.
The pattern on the top of the stack can be retrieved by the #pop com-
mand, which must be followed by lines of subjects that are to be
matched with the pattern, terminated as usual by an empty line or end
of file. This command may be followed by a modifier list containing
only control modifiers that act after a pattern has been compiled. In
particular, hex, posix, posix_nosub, push, and pushcopy are not
allowed, nor are any option-setting modifiers. The JIT modifiers are,
however permitted. Here is an example that saves and reloads two pat-
allowed, nor are any option-setting modifiers. The JIT modifiers are,
however permitted. Here is an example that saves and reloads two pat-
terns.
/abc/push
@ -1591,10 +1596,10 @@ SAVING AND RESTORING COMPILED PATTERNS
#pop jit,bincode
abc
If jitverify is used with #pop, it does not automatically imply jit,
If jitverify is used with #pop, it does not automatically imply jit,
which is different behaviour from when it is used on a pattern.
The #popcopy command is analagous to the pushcopy modifier in that it
The #popcopy command is analagous to the pushcopy modifier in that it
makes current a copy of the topmost stack pattern, leaving the original
still on the stack.
@ -1614,5 +1619,5 @@ AUTHOR
REVISION
Last updated: 28 December 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 21 March 2017
Copyright (c) 1997-2017 University of Cambridge.