Documentation update.
This commit is contained in:
parent
32bab50c01
commit
3aeb812180
|
@ -1,10 +1,6 @@
|
||||||
Building PCRE2 without using autotools
|
Building PCRE2 without using autotools
|
||||||
--------------------------------------
|
--------------------------------------
|
||||||
|
|
||||||
This document has been converted from the PCRE1 document. I have removed a
|
|
||||||
number of sections about building in various environments, as they applied only
|
|
||||||
to PCRE1 and are probably out of date.
|
|
||||||
|
|
||||||
This document contains the following sections:
|
This document contains the following sections:
|
||||||
|
|
||||||
General
|
General
|
||||||
|
@ -183,21 +179,9 @@ can skip ahead to the CMake section.
|
||||||
|
|
||||||
STACK SIZE IN WINDOWS ENVIRONMENTS
|
STACK SIZE IN WINDOWS ENVIRONMENTS
|
||||||
|
|
||||||
The default processor stack size of 1Mb in some Windows environments is too
|
Prior to release 10.30 the default system stack size of 1Mb in some Windows
|
||||||
small for matching patterns that need much recursion. In particular, test 2 may
|
environments caused issues with some tests. This should no longer be the case
|
||||||
fail because of this. Normally, running out of stack causes a crash, but there
|
for 10.30 and later releases.
|
||||||
have been cases where the test program has just died silently. See your linker
|
|
||||||
documentation for how to increase stack size if you experience problems. If you
|
|
||||||
are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
|
|
||||||
compiler, you can increase the stack size for pcre2test and pcre2grep by
|
|
||||||
setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
|
|
||||||
example). The Linux default of 8Mb is a reasonable choice for the stack, though
|
|
||||||
even that can be too small for some pattern/subject combinations.
|
|
||||||
|
|
||||||
PCRE2 has a compile configuration option to disable the use of stack for
|
|
||||||
recursion so that heap is used instead. However, pattern matching is
|
|
||||||
significantly slower when this is done. There is more about stack usage in the
|
|
||||||
"pcre2stack" documentation.
|
|
||||||
|
|
||||||
|
|
||||||
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
|
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
|
||||||
|
@ -393,4 +377,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
|
||||||
recommended download site.
|
recommended download site.
|
||||||
|
|
||||||
=============================
|
=============================
|
||||||
Last Updated: 13 October 2016
|
Last Updated: 17 March 2017
|
||||||
|
|
|
@ -15,8 +15,8 @@ subscribe or manage your subscription here:
|
||||||
|
|
||||||
https://lists.exim.org/mailman/listinfo/pcre-dev
|
https://lists.exim.org/mailman/listinfo/pcre-dev
|
||||||
|
|
||||||
Please read the NEWS file if you are upgrading from a previous release.
|
Please read the NEWS file if you are upgrading from a previous release. The
|
||||||
The contents of this README file are:
|
contents of this README file are:
|
||||||
|
|
||||||
The PCRE2 APIs
|
The PCRE2 APIs
|
||||||
Documentation for PCRE2
|
Documentation for PCRE2
|
||||||
|
@ -44,8 +44,8 @@ wrappers.
|
||||||
|
|
||||||
The distribution does contain a set of C wrapper functions for the 8-bit
|
The distribution does contain a set of C wrapper functions for the 8-bit
|
||||||
library that are based on the POSIX regular expression API (see the pcre2posix
|
library that are based on the POSIX regular expression API (see the pcre2posix
|
||||||
man page). These can be found in a library called libpcre2-posix. Note that this
|
man page). These can be found in a library called libpcre2-posix. Note that
|
||||||
just provides a POSIX calling interface to PCRE2; the regular expressions
|
this just provides a POSIX calling interface to PCRE2; the regular expressions
|
||||||
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
|
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
|
||||||
and does not give full access to all of PCRE2's facilities.
|
and does not give full access to all of PCRE2's facilities.
|
||||||
|
|
||||||
|
@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms:
|
||||||
Building PCRE2 on non-Unix-like systems
|
Building PCRE2 on non-Unix-like systems
|
||||||
---------------------------------------
|
---------------------------------------
|
||||||
|
|
||||||
For a non-Unix-like system, please read the comments in the file
|
For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
|
||||||
NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and
|
your system supports the use of "configure" and "make" you may be able to build
|
||||||
"make" you may be able to build PCRE2 using autotools in the same way as for
|
PCRE2 using autotools in the same way as for many Unix-like systems.
|
||||||
many Unix-like systems.
|
|
||||||
|
|
||||||
PCRE2 can also be configured using CMake, which can be run in various ways
|
PCRE2 can also be configured using CMake, which can be run in various ways
|
||||||
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
|
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
|
||||||
|
@ -174,19 +173,19 @@ library. They are also documented in the pcre2build man page.
|
||||||
architectures. If you try to enable it on an unsupported architecture, there
|
architectures. If you try to enable it on an unsupported architecture, there
|
||||||
will be a compile time error.
|
will be a compile time error.
|
||||||
|
|
||||||
. If you do not want to make use of the support for UTF-8 Unicode character
|
. If you do not want to make use of the default support for UTF-8 Unicode
|
||||||
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
character strings in the 8-bit library, UTF-16 Unicode character strings in
|
||||||
library, or UTF-32 Unicode character strings in the 32-bit library, you can
|
the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
|
||||||
add --disable-unicode to the "configure" command. This reduces the size of
|
library, you can add --disable-unicode to the "configure" command. This
|
||||||
the libraries. It is not possible to configure one library with Unicode
|
reduces the size of the libraries. It is not possible to configure one
|
||||||
support, and another without, in the same configuration.
|
library with Unicode support, and another without, in the same configuration.
|
||||||
|
It is also not possible to use --enable-ebcdic (see below) with Unicode
|
||||||
|
support, so if this option is set, you must also use --disable-unicode.
|
||||||
|
|
||||||
When Unicode support is available, the use of a UTF encoding still has to be
|
When Unicode support is available, the use of a UTF encoding still has to be
|
||||||
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
||||||
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
||||||
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
|
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
|
||||||
not possible to use both --enable-unicode and --enable-ebcdic at the same
|
|
||||||
time.
|
|
||||||
|
|
||||||
As well as supporting UTF strings, Unicode support includes support for the
|
As well as supporting UTF strings, Unicode support includes support for the
|
||||||
\P, \p, and \X sequences that recognize Unicode character properties.
|
\P, \p, and \X sequences that recognize Unicode character properties.
|
||||||
|
@ -232,18 +231,18 @@ library. They are also documented in the pcre2build man page.
|
||||||
--with-match-limit=500000
|
--with-match-limit=500000
|
||||||
|
|
||||||
on the "configure" command. This is just the default; individual calls to
|
on the "configure" command. This is just the default; individual calls to
|
||||||
pcre2_match() can supply their own value. There is more discussion on the
|
pcre2_match() can supply their own value. There is more discussion in the
|
||||||
pcre2api man page.
|
pcre2api man page (search for pcre2_set_match_limit).
|
||||||
|
|
||||||
. There is a separate counter that limits the depth of recursive function calls
|
. There is a separate counter that limits the depth of nested backtracking
|
||||||
during a matching process. This also has a default of ten million, which is
|
during a matching process, which in turn limits the amount of memory that is
|
||||||
essentially "unlimited". You can change the default by setting, for example,
|
used. This also has a default of ten million, which is essentially
|
||||||
|
"unlimited". You can change the default by setting, for example,
|
||||||
|
|
||||||
--with-match-limit-recursion=500000
|
--with-match-limit-depth=5000
|
||||||
|
|
||||||
Recursive function calls use up the runtime stack; running out of stack can
|
There is more discussion in the pcre2api man page (search for
|
||||||
cause programs to crash in strange ways. There is a discussion about stack
|
pcre2_set_depth_limit).
|
||||||
sizes in the pcre2stack man page.
|
|
||||||
|
|
||||||
. In the 8-bit library, the default maximum compiled pattern size is around
|
. In the 8-bit library, the default maximum compiled pattern size is around
|
||||||
64K bytes. You can increase this by adding --with-link-size=3 to the
|
64K bytes. You can increase this by adding --with-link-size=3 to the
|
||||||
|
@ -254,20 +253,6 @@ library. They are also documented in the pcre2build man page.
|
||||||
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
|
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
|
||||||
link size setting is ignored, as 4-byte offsets are always used.
|
link size setting is ignored, as 4-byte offsets are always used.
|
||||||
|
|
||||||
. You can build PCRE2 so that its internal match() function that is called from
|
|
||||||
pcre2_match() does not call itself recursively. Instead, it uses memory
|
|
||||||
blocks obtained from the heap to save data that would otherwise be saved on
|
|
||||||
the stack. To build PCRE2 like this, use
|
|
||||||
|
|
||||||
--disable-stack-for-recursion
|
|
||||||
|
|
||||||
on the "configure" command. PCRE2 runs more slowly in this mode, but it may
|
|
||||||
be necessary in environments with limited stack sizes. This applies only to
|
|
||||||
the normal execution of the pcre2_match() function; if JIT support is being
|
|
||||||
successfully used, it is not relevant. Equally, it does not apply to
|
|
||||||
pcre2_dfa_match(), which does not use deeply nested recursion. There is a
|
|
||||||
discussion about stack sizes in the pcre2stack man page.
|
|
||||||
|
|
||||||
. For speed, PCRE2 uses four tables for manipulating and identifying characters
|
. For speed, PCRE2 uses four tables for manipulating and identifying characters
|
||||||
whose code point values are less than 256. By default, it uses a set of
|
whose code point values are less than 256. By default, it uses a set of
|
||||||
tables for ASCII encoding that is part of the distribution. If you specify
|
tables for ASCII encoding that is part of the distribution. If you specify
|
||||||
|
@ -389,6 +374,13 @@ library. They are also documented in the pcre2build man page.
|
||||||
string. Otherwise, it is assumed to be a file name, and the contents of the
|
string. Otherwise, it is assumed to be a file name, and the contents of the
|
||||||
file are the test string.
|
file are the test string.
|
||||||
|
|
||||||
|
. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
|
||||||
|
which caused pcre2_match() to use individual blocks on the heap for
|
||||||
|
backtracking instead of recursive function calls (which use the stack). This
|
||||||
|
is now obsolete since pcre2_match() was refactored always to use the heap (in
|
||||||
|
a much more efficient way than before). This option is retained for backwards
|
||||||
|
compatibility, but has no effect other than to output a warning.
|
||||||
|
|
||||||
The "configure" script builds the following files for the basic C library:
|
The "configure" script builds the following files for the basic C library:
|
||||||
|
|
||||||
. Makefile the makefile that builds the library
|
. Makefile the makefile that builds the library
|
||||||
|
@ -662,25 +654,32 @@ Unicode support is enabled.
|
||||||
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
|
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
|
||||||
16-bit and 32-bit modes. These are tests that generate different output in
|
16-bit and 32-bit modes. These are tests that generate different output in
|
||||||
8-bit mode. Each pair are for general cases and Unicode support, respectively.
|
8-bit mode. Each pair are for general cases and Unicode support, respectively.
|
||||||
|
|
||||||
Test 13 checks the handling of non-UTF characters greater than 255 by
|
Test 13 checks the handling of non-UTF characters greater than 255 by
|
||||||
pcre2_dfa_match() in 16-bit and 32-bit modes.
|
pcre2_dfa_match() in 16-bit and 32-bit modes.
|
||||||
|
|
||||||
Test 14 contains a number of tests that must not be run with JIT. They check,
|
Test 14 contains some special UTF and UCP tests that give different output for
|
||||||
|
the different widths.
|
||||||
|
|
||||||
|
Test 15 contains a number of tests that must not be run with JIT. They check,
|
||||||
among other non-JIT things, the match-limiting features of the intepretive
|
among other non-JIT things, the match-limiting features of the intepretive
|
||||||
matcher.
|
matcher.
|
||||||
|
|
||||||
Test 15 is run only when JIT support is not available. It checks that an
|
Test 16 is run only when JIT support is not available. It checks that an
|
||||||
attempt to use JIT has the expected behaviour.
|
attempt to use JIT has the expected behaviour.
|
||||||
|
|
||||||
Test 16 is run only when JIT support is available. It checks JIT complete and
|
Test 17 is run only when JIT support is available. It checks JIT complete and
|
||||||
partial modes, match-limiting under JIT, and other JIT-specific features.
|
partial modes, match-limiting under JIT, and other JIT-specific features.
|
||||||
|
|
||||||
Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to
|
Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
|
||||||
the 8-bit library, without and with Unicode support, respectively.
|
the 8-bit library, without and with Unicode support, respectively.
|
||||||
|
|
||||||
Test 19 checks the serialization functions by writing a set of compiled
|
Test 20 checks the serialization functions by writing a set of compiled
|
||||||
patterns to a file, and then reloading and checking them.
|
patterns to a file, and then reloading and checking them.
|
||||||
|
|
||||||
|
Tests 21 and 22 test \C support when the use of \C is not locked out, without
|
||||||
|
and with UTF support, respectively. Test 23 tests \C when it is locked out.
|
||||||
|
|
||||||
|
|
||||||
Character tables
|
Character tables
|
||||||
----------------
|
----------------
|
||||||
|
@ -866,4 +865,4 @@ The distribution should contain the files listed below.
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
Email local part: ph10
|
Email local part: ph10
|
||||||
Email domain: cam.ac.uk
|
Email domain: cam.ac.uk
|
||||||
Last updated: 01 November 2016
|
Last updated: 17 March 2017
|
||||||
|
|
|
@ -109,7 +109,7 @@ lose performance.
|
||||||
One way of guarding against this possibility is to use the
|
One way of guarding against this possibility is to use the
|
||||||
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
|
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
|
||||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
|
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
|
||||||
<b>pcre2_compile()</b>. This causes an compile time error if a pattern contains
|
<b>pcre2_compile()</b>. This causes a compile time error if the pattern contains
|
||||||
a UTF-setting sequence.
|
a UTF-setting sequence.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -137,7 +137,8 @@ large search tree against a string that will never match. Nested unlimited
|
||||||
repeats in a pattern are a common example. PCRE2 provides some protection
|
repeats in a pattern are a common example. PCRE2 provides some protection
|
||||||
against this: see the <b>pcre2_set_match_limit()</b> function in the
|
against this: see the <b>pcre2_set_match_limit()</b> function in the
|
||||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
page.
|
page. There is a similar function called <b>pcre2_set_depth_limit()</b> that can
|
||||||
|
be used to restrict the amount of memory that is used.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
|
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -166,7 +167,7 @@ listing), and the short pages for individual functions, are concatenated in
|
||||||
pcre2perform discussion of performance issues
|
pcre2perform discussion of performance issues
|
||||||
pcre2posix the POSIX-compatible C API for the 8-bit library
|
pcre2posix the POSIX-compatible C API for the 8-bit library
|
||||||
pcre2sample discussion of the pcre2demo program
|
pcre2sample discussion of the pcre2demo program
|
||||||
pcre2stack discussion of stack usage
|
pcre2stack discussion of stack and memory usage
|
||||||
pcre2syntax quick syntax reference
|
pcre2syntax quick syntax reference
|
||||||
pcre2test description of the <b>pcre2test</b> command
|
pcre2test description of the <b>pcre2test</b> command
|
||||||
pcre2unicode discussion of Unicode and UTF support
|
pcre2unicode discussion of Unicode and UTF support
|
||||||
|
@ -189,9 +190,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 16 October 2015
|
Last updated: 27 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2015 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -36,20 +36,21 @@ for success and non-zero otherwise. The arguments are:
|
||||||
<i>callout_data</i> User data that is passed to the callback
|
<i>callout_data</i> User data that is passed to the callback
|
||||||
</pre>
|
</pre>
|
||||||
The <i>callback()</i> function is passed a pointer to a data block containing
|
The <i>callback()</i> function is passed a pointer to a data block containing
|
||||||
the following fields:
|
the following fields (not necessarily in this order):
|
||||||
<pre>
|
<pre>
|
||||||
<i>version</i> Block version number
|
uint32_t <i>version</i> Block version number
|
||||||
<i>pattern_position</i> Offset to next item in pattern
|
uint32_t <i>callout_number</i> Number for numbered callouts
|
||||||
<i>next_item_length</i> Length of next item in pattern
|
PCRE2_SIZE <i>pattern_position</i> Offset to next item in pattern
|
||||||
<i>callout_number</i> Number for numbered callouts
|
PCRE2_SIZE <i>next_item_length</i> Length of next item in pattern
|
||||||
<i>callout_string_offset</i> Offset to string within pattern
|
PCRE2_SIZE <i>callout_string_offset</i> Offset to string within pattern
|
||||||
<i>callout_string_length</i> Length of callout string
|
PCRE2_SIZE <i>callout_string_length</i> Length of callout string
|
||||||
<i>callout_string</i> Points to callout string or is NULL
|
PCRE2_SPTR <i>callout_string</i> Points to callout string or is NULL
|
||||||
</pre>
|
</pre>
|
||||||
The second argument is the callout data that was passed to
|
The second argument passed to the <b>callback()</b> function is the callout data
|
||||||
<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero
|
that was passed to <b>pcre2_callout_enumerate()</b>. The <b>callback()</b>
|
||||||
for success. Any other value causes the pattern scan to stop, with the value
|
function must return zero for success. Any other value causes the pattern scan
|
||||||
being passed back as the result of <b>pcre2_callout_enumerate()</b>.
|
to stop, with the value being passed back as the result of
|
||||||
|
<b>pcre2_callout_enumerate()</b>.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API in the
|
||||||
|
|
|
@ -26,7 +26,9 @@ DESCRIPTION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
This function frees the memory used for a compiled pattern, including any
|
This function frees the memory used for a compiled pattern, including any
|
||||||
memory used by the JIT compiler.
|
memory used by the JIT compiler. If the compiled pattern was created by a call
|
||||||
|
to <b>pcre2_code_copy_with_tables()</b>, the memory for the character tables is
|
||||||
|
also freed.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API in the
|
||||||
|
|
|
@ -37,19 +37,24 @@ arguments are:
|
||||||
<i>erroffset</i> Where to put an error offset
|
<i>erroffset</i> Where to put an error offset
|
||||||
<i>ccontext</i> Pointer to a compile context or NULL
|
<i>ccontext</i> Pointer to a compile context or NULL
|
||||||
</pre>
|
</pre>
|
||||||
The length of the string and any error offset that is returned are in code
|
The length of the pattern and any error offset that is returned are in code
|
||||||
units, not characters. A compile context is needed only if you want to change
|
units, not characters. A compile context is needed only if you want to provide
|
||||||
|
custom memory allocation functions, or to provide an external function for
|
||||||
|
system stack size checking, or to change one or more of these parameters:
|
||||||
<pre>
|
<pre>
|
||||||
What \R matches (Unicode newlines or CR, LF, CRLF only)
|
What \R matches (Unicode newlines, or CR, LF, CRLF only);
|
||||||
PCRE2's character tables
|
PCRE2's character tables;
|
||||||
The newline character sequence
|
The newline character sequence;
|
||||||
The compile time nested parentheses limit
|
The compile time nested parentheses limit;
|
||||||
|
The maximum pattern length (in code units) that is allowed.
|
||||||
</pre>
|
</pre>
|
||||||
or provide an external function for stack size checking. The option bits are:
|
The option bits are:
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_ANCHORED Force pattern anchoring
|
PCRE2_ANCHORED Force pattern anchoring
|
||||||
|
PCRE2_ALLOW_EMPTY_CLASS Allow empty classes
|
||||||
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
|
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
|
||||||
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
|
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
|
||||||
|
PCRE2_ALT_VERBNAMES Process backslashes in verb names
|
||||||
PCRE2_AUTO_CALLOUT Compile automatic callouts
|
PCRE2_AUTO_CALLOUT Compile automatic callouts
|
||||||
PCRE2_CASELESS Do caseless matching
|
PCRE2_CASELESS Do caseless matching
|
||||||
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
|
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
|
||||||
|
@ -71,19 +76,21 @@ or provide an external function for stack size checking. The option bits are:
|
||||||
(only relevant if PCRE2_UTF is set)
|
(only relevant if PCRE2_UTF is set)
|
||||||
PCRE2_UCP Use Unicode properties for \d, \w, etc.
|
PCRE2_UCP Use Unicode properties for \d, \w, etc.
|
||||||
PCRE2_UNGREEDY Invert greediness of quantifiers
|
PCRE2_UNGREEDY Invert greediness of quantifiers
|
||||||
|
PCRE2_USE_OFFSET_LIMIT Enable offset limit for unanchored matching
|
||||||
PCRE2_UTF Treat pattern and subjects as UTF strings
|
PCRE2_UTF Treat pattern and subjects as UTF strings
|
||||||
</pre>
|
</pre>
|
||||||
PCRE2 must be built with Unicode support in order to use PCRE2_UTF, PCRE2_UCP
|
PCRE2 must be built with Unicode support (the default) in order to use
|
||||||
and related options.
|
PCRE2_UTF, PCRE2_UCP and related options.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The yield of the function is a pointer to a private data structure that
|
The yield of the function is a pointer to a private data structure that
|
||||||
contains the compiled pattern, or NULL if an error was detected.
|
contains the compiled pattern, or NULL if an error was detected.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API, with more detail on
|
||||||
|
each option, in the
|
||||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
page and a description of the POSIX API in the
|
page, and a description of the POSIX API in the
|
||||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||||
page.
|
page.
|
||||||
<p>
|
<p>
|
||||||
|
|
|
@ -45,10 +45,9 @@ point to a uint32_t integer variable. The available codes are:
|
||||||
PCRE2_CONFIG_BSR Indicates what \R matches by default:
|
PCRE2_CONFIG_BSR Indicates what \R matches by default:
|
||||||
PCRE2_BSR_UNICODE
|
PCRE2_BSR_UNICODE
|
||||||
PCRE2_BSR_ANYCRLF
|
PCRE2_BSR_ANYCRLF
|
||||||
PCRE2_CONFIG_JIT Availability of just-in-time compiler
|
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
|
||||||
support (1=yes 0=no)
|
PCRE2_CONFIG_JIT Availability of just-in-time compiler support (1=yes 0=no)
|
||||||
PCRE2_CONFIG_JITTARGET Information about the target archi-
|
PCRE2_CONFIG_JITTARGET Information (a string) about the target architecture for the JIT compiler
|
||||||
tecture for the JIT compiler
|
|
||||||
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
||||||
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
||||||
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
||||||
|
@ -58,11 +57,9 @@ point to a uint32_t integer variable. The available codes are:
|
||||||
PCRE2_NEWLINE_ANY
|
PCRE2_NEWLINE_ANY
|
||||||
PCRE2_NEWLINE_ANYCRLF
|
PCRE2_NEWLINE_ANYCRLF
|
||||||
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
||||||
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
|
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
|
||||||
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
|
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
|
||||||
0=heap)
|
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes 0=no)
|
||||||
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
|
|
||||||
0=no)
|
|
||||||
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
||||||
PCRE2_CONFIG_VERSION The PCRE2 version (a string)
|
PCRE2_CONFIG_VERSION The PCRE2 version (a string)
|
||||||
</pre>
|
</pre>
|
||||||
|
|
|
@ -31,8 +31,9 @@ DESCRIPTION
|
||||||
<P>
|
<P>
|
||||||
This function matches a compiled regular expression against a given subject
|
This function matches a compiled regular expression against a given subject
|
||||||
string, using an alternative matching algorithm that scans the subject string
|
string, using an alternative matching algorithm that scans the subject string
|
||||||
just once (<i>not</i> Perl-compatible). (The Perl-compatible matching function
|
just once (except when processing lookaround assertions). This function is
|
||||||
is <b>pcre2_match()</b>.) The arguments for this function are:
|
<i>not</i> Perl-compatible (the Perl-compatible matching function is
|
||||||
|
<b>pcre2_match()</b>). The arguments for this function are:
|
||||||
<pre>
|
<pre>
|
||||||
<i>code</i> Points to the compiled pattern
|
<i>code</i> Points to the compiled pattern
|
||||||
<i>subject</i> Points to the subject string
|
<i>subject</i> Points to the subject string
|
||||||
|
@ -45,22 +46,18 @@ is <b>pcre2_match()</b>.) The arguments for this function are:
|
||||||
<i>wscount</i> Number of elements in the vector
|
<i>wscount</i> Number of elements in the vector
|
||||||
</pre>
|
</pre>
|
||||||
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
|
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
|
||||||
up a callout function or specify the recursion limit. The <i>length</i> and
|
up a callout function or specify the recursion depth limit. The <i>length</i>
|
||||||
<i>startoffset</i> values are code units, not characters. The options are:
|
and <i>startoffset</i> values are code units, not characters. The options are:
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_ANCHORED Match only at the first position
|
PCRE2_ANCHORED Match only at the first position
|
||||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||||
PCRE2_NOTEOL Subject is not the end of a line
|
PCRE2_NOTEOL Subject is not the end of a line
|
||||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
|
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
|
||||||
is not a valid match
|
PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
|
||||||
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
|
|
||||||
validity (only relevant if PCRE2_UTF
|
|
||||||
was set at compile time)
|
was set at compile time)
|
||||||
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
|
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
|
||||||
match if no full matches are found
|
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
|
||||||
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
|
|
||||||
even if there is a full match as well
|
|
||||||
PCRE2_DFA_RESTART Restart after a partial match
|
PCRE2_DFA_RESTART Restart after a partial match
|
||||||
PCRE2_DFA_SHORTEST Return only the shortest match
|
PCRE2_DFA_SHORTEST Return only the shortest match
|
||||||
</pre>
|
</pre>
|
||||||
|
|
|
@ -34,11 +34,11 @@ errors are negative numbers. The arguments are:
|
||||||
<i>buffer</i> where to put the message
|
<i>buffer</i> where to put the message
|
||||||
<i>bufflen</i> the length of the buffer (code units)
|
<i>bufflen</i> the length of the buffer (code units)
|
||||||
</pre>
|
</pre>
|
||||||
The function returns the length of the message, excluding the trailing zero, or
|
The function returns the length of the message in code units, excluding the
|
||||||
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
|
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
|
||||||
this case, the returned message is truncated (but still with a trailing zero).
|
too small. In this case, the returned message is truncated (but still with a
|
||||||
If <i>errorcode</i> does not contain a recognized error code number, the
|
trailing zero). If <i>errorcode</i> does not contain a recognized error code
|
||||||
negative value PCRE2_ERROR_BADDATA is returned.
|
number, the negative value PCRE2_ERROR_BADDATA is returned.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API in the
|
||||||
|
|
|
@ -32,10 +32,9 @@ maximum size to which it is allowed to grow. The final argument is a general
|
||||||
context, for memory allocation functions, or NULL for standard memory
|
context, for memory allocation functions, or NULL for standard memory
|
||||||
allocation. The result can be passed to the JIT run-time code by calling
|
allocation. The result can be passed to the JIT run-time code by calling
|
||||||
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
|
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
|
||||||
which can then be processed by <b>pcre2_match()</b>. If the "fast path" JIT
|
which can then be processed by <b>pcre2_match()</b> or <b>pcre2_jit_match()</b>.
|
||||||
matcher, <b>pcre2_jit_match()</b> is used, the stack can be passed directly as
|
A maximum stack size of 512K to 1M should be more than enough for any pattern.
|
||||||
an argument. A maximum stack size of 512K to 1M should be more than enough for
|
For more details, see the
|
||||||
any pattern. For more details, see the
|
|
||||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||||
page.
|
page.
|
||||||
</P>
|
</P>
|
||||||
|
|
|
@ -25,10 +25,10 @@ SYNOPSIS
|
||||||
DESCRIPTION
|
DESCRIPTION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
This function builds a set of character tables for character values less than
|
This function builds a set of character tables for character code points that
|
||||||
256. These can be passed to <b>pcre2_compile()</b> in a compile context in order
|
are less than 256. These can be passed to <b>pcre2_compile()</b> in a compile
|
||||||
to override the internal, built-in tables (which were either defaulted or made
|
context in order to override the internal, built-in tables (which were either
|
||||||
by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
|
defaulted or made by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
|
||||||
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
|
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
|
||||||
page. You might want to do this if you are using a non-standard locale.
|
page. You might want to do this if you are using a non-standard locale.
|
||||||
</P>
|
</P>
|
||||||
|
|
|
@ -2575,8 +2575,8 @@ The internal recursion limit was reached.
|
||||||
A text message for an error code from any PCRE2 function (compile, match, or
|
A text message for an error code from any PCRE2 function (compile, match, or
|
||||||
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
|
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
|
||||||
is passed as the first argument, with the remaining two arguments specifying a
|
is passed as the first argument, with the remaining two arguments specifying a
|
||||||
code unit buffer and its length, into which the text message is placed. Note
|
code unit buffer and its length in code units, into which the text message is
|
||||||
that the message is returned in code units of the appropriate width for the
|
placed. The message is returned in code units of the appropriate width for the
|
||||||
library that is being used.
|
library that is being used.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -3265,9 +3265,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC41" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC41" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 23 December 2016
|
Last updated: 21 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -280,6 +280,10 @@ operating systems the effect of reading a directory like this is an immediate
|
||||||
end-of-file; in others it may provoke an error.
|
end-of-file; in others it may provoke an error.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
<b>--depth-limit</b>=<i>number</i>
|
||||||
|
See <b>--match-limit</b> below.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
|
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
|
||||||
Specify a pattern to be matched. This option can be used multiple times in
|
Specify a pattern to be matched. This option can be used multiple times in
|
||||||
order to specify several patterns. It can also be used as a way of specifying a
|
order to specify several patterns. It can also be used as a way of specifying a
|
||||||
|
@ -498,29 +502,22 @@ used. There is no short form for this option.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
<b>--match-limit</b>=<i>number</i>
|
<b>--match-limit</b>=<i>number</i>
|
||||||
Processing some regular expression patterns can require a very large amount of
|
Processing some regular expression patterns may take a very long time to search
|
||||||
memory, leading in some cases to a program crash if not enough is available.
|
for all possible matching strings. Others may require a very large amount of
|
||||||
Other patterns may take a very long time to search for all possible matching
|
memory. There are two options that set resource limits for matching.
|
||||||
strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
|
|
||||||
do the matching has two parameters that can limit the resources that it uses.
|
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
The <b>--match-limit</b> option provides a means of limiting resource usage
|
The <b>--match-limit</b> option provides a means of limiting computing resource
|
||||||
when processing patterns that are not going to match, but which have a very
|
usage when processing patterns that are not going to match, but which have a
|
||||||
large number of possibilities in their search trees. The classic example is a
|
very large number of possibilities in their search trees. The classic example
|
||||||
pattern that uses nested unlimited repeats. Internally, PCRE2 uses a function
|
is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
|
||||||
called <b>match()</b> which it calls repeatedly (sometimes recursively). The
|
counter that is incremented each time around its main processing loop. If the
|
||||||
limit set by <b>--match-limit</b> is imposed on the number of times this
|
value set by <b>--match-limit</b> is reached, an error occurs.
|
||||||
function is called during a match, which has the effect of limiting the amount
|
|
||||||
of backtracking that can take place.
|
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
The <b>--recursion-limit</b> option is similar to <b>--match-limit</b>, but
|
The <b>--depth-limit</b> option limits the depth of nested backtracking points,
|
||||||
instead of limiting the total number of times that <b>match()</b> is called, it
|
which in turn limits the amount of memory that is used. This limit is of use
|
||||||
limits the depth of recursive calls, which in turn limits the amount of memory
|
only if it is set smaller than <b>--match-limit</b>.
|
||||||
that can be used. The recursion depth is a smaller number than the total number
|
|
||||||
of calls, because not all calls to <b>match()</b> are recursive. This limit is
|
|
||||||
of use only if it is set smaller than <b>--match-limit</b>.
|
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
There are no short forms for these options. The default settings are specified
|
There are no short forms for these options. The default settings are specified
|
||||||
|
@ -843,9 +840,9 @@ there are more than 20 such errors, <b>pcre2grep</b> gives up.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
|
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
|
||||||
overall resource limit; there is a second option called <b>--recursion-limit</b>
|
overall resource limit; there is a second option called <b>--depth-limit</b>
|
||||||
that sets a limit on the amount of memory (usually stack) that is used (see the
|
that sets a limit on the amount of memory that is used (see the discussion of
|
||||||
discussion of these options above).
|
these options above).
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
|
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -870,9 +867,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 31 December 2016
|
Last updated: 21 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -170,20 +170,24 @@ the application to apply the JIT optimization by calling
|
||||||
<b>pcre2_jit_compile()</b> is ignored.
|
<b>pcre2_jit_compile()</b> is ignored.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Setting match and recursion limits
|
Setting match and backtracking depth limits
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
The caller of <b>pcre2_match()</b> can set a limit on the number of times the
|
The pcre2_match() function contains a counter that is incremented every time it
|
||||||
internal <b>match()</b> function is called and on the maximum depth of
|
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
|
||||||
recursive calls. These facilities are provided to catch runaway matches that
|
this counter, which therefore limits the amount of computing resource used for
|
||||||
are provoked by patterns with huge matching trees (a typical example is a
|
a match. The maximum depth of nested backtracking can also be limited, and this
|
||||||
pattern with nested unlimited repeats) and to avoid running out of system stack
|
restricts the amount of heap memory that is used.
|
||||||
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
|
</P>
|
||||||
gives an error return. The limits can also be set by items at the start of the
|
<P>
|
||||||
pattern of the form
|
These facilities are provided to catch runaway matches that are provoked by
|
||||||
|
patterns with huge matching trees (a typical example is a pattern with nested
|
||||||
|
unlimited repeats applied to a long string that does not match). When one of
|
||||||
|
these limits is reached, <b>pcre2_match()</b> gives an error return. The limits
|
||||||
|
can also be set by items at the start of the pattern of the form
|
||||||
<pre>
|
<pre>
|
||||||
(*LIMIT_MATCH=d)
|
(*LIMIT_MATCH=d)
|
||||||
(*LIMIT_RECURSION=d)
|
(*LIMIT_DEPTH=d)
|
||||||
</pre>
|
</pre>
|
||||||
where d is any number of decimal digits. However, the value of the setting must
|
where d is any number of decimal digits. However, the value of the setting must
|
||||||
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
||||||
|
@ -192,10 +196,15 @@ limits set by the programmer, but not raise them. If there is more than one
|
||||||
setting of one of these limits, the lower value is used.
|
setting of one of these limits, the lower value is used.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||||
|
still recognized for backwards compatibility.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
The match limit is used (but in a different way) when JIT is being used, but it
|
The match limit is used (but in a different way) when JIT is being used, but it
|
||||||
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
|
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
|
||||||
However, the recursion limit is relevant for DFA matching, which does use some
|
However, the depth limit is relevant for DFA matching, which uses function
|
||||||
function recursion, in particular, for recursions within the pattern.
|
recursion for recursions within the pattern. In this case, the depth limit
|
||||||
|
controls the amount of system stack that is used.
|
||||||
<a name="newlines"></a></P>
|
<a name="newlines"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Newline conventions
|
Newline conventions
|
||||||
|
@ -235,8 +244,8 @@ The newline convention affects where the circumflex and dollar assertions are
|
||||||
true. It also affects the interpretation of the dot metacharacter when
|
true. It also affects the interpretation of the dot metacharacter when
|
||||||
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
||||||
what the \R escape sequence matches. By default, this is any Unicode newline
|
what the \R escape sequence matches. By default, this is any Unicode newline
|
||||||
sequence, for Perl compatibility. However, this can be changed; see the
|
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||||
description of \R in the section entitled
|
section and the description of \R in the section entitled
|
||||||
<a href="#newlineseq">"Newline sequences"</a>
|
<a href="#newlineseq">"Newline sequences"</a>
|
||||||
below. A change of \R setting can be combined with a change of newline
|
below. A change of \R setting can be combined with a change of newline
|
||||||
convention.
|
convention.
|
||||||
|
@ -254,7 +263,7 @@ corresponding to PCRE2_BSR_UNICODE.
|
||||||
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
|
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
|
||||||
<P>
|
<P>
|
||||||
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
||||||
character code rather than ASCII or Unicode (typically a mainframe system). In
|
character code instead of ASCII or Unicode (typically a mainframe system). In
|
||||||
the sections below, character code values are ASCII or Unicode; in an EBCDIC
|
the sections below, character code values are ASCII or Unicode; in an EBCDIC
|
||||||
environment these characters may have different code values, and there are no
|
environment these characters may have different code values, and there are no
|
||||||
code points greater than 255.
|
code points greater than 255.
|
||||||
|
@ -318,11 +327,11 @@ that character may have. This use of backslash as an escape character applies
|
||||||
both inside and outside character classes.
|
both inside and outside character classes.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For example, if you want to match a * character, you write \* in the pattern.
|
For example, if you want to match a * character, you must write \* in the
|
||||||
This escaping action applies whether or not the following character would
|
pattern. This escaping action applies whether or not the following character
|
||||||
otherwise be interpreted as a metacharacter, so it is always safe to precede a
|
would otherwise be interpreted as a metacharacter, so it is always safe to
|
||||||
non-alphanumeric with backslash to specify that it stands for itself. In
|
precede a non-alphanumeric with backslash to specify that it stands for itself.
|
||||||
particular, if you want to match a backslash, you write \\.
|
In particular, if you want to match a backslash, you write \\.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In a UTF mode, only ASCII numbers and letters have any special meaning after a
|
In a UTF mode, only ASCII numbers and letters have any special meaning after a
|
||||||
|
@ -353,7 +362,7 @@ An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
|
||||||
by \E later in the pattern, the literal interpretation continues to the end of
|
by \E later in the pattern, the literal interpretation continues to the end of
|
||||||
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
|
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
|
||||||
a character class, this causes an error, because the character class is not
|
a character class, this causes an error, because the character class is not
|
||||||
terminated.
|
terminated by a closing square bracket.
|
||||||
<a name="digitsafterbackslash"></a></P>
|
<a name="digitsafterbackslash"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Non-printing characters
|
Non-printing characters
|
||||||
|
@ -476,9 +485,9 @@ a hexadecimal digit appears between \x{ and }, or if there is no terminating
|
||||||
<P>
|
<P>
|
||||||
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
|
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
|
||||||
described only when it is followed by two hexadecimal digits. Otherwise, it
|
described only when it is followed by two hexadecimal digits. Otherwise, it
|
||||||
matches a literal "x" character. In this mode mode, support for code points
|
matches a literal "x" character. In this mode, support for code points greater
|
||||||
greater than 256 is provided by \u, which must be followed by four hexadecimal
|
than 256 is provided by \u, which must be followed by four hexadecimal digits;
|
||||||
digits; otherwise it matches a literal "u" character.
|
otherwise it matches a literal "u" character.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Characters whose value is less than 256 can be defined by either of the two
|
Characters whose value is less than 256 can be defined by either of the two
|
||||||
|
@ -493,12 +502,10 @@ Constraints on character values
|
||||||
Characters that are specified using octal or hexadecimal numbers are
|
Characters that are specified using octal or hexadecimal numbers are
|
||||||
limited to certain values, as follows:
|
limited to certain values, as follows:
|
||||||
<pre>
|
<pre>
|
||||||
8-bit non-UTF mode less than 0x100
|
8-bit non-UTF mode no greater than 0xff
|
||||||
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
|
16-bit non-UTF mode no greater than 0xffff
|
||||||
16-bit non-UTF mode less than 0x10000
|
32-bit non-UTF mode no greater than 0xffffffff
|
||||||
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
|
All UTF modes no greater than 0x10ffff and a valid codepoint
|
||||||
32-bit non-UTF mode less than 0x100000000
|
|
||||||
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
|
|
||||||
</pre>
|
</pre>
|
||||||
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
|
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
|
||||||
"surrogate" codepoints), and 0xffef.
|
"surrogate" codepoints), and 0xffef.
|
||||||
|
@ -525,7 +532,7 @@ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
|
||||||
handler and used to modify the case of following characters. By default, PCRE2
|
handler and used to modify the case of following characters. By default, PCRE2
|
||||||
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
|
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
|
||||||
is set, \U matches a "U" character, and \u can be used to define a character
|
is set, \U matches a "U" character, and \u can be used to define a character
|
||||||
by code point, as described in the previous section.
|
by code point, as described above.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Absolute and relative back references
|
Absolute and relative back references
|
||||||
|
@ -714,7 +721,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
|
||||||
sequences that match characters with specific properties are available. In
|
sequences that match characters with specific properties are available. In
|
||||||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||||
characters whose codepoints are less than 256, but they do work in this mode.
|
characters whose codepoints are less than 256, but they do work in this mode.
|
||||||
The extra escape sequences are:
|
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
|
||||||
|
may be encountered. These are all treated as being in the Common script and
|
||||||
|
with an unassigned type. The extra escape sequences are:
|
||||||
<pre>
|
<pre>
|
||||||
\p{<i>xx</i>} a character with the <i>xx</i> property
|
\p{<i>xx</i>} a character with the <i>xx</i> property
|
||||||
\P{<i>xx</i>} a character without the <i>xx</i> property
|
\P{<i>xx</i>} a character without the <i>xx</i> property
|
||||||
|
@ -2214,16 +2223,8 @@ except that it does not cause the current matching position to be changed.
|
||||||
Assertion subpatterns are not capturing subpatterns. If such an assertion
|
Assertion subpatterns are not capturing subpatterns. If such an assertion
|
||||||
contains capturing subpatterns within it, these are counted for the purposes of
|
contains capturing subpatterns within it, these are counted for the purposes of
|
||||||
numbering the capturing subpatterns in the whole pattern. However, substring
|
numbering the capturing subpatterns in the whole pattern. However, substring
|
||||||
capturing is carried out only for positive assertions. (Perl sometimes, but not
|
capturing is normally carried out only for positive assertions (but see the
|
||||||
always, does do capturing in negative assertions.)
|
discussion of conditional subpatterns below).
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
WARNING: If a positive assertion containing one or more capturing subpatterns
|
|
||||||
succeeds, but failure to match later in the pattern causes backtracking over
|
|
||||||
this assertion, the captures within the assertion are reset only if no higher
|
|
||||||
numbered captures are already set. This is, unfortunately, a fundamental
|
|
||||||
limitation of the current implementation; it may get removed in a future
|
|
||||||
reworking.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
||||||
|
@ -2601,6 +2602,12 @@ presence of at least one letter in the subject. If a letter is found, the
|
||||||
subject is matched against the first alternative; otherwise it is matched
|
subject is matched against the first alternative; otherwise it is matched
|
||||||
against the second. This pattern matches strings in one of the two forms
|
against the second. This pattern matches strings in one of the two forms
|
||||||
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
|
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
For Perl compatibility, if an assertion that is a condition contains capturing
|
||||||
|
subpatterns, any capturing that occurs is retained afterwards, for both
|
||||||
|
positive and negative assertions. (Compare non-conditional assertions, when
|
||||||
|
captures are retained only for positive assertions.)
|
||||||
<a name="comments"></a></P>
|
<a name="comments"></a></P>
|
||||||
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
|
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -2773,93 +2780,57 @@ is the actual recursive call.
|
||||||
Differences in recursion processing between PCRE2 and Perl
|
Differences in recursion processing between PCRE2 and Perl
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2
|
Some former differences between PCRE2 and Perl no longer exist.
|
||||||
(like Python, but unlike Perl), a recursive subpattern call is always treated
|
|
||||||
as an atomic group. That is, once it has matched some of the subject string, it
|
|
||||||
is never re-entered, even if it contains untried alternatives and there is a
|
|
||||||
subsequent matching failure. This can be illustrated by the following pattern,
|
|
||||||
which purports to match a palindromic string that contains an odd number of
|
|
||||||
characters (for example, "a", "aba", "abcba", "abcdcba"):
|
|
||||||
<pre>
|
|
||||||
^(.|(.)(?1)\2)$
|
|
||||||
</pre>
|
|
||||||
The idea is that it either matches a single character, or two identical
|
|
||||||
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
|
|
||||||
it does not if the pattern is longer than three characters. Consider the
|
|
||||||
subject string "abcba":
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
At the top level, the first character is matched, but as it is not at the end
|
Before release 10.30, recursion processing in PCRE2 differed from Perl in that
|
||||||
of the string, the first alternative fails; the second alternative is taken
|
a recursive subpattern call was always treated as an atomic group. That is,
|
||||||
and the recursion kicks in. The recursive call to subpattern 1 successfully
|
once it had matched some of the subject string, it was never re-entered, even
|
||||||
matches the next character ("b"). (Note that the beginning and end of line
|
if it contained untried alternatives and there was a subsequent matching
|
||||||
tests are not part of the recursion).
|
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Back at the top level, the next character ("c") is compared with what
|
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||||
subpattern 2 matched, which was "a". This fails. Because the recursion is
|
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||||
treated as an atomic group, there are now no backtracking points, and so the
|
is a matching failure later in the pattern. This is now compatible with the way
|
||||||
entire match fails. (Perl is able, at this point, to re-enter the recursion and
|
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
||||||
try the second alternative.) However, if the pattern is written with the
|
enclose it in an atomic group.
|
||||||
alternatives in the other order, things are different:
|
|
||||||
<pre>
|
|
||||||
^((.)(?1)\2|.)$
|
|
||||||
</pre>
|
|
||||||
This time, the recursing alternative is tried first, and continues to recurse
|
|
||||||
until it runs out of characters, at which point the recursion fails. But this
|
|
||||||
time we do have another alternative to try at the higher level. That is the big
|
|
||||||
difference: in the previous case the remaining alternative is at a deeper
|
|
||||||
recursion level, which PCRE2 cannot use.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
To change the pattern so that it matches all palindromic strings, not just
|
Supporting backtracking into recursions simplifies certain types of recursive
|
||||||
those with an odd number of characters, it is tempting to change the pattern to
|
pattern. For example, this pattern matches palindromic strings:
|
||||||
this:
|
|
||||||
<pre>
|
<pre>
|
||||||
^((.)(?1)\2|.?)$
|
^((.)(?1)\2|.?)$
|
||||||
</pre>
|
</pre>
|
||||||
Again, this works in Perl, but not in PCRE2, and for the same reason. When a
|
The second branch in the group matches a single central character in the
|
||||||
deeper recursion has matched a single character, it cannot be entered again in
|
palindrome when there are an odd number of characters, or nothing when there
|
||||||
order to match an empty string. The solution is to separate the two cases, and
|
are an even number of characters, but in order to work it has to be able to try
|
||||||
write out the odd and even cases as alternatives at the higher level:
|
the second case when the rest of the pattern match fails. If you want to match
|
||||||
|
typical palindromic phrases, the pattern has to ignore all non-word characters,
|
||||||
|
which can be done like this:
|
||||||
<pre>
|
<pre>
|
||||||
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
|
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
|
||||||
</pre>
|
|
||||||
If you want to match typical palindromic phrases, the pattern has to ignore all
|
|
||||||
non-word characters, which can be done like this:
|
|
||||||
<pre>
|
|
||||||
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
|
|
||||||
</pre>
|
</pre>
|
||||||
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
|
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
|
||||||
man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the
|
man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
|
||||||
use of the possessive quantifier *+ to avoid backtracking into sequences of
|
avoid backtracking into sequences of non-word characters. Without this, PCRE2
|
||||||
non-word characters. Without this, PCRE2 takes a great deal longer (ten times
|
takes a great deal longer (ten times or more) to match typical phrases, and
|
||||||
or more) to match typical phrases, and Perl takes so long that you think it has
|
Perl takes so long that you think it has gone into a loop.
|
||||||
gone into a loop.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
<b>WARNING</b>: The palindrome-matching patterns above work only if the subject
|
Another way in which PCRE2 and Perl used to differ in their recursion
|
||||||
string does not start with a palindrome that is shorter than the entire string.
|
processing is in the handling of captured values. Formerly in Perl, when a
|
||||||
For example, although "abcba" is correctly matched, if the subject is "ababa",
|
subpattern was called recursively or as a subpattern (see the next section), it
|
||||||
PCRE2 finds the palindrome "aba" at the start, then fails at top level because
|
had no access to any values that were captured outside the recursion, whereas
|
||||||
the end of the string does not follow. Once again, it cannot jump back into the
|
in PCRE2 these values can be referenced. Consider this pattern:
|
||||||
recursion to try other alternatives, so the entire match fails.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
The second way in which PCRE2 and Perl differ in their recursion processing is
|
|
||||||
in the handling of captured values. In Perl, when a subpattern is called
|
|
||||||
recursively or as a subpattern (see the next section), it has no access to any
|
|
||||||
values that were captured outside the recursion, whereas in PCRE2 these values
|
|
||||||
can be referenced. Consider this pattern:
|
|
||||||
<pre>
|
<pre>
|
||||||
^(.)(\1|a(?2))
|
^(.)(\1|a(?2))
|
||||||
</pre>
|
</pre>
|
||||||
In PCRE2, this pattern matches "bab". The first capturing parentheses match "b",
|
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||||
then in the second group, when the back reference \1 fails to match "b", the
|
the second group, when the back reference \1 fails to match "b", the second
|
||||||
second alternative matches "a" and then recurses. In the recursion, \1 does
|
alternative matches "a" and then recurses. In the recursion, \1 does now match
|
||||||
now match "b" and so the whole match succeeds. In Perl, the pattern fails to
|
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||||
match because inside the recursive call \1 cannot access the externally set
|
later versions (I tried 5.024) it now works.
|
||||||
value.
|
|
||||||
<a name="subpatternsassubroutines"></a></P>
|
<a name="subpatternsassubroutines"></a></P>
|
||||||
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -2886,11 +2857,10 @@ is used, it does match "sense and responsibility" as well as the other two
|
||||||
strings. Another example is given in the discussion of DEFINE above.
|
strings. Another example is given in the discussion of DEFINE above.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
All subroutine calls, whether recursive or not, are always treated as atomic
|
Like recursions, subroutine calls used to be treated as atomic, but this
|
||||||
groups. That is, once a subroutine has matched some of the subject string, it
|
changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
|
||||||
is never re-entered, even if it contains untried alternatives and there is a
|
occur. However, any capturing parentheses that are set during the subroutine
|
||||||
subsequent matching failure. Any capturing parentheses that are set during the
|
call revert to their previous values afterwards.
|
||||||
subroutine call revert to their previous values afterwards.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Processing options such as case-independence are fixed when a subpattern is
|
Processing options such as case-independence are fixed when a subpattern is
|
||||||
|
@ -2998,17 +2968,10 @@ The doubling is removed before the string is passed to the callout function.
|
||||||
<a name="backtrackcontrol"></a></P>
|
<a name="backtrackcontrol"></a></P>
|
||||||
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||||
<P>
|
<P>
|
||||||
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
|
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||||
are still described in the Perl documentation as "experimental and subject to
|
terminology) that modify the behaviour of backtracking during matching. They
|
||||||
change or removal in a future version of Perl". It goes on to say: "Their usage
|
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
|
||||||
in production code should be noted to avoid problems during upgrades." The same
|
possibly behaving differently depending on whether or not a name is present.
|
||||||
remarks apply to the PCRE2 features described in this section.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
The new verbs make use of what was previously invalid syntax: an opening
|
|
||||||
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
|
|
||||||
(*VERB:NAME). Some verbs take either form, possibly behaving differently
|
|
||||||
depending on whether or not a name is present.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
By default, for compatibility with Perl, a name is any sequence of characters
|
By default, for compatibility with Perl, a name is any sequence of characters
|
||||||
|
@ -3040,7 +3003,7 @@ not there. Any number of these verbs may occur in a pattern.
|
||||||
<P>
|
<P>
|
||||||
Since these verbs are specifically related to backtracking, most of them can be
|
Since these verbs are specifically related to backtracking, most of them can be
|
||||||
used only when the pattern is to be matched using the traditional matching
|
used only when the pattern is to be matched using the traditional matching
|
||||||
function, because these use a backtracking algorithm. With the exception of
|
function, because that uses a backtracking algorithm. With the exception of
|
||||||
(*FAIL), which behaves like a failing negative assertion, the backtracking
|
(*FAIL), which behaves like a failing negative assertion, the backtracking
|
||||||
control verbs cause an error if encountered by the DFA matching function.
|
control verbs cause an error if encountered by the DFA matching function.
|
||||||
</P>
|
</P>
|
||||||
|
@ -3178,11 +3141,11 @@ Verbs that act after backtracking
|
||||||
The following verbs do nothing when they are encountered. Matching continues
|
The following verbs do nothing when they are encountered. Matching continues
|
||||||
with what follows, but if there is no subsequent match, causing a backtrack to
|
with what follows, but if there is no subsequent match, causing a backtrack to
|
||||||
the verb, a failure is forced. That is, backtracking cannot pass to the left of
|
the verb, a failure is forced. That is, backtracking cannot pass to the left of
|
||||||
the verb. However, when one of these verbs appears inside an atomic group
|
the verb. However, when one of these verbs appears inside an atomic group or in
|
||||||
(which includes any group that is called as a subroutine) or in an assertion
|
an assertion that is true, its effect is confined to that group, because once
|
||||||
that is true, its effect is confined to that group, because once the group has
|
the group has been matched, there is never any backtracking into it. In this
|
||||||
been matched, there is never any backtracking into it. In this situation,
|
situation, backtracking has to jump to the left of the entire atomic group or
|
||||||
backtracking has to jump to the left of the entire atomic group or assertion.
|
assertion.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
These verbs differ in exactly what kind of failure occurs when backtracking
|
These verbs differ in exactly what kind of failure occurs when backtracking
|
||||||
|
@ -3246,8 +3209,8 @@ expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
|
||||||
as (*COMMIT).
|
as (*COMMIT).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
|
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
|
||||||
It is like (*MARK:NAME) in that the name is remembered for passing back to the
|
like (*MARK:NAME) in that the name is remembered for passing back to the
|
||||||
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
|
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
|
||||||
ignoring those set by (*PRUNE) or (*THEN).
|
ignoring those set by (*PRUNE) or (*THEN).
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -3452,9 +3415,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 27 December 2016
|
Last updated: 18 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -55,7 +55,10 @@ The facility for saving and restoring compiled patterns is intended for use
|
||||||
within individual applications. As such, the data supplied to
|
within individual applications. As such, the data supplied to
|
||||||
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
|
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
|
||||||
arbitrary external sources. There is only some simple consistency checking, not
|
arbitrary external sources. There is only some simple consistency checking, not
|
||||||
complete validation of what is being re-loaded.
|
complete validation of what is being re-loaded. Corrupted data may cause
|
||||||
|
undefined results. For example, if the length field of a pattern in the
|
||||||
|
serialized data is corrupted, the deserializing code may read beyond the end of
|
||||||
|
the byte stream that is passed to it.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
|
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -190,9 +193,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 24 May 2016
|
Last updated: 21 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -126,12 +126,13 @@ character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||||
to occur).
|
to occur).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
UTF-8 (in its original definition) is not capable of encoding values greater
|
||||||
values can be handled by the 32-bit library. When testing this library in
|
than 0x7fffffff, but such values can be handled by the 32-bit library. When
|
||||||
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
|
testing this library in non-UTF mode with <b>utf8_input</b> set, if any
|
||||||
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
|
||||||
character's value. This is the only way of passing such code points in a
|
0x80000000 is added to the character's value. This is the only way of passing
|
||||||
pattern string. For subject strings, using an escape sequence is preferable.
|
such code points in a pattern string. For subject strings, using an escape
|
||||||
|
sequence is preferable.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -602,6 +603,7 @@ about the pattern:
|
||||||
/B bincode show binary code without lengths
|
/B bincode show binary code without lengths
|
||||||
callout_info show callout information
|
callout_info show callout information
|
||||||
debug same as info,fullbincode
|
debug same as info,fullbincode
|
||||||
|
framesize show matching frame size
|
||||||
fullbincode show binary code with lengths
|
fullbincode show binary code with lengths
|
||||||
/I info show info about compiled pattern
|
/I info show info about compiled pattern
|
||||||
hex unquoted characters are hexadecimal
|
hex unquoted characters are hexadecimal
|
||||||
|
@ -689,6 +691,11 @@ not necessarily the last character. These lines are omitted if no starting or
|
||||||
ending code units are recorded.
|
ending code units are recorded.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
The <b>framesize</b> modifier shows the size, in bytes, of the storage frames
|
||||||
|
used by <b>pcre2_match()</b> for handling backtracking. The size depends on the
|
||||||
|
number of capturing parentheses in the pattern.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
The <b>callout_info</b> modifier requests information about all the callouts in
|
The <b>callout_info</b> modifier requests information about all the callouts in
|
||||||
the pattern. A list of them is output at the end of any other information that
|
the pattern. A list of them is output at the end of any other information that
|
||||||
is requested. For each callout, either its number or string is given, followed
|
is requested. For each callout, either its number or string is given, followed
|
||||||
|
@ -1073,6 +1080,7 @@ pattern.
|
||||||
callout_fail=<n>[:<m>] control callout failure
|
callout_fail=<n>[:<m>] control callout failure
|
||||||
callout_none do not supply a callout function
|
callout_none do not supply a callout function
|
||||||
copy=<number or name> copy captured substring
|
copy=<number or name> copy captured substring
|
||||||
|
depth_limit=<n> set a depth limit
|
||||||
dfa use <b>pcre2_dfa_match()</b>
|
dfa use <b>pcre2_dfa_match()</b>
|
||||||
find_limits find match and recursion limits
|
find_limits find match and recursion limits
|
||||||
get=<number or name> extract captured substring
|
get=<number or name> extract captured substring
|
||||||
|
@ -1086,7 +1094,7 @@ pattern.
|
||||||
offset=<n> set starting offset
|
offset=<n> set starting offset
|
||||||
offset_limit=<n> set offset limit
|
offset_limit=<n> set offset limit
|
||||||
ovector=<n> set size of output vector
|
ovector=<n> set size of output vector
|
||||||
recursion_limit=<n> set a recursion limit
|
recursion_limit=<n> obsolete synonym for depth_limit
|
||||||
replace=<string> specify a replacement string
|
replace=<string> specify a replacement string
|
||||||
startchar show startchar when relevant
|
startchar show startchar when relevant
|
||||||
startoffset=<n> same as offset=<n>
|
startoffset=<n> same as offset=<n>
|
||||||
|
@ -1320,10 +1328,10 @@ stack that is larger than the default 32K is necessary only for very
|
||||||
complicated patterns.
|
complicated patterns.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Setting match and recursion limits
|
Setting match and depth limits
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate
|
The <b>match_limit</b> and <b>depth_limit</b> modifiers set the appropriate
|
||||||
limits in the match context. These values are ignored when the
|
limits in the match context. These values are ignored when the
|
||||||
<b>find_limits</b> modifier is specified.
|
<b>find_limits</b> modifier is specified.
|
||||||
</P>
|
</P>
|
||||||
|
@ -1333,23 +1341,23 @@ Finding minimum limits
|
||||||
<P>
|
<P>
|
||||||
If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
|
If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
|
||||||
<b>pcre2_match()</b> several times, setting different values in the match
|
<b>pcre2_match()</b> several times, setting different values in the match
|
||||||
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b>
|
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_depth_limit()</b>
|
||||||
until it finds the minimum values for each parameter that allow
|
until it finds the minimum values for each parameter that allow
|
||||||
<b>pcre2_match()</b> to complete without error.
|
<b>pcre2_match()</b> to complete without error.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If JIT is being used, only the match limit is relevant. If DFA matching is
|
If JIT is being used, only the match limit is relevant. If DFA matching is
|
||||||
being used, neither limit is relevant, and this modifier is ignored (with a
|
being used, only the depth limit is relevant, but at present this modifier is
|
||||||
warning message).
|
ignored (with a warning message).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The <i>match_limit</i> number is a measure of the amount of backtracking
|
The <i>match_limit</i> number is a measure of the amount of backtracking
|
||||||
that takes place, and learning the minimum value can be instructive. For most
|
that takes place, and learning the minimum value can be instructive. For most
|
||||||
simple matches, the number is quite small, but for patterns with very large
|
simple matches, the number is quite small, but for patterns with very large
|
||||||
numbers of matching possibilities, it can become large very quickly with
|
numbers of matching possibilities, it can become large very quickly with
|
||||||
increasing length of subject string. The <i>match_limit_recursion</i> number is
|
increasing length of subject string. The <i>depth_limit</i> number is
|
||||||
a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much
|
a measure of how much memory for recording backtracking points is needed to
|
||||||
heap) memory is needed to complete the match attempt.
|
complete the match attempt.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Showing MARK names
|
Showing MARK names
|
||||||
|
@ -1466,7 +1474,7 @@ code unit offset of the start of the failing character is also output. Here is
|
||||||
an example of an interactive <b>pcre2test</b> run.
|
an example of an interactive <b>pcre2test</b> run.
|
||||||
<pre>
|
<pre>
|
||||||
$ pcre2test
|
$ pcre2test
|
||||||
PCRE2 version 9.00 2014-05-10
|
PCRE2 version 10.22 2016-07-29
|
||||||
|
|
||||||
re> /^abc(\d+)/
|
re> /^abc(\d+)/
|
||||||
data> abc123
|
data> abc123
|
||||||
|
@ -1779,9 +1787,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 28 December 2016
|
Last updated: 21 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
295
doc/pcre2.txt
295
doc/pcre2.txt
|
@ -89,8 +89,8 @@ SECURITY CONSIDERATIONS
|
||||||
One way of guarding against this possibility is to use the pcre2_pat-
|
One way of guarding against this possibility is to use the pcre2_pat-
|
||||||
tern_info() function to check the compiled pattern's options for
|
tern_info() function to check the compiled pattern's options for
|
||||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
|
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
|
||||||
calling pcre2_compile(). This causes an compile time error if a pattern
|
calling pcre2_compile(). This causes a compile time error if the pat-
|
||||||
contains a UTF-setting sequence.
|
tern contains a UTF-setting sequence.
|
||||||
|
|
||||||
The use of Unicode properties for character types such as \d can also
|
The use of Unicode properties for character types such as \d can also
|
||||||
be enabled from within the pattern, by specifying "(*UCP)". This fea-
|
be enabled from within the pattern, by specifying "(*UCP)". This fea-
|
||||||
|
@ -112,7 +112,9 @@ SECURITY CONSIDERATIONS
|
||||||
has a very large search tree against a string that will never match.
|
has a very large search tree against a string that will never match.
|
||||||
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
|
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
|
||||||
vides some protection against this: see the pcre2_set_match_limit()
|
vides some protection against this: see the pcre2_set_match_limit()
|
||||||
function in the pcre2api page.
|
function in the pcre2api page. There is a similar function called
|
||||||
|
pcre2_set_depth_limit() that can be used to restrict the amount of mem-
|
||||||
|
ory that is used.
|
||||||
|
|
||||||
|
|
||||||
USER DOCUMENTATION
|
USER DOCUMENTATION
|
||||||
|
@ -144,7 +146,7 @@ USER DOCUMENTATION
|
||||||
pcre2perform discussion of performance issues
|
pcre2perform discussion of performance issues
|
||||||
pcre2posix the POSIX-compatible C API for the 8-bit library
|
pcre2posix the POSIX-compatible C API for the 8-bit library
|
||||||
pcre2sample discussion of the pcre2demo program
|
pcre2sample discussion of the pcre2demo program
|
||||||
pcre2stack discussion of stack usage
|
pcre2stack discussion of stack and memory usage
|
||||||
pcre2syntax quick syntax reference
|
pcre2syntax quick syntax reference
|
||||||
pcre2test description of the pcre2test command
|
pcre2test description of the pcre2test command
|
||||||
pcre2unicode discussion of Unicode and UTF support
|
pcre2unicode discussion of Unicode and UTF support
|
||||||
|
@ -166,8 +168,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 16 October 2015
|
Last updated: 27 March 2017
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -2533,9 +2535,10 @@ OBTAINING A TEXTUAL ERROR MESSAGE
|
||||||
A text message for an error code from any PCRE2 function (compile,
|
A text message for an error code from any PCRE2 function (compile,
|
||||||
match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
|
match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
|
||||||
sage(). The code is passed as the first argument, with the remaining
|
sage(). The code is passed as the first argument, with the remaining
|
||||||
two arguments specifying a code unit buffer and its length, into which
|
two arguments specifying a code unit buffer and its length in code
|
||||||
the text message is placed. Note that the message is returned in code
|
units, into which the text message is placed. The message is returned
|
||||||
units of the appropriate width for the library that is being used.
|
in code units of the appropriate width for the library that is being
|
||||||
|
used.
|
||||||
|
|
||||||
The returned message is terminated with a trailing zero, and the func-
|
The returned message is terminated with a trailing zero, and the func-
|
||||||
tion returns the number of code units used, excluding the trailing
|
tion returns the number of code units used, excluding the trailing
|
||||||
|
@ -3178,8 +3181,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 23 December 2016
|
Last updated: 21 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -5519,19 +5522,24 @@ SPECIAL START-OF-PATTERN ITEMS
|
||||||
attempt by the application to apply the JIT optimization by calling
|
attempt by the application to apply the JIT optimization by calling
|
||||||
pcre2_jit_compile() is ignored.
|
pcre2_jit_compile() is ignored.
|
||||||
|
|
||||||
Setting match and recursion limits
|
Setting match and backtracking depth limits
|
||||||
|
|
||||||
The caller of pcre2_match() can set a limit on the number of times the
|
The pcre2_match() function contains a counter that is incremented every
|
||||||
internal match() function is called and on the maximum depth of recur-
|
time it goes round its main loop. The caller of pcre2_match() can set a
|
||||||
sive calls. These facilities are provided to catch runaway matches that
|
limit on this counter, which therefore limits the amount of computing
|
||||||
are provoked by patterns with huge matching trees (a typical example is
|
resource used for a match. The maximum depth of nested backtracking can
|
||||||
a pattern with nested unlimited repeats) and to avoid running out of
|
also be limited, and this restricts the amount of heap memory that is
|
||||||
system stack by too much recursion. When one of these limits is
|
used.
|
||||||
reached, pcre2_match() gives an error return. The limits can also be
|
|
||||||
set by items at the start of the pattern of the form
|
These facilities are provided to catch runaway matches that are pro-
|
||||||
|
voked by patterns with huge matching trees (a typical example is a pat-
|
||||||
|
tern with nested unlimited repeats applied to a long string that does
|
||||||
|
not match). When one of these limits is reached, pcre2_match() gives an
|
||||||
|
error return. The limits can also be set by items at the start of the
|
||||||
|
pattern of the form
|
||||||
|
|
||||||
(*LIMIT_MATCH=d)
|
(*LIMIT_MATCH=d)
|
||||||
(*LIMIT_RECURSION=d)
|
(*LIMIT_DEPTH=d)
|
||||||
|
|
||||||
where d is any number of decimal digits. However, the value of the set-
|
where d is any number of decimal digits. However, the value of the set-
|
||||||
ting must be less than the value set (or defaulted) by the caller of
|
ting must be less than the value set (or defaulted) by the caller of
|
||||||
|
@ -5540,11 +5548,15 @@ SPECIAL START-OF-PATTERN ITEMS
|
||||||
If there is more than one setting of one of these limits, the lower
|
If there is more than one setting of one of these limits, the lower
|
||||||
value is used.
|
value is used.
|
||||||
|
|
||||||
|
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
|
||||||
|
name is still recognized for backwards compatibility.
|
||||||
|
|
||||||
The match limit is used (but in a different way) when JIT is being
|
The match limit is used (but in a different way) when JIT is being
|
||||||
used, but it is not relevant, and is ignored, when matching with
|
used, but it is not relevant, and is ignored, when matching with
|
||||||
pcre2_dfa_match(). However, the recursion limit is relevant for DFA
|
pcre2_dfa_match(). However, the depth limit is relevant for DFA match-
|
||||||
matching, which does use some function recursion, in particular, for
|
ing, which uses function recursion for recursions within the pattern.
|
||||||
recursions within the pattern.
|
In this case, the depth limit controls the amount of system stack that
|
||||||
|
is used.
|
||||||
|
|
||||||
Newline conventions
|
Newline conventions
|
||||||
|
|
||||||
|
@ -5579,9 +5591,9 @@ SPECIAL START-OF-PATTERN ITEMS
|
||||||
acter when PCRE2_DOTALL is not set, and the behaviour of \N. However,
|
acter when PCRE2_DOTALL is not set, and the behaviour of \N. However,
|
||||||
it does not affect what the \R escape sequence matches. By default,
|
it does not affect what the \R escape sequence matches. By default,
|
||||||
this is any Unicode newline sequence, for Perl compatibility. However,
|
this is any Unicode newline sequence, for Perl compatibility. However,
|
||||||
this can be changed; see the description of \R in the section entitled
|
this can be changed; see the next section and the description of \R in
|
||||||
"Newline sequences" below. A change of \R setting can be combined with
|
the section entitled "Newline sequences" below. A change of \R setting
|
||||||
a change of newline convention.
|
can be combined with a change of newline convention.
|
||||||
|
|
||||||
Specifying what \R matches
|
Specifying what \R matches
|
||||||
|
|
||||||
|
@ -5595,7 +5607,7 @@ SPECIAL START-OF-PATTERN ITEMS
|
||||||
EBCDIC CHARACTER CODES
|
EBCDIC CHARACTER CODES
|
||||||
|
|
||||||
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
||||||
character code rather than ASCII or Unicode (typically a mainframe sys-
|
character code instead of ASCII or Unicode (typically a mainframe sys-
|
||||||
tem). In the sections below, character code values are ASCII or Uni-
|
tem). In the sections below, character code values are ASCII or Uni-
|
||||||
code; in an EBCDIC environment these characters may have different code
|
code; in an EBCDIC environment these characters may have different code
|
||||||
values, and there are no code points greater than 255.
|
values, and there are no code points greater than 255.
|
||||||
|
@ -5660,8 +5672,8 @@ BACKSLASH
|
||||||
meaning that character may have. This use of backslash as an escape
|
meaning that character may have. This use of backslash as an escape
|
||||||
character applies both inside and outside character classes.
|
character applies both inside and outside character classes.
|
||||||
|
|
||||||
For example, if you want to match a * character, you write \* in the
|
For example, if you want to match a * character, you must write \* in
|
||||||
pattern. This escaping action applies whether or not the following
|
the pattern. This escaping action applies whether or not the following
|
||||||
character would otherwise be interpreted as a metacharacter, so it is
|
character would otherwise be interpreted as a metacharacter, so it is
|
||||||
always safe to precede a non-alphanumeric with backslash to specify
|
always safe to precede a non-alphanumeric with backslash to specify
|
||||||
that it stands for itself. In particular, if you want to match a back-
|
that it stands for itself. In particular, if you want to match a back-
|
||||||
|
@ -5695,7 +5707,8 @@ BACKSLASH
|
||||||
is not followed by \E later in the pattern, the literal interpretation
|
is not followed by \E later in the pattern, the literal interpretation
|
||||||
continues to the end of the pattern (that is, \E is assumed at the
|
continues to the end of the pattern (that is, \E is assumed at the
|
||||||
end). If the isolated \Q is inside a character class, this causes an
|
end). If the isolated \Q is inside a character class, this causes an
|
||||||
error, because the character class is not terminated.
|
error, because the character class is not terminated by a closing
|
||||||
|
square bracket.
|
||||||
|
|
||||||
Non-printing characters
|
Non-printing characters
|
||||||
|
|
||||||
|
@ -5810,10 +5823,10 @@ BACKSLASH
|
||||||
|
|
||||||
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
|
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
|
||||||
just described only when it is followed by two hexadecimal digits. Oth-
|
just described only when it is followed by two hexadecimal digits. Oth-
|
||||||
erwise, it matches a literal "x" character. In this mode mode, support
|
erwise, it matches a literal "x" character. In this mode, support for
|
||||||
for code points greater than 256 is provided by \u, which must be fol-
|
code points greater than 256 is provided by \u, which must be followed
|
||||||
lowed by four hexadecimal digits; otherwise it matches a literal "u"
|
by four hexadecimal digits; otherwise it matches a literal "u" charac-
|
||||||
character.
|
ter.
|
||||||
|
|
||||||
Characters whose value is less than 256 can be defined by either of the
|
Characters whose value is less than 256 can be defined by either of the
|
||||||
two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
|
two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
|
||||||
|
@ -5825,12 +5838,10 @@ BACKSLASH
|
||||||
Characters that are specified using octal or hexadecimal numbers are
|
Characters that are specified using octal or hexadecimal numbers are
|
||||||
limited to certain values, as follows:
|
limited to certain values, as follows:
|
||||||
|
|
||||||
8-bit non-UTF mode less than 0x100
|
8-bit non-UTF mode no greater than 0xff
|
||||||
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
|
16-bit non-UTF mode no greater than 0xffff
|
||||||
16-bit non-UTF mode less than 0x10000
|
32-bit non-UTF mode no greater than 0xffffffff
|
||||||
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
|
All UTF modes no greater than 0x10ffff and a valid codepoint
|
||||||
32-bit non-UTF mode less than 0x100000000
|
|
||||||
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
|
|
||||||
|
|
||||||
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
|
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
|
||||||
called "surrogate" codepoints), and 0xffef.
|
called "surrogate" codepoints), and 0xffef.
|
||||||
|
@ -5852,8 +5863,7 @@ BACKSLASH
|
||||||
handler and used to modify the case of following characters. By
|
handler and used to modify the case of following characters. By
|
||||||
default, PCRE2 does not support these escape sequences. However, if the
|
default, PCRE2 does not support these escape sequences. However, if the
|
||||||
PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
|
PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
|
||||||
used to define a character by code point, as described in the previous
|
used to define a character by code point, as described above.
|
||||||
section.
|
|
||||||
|
|
||||||
Absolute and relative back references
|
Absolute and relative back references
|
||||||
|
|
||||||
|
@ -6022,7 +6032,10 @@ BACKSLASH
|
||||||
tional escape sequences that match characters with specific properties
|
tional escape sequences that match characters with specific properties
|
||||||
are available. In 8-bit non-UTF-8 mode, these sequences are of course
|
are available. In 8-bit non-UTF-8 mode, these sequences are of course
|
||||||
limited to testing characters whose codepoints are less than 256, but
|
limited to testing characters whose codepoints are less than 256, but
|
||||||
they do work in this mode. The extra escape sequences are:
|
they do work in this mode. In 32-bit non-UTF mode, codepoints greater
|
||||||
|
than 0x10ffff (the Unicode limit) may be encountered. These are all
|
||||||
|
treated as being in the Common script and with an unassigned type. The
|
||||||
|
extra escape sequences are:
|
||||||
|
|
||||||
\p{xx} a character with the xx property
|
\p{xx} a character with the xx property
|
||||||
\P{xx} a character without the xx property
|
\P{xx} a character without the xx property
|
||||||
|
@ -7328,16 +7341,9 @@ ASSERTIONS
|
||||||
Assertion subpatterns are not capturing subpatterns. If such an asser-
|
Assertion subpatterns are not capturing subpatterns. If such an asser-
|
||||||
tion contains capturing subpatterns within it, these are counted for
|
tion contains capturing subpatterns within it, these are counted for
|
||||||
the purposes of numbering the capturing subpatterns in the whole pat-
|
the purposes of numbering the capturing subpatterns in the whole pat-
|
||||||
tern. However, substring capturing is carried out only for positive
|
tern. However, substring capturing is normally carried out only for
|
||||||
assertions. (Perl sometimes, but not always, does do capturing in nega-
|
positive assertions (but see the discussion of conditional subpatterns
|
||||||
tive assertions.)
|
below).
|
||||||
|
|
||||||
WARNING: If a positive assertion containing one or more capturing sub-
|
|
||||||
patterns succeeds, but failure to match later in the pattern causes
|
|
||||||
backtracking over this assertion, the captures within the assertion are
|
|
||||||
reset only if no higher numbered captures are already set. This is,
|
|
||||||
unfortunately, a fundamental limitation of the current implementation;
|
|
||||||
it may get removed in a future reworking.
|
|
||||||
|
|
||||||
For compatibility with Perl, most assertion subpatterns may be
|
For compatibility with Perl, most assertion subpatterns may be
|
||||||
repeated; though it makes no sense to assert the same thing several
|
repeated; though it makes no sense to assert the same thing several
|
||||||
|
@ -7686,6 +7692,12 @@ CONDITIONAL SUBPATTERNS
|
||||||
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
|
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
|
||||||
letters and dd are digits.
|
letters and dd are digits.
|
||||||
|
|
||||||
|
For Perl compatibility, if an assertion that is a condition contains
|
||||||
|
capturing subpatterns, any capturing that occurs is retained after-
|
||||||
|
wards, for both positive and negative assertions. (Compare non-condi-
|
||||||
|
tional assertions, when captures are retained only for positive asser-
|
||||||
|
tions.)
|
||||||
|
|
||||||
|
|
||||||
COMMENTS
|
COMMENTS
|
||||||
|
|
||||||
|
@ -7849,94 +7861,59 @@ RECURSIVE PATTERNS
|
||||||
|
|
||||||
Differences in recursion processing between PCRE2 and Perl
|
Differences in recursion processing between PCRE2 and Perl
|
||||||
|
|
||||||
Recursion processing in PCRE2 differs from Perl in two important ways.
|
Some former differences between PCRE2 and Perl no longer exist.
|
||||||
In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is
|
|
||||||
always treated as an atomic group. That is, once it has matched some of
|
|
||||||
the subject string, it is never re-entered, even if it contains untried
|
|
||||||
alternatives and there is a subsequent matching failure. This can be
|
|
||||||
illustrated by the following pattern, which purports to match a palin-
|
|
||||||
dromic string that contains an odd number of characters (for example,
|
|
||||||
"a", "aba", "abcba", "abcdcba"):
|
|
||||||
|
|
||||||
^(.|(.)(?1)\2)$
|
Before release 10.30, recursion processing in PCRE2 differed from Perl
|
||||||
|
in that a recursive subpattern call was always treated as an atomic
|
||||||
|
group. That is, once it had matched some of the subject string, it was
|
||||||
|
never re-entered, even if it contained untried alternatives and there
|
||||||
|
was a subsequent matching failure. (Historical note: PCRE implemented
|
||||||
|
recursion before Perl did.)
|
||||||
|
|
||||||
The idea is that it either matches a single character, or two identical
|
Starting with release 10.30, recursive subroutine calls are no longer
|
||||||
characters surrounding a sub-palindrome. In Perl, this pattern works;
|
treated as atomic. That is, they can be re-entered to try unused alter-
|
||||||
in PCRE2 it does not if the pattern is longer than three characters.
|
natives if there is a matching failure later in the pattern. This is
|
||||||
Consider the subject string "abcba":
|
now compatible with the way Perl works. If you want a subroutine call
|
||||||
|
to be atomic, you must explicitly enclose it in an atomic group.
|
||||||
|
|
||||||
At the top level, the first character is matched, but as it is not at
|
Supporting backtracking into recursions simplifies certain types of
|
||||||
the end of the string, the first alternative fails; the second alterna-
|
recursive pattern. For example, this pattern matches palindromic
|
||||||
tive is taken and the recursion kicks in. The recursive call to subpat-
|
strings:
|
||||||
tern 1 successfully matches the next character ("b"). (Note that the
|
|
||||||
beginning and end of line tests are not part of the recursion).
|
|
||||||
|
|
||||||
Back at the top level, the next character ("c") is compared with what
|
|
||||||
subpattern 2 matched, which was "a". This fails. Because the recursion
|
|
||||||
is treated as an atomic group, there are now no backtracking points,
|
|
||||||
and so the entire match fails. (Perl is able, at this point, to re-
|
|
||||||
enter the recursion and try the second alternative.) However, if the
|
|
||||||
pattern is written with the alternatives in the other order, things are
|
|
||||||
different:
|
|
||||||
|
|
||||||
^((.)(?1)\2|.)$
|
|
||||||
|
|
||||||
This time, the recursing alternative is tried first, and continues to
|
|
||||||
recurse until it runs out of characters, at which point the recursion
|
|
||||||
fails. But this time we do have another alternative to try at the
|
|
||||||
higher level. That is the big difference: in the previous case the
|
|
||||||
remaining alternative is at a deeper recursion level, which PCRE2 can-
|
|
||||||
not use.
|
|
||||||
|
|
||||||
To change the pattern so that it matches all palindromic strings, not
|
|
||||||
just those with an odd number of characters, it is tempting to change
|
|
||||||
the pattern to this:
|
|
||||||
|
|
||||||
^((.)(?1)\2|.?)$
|
^((.)(?1)\2|.?)$
|
||||||
|
|
||||||
Again, this works in Perl, but not in PCRE2, and for the same reason.
|
The second branch in the group matches a single central character in
|
||||||
When a deeper recursion has matched a single character, it cannot be
|
the palindrome when there are an odd number of characters, or nothing
|
||||||
entered again in order to match an empty string. The solution is to
|
when there are an even number of characters, but in order to work it
|
||||||
separate the two cases, and write out the odd and even cases as alter-
|
has to be able to try the second case when the rest of the pattern
|
||||||
natives at the higher level:
|
match fails. If you want to match typical palindromic phrases, the pat-
|
||||||
|
tern has to ignore all non-word characters, which can be done like
|
||||||
|
this:
|
||||||
|
|
||||||
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
|
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
|
||||||
|
|
||||||
If you want to match typical palindromic phrases, the pattern has to
|
|
||||||
ignore all non-word characters, which can be done like this:
|
|
||||||
|
|
||||||
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
|
|
||||||
|
|
||||||
If run with the PCRE2_CASELESS option, this pattern matches phrases
|
If run with the PCRE2_CASELESS option, this pattern matches phrases
|
||||||
such as "A man, a plan, a canal: Panama!" and it works in both PCRE2
|
such as "A man, a plan, a canal: Panama!". Note the use of the posses-
|
||||||
and Perl. Note the use of the possessive quantifier *+ to avoid back-
|
sive quantifier *+ to avoid backtracking into sequences of non-word
|
||||||
tracking into sequences of non-word characters. Without this, PCRE2
|
characters. Without this, PCRE2 takes a great deal longer (ten times or
|
||||||
takes a great deal longer (ten times or more) to match typical phrases,
|
more) to match typical phrases, and Perl takes so long that you think
|
||||||
and Perl takes so long that you think it has gone into a loop.
|
it has gone into a loop.
|
||||||
|
|
||||||
WARNING: The palindrome-matching patterns above work only if the sub-
|
Another way in which PCRE2 and Perl used to differ in their recursion
|
||||||
ject string does not start with a palindrome that is shorter than the
|
processing is in the handling of captured values. Formerly in Perl,
|
||||||
entire string. For example, although "abcba" is correctly matched, if
|
when a subpattern was called recursively or as a subpattern (see the
|
||||||
the subject is "ababa", PCRE2 finds the palindrome "aba" at the start,
|
next section), it had no access to any values that were captured out-
|
||||||
then fails at top level because the end of the string does not follow.
|
side the recursion, whereas in PCRE2 these values can be referenced.
|
||||||
Once again, it cannot jump back into the recursion to try other alter-
|
Consider this pattern:
|
||||||
natives, so the entire match fails.
|
|
||||||
|
|
||||||
The second way in which PCRE2 and Perl differ in their recursion pro-
|
|
||||||
cessing is in the handling of captured values. In Perl, when a subpat-
|
|
||||||
tern is called recursively or as a subpattern (see the next section),
|
|
||||||
it has no access to any values that were captured outside the recur-
|
|
||||||
sion, whereas in PCRE2 these values can be referenced. Consider this
|
|
||||||
pattern:
|
|
||||||
|
|
||||||
^(.)(\1|a(?2))
|
^(.)(\1|a(?2))
|
||||||
|
|
||||||
In PCRE2, this pattern matches "bab". The first capturing parentheses
|
This pattern matches "bab". The first capturing parentheses match "b",
|
||||||
match "b", then in the second group, when the back reference \1 fails
|
then in the second group, when the back reference \1 fails to match
|
||||||
to match "b", the second alternative matches "a" and then recurses. In
|
"b", the second alternative matches "a" and then recurses. In the
|
||||||
the recursion, \1 does now match "b" and so the whole match succeeds.
|
recursion, \1 does now match "b" and so the whole match succeeds. This
|
||||||
In Perl, the pattern fails to match because inside the recursive call
|
match used to fail in Perl, but in later versions (I tried 5.024) it
|
||||||
\1 cannot access the externally set value.
|
now works.
|
||||||
|
|
||||||
|
|
||||||
SUBPATTERNS AS SUBROUTINES
|
SUBPATTERNS AS SUBROUTINES
|
||||||
|
@ -7964,12 +7941,10 @@ SUBPATTERNS AS SUBROUTINES
|
||||||
two strings. Another example is given in the discussion of DEFINE
|
two strings. Another example is given in the discussion of DEFINE
|
||||||
above.
|
above.
|
||||||
|
|
||||||
All subroutine calls, whether recursive or not, are always treated as
|
Like recursions, subroutine calls used to be treated as atomic, but
|
||||||
atomic groups. That is, once a subroutine has matched some of the sub-
|
this changed at PCRE2 release 10.30, so backtracking into subroutine
|
||||||
ject string, it is never re-entered, even if it contains untried alter-
|
calls can now occur. However, any capturing parentheses that are set
|
||||||
natives and there is a subsequent matching failure. Any capturing
|
during the subroutine call revert to their previous values afterwards.
|
||||||
parentheses that are set during the subroutine call revert to their
|
|
||||||
previous values afterwards.
|
|
||||||
|
|
||||||
Processing options such as case-independence are fixed when a subpat-
|
Processing options such as case-independence are fixed when a subpat-
|
||||||
tern is defined, so if it is used as a subroutine, such options cannot
|
tern is defined, so if it is used as a subroutine, such options cannot
|
||||||
|
@ -8076,17 +8051,11 @@ CALLOUTS
|
||||||
|
|
||||||
BACKTRACKING CONTROL
|
BACKTRACKING CONTROL
|
||||||
|
|
||||||
Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
|
There are a number of special "Backtracking Control Verbs" (to use
|
||||||
which are still described in the Perl documentation as "experimental
|
Perl's terminology) that modify the behaviour of backtracking during
|
||||||
and subject to change or removal in a future version of Perl". It goes
|
matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
|
||||||
on to say: "Their usage in production code should be noted to avoid
|
verbs take either form, possibly behaving differently depending on
|
||||||
problems during upgrades." The same remarks apply to the PCRE2 features
|
whether or not a name is present.
|
||||||
described in this section.
|
|
||||||
|
|
||||||
The new verbs make use of what was previously invalid syntax: an open-
|
|
||||||
ing parenthesis followed by an asterisk. They are generally of the form
|
|
||||||
(*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving
|
|
||||||
differently depending on whether or not a name is present.
|
|
||||||
|
|
||||||
By default, for compatibility with Perl, a name is any sequence of
|
By default, for compatibility with Perl, a name is any sequence of
|
||||||
characters that does not include a closing parenthesis. The name is not
|
characters that does not include a closing parenthesis. The name is not
|
||||||
|
@ -8116,7 +8085,7 @@ BACKTRACKING CONTROL
|
||||||
|
|
||||||
Since these verbs are specifically related to backtracking, most of
|
Since these verbs are specifically related to backtracking, most of
|
||||||
them can be used only when the pattern is to be matched using the tra-
|
them can be used only when the pattern is to be matched using the tra-
|
||||||
ditional matching function, because these use a backtracking algorithm.
|
ditional matching function, because that uses a backtracking algorithm.
|
||||||
With the exception of (*FAIL), which behaves like a failing negative
|
With the exception of (*FAIL), which behaves like a failing negative
|
||||||
assertion, the backtracking control verbs cause an error if encountered
|
assertion, the backtracking control verbs cause an error if encountered
|
||||||
by the DFA matching function.
|
by the DFA matching function.
|
||||||
|
@ -8236,11 +8205,11 @@ BACKTRACKING CONTROL
|
||||||
tinues with what follows, but if there is no subsequent match, causing
|
tinues with what follows, but if there is no subsequent match, causing
|
||||||
a backtrack to the verb, a failure is forced. That is, backtracking
|
a backtrack to the verb, a failure is forced. That is, backtracking
|
||||||
cannot pass to the left of the verb. However, when one of these verbs
|
cannot pass to the left of the verb. However, when one of these verbs
|
||||||
appears inside an atomic group (which includes any group that is called
|
appears inside an atomic group or in an assertion that is true, its
|
||||||
as a subroutine) or in an assertion that is true, its effect is con-
|
effect is confined to that group, because once the group has been
|
||||||
fined to that group, because once the group has been matched, there is
|
matched, there is never any backtracking into it. In this situation,
|
||||||
never any backtracking into it. In this situation, backtracking has to
|
backtracking has to jump to the left of the entire atomic group or
|
||||||
jump to the left of the entire atomic group or assertion.
|
assertion.
|
||||||
|
|
||||||
These verbs differ in exactly what kind of failure occurs when back-
|
These verbs differ in exactly what kind of failure occurs when back-
|
||||||
tracking reaches them. The behaviour described below is what happens
|
tracking reaches them. The behaviour described below is what happens
|
||||||
|
@ -8303,11 +8272,10 @@ BACKTRACKING CONTROL
|
||||||
any other way. In an anchored pattern (*PRUNE) has the same effect as
|
any other way. In an anchored pattern (*PRUNE) has the same effect as
|
||||||
(*COMMIT).
|
(*COMMIT).
|
||||||
|
|
||||||
The behaviour of (*PRUNE:NAME) is the not the same as
|
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
|
||||||
(*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is
|
It is like (*MARK:NAME) in that the name is remembered for passing back
|
||||||
remembered for passing back to the caller. However, (*SKIP:NAME)
|
to the caller. However, (*SKIP:NAME) searches only for names set with
|
||||||
searches only for names set with (*MARK), ignoring those set by
|
(*MARK), ignoring those set by (*PRUNE) or (*THEN).
|
||||||
(*PRUNE) or (*THEN).
|
|
||||||
|
|
||||||
(*SKIP)
|
(*SKIP)
|
||||||
|
|
||||||
|
@ -8496,8 +8464,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 27 December 2016
|
Last updated: 18 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -9078,7 +9046,10 @@ SECURITY CONCERNS
|
||||||
use within individual applications. As such, the data supplied to
|
use within individual applications. As such, the data supplied to
|
||||||
pcre2_serialize_decode() is expected to be trusted data, not data from
|
pcre2_serialize_decode() is expected to be trusted data, not data from
|
||||||
arbitrary external sources. There is only some simple consistency
|
arbitrary external sources. There is only some simple consistency
|
||||||
checking, not complete validation of what is being re-loaded.
|
checking, not complete validation of what is being re-loaded. Corrupted
|
||||||
|
data may cause undefined results. For example, if the length field of a
|
||||||
|
pattern in the serialized data is corrupted, the deserializing code may
|
||||||
|
read beyond the end of the byte stream that is passed to it.
|
||||||
|
|
||||||
|
|
||||||
SAVING COMPILED PATTERNS
|
SAVING COMPILED PATTERNS
|
||||||
|
@ -9211,8 +9182,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 24 May 2016
|
Last updated: 21 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_CONFIG 3 "20 April 2014" "PCRE2 10.0"
|
.TH PCRE2_CONFIG 3 "24 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -31,10 +31,13 @@ point to a uint32_t integer variable. The available codes are:
|
||||||
PCRE2_CONFIG_BSR Indicates what \eR matches by default:
|
PCRE2_CONFIG_BSR Indicates what \eR matches by default:
|
||||||
PCRE2_BSR_UNICODE
|
PCRE2_BSR_UNICODE
|
||||||
PCRE2_BSR_ANYCRLF
|
PCRE2_BSR_ANYCRLF
|
||||||
|
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
|
||||||
|
.\" JOIN
|
||||||
PCRE2_CONFIG_JIT Availability of just-in-time compiler
|
PCRE2_CONFIG_JIT Availability of just-in-time compiler
|
||||||
support (1=yes 0=no)
|
support (1=yes 0=no)
|
||||||
PCRE2_CONFIG_JITTARGET Information about the target archi-
|
.\" JOIN
|
||||||
tecture for the JIT compiler
|
PCRE2_CONFIG_JITTARGET Information (a string) about the target
|
||||||
|
architecture for the JIT compiler
|
||||||
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
||||||
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
||||||
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
||||||
|
@ -44,9 +47,9 @@ point to a uint32_t integer variable. The available codes are:
|
||||||
PCRE2_NEWLINE_ANY
|
PCRE2_NEWLINE_ANY
|
||||||
PCRE2_NEWLINE_ANYCRLF
|
PCRE2_NEWLINE_ANYCRLF
|
||||||
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
||||||
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
|
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
|
||||||
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
|
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
|
||||||
0=heap)
|
.\" JOIN
|
||||||
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
|
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
|
||||||
0=no)
|
0=no)
|
||||||
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_DFA_MATCH 3 "23 December 2016" "PCRE2 10.23"
|
.TH PCRE2_DFA_MATCH 3 "24 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -19,8 +19,9 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
This function matches a compiled regular expression against a given subject
|
This function matches a compiled regular expression against a given subject
|
||||||
string, using an alternative matching algorithm that scans the subject string
|
string, using an alternative matching algorithm that scans the subject string
|
||||||
just once (\fInot\fP Perl-compatible). (The Perl-compatible matching function
|
just once (except when processing lookaround assertions). This function is
|
||||||
is \fBpcre2_match()\fP.) The arguments for this function are:
|
\fInot\fP Perl-compatible (the Perl-compatible matching function is
|
||||||
|
\fBpcre2_match()\fP). The arguments for this function are:
|
||||||
.sp
|
.sp
|
||||||
\fIcode\fP Points to the compiled pattern
|
\fIcode\fP Points to the compiled pattern
|
||||||
\fIsubject\fP Points to the subject string
|
\fIsubject\fP Points to the subject string
|
||||||
|
@ -33,22 +34,26 @@ is \fBpcre2_match()\fP.) The arguments for this function are:
|
||||||
\fIwscount\fP Number of elements in the vector
|
\fIwscount\fP Number of elements in the vector
|
||||||
.sp
|
.sp
|
||||||
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
|
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
|
||||||
up a callout function or specify the recursion limit. The \fIlength\fP and
|
up a callout function or specify the recursion depth limit. The \fIlength\fP
|
||||||
\fIstartoffset\fP values are code units, not characters. The options are:
|
and \fIstartoffset\fP values are code units, not characters. The options are:
|
||||||
.sp
|
.sp
|
||||||
PCRE2_ANCHORED Match only at the first position
|
PCRE2_ANCHORED Match only at the first position
|
||||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||||
PCRE2_NOTEOL Subject is not the end of a line
|
PCRE2_NOTEOL Subject is not the end of a line
|
||||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||||
|
.\" JOIN
|
||||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
|
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
|
||||||
is not a valid match
|
is not a valid match
|
||||||
|
.\" JOIN
|
||||||
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
|
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
|
||||||
validity (only relevant if PCRE2_UTF
|
validity (only relevant if PCRE2_UTF
|
||||||
was set at compile time)
|
was set at compile time)
|
||||||
|
.\" JOIN
|
||||||
|
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial
|
||||||
|
match even if there is a full match
|
||||||
|
.\" JOIN
|
||||||
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
|
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
|
||||||
match if no full matches are found
|
match if no full matches are found
|
||||||
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
|
|
||||||
even if there is a full match as well
|
|
||||||
PCRE2_DFA_RESTART Restart after a partial match
|
PCRE2_DFA_RESTART Restart after a partial match
|
||||||
PCRE2_DFA_SHORTEST Return only the shortest match
|
PCRE2_DFA_SHORTEST Return only the shortest match
|
||||||
.sp
|
.sp
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_GET_ERROR_MESSAGE 3 "17 June 2016" "PCRE2 10.22"
|
.TH PCRE2_GET_ERROR_MESSAGE 3 "24 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -22,11 +22,11 @@ errors are negative numbers. The arguments are:
|
||||||
\fIbuffer\fP where to put the message
|
\fIbuffer\fP where to put the message
|
||||||
\fIbufflen\fP the length of the buffer (code units)
|
\fIbufflen\fP the length of the buffer (code units)
|
||||||
.sp
|
.sp
|
||||||
The function returns the length of the message, excluding the trailing zero, or
|
The function returns the length of the message in code units, excluding the
|
||||||
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
|
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
|
||||||
this case, the returned message is truncated (but still with a trailing zero).
|
too small. In this case, the returned message is truncated (but still with a
|
||||||
If \fIerrorcode\fP does not contain a recognized error code number, the
|
trailing zero). If \fIerrorcode\fP does not contain a recognized error code
|
||||||
negative value PCRE2_ERROR_BADDATA is returned.
|
number, the negative value PCRE2_ERROR_BADDATA is returned.
|
||||||
.P
|
.P
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API in the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_JIT_STACK_CREATE 3 "03 November 2014" "PCRE2 10.00"
|
.TH PCRE2_JIT_STACK_CREATE 3 "24 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -20,10 +20,9 @@ maximum size to which it is allowed to grow. The final argument is a general
|
||||||
context, for memory allocation functions, or NULL for standard memory
|
context, for memory allocation functions, or NULL for standard memory
|
||||||
allocation. The result can be passed to the JIT run-time code by calling
|
allocation. The result can be passed to the JIT run-time code by calling
|
||||||
\fBpcre2_jit_stack_assign()\fP to associate the stack with a compiled pattern,
|
\fBpcre2_jit_stack_assign()\fP to associate the stack with a compiled pattern,
|
||||||
which can then be processed by \fBpcre2_match()\fP. If the "fast path" JIT
|
which can then be processed by \fBpcre2_match()\fP or \fBpcre2_jit_match()\fP.
|
||||||
matcher, \fBpcre2_jit_match()\fP is used, the stack can be passed directly as
|
A maximum stack size of 512K to 1M should be more than enough for any pattern.
|
||||||
an argument. A maximum stack size of 512K to 1M should be more than enough for
|
For more details, see the
|
||||||
any pattern. For more details, see the
|
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2jit\fP
|
\fBpcre2jit\fP
|
||||||
.\"
|
.\"
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_MAKETABLES 3 "21 October 2014" "PCRE2 10.00"
|
.TH PCRE2_MAKETABLES 3 "24 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -12,10 +12,10 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH DESCRIPTION
|
.SH DESCRIPTION
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
This function builds a set of character tables for character values less than
|
This function builds a set of character tables for character code points that
|
||||||
256. These can be passed to \fBpcre2_compile()\fP in a compile context in order
|
are less than 256. These can be passed to \fBpcre2_compile()\fP in a compile
|
||||||
to override the internal, built-in tables (which were either defaulted or made
|
context in order to override the internal, built-in tables (which were either
|
||||||
by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
|
defaulted or made by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2_set_character_tables()\fP
|
\fBpcre2_set_character_tables()\fP
|
||||||
.\"
|
.\"
|
||||||
|
|
|
@ -255,6 +255,9 @@ OPTIONS
|
||||||
directory like this is an immediate end-of-file; in others it
|
directory like this is an immediate end-of-file; in others it
|
||||||
may provoke an error.
|
may provoke an error.
|
||||||
|
|
||||||
|
--depth-limit=number
|
||||||
|
See --match-limit below.
|
||||||
|
|
||||||
-e pattern, --regex=pattern, --regexp=pattern
|
-e pattern, --regex=pattern, --regexp=pattern
|
||||||
Specify a pattern to be matched. This option can be used mul-
|
Specify a pattern to be matched. This option can be used mul-
|
||||||
tiple times in order to specify several patterns. It can also
|
tiple times in order to specify several patterns. It can also
|
||||||
|
@ -477,32 +480,24 @@ OPTIONS
|
||||||
no short form for this option.
|
no short form for this option.
|
||||||
|
|
||||||
--match-limit=number
|
--match-limit=number
|
||||||
Processing some regular expression patterns can require a
|
Processing some regular expression patterns may take a very
|
||||||
very large amount of memory, leading in some cases to a pro-
|
long time to search for all possible matching strings. Others
|
||||||
gram crash if not enough is available. Other patterns may
|
may require a very large amount of memory. There are two
|
||||||
take a very long time to search for all possible matching
|
options that set resource limits for matching.
|
||||||
strings. The pcre2_match() function that is called by
|
|
||||||
pcre2grep to do the matching has two parameters that can
|
|
||||||
limit the resources that it uses.
|
|
||||||
|
|
||||||
The --match-limit option provides a means of limiting
|
The --match-limit option provides a means of limiting comput-
|
||||||
resource usage when processing patterns that are not going to
|
ing resource usage when processing patterns that are not
|
||||||
match, but which have a very large number of possibilities in
|
going to match, but which have a very large number of possi-
|
||||||
their search trees. The classic example is a pattern that
|
bilities in their search trees. The classic example is a pat-
|
||||||
uses nested unlimited repeats. Internally, PCRE2 uses a func-
|
tern that uses nested unlimited repeats. Internally, PCRE2
|
||||||
tion called match() which it calls repeatedly (sometimes
|
has a counter that is incremented each time around its main
|
||||||
recursively). The limit set by --match-limit is imposed on
|
processing loop. If the value set by --match-limit is
|
||||||
the number of times this function is called during a match,
|
reached, an error occurs.
|
||||||
which has the effect of limiting the amount of backtracking
|
|
||||||
that can take place.
|
|
||||||
|
|
||||||
The --recursion-limit option is similar to --match-limit, but
|
The --depth-limit option limits the depth of nested back-
|
||||||
instead of limiting the total number of times that match() is
|
tracking points, which in turn limits the amount of memory
|
||||||
called, it limits the depth of recursive calls, which in turn
|
that is used. This limit is of use only if it is set smaller
|
||||||
limits the amount of memory that can be used. The recursion
|
than --match-limit.
|
||||||
depth is a smaller number than the total number of calls,
|
|
||||||
because not all calls to match() are recursive. This limit is
|
|
||||||
of use only if it is set smaller than --match-limit.
|
|
||||||
|
|
||||||
There are no short forms for these options. The default set-
|
There are no short forms for these options. The default set-
|
||||||
tings are specified when the PCRE2 library is compiled, with
|
tings are specified when the PCRE2 library is compiled, with
|
||||||
|
@ -834,9 +829,9 @@ MATCHING ERRORS
|
||||||
such errors, pcre2grep gives up.
|
such errors, pcre2grep gives up.
|
||||||
|
|
||||||
The --match-limit option of pcre2grep can be used to set the overall
|
The --match-limit option of pcre2grep can be used to set the overall
|
||||||
resource limit; there is a second option called --recursion-limit that
|
resource limit; there is a second option called --depth-limit that sets
|
||||||
sets a limit on the amount of memory (usually stack) that is used (see
|
a limit on the amount of memory that is used (see the discussion of
|
||||||
the discussion of these options above).
|
these options above).
|
||||||
|
|
||||||
|
|
||||||
DIAGNOSTICS
|
DIAGNOSTICS
|
||||||
|
@ -862,5 +857,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 31 December 2016
|
Last updated: 21 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
|
|
|
@ -91,13 +91,13 @@ INPUT ENCODING
|
||||||
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
|
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
|
||||||
values greater than 0xffff cause an error to occur).
|
values greater than 0xffff cause an error to occur).
|
||||||
|
|
||||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but
|
UTF-8 (in its original definition) is not capable of encoding values
|
||||||
such values can be handled by the 32-bit library. When testing this
|
greater than 0x7fffffff, but such values can be handled by the 32-bit
|
||||||
library in non-UTF mode with utf8_input set, if any character is pre-
|
library. When testing this library in non-UTF mode with utf8_input set,
|
||||||
ceded by the byte 0xff (which is an illegal byte in UTF-8) 0x80000000
|
if any character is preceded by the byte 0xff (which is an illegal byte
|
||||||
is added to the character's value. This is the only way of passing such
|
in UTF-8) 0x80000000 is added to the character's value. This is the
|
||||||
code points in a pattern string. For subject strings, using an escape
|
only way of passing such code points in a pattern string. For subject
|
||||||
sequence is preferable.
|
strings, using an escape sequence is preferable.
|
||||||
|
|
||||||
|
|
||||||
COMMAND LINE OPTIONS
|
COMMAND LINE OPTIONS
|
||||||
|
@ -544,6 +544,7 @@ PATTERN MODIFIERS
|
||||||
/B bincode show binary code without lengths
|
/B bincode show binary code without lengths
|
||||||
callout_info show callout information
|
callout_info show callout information
|
||||||
debug same as info,fullbincode
|
debug same as info,fullbincode
|
||||||
|
framesize show matching frame size
|
||||||
fullbincode show binary code with lengths
|
fullbincode show binary code with lengths
|
||||||
/I info show info about compiled pattern
|
/I info show info about compiled pattern
|
||||||
hex unquoted characters are hexadecimal
|
hex unquoted characters are hexadecimal
|
||||||
|
@ -624,6 +625,10 @@ PATTERN MODIFIERS
|
||||||
last character. These lines are omitted if no starting or ending code
|
last character. These lines are omitted if no starting or ending code
|
||||||
units are recorded.
|
units are recorded.
|
||||||
|
|
||||||
|
The framesize modifier shows the size, in bytes, of the storage frames
|
||||||
|
used by pcre2_match() for handling backtracking. The size depends on
|
||||||
|
the number of capturing parentheses in the pattern.
|
||||||
|
|
||||||
The callout_info modifier requests information about all the callouts
|
The callout_info modifier requests information about all the callouts
|
||||||
in the pattern. A list of them is output at the end of any other infor-
|
in the pattern. A list of them is output at the end of any other infor-
|
||||||
mation that is requested. For each callout, either its number or string
|
mation that is requested. For each callout, either its number or string
|
||||||
|
@ -959,6 +964,7 @@ SUBJECT MODIFIERS
|
||||||
callout_fail=<n>[:<m>] control callout failure
|
callout_fail=<n>[:<m>] control callout failure
|
||||||
callout_none do not supply a callout function
|
callout_none do not supply a callout function
|
||||||
copy=<number or name> copy captured substring
|
copy=<number or name> copy captured substring
|
||||||
|
depth_limit=<n> set a depth limit
|
||||||
dfa use pcre2_dfa_match()
|
dfa use pcre2_dfa_match()
|
||||||
find_limits find match and recursion limits
|
find_limits find match and recursion limits
|
||||||
get=<number or name> extract captured substring
|
get=<number or name> extract captured substring
|
||||||
|
@ -972,7 +978,7 @@ SUBJECT MODIFIERS
|
||||||
offset=<n> set starting offset
|
offset=<n> set starting offset
|
||||||
offset_limit=<n> set offset limit
|
offset_limit=<n> set offset limit
|
||||||
ovector=<n> set size of output vector
|
ovector=<n> set size of output vector
|
||||||
recursion_limit=<n> set a recursion limit
|
recursion_limit=<n> obsolete synonym for depth_limit
|
||||||
replace=<string> specify a replacement string
|
replace=<string> specify a replacement string
|
||||||
startchar show startchar when relevant
|
startchar show startchar when relevant
|
||||||
startoffset=<n> same as offset=<n>
|
startoffset=<n> same as offset=<n>
|
||||||
|
@ -1188,32 +1194,31 @@ SUBJECT MODIFIERS
|
||||||
Providing a stack that is larger than the default 32K is necessary only
|
Providing a stack that is larger than the default 32K is necessary only
|
||||||
for very complicated patterns.
|
for very complicated patterns.
|
||||||
|
|
||||||
Setting match and recursion limits
|
Setting match and depth limits
|
||||||
|
|
||||||
The match_limit and recursion_limit modifiers set the appropriate lim-
|
The match_limit and depth_limit modifiers set the appropriate limits in
|
||||||
its in the match context. These values are ignored when the find_limits
|
the match context. These values are ignored when the find_limits modi-
|
||||||
modifier is specified.
|
fier is specified.
|
||||||
|
|
||||||
Finding minimum limits
|
Finding minimum limits
|
||||||
|
|
||||||
If the find_limits modifier is present, pcre2test calls pcre2_match()
|
If the find_limits modifier is present, pcre2test calls pcre2_match()
|
||||||
several times, setting different values in the match context via
|
several times, setting different values in the match context via
|
||||||
pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds
|
pcre2_set_match_limit() and pcre2_set_depth_limit() until it finds the
|
||||||
the minimum values for each parameter that allow pcre2_match() to com-
|
minimum values for each parameter that allow pcre2_match() to complete
|
||||||
plete without error.
|
without error.
|
||||||
|
|
||||||
If JIT is being used, only the match limit is relevant. If DFA matching
|
If JIT is being used, only the match limit is relevant. If DFA matching
|
||||||
is being used, neither limit is relevant, and this modifier is ignored
|
is being used, only the depth limit is relevant, but at present this
|
||||||
(with a warning message).
|
modifier is ignored (with a warning message).
|
||||||
|
|
||||||
The match_limit number is a measure of the amount of backtracking that
|
The match_limit number is a measure of the amount of backtracking that
|
||||||
takes place, and learning the minimum value can be instructive. For
|
takes place, and learning the minimum value can be instructive. For
|
||||||
most simple matches, the number is quite small, but for patterns with
|
most simple matches, the number is quite small, but for patterns with
|
||||||
very large numbers of matching possibilities, it can become large very
|
very large numbers of matching possibilities, it can become large very
|
||||||
quickly with increasing length of subject string. The
|
quickly with increasing length of subject string. The depth_limit num-
|
||||||
match_limit_recursion number is a measure of how much stack (or, if
|
ber is a measure of how much memory for recording backtracking points
|
||||||
PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to
|
is needed to complete the match attempt.
|
||||||
complete the match attempt.
|
|
||||||
|
|
||||||
Showing MARK names
|
Showing MARK names
|
||||||
|
|
||||||
|
@ -1314,7 +1319,7 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
also output. Here is an example of an interactive pcre2test run.
|
also output. Here is an example of an interactive pcre2test run.
|
||||||
|
|
||||||
$ pcre2test
|
$ pcre2test
|
||||||
PCRE2 version 9.00 2014-05-10
|
PCRE2 version 10.22 2016-07-29
|
||||||
|
|
||||||
re> /^abc(\d+)/
|
re> /^abc(\d+)/
|
||||||
data> abc123
|
data> abc123
|
||||||
|
@ -1614,5 +1619,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 28 December 2016
|
Last updated: 21 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
|
|
Loading…
Reference in New Issue