Documentation update.
This commit is contained in:
parent
32bab50c01
commit
3aeb812180
|
@ -1,10 +1,6 @@
|
||||||
Building PCRE2 without using autotools
|
Building PCRE2 without using autotools
|
||||||
--------------------------------------
|
--------------------------------------
|
||||||
|
|
||||||
This document has been converted from the PCRE1 document. I have removed a
|
|
||||||
number of sections about building in various environments, as they applied only
|
|
||||||
to PCRE1 and are probably out of date.
|
|
||||||
|
|
||||||
This document contains the following sections:
|
This document contains the following sections:
|
||||||
|
|
||||||
General
|
General
|
||||||
|
@ -183,21 +179,9 @@ can skip ahead to the CMake section.
|
||||||
|
|
||||||
STACK SIZE IN WINDOWS ENVIRONMENTS
|
STACK SIZE IN WINDOWS ENVIRONMENTS
|
||||||
|
|
||||||
The default processor stack size of 1Mb in some Windows environments is too
|
Prior to release 10.30 the default system stack size of 1Mb in some Windows
|
||||||
small for matching patterns that need much recursion. In particular, test 2 may
|
environments caused issues with some tests. This should no longer be the case
|
||||||
fail because of this. Normally, running out of stack causes a crash, but there
|
for 10.30 and later releases.
|
||||||
have been cases where the test program has just died silently. See your linker
|
|
||||||
documentation for how to increase stack size if you experience problems. If you
|
|
||||||
are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
|
|
||||||
compiler, you can increase the stack size for pcre2test and pcre2grep by
|
|
||||||
setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
|
|
||||||
example). The Linux default of 8Mb is a reasonable choice for the stack, though
|
|
||||||
even that can be too small for some pattern/subject combinations.
|
|
||||||
|
|
||||||
PCRE2 has a compile configuration option to disable the use of stack for
|
|
||||||
recursion so that heap is used instead. However, pattern matching is
|
|
||||||
significantly slower when this is done. There is more about stack usage in the
|
|
||||||
"pcre2stack" documentation.
|
|
||||||
|
|
||||||
|
|
||||||
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
|
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
|
||||||
|
@ -393,4 +377,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
|
||||||
recommended download site.
|
recommended download site.
|
||||||
|
|
||||||
=============================
|
=============================
|
||||||
Last Updated: 13 October 2016
|
Last Updated: 17 March 2017
|
||||||
|
|
|
@ -15,8 +15,8 @@ subscribe or manage your subscription here:
|
||||||
|
|
||||||
https://lists.exim.org/mailman/listinfo/pcre-dev
|
https://lists.exim.org/mailman/listinfo/pcre-dev
|
||||||
|
|
||||||
Please read the NEWS file if you are upgrading from a previous release.
|
Please read the NEWS file if you are upgrading from a previous release. The
|
||||||
The contents of this README file are:
|
contents of this README file are:
|
||||||
|
|
||||||
The PCRE2 APIs
|
The PCRE2 APIs
|
||||||
Documentation for PCRE2
|
Documentation for PCRE2
|
||||||
|
@ -44,8 +44,8 @@ wrappers.
|
||||||
|
|
||||||
The distribution does contain a set of C wrapper functions for the 8-bit
|
The distribution does contain a set of C wrapper functions for the 8-bit
|
||||||
library that are based on the POSIX regular expression API (see the pcre2posix
|
library that are based on the POSIX regular expression API (see the pcre2posix
|
||||||
man page). These can be found in a library called libpcre2-posix. Note that this
|
man page). These can be found in a library called libpcre2-posix. Note that
|
||||||
just provides a POSIX calling interface to PCRE2; the regular expressions
|
this just provides a POSIX calling interface to PCRE2; the regular expressions
|
||||||
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
|
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
|
||||||
and does not give full access to all of PCRE2's facilities.
|
and does not give full access to all of PCRE2's facilities.
|
||||||
|
|
||||||
|
@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms:
|
||||||
Building PCRE2 on non-Unix-like systems
|
Building PCRE2 on non-Unix-like systems
|
||||||
---------------------------------------
|
---------------------------------------
|
||||||
|
|
||||||
For a non-Unix-like system, please read the comments in the file
|
For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
|
||||||
NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and
|
your system supports the use of "configure" and "make" you may be able to build
|
||||||
"make" you may be able to build PCRE2 using autotools in the same way as for
|
PCRE2 using autotools in the same way as for many Unix-like systems.
|
||||||
many Unix-like systems.
|
|
||||||
|
|
||||||
PCRE2 can also be configured using CMake, which can be run in various ways
|
PCRE2 can also be configured using CMake, which can be run in various ways
|
||||||
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
|
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
|
||||||
|
@ -174,19 +173,19 @@ library. They are also documented in the pcre2build man page.
|
||||||
architectures. If you try to enable it on an unsupported architecture, there
|
architectures. If you try to enable it on an unsupported architecture, there
|
||||||
will be a compile time error.
|
will be a compile time error.
|
||||||
|
|
||||||
. If you do not want to make use of the support for UTF-8 Unicode character
|
. If you do not want to make use of the default support for UTF-8 Unicode
|
||||||
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
character strings in the 8-bit library, UTF-16 Unicode character strings in
|
||||||
library, or UTF-32 Unicode character strings in the 32-bit library, you can
|
the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
|
||||||
add --disable-unicode to the "configure" command. This reduces the size of
|
library, you can add --disable-unicode to the "configure" command. This
|
||||||
the libraries. It is not possible to configure one library with Unicode
|
reduces the size of the libraries. It is not possible to configure one
|
||||||
support, and another without, in the same configuration.
|
library with Unicode support, and another without, in the same configuration.
|
||||||
|
It is also not possible to use --enable-ebcdic (see below) with Unicode
|
||||||
|
support, so if this option is set, you must also use --disable-unicode.
|
||||||
|
|
||||||
When Unicode support is available, the use of a UTF encoding still has to be
|
When Unicode support is available, the use of a UTF encoding still has to be
|
||||||
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
||||||
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
||||||
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
|
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
|
||||||
not possible to use both --enable-unicode and --enable-ebcdic at the same
|
|
||||||
time.
|
|
||||||
|
|
||||||
As well as supporting UTF strings, Unicode support includes support for the
|
As well as supporting UTF strings, Unicode support includes support for the
|
||||||
\P, \p, and \X sequences that recognize Unicode character properties.
|
\P, \p, and \X sequences that recognize Unicode character properties.
|
||||||
|
@ -232,18 +231,18 @@ library. They are also documented in the pcre2build man page.
|
||||||
--with-match-limit=500000
|
--with-match-limit=500000
|
||||||
|
|
||||||
on the "configure" command. This is just the default; individual calls to
|
on the "configure" command. This is just the default; individual calls to
|
||||||
pcre2_match() can supply their own value. There is more discussion on the
|
pcre2_match() can supply their own value. There is more discussion in the
|
||||||
pcre2api man page.
|
pcre2api man page (search for pcre2_set_match_limit).
|
||||||
|
|
||||||
. There is a separate counter that limits the depth of recursive function calls
|
. There is a separate counter that limits the depth of nested backtracking
|
||||||
during a matching process. This also has a default of ten million, which is
|
during a matching process, which in turn limits the amount of memory that is
|
||||||
essentially "unlimited". You can change the default by setting, for example,
|
used. This also has a default of ten million, which is essentially
|
||||||
|
"unlimited". You can change the default by setting, for example,
|
||||||
|
|
||||||
--with-match-limit-recursion=500000
|
--with-match-limit-depth=5000
|
||||||
|
|
||||||
Recursive function calls use up the runtime stack; running out of stack can
|
There is more discussion in the pcre2api man page (search for
|
||||||
cause programs to crash in strange ways. There is a discussion about stack
|
pcre2_set_depth_limit).
|
||||||
sizes in the pcre2stack man page.
|
|
||||||
|
|
||||||
. In the 8-bit library, the default maximum compiled pattern size is around
|
. In the 8-bit library, the default maximum compiled pattern size is around
|
||||||
64K bytes. You can increase this by adding --with-link-size=3 to the
|
64K bytes. You can increase this by adding --with-link-size=3 to the
|
||||||
|
@ -254,20 +253,6 @@ library. They are also documented in the pcre2build man page.
|
||||||
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
|
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
|
||||||
link size setting is ignored, as 4-byte offsets are always used.
|
link size setting is ignored, as 4-byte offsets are always used.
|
||||||
|
|
||||||
. You can build PCRE2 so that its internal match() function that is called from
|
|
||||||
pcre2_match() does not call itself recursively. Instead, it uses memory
|
|
||||||
blocks obtained from the heap to save data that would otherwise be saved on
|
|
||||||
the stack. To build PCRE2 like this, use
|
|
||||||
|
|
||||||
--disable-stack-for-recursion
|
|
||||||
|
|
||||||
on the "configure" command. PCRE2 runs more slowly in this mode, but it may
|
|
||||||
be necessary in environments with limited stack sizes. This applies only to
|
|
||||||
the normal execution of the pcre2_match() function; if JIT support is being
|
|
||||||
successfully used, it is not relevant. Equally, it does not apply to
|
|
||||||
pcre2_dfa_match(), which does not use deeply nested recursion. There is a
|
|
||||||
discussion about stack sizes in the pcre2stack man page.
|
|
||||||
|
|
||||||
. For speed, PCRE2 uses four tables for manipulating and identifying characters
|
. For speed, PCRE2 uses four tables for manipulating and identifying characters
|
||||||
whose code point values are less than 256. By default, it uses a set of
|
whose code point values are less than 256. By default, it uses a set of
|
||||||
tables for ASCII encoding that is part of the distribution. If you specify
|
tables for ASCII encoding that is part of the distribution. If you specify
|
||||||
|
@ -389,6 +374,13 @@ library. They are also documented in the pcre2build man page.
|
||||||
string. Otherwise, it is assumed to be a file name, and the contents of the
|
string. Otherwise, it is assumed to be a file name, and the contents of the
|
||||||
file are the test string.
|
file are the test string.
|
||||||
|
|
||||||
|
. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
|
||||||
|
which caused pcre2_match() to use individual blocks on the heap for
|
||||||
|
backtracking instead of recursive function calls (which use the stack). This
|
||||||
|
is now obsolete since pcre2_match() was refactored always to use the heap (in
|
||||||
|
a much more efficient way than before). This option is retained for backwards
|
||||||
|
compatibility, but has no effect other than to output a warning.
|
||||||
|
|
||||||
The "configure" script builds the following files for the basic C library:
|
The "configure" script builds the following files for the basic C library:
|
||||||
|
|
||||||
. Makefile the makefile that builds the library
|
. Makefile the makefile that builds the library
|
||||||
|
@ -662,25 +654,32 @@ Unicode support is enabled.
|
||||||
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
|
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
|
||||||
16-bit and 32-bit modes. These are tests that generate different output in
|
16-bit and 32-bit modes. These are tests that generate different output in
|
||||||
8-bit mode. Each pair are for general cases and Unicode support, respectively.
|
8-bit mode. Each pair are for general cases and Unicode support, respectively.
|
||||||
|
|
||||||
Test 13 checks the handling of non-UTF characters greater than 255 by
|
Test 13 checks the handling of non-UTF characters greater than 255 by
|
||||||
pcre2_dfa_match() in 16-bit and 32-bit modes.
|
pcre2_dfa_match() in 16-bit and 32-bit modes.
|
||||||
|
|
||||||
Test 14 contains a number of tests that must not be run with JIT. They check,
|
Test 14 contains some special UTF and UCP tests that give different output for
|
||||||
|
the different widths.
|
||||||
|
|
||||||
|
Test 15 contains a number of tests that must not be run with JIT. They check,
|
||||||
among other non-JIT things, the match-limiting features of the intepretive
|
among other non-JIT things, the match-limiting features of the intepretive
|
||||||
matcher.
|
matcher.
|
||||||
|
|
||||||
Test 15 is run only when JIT support is not available. It checks that an
|
Test 16 is run only when JIT support is not available. It checks that an
|
||||||
attempt to use JIT has the expected behaviour.
|
attempt to use JIT has the expected behaviour.
|
||||||
|
|
||||||
Test 16 is run only when JIT support is available. It checks JIT complete and
|
Test 17 is run only when JIT support is available. It checks JIT complete and
|
||||||
partial modes, match-limiting under JIT, and other JIT-specific features.
|
partial modes, match-limiting under JIT, and other JIT-specific features.
|
||||||
|
|
||||||
Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to
|
Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
|
||||||
the 8-bit library, without and with Unicode support, respectively.
|
the 8-bit library, without and with Unicode support, respectively.
|
||||||
|
|
||||||
Test 19 checks the serialization functions by writing a set of compiled
|
Test 20 checks the serialization functions by writing a set of compiled
|
||||||
patterns to a file, and then reloading and checking them.
|
patterns to a file, and then reloading and checking them.
|
||||||
|
|
||||||
|
Tests 21 and 22 test \C support when the use of \C is not locked out, without
|
||||||
|
and with UTF support, respectively. Test 23 tests \C when it is locked out.
|
||||||
|
|
||||||
|
|
||||||
Character tables
|
Character tables
|
||||||
----------------
|
----------------
|
||||||
|
@ -866,4 +865,4 @@ The distribution should contain the files listed below.
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
Email local part: ph10
|
Email local part: ph10
|
||||||
Email domain: cam.ac.uk
|
Email domain: cam.ac.uk
|
||||||
Last updated: 01 November 2016
|
Last updated: 17 March 2017
|
||||||
|
|
|
@ -109,7 +109,7 @@ lose performance.
|
||||||
One way of guarding against this possibility is to use the
|
One way of guarding against this possibility is to use the
|
||||||
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
|
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
|
||||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
|
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
|
||||||
<b>pcre2_compile()</b>. This causes an compile time error if a pattern contains
|
<b>pcre2_compile()</b>. This causes a compile time error if the pattern contains
|
||||||
a UTF-setting sequence.
|
a UTF-setting sequence.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -137,7 +137,8 @@ large search tree against a string that will never match. Nested unlimited
|
||||||
repeats in a pattern are a common example. PCRE2 provides some protection
|
repeats in a pattern are a common example. PCRE2 provides some protection
|
||||||
against this: see the <b>pcre2_set_match_limit()</b> function in the
|
against this: see the <b>pcre2_set_match_limit()</b> function in the
|
||||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
page.
|
page. There is a similar function called <b>pcre2_set_depth_limit()</b> that can
|
||||||
|
be used to restrict the amount of memory that is used.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
|
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -166,7 +167,7 @@ listing), and the short pages for individual functions, are concatenated in
|
||||||
pcre2perform discussion of performance issues
|
pcre2perform discussion of performance issues
|
||||||
pcre2posix the POSIX-compatible C API for the 8-bit library
|
pcre2posix the POSIX-compatible C API for the 8-bit library
|
||||||
pcre2sample discussion of the pcre2demo program
|
pcre2sample discussion of the pcre2demo program
|
||||||
pcre2stack discussion of stack usage
|
pcre2stack discussion of stack and memory usage
|
||||||
pcre2syntax quick syntax reference
|
pcre2syntax quick syntax reference
|
||||||
pcre2test description of the <b>pcre2test</b> command
|
pcre2test description of the <b>pcre2test</b> command
|
||||||
pcre2unicode discussion of Unicode and UTF support
|
pcre2unicode discussion of Unicode and UTF support
|
||||||
|
@ -189,9 +190,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 16 October 2015
|
Last updated: 27 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2015 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -36,20 +36,21 @@ for success and non-zero otherwise. The arguments are:
|
||||||
<i>callout_data</i> User data that is passed to the callback
|
<i>callout_data</i> User data that is passed to the callback
|
||||||
</pre>
|
</pre>
|
||||||
The <i>callback()</i> function is passed a pointer to a data block containing
|
The <i>callback()</i> function is passed a pointer to a data block containing
|
||||||
the following fields:
|
the following fields (not necessarily in this order):
|
||||||
<pre>
|
<pre>
|
||||||
<i>version</i> Block version number
|
uint32_t <i>version</i> Block version number
|
||||||
<i>pattern_position</i> Offset to next item in pattern
|
uint32_t <i>callout_number</i> Number for numbered callouts
|
||||||
<i>next_item_length</i> Length of next item in pattern
|
PCRE2_SIZE <i>pattern_position</i> Offset to next item in pattern
|
||||||
<i>callout_number</i> Number for numbered callouts
|
PCRE2_SIZE <i>next_item_length</i> Length of next item in pattern
|
||||||
<i>callout_string_offset</i> Offset to string within pattern
|
PCRE2_SIZE <i>callout_string_offset</i> Offset to string within pattern
|
||||||
<i>callout_string_length</i> Length of callout string
|
PCRE2_SIZE <i>callout_string_length</i> Length of callout string
|
||||||
<i>callout_string</i> Points to callout string or is NULL
|
PCRE2_SPTR <i>callout_string</i> Points to callout string or is NULL
|
||||||
</pre>
|
</pre>
|
||||||
The second argument is the callout data that was passed to
|
The second argument passed to the <b>callback()</b> function is the callout data
|
||||||
<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero
|
that was passed to <b>pcre2_callout_enumerate()</b>. The <b>callback()</b>
|
||||||
for success. Any other value causes the pattern scan to stop, with the value
|
function must return zero for success. Any other value causes the pattern scan
|
||||||
being passed back as the result of <b>pcre2_callout_enumerate()</b>.
|
to stop, with the value being passed back as the result of
|
||||||
|
<b>pcre2_callout_enumerate()</b>.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API in the
|
||||||
|
|
|
@ -26,7 +26,9 @@ DESCRIPTION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
This function frees the memory used for a compiled pattern, including any
|
This function frees the memory used for a compiled pattern, including any
|
||||||
memory used by the JIT compiler.
|
memory used by the JIT compiler. If the compiled pattern was created by a call
|
||||||
|
to <b>pcre2_code_copy_with_tables()</b>, the memory for the character tables is
|
||||||
|
also freed.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API in the
|
||||||
|
|
|
@ -37,19 +37,24 @@ arguments are:
|
||||||
<i>erroffset</i> Where to put an error offset
|
<i>erroffset</i> Where to put an error offset
|
||||||
<i>ccontext</i> Pointer to a compile context or NULL
|
<i>ccontext</i> Pointer to a compile context or NULL
|
||||||
</pre>
|
</pre>
|
||||||
The length of the string and any error offset that is returned are in code
|
The length of the pattern and any error offset that is returned are in code
|
||||||
units, not characters. A compile context is needed only if you want to change
|
units, not characters. A compile context is needed only if you want to provide
|
||||||
|
custom memory allocation functions, or to provide an external function for
|
||||||
|
system stack size checking, or to change one or more of these parameters:
|
||||||
<pre>
|
<pre>
|
||||||
What \R matches (Unicode newlines or CR, LF, CRLF only)
|
What \R matches (Unicode newlines, or CR, LF, CRLF only);
|
||||||
PCRE2's character tables
|
PCRE2's character tables;
|
||||||
The newline character sequence
|
The newline character sequence;
|
||||||
The compile time nested parentheses limit
|
The compile time nested parentheses limit;
|
||||||
|
The maximum pattern length (in code units) that is allowed.
|
||||||
</pre>
|
</pre>
|
||||||
or provide an external function for stack size checking. The option bits are:
|
The option bits are:
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_ANCHORED Force pattern anchoring
|
PCRE2_ANCHORED Force pattern anchoring
|
||||||
|
PCRE2_ALLOW_EMPTY_CLASS Allow empty classes
|
||||||
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
|
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
|
||||||
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
|
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
|
||||||
|
PCRE2_ALT_VERBNAMES Process backslashes in verb names
|
||||||
PCRE2_AUTO_CALLOUT Compile automatic callouts
|
PCRE2_AUTO_CALLOUT Compile automatic callouts
|
||||||
PCRE2_CASELESS Do caseless matching
|
PCRE2_CASELESS Do caseless matching
|
||||||
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
|
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
|
||||||
|
@ -71,19 +76,21 @@ or provide an external function for stack size checking. The option bits are:
|
||||||
(only relevant if PCRE2_UTF is set)
|
(only relevant if PCRE2_UTF is set)
|
||||||
PCRE2_UCP Use Unicode properties for \d, \w, etc.
|
PCRE2_UCP Use Unicode properties for \d, \w, etc.
|
||||||
PCRE2_UNGREEDY Invert greediness of quantifiers
|
PCRE2_UNGREEDY Invert greediness of quantifiers
|
||||||
|
PCRE2_USE_OFFSET_LIMIT Enable offset limit for unanchored matching
|
||||||
PCRE2_UTF Treat pattern and subjects as UTF strings
|
PCRE2_UTF Treat pattern and subjects as UTF strings
|
||||||
</pre>
|
</pre>
|
||||||
PCRE2 must be built with Unicode support in order to use PCRE2_UTF, PCRE2_UCP
|
PCRE2 must be built with Unicode support (the default) in order to use
|
||||||
and related options.
|
PCRE2_UTF, PCRE2_UCP and related options.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The yield of the function is a pointer to a private data structure that
|
The yield of the function is a pointer to a private data structure that
|
||||||
contains the compiled pattern, or NULL if an error was detected.
|
contains the compiled pattern, or NULL if an error was detected.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API, with more detail on
|
||||||
|
each option, in the
|
||||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
page and a description of the POSIX API in the
|
page, and a description of the POSIX API in the
|
||||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||||
page.
|
page.
|
||||||
<p>
|
<p>
|
||||||
|
|
|
@ -45,10 +45,9 @@ point to a uint32_t integer variable. The available codes are:
|
||||||
PCRE2_CONFIG_BSR Indicates what \R matches by default:
|
PCRE2_CONFIG_BSR Indicates what \R matches by default:
|
||||||
PCRE2_BSR_UNICODE
|
PCRE2_BSR_UNICODE
|
||||||
PCRE2_BSR_ANYCRLF
|
PCRE2_BSR_ANYCRLF
|
||||||
PCRE2_CONFIG_JIT Availability of just-in-time compiler
|
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
|
||||||
support (1=yes 0=no)
|
PCRE2_CONFIG_JIT Availability of just-in-time compiler support (1=yes 0=no)
|
||||||
PCRE2_CONFIG_JITTARGET Information about the target archi-
|
PCRE2_CONFIG_JITTARGET Information (a string) about the target architecture for the JIT compiler
|
||||||
tecture for the JIT compiler
|
|
||||||
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
||||||
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
||||||
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
||||||
|
@ -58,11 +57,9 @@ point to a uint32_t integer variable. The available codes are:
|
||||||
PCRE2_NEWLINE_ANY
|
PCRE2_NEWLINE_ANY
|
||||||
PCRE2_NEWLINE_ANYCRLF
|
PCRE2_NEWLINE_ANYCRLF
|
||||||
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
||||||
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
|
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
|
||||||
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
|
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
|
||||||
0=heap)
|
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes 0=no)
|
||||||
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
|
|
||||||
0=no)
|
|
||||||
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
||||||
PCRE2_CONFIG_VERSION The PCRE2 version (a string)
|
PCRE2_CONFIG_VERSION The PCRE2 version (a string)
|
||||||
</pre>
|
</pre>
|
||||||
|
|
|
@ -31,8 +31,9 @@ DESCRIPTION
|
||||||
<P>
|
<P>
|
||||||
This function matches a compiled regular expression against a given subject
|
This function matches a compiled regular expression against a given subject
|
||||||
string, using an alternative matching algorithm that scans the subject string
|
string, using an alternative matching algorithm that scans the subject string
|
||||||
just once (<i>not</i> Perl-compatible). (The Perl-compatible matching function
|
just once (except when processing lookaround assertions). This function is
|
||||||
is <b>pcre2_match()</b>.) The arguments for this function are:
|
<i>not</i> Perl-compatible (the Perl-compatible matching function is
|
||||||
|
<b>pcre2_match()</b>). The arguments for this function are:
|
||||||
<pre>
|
<pre>
|
||||||
<i>code</i> Points to the compiled pattern
|
<i>code</i> Points to the compiled pattern
|
||||||
<i>subject</i> Points to the subject string
|
<i>subject</i> Points to the subject string
|
||||||
|
@ -45,22 +46,18 @@ is <b>pcre2_match()</b>.) The arguments for this function are:
|
||||||
<i>wscount</i> Number of elements in the vector
|
<i>wscount</i> Number of elements in the vector
|
||||||
</pre>
|
</pre>
|
||||||
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
|
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
|
||||||
up a callout function or specify the recursion limit. The <i>length</i> and
|
up a callout function or specify the recursion depth limit. The <i>length</i>
|
||||||
<i>startoffset</i> values are code units, not characters. The options are:
|
and <i>startoffset</i> values are code units, not characters. The options are:
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_ANCHORED Match only at the first position
|
PCRE2_ANCHORED Match only at the first position
|
||||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||||
PCRE2_NOTEOL Subject is not the end of a line
|
PCRE2_NOTEOL Subject is not the end of a line
|
||||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
|
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
|
||||||
is not a valid match
|
PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
|
||||||
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
|
|
||||||
validity (only relevant if PCRE2_UTF
|
|
||||||
was set at compile time)
|
was set at compile time)
|
||||||
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
|
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
|
||||||
match if no full matches are found
|
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
|
||||||
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
|
|
||||||
even if there is a full match as well
|
|
||||||
PCRE2_DFA_RESTART Restart after a partial match
|
PCRE2_DFA_RESTART Restart after a partial match
|
||||||
PCRE2_DFA_SHORTEST Return only the shortest match
|
PCRE2_DFA_SHORTEST Return only the shortest match
|
||||||
</pre>
|
</pre>
|
||||||
|
|
|
@ -34,11 +34,11 @@ errors are negative numbers. The arguments are:
|
||||||
<i>buffer</i> where to put the message
|
<i>buffer</i> where to put the message
|
||||||
<i>bufflen</i> the length of the buffer (code units)
|
<i>bufflen</i> the length of the buffer (code units)
|
||||||
</pre>
|
</pre>
|
||||||
The function returns the length of the message, excluding the trailing zero, or
|
The function returns the length of the message in code units, excluding the
|
||||||
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
|
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
|
||||||
this case, the returned message is truncated (but still with a trailing zero).
|
too small. In this case, the returned message is truncated (but still with a
|
||||||
If <i>errorcode</i> does not contain a recognized error code number, the
|
trailing zero). If <i>errorcode</i> does not contain a recognized error code
|
||||||
negative value PCRE2_ERROR_BADDATA is returned.
|
number, the negative value PCRE2_ERROR_BADDATA is returned.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API in the
|
||||||
|
|
|
@ -32,10 +32,9 @@ maximum size to which it is allowed to grow. The final argument is a general
|
||||||
context, for memory allocation functions, or NULL for standard memory
|
context, for memory allocation functions, or NULL for standard memory
|
||||||
allocation. The result can be passed to the JIT run-time code by calling
|
allocation. The result can be passed to the JIT run-time code by calling
|
||||||
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
|
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
|
||||||
which can then be processed by <b>pcre2_match()</b>. If the "fast path" JIT
|
which can then be processed by <b>pcre2_match()</b> or <b>pcre2_jit_match()</b>.
|
||||||
matcher, <b>pcre2_jit_match()</b> is used, the stack can be passed directly as
|
A maximum stack size of 512K to 1M should be more than enough for any pattern.
|
||||||
an argument. A maximum stack size of 512K to 1M should be more than enough for
|
For more details, see the
|
||||||
any pattern. For more details, see the
|
|
||||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||||
page.
|
page.
|
||||||
</P>
|
</P>
|
||||||
|
|
|
@ -25,10 +25,10 @@ SYNOPSIS
|
||||||
DESCRIPTION
|
DESCRIPTION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
This function builds a set of character tables for character values less than
|
This function builds a set of character tables for character code points that
|
||||||
256. These can be passed to <b>pcre2_compile()</b> in a compile context in order
|
are less than 256. These can be passed to <b>pcre2_compile()</b> in a compile
|
||||||
to override the internal, built-in tables (which were either defaulted or made
|
context in order to override the internal, built-in tables (which were either
|
||||||
by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
|
defaulted or made by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
|
||||||
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
|
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
|
||||||
page. You might want to do this if you are using a non-standard locale.
|
page. You might want to do this if you are using a non-standard locale.
|
||||||
</P>
|
</P>
|
||||||
|
|
|
@ -2575,8 +2575,8 @@ The internal recursion limit was reached.
|
||||||
A text message for an error code from any PCRE2 function (compile, match, or
|
A text message for an error code from any PCRE2 function (compile, match, or
|
||||||
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
|
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
|
||||||
is passed as the first argument, with the remaining two arguments specifying a
|
is passed as the first argument, with the remaining two arguments specifying a
|
||||||
code unit buffer and its length, into which the text message is placed. Note
|
code unit buffer and its length in code units, into which the text message is
|
||||||
that the message is returned in code units of the appropriate width for the
|
placed. The message is returned in code units of the appropriate width for the
|
||||||
library that is being used.
|
library that is being used.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -3265,9 +3265,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC41" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC41" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 23 December 2016
|
Last updated: 21 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -280,6 +280,10 @@ operating systems the effect of reading a directory like this is an immediate
|
||||||
end-of-file; in others it may provoke an error.
|
end-of-file; in others it may provoke an error.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
<b>--depth-limit</b>=<i>number</i>
|
||||||
|
See <b>--match-limit</b> below.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
|
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
|
||||||
Specify a pattern to be matched. This option can be used multiple times in
|
Specify a pattern to be matched. This option can be used multiple times in
|
||||||
order to specify several patterns. It can also be used as a way of specifying a
|
order to specify several patterns. It can also be used as a way of specifying a
|
||||||
|
@ -498,29 +502,22 @@ used. There is no short form for this option.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
<b>--match-limit</b>=<i>number</i>
|
<b>--match-limit</b>=<i>number</i>
|
||||||
Processing some regular expression patterns can require a very large amount of
|
Processing some regular expression patterns may take a very long time to search
|
||||||
memory, leading in some cases to a program crash if not enough is available.
|
for all possible matching strings. Others may require a very large amount of
|
||||||
Other patterns may take a very long time to search for all possible matching
|
memory. There are two options that set resource limits for matching.
|
||||||
strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
|
|
||||||
do the matching has two parameters that can limit the resources that it uses.
|
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
The <b>--match-limit</b> option provides a means of limiting resource usage
|
The <b>--match-limit</b> option provides a means of limiting computing resource
|
||||||
when processing patterns that are not going to match, but which have a very
|
usage when processing patterns that are not going to match, but which have a
|
||||||
large number of possibilities in their search trees. The classic example is a
|
very large number of possibilities in their search trees. The classic example
|
||||||
pattern that uses nested unlimited repeats. Internally, PCRE2 uses a function
|
is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
|
||||||
called <b>match()</b> which it calls repeatedly (sometimes recursively). The
|
counter that is incremented each time around its main processing loop. If the
|
||||||
limit set by <b>--match-limit</b> is imposed on the number of times this
|
value set by <b>--match-limit</b> is reached, an error occurs.
|
||||||
function is called during a match, which has the effect of limiting the amount
|
|
||||||
of backtracking that can take place.
|
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
The <b>--recursion-limit</b> option is similar to <b>--match-limit</b>, but
|
The <b>--depth-limit</b> option limits the depth of nested backtracking points,
|
||||||
instead of limiting the total number of times that <b>match()</b> is called, it
|
which in turn limits the amount of memory that is used. This limit is of use
|
||||||
limits the depth of recursive calls, which in turn limits the amount of memory
|
only if it is set smaller than <b>--match-limit</b>.
|
||||||
that can be used. The recursion depth is a smaller number than the total number
|
|
||||||
of calls, because not all calls to <b>match()</b> are recursive. This limit is
|
|
||||||
of use only if it is set smaller than <b>--match-limit</b>.
|
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
There are no short forms for these options. The default settings are specified
|
There are no short forms for these options. The default settings are specified
|
||||||
|
@ -843,9 +840,9 @@ there are more than 20 such errors, <b>pcre2grep</b> gives up.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
|
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
|
||||||
overall resource limit; there is a second option called <b>--recursion-limit</b>
|
overall resource limit; there is a second option called <b>--depth-limit</b>
|
||||||
that sets a limit on the amount of memory (usually stack) that is used (see the
|
that sets a limit on the amount of memory that is used (see the discussion of
|
||||||
discussion of these options above).
|
these options above).
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
|
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -870,9 +867,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 31 December 2016
|
Last updated: 21 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -170,20 +170,24 @@ the application to apply the JIT optimization by calling
|
||||||
<b>pcre2_jit_compile()</b> is ignored.
|
<b>pcre2_jit_compile()</b> is ignored.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Setting match and recursion limits
|
Setting match and backtracking depth limits
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
The caller of <b>pcre2_match()</b> can set a limit on the number of times the
|
The pcre2_match() function contains a counter that is incremented every time it
|
||||||
internal <b>match()</b> function is called and on the maximum depth of
|
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
|
||||||
recursive calls. These facilities are provided to catch runaway matches that
|
this counter, which therefore limits the amount of computing resource used for
|
||||||
are provoked by patterns with huge matching trees (a typical example is a
|
a match. The maximum depth of nested backtracking can also be limited, and this
|
||||||
pattern with nested unlimited repeats) and to avoid running out of system stack
|
restricts the amount of heap memory that is used.
|
||||||
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
|
</P>
|
||||||
gives an error return. The limits can also be set by items at the start of the
|
<P>
|
||||||
pattern of the form
|
These facilities are provided to catch runaway matches that are provoked by
|
||||||
|
patterns with huge matching trees (a typical example is a pattern with nested
|
||||||
|
unlimited repeats applied to a long string that does not match). When one of
|
||||||
|
these limits is reached, <b>pcre2_match()</b> gives an error return. The limits
|
||||||
|
can also be set by items at the start of the pattern of the form
|
||||||
<pre>
|
<pre>
|
||||||
(*LIMIT_MATCH=d)
|
(*LIMIT_MATCH=d)
|
||||||
(*LIMIT_RECURSION=d)
|
(*LIMIT_DEPTH=d)
|
||||||
</pre>
|
</pre>
|
||||||
where d is any number of decimal digits. However, the value of the setting must
|
where d is any number of decimal digits. However, the value of the setting must
|
||||||
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
||||||
|
@ -192,10 +196,15 @@ limits set by the programmer, but not raise them. If there is more than one
|
||||||
setting of one of these limits, the lower value is used.
|
setting of one of these limits, the lower value is used.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||||
|
still recognized for backwards compatibility.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
The match limit is used (but in a different way) when JIT is being used, but it
|
The match limit is used (but in a different way) when JIT is being used, but it
|
||||||
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
|
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
|
||||||
However, the recursion limit is relevant for DFA matching, which does use some
|
However, the depth limit is relevant for DFA matching, which uses function
|
||||||
function recursion, in particular, for recursions within the pattern.
|
recursion for recursions within the pattern. In this case, the depth limit
|
||||||
|
controls the amount of system stack that is used.
|
||||||
<a name="newlines"></a></P>
|
<a name="newlines"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Newline conventions
|
Newline conventions
|
||||||
|
@ -235,8 +244,8 @@ The newline convention affects where the circumflex and dollar assertions are
|
||||||
true. It also affects the interpretation of the dot metacharacter when
|
true. It also affects the interpretation of the dot metacharacter when
|
||||||
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
||||||
what the \R escape sequence matches. By default, this is any Unicode newline
|
what the \R escape sequence matches. By default, this is any Unicode newline
|
||||||
sequence, for Perl compatibility. However, this can be changed; see the
|
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||||
description of \R in the section entitled
|
section and the description of \R in the section entitled
|
||||||
<a href="#newlineseq">"Newline sequences"</a>
|
<a href="#newlineseq">"Newline sequences"</a>
|
||||||
below. A change of \R setting can be combined with a change of newline
|
below. A change of \R setting can be combined with a change of newline
|
||||||
convention.
|
convention.
|
||||||
|
@ -254,7 +263,7 @@ corresponding to PCRE2_BSR_UNICODE.
|
||||||
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
|
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
|
||||||
<P>
|
<P>
|
||||||
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
||||||
character code rather than ASCII or Unicode (typically a mainframe system). In
|
character code instead of ASCII or Unicode (typically a mainframe system). In
|
||||||
the sections below, character code values are ASCII or Unicode; in an EBCDIC
|
the sections below, character code values are ASCII or Unicode; in an EBCDIC
|
||||||
environment these characters may have different code values, and there are no
|
environment these characters may have different code values, and there are no
|
||||||
code points greater than 255.
|
code points greater than 255.
|
||||||
|
@ -318,11 +327,11 @@ that character may have. This use of backslash as an escape character applies
|
||||||
both inside and outside character classes.
|
both inside and outside character classes.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For example, if you want to match a * character, you write \* in the pattern.
|
For example, if you want to match a * character, you must write \* in the
|
||||||
This escaping action applies whether or not the following character would
|
pattern. This escaping action applies whether or not the following character
|
||||||
otherwise be interpreted as a metacharacter, so it is always safe to precede a
|
would otherwise be interpreted as a metacharacter, so it is always safe to
|
||||||
non-alphanumeric with backslash to specify that it stands for itself. In
|
precede a non-alphanumeric with backslash to specify that it stands for itself.
|
||||||
particular, if you want to match a backslash, you write \\.
|
In particular, if you want to match a backslash, you write \\.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In a UTF mode, only ASCII numbers and letters have any special meaning after a
|
In a UTF mode, only ASCII numbers and letters have any special meaning after a
|
||||||
|
@ -353,7 +362,7 @@ An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
|
||||||
by \E later in the pattern, the literal interpretation continues to the end of
|
by \E later in the pattern, the literal interpretation continues to the end of
|
||||||
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
|
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
|
||||||
a character class, this causes an error, because the character class is not
|
a character class, this causes an error, because the character class is not
|
||||||
terminated.
|
terminated by a closing square bracket.
|
||||||
<a name="digitsafterbackslash"></a></P>
|
<a name="digitsafterbackslash"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Non-printing characters
|
Non-printing characters
|
||||||
|
@ -476,9 +485,9 @@ a hexadecimal digit appears between \x{ and }, or if there is no terminating
|
||||||
<P>
|
<P>
|
||||||
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
|
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
|
||||||
described only when it is followed by two hexadecimal digits. Otherwise, it
|
described only when it is followed by two hexadecimal digits. Otherwise, it
|
||||||
matches a literal "x" character. In this mode mode, support for code points
|
matches a literal "x" character. In this mode, support for code points greater
|
||||||
greater than 256 is provided by \u, which must be followed by four hexadecimal
|
than 256 is provided by \u, which must be followed by four hexadecimal digits;
|
||||||
digits; otherwise it matches a literal "u" character.
|
otherwise it matches a literal "u" character.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Characters whose value is less than 256 can be defined by either of the two
|
Characters whose value is less than 256 can be defined by either of the two
|
||||||
|
@ -493,12 +502,10 @@ Constraints on character values
|
||||||
Characters that are specified using octal or hexadecimal numbers are
|
Characters that are specified using octal or hexadecimal numbers are
|
||||||
limited to certain values, as follows:
|
limited to certain values, as follows:
|
||||||
<pre>
|
<pre>
|
||||||
8-bit non-UTF mode less than 0x100
|
8-bit non-UTF mode no greater than 0xff
|
||||||
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
|
16-bit non-UTF mode no greater than 0xffff
|
||||||
16-bit non-UTF mode less than 0x10000
|
32-bit non-UTF mode no greater than 0xffffffff
|
||||||
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
|
All UTF modes no greater than 0x10ffff and a valid codepoint
|
||||||
32-bit non-UTF mode less than 0x100000000
|
|
||||||
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
|
|
||||||
</pre>
|
</pre>
|
||||||
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
|
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
|
||||||
"surrogate" codepoints), and 0xffef.
|
"surrogate" codepoints), and 0xffef.
|
||||||
|
@ -525,7 +532,7 @@ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
|
||||||
handler and used to modify the case of following characters. By default, PCRE2
|
handler and used to modify the case of following characters. By default, PCRE2
|
||||||
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
|
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
|
||||||
is set, \U matches a "U" character, and \u can be used to define a character
|
is set, \U matches a "U" character, and \u can be used to define a character
|
||||||
by code point, as described in the previous section.
|
by code point, as described above.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Absolute and relative back references
|
Absolute and relative back references
|
||||||
|
@ -714,7 +721,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
|
||||||
sequences that match characters with specific properties are available. In
|
sequences that match characters with specific properties are available. In
|
||||||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||||
characters whose codepoints are less than 256, but they do work in this mode.
|
characters whose codepoints are less than 256, but they do work in this mode.
|
||||||
The extra escape sequences are:
|
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
|
||||||
|
may be encountered. These are all treated as being in the Common script and
|
||||||
|
with an unassigned type. The extra escape sequences are:
|
||||||
<pre>
|
<pre>
|
||||||
\p{<i>xx</i>} a character with the <i>xx</i> property
|
\p{<i>xx</i>} a character with the <i>xx</i> property
|
||||||
\P{<i>xx</i>} a character without the <i>xx</i> property
|
\P{<i>xx</i>} a character without the <i>xx</i> property
|
||||||
|
@ -2214,16 +2223,8 @@ except that it does not cause the current matching position to be changed.
|
||||||
Assertion subpatterns are not capturing subpatterns. If such an assertion
|
Assertion subpatterns are not capturing subpatterns. If such an assertion
|
||||||
contains capturing subpatterns within it, these are counted for the purposes of
|
contains capturing subpatterns within it, these are counted for the purposes of
|
||||||
numbering the capturing subpatterns in the whole pattern. However, substring
|
numbering the capturing subpatterns in the whole pattern. However, substring
|
||||||
capturing is carried out only for positive assertions. (Perl sometimes, but not
|
capturing is normally carried out only for positive assertions (but see the
|
||||||
always, does do capturing in negative assertions.)
|
discussion of conditional subpatterns below).
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
WARNING: If a positive assertion containing one or more capturing subpatterns
|
|
||||||
succeeds, but failure to match later in the pattern causes backtracking over
|
|
||||||
this assertion, the captures within the assertion are reset only if no higher
|
|
||||||
numbered captures are already set. This is, unfortunately, a fundamental
|
|
||||||
limitation of the current implementation; it may get removed in a future
|
|
||||||
reworking.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
||||||
|
@ -2601,6 +2602,12 @@ presence of at least one letter in the subject. If a letter is found, the
|
||||||
subject is matched against the first alternative; otherwise it is matched
|
subject is matched against the first alternative; otherwise it is matched
|
||||||
against the second. This pattern matches strings in one of the two forms
|
against the second. This pattern matches strings in one of the two forms
|
||||||
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
|
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
For Perl compatibility, if an assertion that is a condition contains capturing
|
||||||
|
subpatterns, any capturing that occurs is retained afterwards, for both
|
||||||
|
positive and negative assertions. (Compare non-conditional assertions, when
|
||||||
|
captures are retained only for positive assertions.)
|
||||||
<a name="comments"></a></P>
|
<a name="comments"></a></P>
|
||||||
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
|
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -2773,93 +2780,57 @@ is the actual recursive call.
|
||||||
Differences in recursion processing between PCRE2 and Perl
|
Differences in recursion processing between PCRE2 and Perl
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2
|
Some former differences between PCRE2 and Perl no longer exist.
|
||||||
(like Python, but unlike Perl), a recursive subpattern call is always treated
|
|
||||||
as an atomic group. That is, once it has matched some of the subject string, it
|
|
||||||
is never re-entered, even if it contains untried alternatives and there is a
|
|
||||||
subsequent matching failure. This can be illustrated by the following pattern,
|
|
||||||
which purports to match a palindromic string that contains an odd number of
|
|
||||||
characters (for example, "a", "aba", "abcba", "abcdcba"):
|
|
||||||
<pre>
|
|
||||||
^(.|(.)(?1)\2)$
|
|
||||||
</pre>
|
|
||||||
The idea is that it either matches a single character, or two identical
|
|
||||||
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
|
|
||||||
it does not if the pattern is longer than three characters. Consider the
|
|
||||||
subject string "abcba":
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
At the top level, the first character is matched, but as it is not at the end
|
Before release 10.30, recursion processing in PCRE2 differed from Perl in that
|
||||||
of the string, the first alternative fails; the second alternative is taken
|
a recursive subpattern call was always treated as an atomic group. That is,
|
||||||
and the recursion kicks in. The recursive call to subpattern 1 successfully
|
once it had matched some of the subject string, it was never re-entered, even
|
||||||
matches the next character ("b"). (Note that the beginning and end of line
|
if it contained untried alternatives and there was a subsequent matching
|
||||||
tests are not part of the recursion).
|
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Back at the top level, the next character ("c") is compared with what
|
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||||
subpattern 2 matched, which was "a". This fails. Because the recursion is
|
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||||
treated as an atomic group, there are now no backtracking points, and so the
|
is a matching failure later in the pattern. This is now compatible with the way
|
||||||
entire match fails. (Perl is able, at this point, to re-enter the recursion and
|
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
||||||
try the second alternative.) However, if the pattern is written with the
|
enclose it in an atomic group.
|
||||||
alternatives in the other order, things are different:
|
|
||||||
<pre>
|
|
||||||
^((.)(?1)\2|.)$
|
|
||||||
</pre>
|
|
||||||
This time, the recursing alternative is tried first, and continues to recurse
|
|
||||||
until it runs out of characters, at which point the recursion fails. But this
|
|
||||||
time we do have another alternative to try at the higher level. That is the big
|
|
||||||
difference: in the previous case the remaining alternative is at a deeper
|
|
||||||
recursion level, which PCRE2 cannot use.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
To change the pattern so that it matches all palindromic strings, not just
|
Supporting backtracking into recursions simplifies certain types of recursive
|
||||||
those with an odd number of characters, it is tempting to change the pattern to
|
pattern. For example, this pattern matches palindromic strings:
|
||||||
this:
|
|
||||||
<pre>
|
<pre>
|
||||||
^((.)(?1)\2|.?)$
|
^((.)(?1)\2|.?)$
|
||||||
</pre>
|
</pre>
|
||||||
Again, this works in Perl, but not in PCRE2, and for the same reason. When a
|
The second branch in the group matches a single central character in the
|
||||||
deeper recursion has matched a single character, it cannot be entered again in
|
palindrome when there are an odd number of characters, or nothing when there
|
||||||
order to match an empty string. The solution is to separate the two cases, and
|
are an even number of characters, but in order to work it has to be able to try
|
||||||
write out the odd and even cases as alternatives at the higher level:
|
the second case when the rest of the pattern match fails. If you want to match
|
||||||
|
typical palindromic phrases, the pattern has to ignore all non-word characters,
|
||||||
|
which can be done like this:
|
||||||
<pre>
|
<pre>
|
||||||
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
|
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
|
||||||
</pre>
|
|
||||||
If you want to match typical palindromic phrases, the pattern has to ignore all
|
|
||||||
non-word characters, which can be done like this:
|
|
||||||
<pre>
|
|
||||||
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
|
|
||||||
</pre>
|
</pre>
|
||||||
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
|
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
|
||||||
man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the
|
man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
|
||||||
use of the possessive quantifier *+ to avoid backtracking into sequences of
|
avoid backtracking into sequences of non-word characters. Without this, PCRE2
|
||||||
non-word characters. Without this, PCRE2 takes a great deal longer (ten times
|
takes a great deal longer (ten times or more) to match typical phrases, and
|
||||||
or more) to match typical phrases, and Perl takes so long that you think it has
|
Perl takes so long that you think it has gone into a loop.
|
||||||
gone into a loop.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
<b>WARNING</b>: The palindrome-matching patterns above work only if the subject
|
Another way in which PCRE2 and Perl used to differ in their recursion
|
||||||
string does not start with a palindrome that is shorter than the entire string.
|
processing is in the handling of captured values. Formerly in Perl, when a
|
||||||
For example, although "abcba" is correctly matched, if the subject is "ababa",
|
subpattern was called recursively or as a subpattern (see the next section), it
|
||||||
PCRE2 finds the palindrome "aba" at the start, then fails at top level because
|
had no access to any values that were captured outside the recursion, whereas
|
||||||
the end of the string does not follow. Once again, it cannot jump back into the
|
in PCRE2 these values can be referenced. Consider this pattern:
|
||||||
recursion to try other alternatives, so the entire match fails.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
The second way in which PCRE2 and Perl differ in their recursion processing is
|
|
||||||
in the handling of captured values. In Perl, when a subpattern is called
|
|
||||||
recursively or as a subpattern (see the next section), it has no access to any
|
|
||||||
values that were captured outside the recursion, whereas in PCRE2 these values
|
|
||||||
can be referenced. Consider this pattern:
|
|
||||||
<pre>
|
<pre>
|
||||||
^(.)(\1|a(?2))
|
^(.)(\1|a(?2))
|
||||||
</pre>
|
</pre>
|
||||||
In PCRE2, this pattern matches "bab". The first capturing parentheses match "b",
|
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||||
then in the second group, when the back reference \1 fails to match "b", the
|
the second group, when the back reference \1 fails to match "b", the second
|
||||||
second alternative matches "a" and then recurses. In the recursion, \1 does
|
alternative matches "a" and then recurses. In the recursion, \1 does now match
|
||||||
now match "b" and so the whole match succeeds. In Perl, the pattern fails to
|
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||||
match because inside the recursive call \1 cannot access the externally set
|
later versions (I tried 5.024) it now works.
|
||||||
value.
|
|
||||||
<a name="subpatternsassubroutines"></a></P>
|
<a name="subpatternsassubroutines"></a></P>
|
||||||
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -2886,11 +2857,10 @@ is used, it does match "sense and responsibility" as well as the other two
|
||||||
strings. Another example is given in the discussion of DEFINE above.
|
strings. Another example is given in the discussion of DEFINE above.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
All subroutine calls, whether recursive or not, are always treated as atomic
|
Like recursions, subroutine calls used to be treated as atomic, but this
|
||||||
groups. That is, once a subroutine has matched some of the subject string, it
|
changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
|
||||||
is never re-entered, even if it contains untried alternatives and there is a
|
occur. However, any capturing parentheses that are set during the subroutine
|
||||||
subsequent matching failure. Any capturing parentheses that are set during the
|
call revert to their previous values afterwards.
|
||||||
subroutine call revert to their previous values afterwards.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Processing options such as case-independence are fixed when a subpattern is
|
Processing options such as case-independence are fixed when a subpattern is
|
||||||
|
@ -2998,17 +2968,10 @@ The doubling is removed before the string is passed to the callout function.
|
||||||
<a name="backtrackcontrol"></a></P>
|
<a name="backtrackcontrol"></a></P>
|
||||||
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||||
<P>
|
<P>
|
||||||
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
|
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||||
are still described in the Perl documentation as "experimental and subject to
|
terminology) that modify the behaviour of backtracking during matching. They
|
||||||
change or removal in a future version of Perl". It goes on to say: "Their usage
|
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
|
||||||
in production code should be noted to avoid problems during upgrades." The same
|
possibly behaving differently depending on whether or not a name is present.
|
||||||
remarks apply to the PCRE2 features described in this section.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
The new verbs make use of what was previously invalid syntax: an opening
|
|
||||||
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
|
|
||||||
(*VERB:NAME). Some verbs take either form, possibly behaving differently
|
|
||||||
depending on whether or not a name is present.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
By default, for compatibility with Perl, a name is any sequence of characters
|
By default, for compatibility with Perl, a name is any sequence of characters
|
||||||
|
@ -3040,7 +3003,7 @@ not there. Any number of these verbs may occur in a pattern.
|
||||||
<P>
|
<P>
|
||||||
Since these verbs are specifically related to backtracking, most of them can be
|
Since these verbs are specifically related to backtracking, most of them can be
|
||||||
used only when the pattern is to be matched using the traditional matching
|
used only when the pattern is to be matched using the traditional matching
|
||||||
function, because these use a backtracking algorithm. With the exception of
|
function, because that uses a backtracking algorithm. With the exception of
|
||||||
(*FAIL), which behaves like a failing negative assertion, the backtracking
|
(*FAIL), which behaves like a failing negative assertion, the backtracking
|
||||||
control verbs cause an error if encountered by the DFA matching function.
|
control verbs cause an error if encountered by the DFA matching function.
|
||||||
</P>
|
</P>
|
||||||
|
@ -3178,11 +3141,11 @@ Verbs that act after backtracking
|
||||||
The following verbs do nothing when they are encountered. Matching continues
|
The following verbs do nothing when they are encountered. Matching continues
|
||||||
with what follows, but if there is no subsequent match, causing a backtrack to
|
with what follows, but if there is no subsequent match, causing a backtrack to
|
||||||
the verb, a failure is forced. That is, backtracking cannot pass to the left of
|
the verb, a failure is forced. That is, backtracking cannot pass to the left of
|
||||||
the verb. However, when one of these verbs appears inside an atomic group
|
the verb. However, when one of these verbs appears inside an atomic group or in
|
||||||
(which includes any group that is called as a subroutine) or in an assertion
|
an assertion that is true, its effect is confined to that group, because once
|
||||||
that is true, its effect is confined to that group, because once the group has
|
the group has been matched, there is never any backtracking into it. In this
|
||||||
been matched, there is never any backtracking into it. In this situation,
|
situation, backtracking has to jump to the left of the entire atomic group or
|
||||||
backtracking has to jump to the left of the entire atomic group or assertion.
|
assertion.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
These verbs differ in exactly what kind of failure occurs when backtracking
|
These verbs differ in exactly what kind of failure occurs when backtracking
|
||||||
|
@ -3246,8 +3209,8 @@ expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
|
||||||
as (*COMMIT).
|
as (*COMMIT).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
|
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
|
||||||
It is like (*MARK:NAME) in that the name is remembered for passing back to the
|
like (*MARK:NAME) in that the name is remembered for passing back to the
|
||||||
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
|
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
|
||||||
ignoring those set by (*PRUNE) or (*THEN).
|
ignoring those set by (*PRUNE) or (*THEN).
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -3452,9 +3415,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 27 December 2016
|
Last updated: 18 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -55,7 +55,10 @@ The facility for saving and restoring compiled patterns is intended for use
|
||||||
within individual applications. As such, the data supplied to
|
within individual applications. As such, the data supplied to
|
||||||
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
|
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
|
||||||
arbitrary external sources. There is only some simple consistency checking, not
|
arbitrary external sources. There is only some simple consistency checking, not
|
||||||
complete validation of what is being re-loaded.
|
complete validation of what is being re-loaded. Corrupted data may cause
|
||||||
|
undefined results. For example, if the length field of a pattern in the
|
||||||
|
serialized data is corrupted, the deserializing code may read beyond the end of
|
||||||
|
the byte stream that is passed to it.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
|
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -190,9 +193,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 24 May 2016
|
Last updated: 21 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -126,12 +126,13 @@ character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
||||||
to occur).
|
to occur).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
UTF-8 (in its original definition) is not capable of encoding values greater
|
||||||
values can be handled by the 32-bit library. When testing this library in
|
than 0x7fffffff, but such values can be handled by the 32-bit library. When
|
||||||
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
|
testing this library in non-UTF mode with <b>utf8_input</b> set, if any
|
||||||
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
|
||||||
character's value. This is the only way of passing such code points in a
|
0x80000000 is added to the character's value. This is the only way of passing
|
||||||
pattern string. For subject strings, using an escape sequence is preferable.
|
such code points in a pattern string. For subject strings, using an escape
|
||||||
|
sequence is preferable.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -602,6 +603,7 @@ about the pattern:
|
||||||
/B bincode show binary code without lengths
|
/B bincode show binary code without lengths
|
||||||
callout_info show callout information
|
callout_info show callout information
|
||||||
debug same as info,fullbincode
|
debug same as info,fullbincode
|
||||||
|
framesize show matching frame size
|
||||||
fullbincode show binary code with lengths
|
fullbincode show binary code with lengths
|
||||||
/I info show info about compiled pattern
|
/I info show info about compiled pattern
|
||||||
hex unquoted characters are hexadecimal
|
hex unquoted characters are hexadecimal
|
||||||
|
@ -689,6 +691,11 @@ not necessarily the last character. These lines are omitted if no starting or
|
||||||
ending code units are recorded.
|
ending code units are recorded.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
The <b>framesize</b> modifier shows the size, in bytes, of the storage frames
|
||||||
|
used by <b>pcre2_match()</b> for handling backtracking. The size depends on the
|
||||||
|
number of capturing parentheses in the pattern.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
The <b>callout_info</b> modifier requests information about all the callouts in
|
The <b>callout_info</b> modifier requests information about all the callouts in
|
||||||
the pattern. A list of them is output at the end of any other information that
|
the pattern. A list of them is output at the end of any other information that
|
||||||
is requested. For each callout, either its number or string is given, followed
|
is requested. For each callout, either its number or string is given, followed
|
||||||
|
@ -1073,6 +1080,7 @@ pattern.
|
||||||
callout_fail=<n>[:<m>] control callout failure
|
callout_fail=<n>[:<m>] control callout failure
|
||||||
callout_none do not supply a callout function
|
callout_none do not supply a callout function
|
||||||
copy=<number or name> copy captured substring
|
copy=<number or name> copy captured substring
|
||||||
|
depth_limit=<n> set a depth limit
|
||||||
dfa use <b>pcre2_dfa_match()</b>
|
dfa use <b>pcre2_dfa_match()</b>
|
||||||
find_limits find match and recursion limits
|
find_limits find match and recursion limits
|
||||||
get=<number or name> extract captured substring
|
get=<number or name> extract captured substring
|
||||||
|
@ -1086,7 +1094,7 @@ pattern.
|
||||||
offset=<n> set starting offset
|
offset=<n> set starting offset
|
||||||
offset_limit=<n> set offset limit
|
offset_limit=<n> set offset limit
|
||||||
ovector=<n> set size of output vector
|
ovector=<n> set size of output vector
|
||||||
recursion_limit=<n> set a recursion limit
|
recursion_limit=<n> obsolete synonym for depth_limit
|
||||||
replace=<string> specify a replacement string
|
replace=<string> specify a replacement string
|
||||||
startchar show startchar when relevant
|
startchar show startchar when relevant
|
||||||
startoffset=<n> same as offset=<n>
|
startoffset=<n> same as offset=<n>
|
||||||
|
@ -1320,10 +1328,10 @@ stack that is larger than the default 32K is necessary only for very
|
||||||
complicated patterns.
|
complicated patterns.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Setting match and recursion limits
|
Setting match and depth limits
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate
|
The <b>match_limit</b> and <b>depth_limit</b> modifiers set the appropriate
|
||||||
limits in the match context. These values are ignored when the
|
limits in the match context. These values are ignored when the
|
||||||
<b>find_limits</b> modifier is specified.
|
<b>find_limits</b> modifier is specified.
|
||||||
</P>
|
</P>
|
||||||
|
@ -1333,23 +1341,23 @@ Finding minimum limits
|
||||||
<P>
|
<P>
|
||||||
If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
|
If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
|
||||||
<b>pcre2_match()</b> several times, setting different values in the match
|
<b>pcre2_match()</b> several times, setting different values in the match
|
||||||
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b>
|
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_depth_limit()</b>
|
||||||
until it finds the minimum values for each parameter that allow
|
until it finds the minimum values for each parameter that allow
|
||||||
<b>pcre2_match()</b> to complete without error.
|
<b>pcre2_match()</b> to complete without error.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If JIT is being used, only the match limit is relevant. If DFA matching is
|
If JIT is being used, only the match limit is relevant. If DFA matching is
|
||||||
being used, neither limit is relevant, and this modifier is ignored (with a
|
being used, only the depth limit is relevant, but at present this modifier is
|
||||||
warning message).
|
ignored (with a warning message).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The <i>match_limit</i> number is a measure of the amount of backtracking
|
The <i>match_limit</i> number is a measure of the amount of backtracking
|
||||||
that takes place, and learning the minimum value can be instructive. For most
|
that takes place, and learning the minimum value can be instructive. For most
|
||||||
simple matches, the number is quite small, but for patterns with very large
|
simple matches, the number is quite small, but for patterns with very large
|
||||||
numbers of matching possibilities, it can become large very quickly with
|
numbers of matching possibilities, it can become large very quickly with
|
||||||
increasing length of subject string. The <i>match_limit_recursion</i> number is
|
increasing length of subject string. The <i>depth_limit</i> number is
|
||||||
a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much
|
a measure of how much memory for recording backtracking points is needed to
|
||||||
heap) memory is needed to complete the match attempt.
|
complete the match attempt.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Showing MARK names
|
Showing MARK names
|
||||||
|
@ -1466,7 +1474,7 @@ code unit offset of the start of the failing character is also output. Here is
|
||||||
an example of an interactive <b>pcre2test</b> run.
|
an example of an interactive <b>pcre2test</b> run.
|
||||||
<pre>
|
<pre>
|
||||||
$ pcre2test
|
$ pcre2test
|
||||||
PCRE2 version 9.00 2014-05-10
|
PCRE2 version 10.22 2016-07-29
|
||||||
|
|
||||||
re> /^abc(\d+)/
|
re> /^abc(\d+)/
|
||||||
data> abc123
|
data> abc123
|
||||||
|
@ -1779,9 +1787,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 28 December 2016
|
Last updated: 21 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
1667
doc/pcre2.txt
1667
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_CONFIG 3 "20 April 2014" "PCRE2 10.0"
|
.TH PCRE2_CONFIG 3 "24 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -31,10 +31,13 @@ point to a uint32_t integer variable. The available codes are:
|
||||||
PCRE2_CONFIG_BSR Indicates what \eR matches by default:
|
PCRE2_CONFIG_BSR Indicates what \eR matches by default:
|
||||||
PCRE2_BSR_UNICODE
|
PCRE2_BSR_UNICODE
|
||||||
PCRE2_BSR_ANYCRLF
|
PCRE2_BSR_ANYCRLF
|
||||||
|
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
|
||||||
|
.\" JOIN
|
||||||
PCRE2_CONFIG_JIT Availability of just-in-time compiler
|
PCRE2_CONFIG_JIT Availability of just-in-time compiler
|
||||||
support (1=yes 0=no)
|
support (1=yes 0=no)
|
||||||
PCRE2_CONFIG_JITTARGET Information about the target archi-
|
.\" JOIN
|
||||||
tecture for the JIT compiler
|
PCRE2_CONFIG_JITTARGET Information (a string) about the target
|
||||||
|
architecture for the JIT compiler
|
||||||
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
||||||
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
||||||
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
||||||
|
@ -44,9 +47,9 @@ point to a uint32_t integer variable. The available codes are:
|
||||||
PCRE2_NEWLINE_ANY
|
PCRE2_NEWLINE_ANY
|
||||||
PCRE2_NEWLINE_ANYCRLF
|
PCRE2_NEWLINE_ANYCRLF
|
||||||
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
||||||
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
|
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
|
||||||
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
|
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
|
||||||
0=heap)
|
.\" JOIN
|
||||||
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
|
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
|
||||||
0=no)
|
0=no)
|
||||||
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_DFA_MATCH 3 "23 December 2016" "PCRE2 10.23"
|
.TH PCRE2_DFA_MATCH 3 "24 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -19,8 +19,9 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
This function matches a compiled regular expression against a given subject
|
This function matches a compiled regular expression against a given subject
|
||||||
string, using an alternative matching algorithm that scans the subject string
|
string, using an alternative matching algorithm that scans the subject string
|
||||||
just once (\fInot\fP Perl-compatible). (The Perl-compatible matching function
|
just once (except when processing lookaround assertions). This function is
|
||||||
is \fBpcre2_match()\fP.) The arguments for this function are:
|
\fInot\fP Perl-compatible (the Perl-compatible matching function is
|
||||||
|
\fBpcre2_match()\fP). The arguments for this function are:
|
||||||
.sp
|
.sp
|
||||||
\fIcode\fP Points to the compiled pattern
|
\fIcode\fP Points to the compiled pattern
|
||||||
\fIsubject\fP Points to the subject string
|
\fIsubject\fP Points to the subject string
|
||||||
|
@ -33,22 +34,26 @@ is \fBpcre2_match()\fP.) The arguments for this function are:
|
||||||
\fIwscount\fP Number of elements in the vector
|
\fIwscount\fP Number of elements in the vector
|
||||||
.sp
|
.sp
|
||||||
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
|
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
|
||||||
up a callout function or specify the recursion limit. The \fIlength\fP and
|
up a callout function or specify the recursion depth limit. The \fIlength\fP
|
||||||
\fIstartoffset\fP values are code units, not characters. The options are:
|
and \fIstartoffset\fP values are code units, not characters. The options are:
|
||||||
.sp
|
.sp
|
||||||
PCRE2_ANCHORED Match only at the first position
|
PCRE2_ANCHORED Match only at the first position
|
||||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||||
PCRE2_NOTEOL Subject is not the end of a line
|
PCRE2_NOTEOL Subject is not the end of a line
|
||||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||||
|
.\" JOIN
|
||||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
|
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
|
||||||
is not a valid match
|
is not a valid match
|
||||||
|
.\" JOIN
|
||||||
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
|
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
|
||||||
validity (only relevant if PCRE2_UTF
|
validity (only relevant if PCRE2_UTF
|
||||||
was set at compile time)
|
was set at compile time)
|
||||||
|
.\" JOIN
|
||||||
|
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial
|
||||||
|
match even if there is a full match
|
||||||
|
.\" JOIN
|
||||||
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
|
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
|
||||||
match if no full matches are found
|
match if no full matches are found
|
||||||
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
|
|
||||||
even if there is a full match as well
|
|
||||||
PCRE2_DFA_RESTART Restart after a partial match
|
PCRE2_DFA_RESTART Restart after a partial match
|
||||||
PCRE2_DFA_SHORTEST Return only the shortest match
|
PCRE2_DFA_SHORTEST Return only the shortest match
|
||||||
.sp
|
.sp
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_GET_ERROR_MESSAGE 3 "17 June 2016" "PCRE2 10.22"
|
.TH PCRE2_GET_ERROR_MESSAGE 3 "24 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -22,11 +22,11 @@ errors are negative numbers. The arguments are:
|
||||||
\fIbuffer\fP where to put the message
|
\fIbuffer\fP where to put the message
|
||||||
\fIbufflen\fP the length of the buffer (code units)
|
\fIbufflen\fP the length of the buffer (code units)
|
||||||
.sp
|
.sp
|
||||||
The function returns the length of the message, excluding the trailing zero, or
|
The function returns the length of the message in code units, excluding the
|
||||||
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
|
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
|
||||||
this case, the returned message is truncated (but still with a trailing zero).
|
too small. In this case, the returned message is truncated (but still with a
|
||||||
If \fIerrorcode\fP does not contain a recognized error code number, the
|
trailing zero). If \fIerrorcode\fP does not contain a recognized error code
|
||||||
negative value PCRE2_ERROR_BADDATA is returned.
|
number, the negative value PCRE2_ERROR_BADDATA is returned.
|
||||||
.P
|
.P
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API in the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_JIT_STACK_CREATE 3 "03 November 2014" "PCRE2 10.00"
|
.TH PCRE2_JIT_STACK_CREATE 3 "24 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -20,10 +20,9 @@ maximum size to which it is allowed to grow. The final argument is a general
|
||||||
context, for memory allocation functions, or NULL for standard memory
|
context, for memory allocation functions, or NULL for standard memory
|
||||||
allocation. The result can be passed to the JIT run-time code by calling
|
allocation. The result can be passed to the JIT run-time code by calling
|
||||||
\fBpcre2_jit_stack_assign()\fP to associate the stack with a compiled pattern,
|
\fBpcre2_jit_stack_assign()\fP to associate the stack with a compiled pattern,
|
||||||
which can then be processed by \fBpcre2_match()\fP. If the "fast path" JIT
|
which can then be processed by \fBpcre2_match()\fP or \fBpcre2_jit_match()\fP.
|
||||||
matcher, \fBpcre2_jit_match()\fP is used, the stack can be passed directly as
|
A maximum stack size of 512K to 1M should be more than enough for any pattern.
|
||||||
an argument. A maximum stack size of 512K to 1M should be more than enough for
|
For more details, see the
|
||||||
any pattern. For more details, see the
|
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2jit\fP
|
\fBpcre2jit\fP
|
||||||
.\"
|
.\"
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_MAKETABLES 3 "21 October 2014" "PCRE2 10.00"
|
.TH PCRE2_MAKETABLES 3 "24 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -12,10 +12,10 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH DESCRIPTION
|
.SH DESCRIPTION
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
This function builds a set of character tables for character values less than
|
This function builds a set of character tables for character code points that
|
||||||
256. These can be passed to \fBpcre2_compile()\fP in a compile context in order
|
are less than 256. These can be passed to \fBpcre2_compile()\fP in a compile
|
||||||
to override the internal, built-in tables (which were either defaulted or made
|
context in order to override the internal, built-in tables (which were either
|
||||||
by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
|
defaulted or made by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2_set_character_tables()\fP
|
\fBpcre2_set_character_tables()\fP
|
||||||
.\"
|
.\"
|
||||||
|
|
|
@ -255,6 +255,9 @@ OPTIONS
|
||||||
directory like this is an immediate end-of-file; in others it
|
directory like this is an immediate end-of-file; in others it
|
||||||
may provoke an error.
|
may provoke an error.
|
||||||
|
|
||||||
|
--depth-limit=number
|
||||||
|
See --match-limit below.
|
||||||
|
|
||||||
-e pattern, --regex=pattern, --regexp=pattern
|
-e pattern, --regex=pattern, --regexp=pattern
|
||||||
Specify a pattern to be matched. This option can be used mul-
|
Specify a pattern to be matched. This option can be used mul-
|
||||||
tiple times in order to specify several patterns. It can also
|
tiple times in order to specify several patterns. It can also
|
||||||
|
@ -477,32 +480,24 @@ OPTIONS
|
||||||
no short form for this option.
|
no short form for this option.
|
||||||
|
|
||||||
--match-limit=number
|
--match-limit=number
|
||||||
Processing some regular expression patterns can require a
|
Processing some regular expression patterns may take a very
|
||||||
very large amount of memory, leading in some cases to a pro-
|
long time to search for all possible matching strings. Others
|
||||||
gram crash if not enough is available. Other patterns may
|
may require a very large amount of memory. There are two
|
||||||
take a very long time to search for all possible matching
|
options that set resource limits for matching.
|
||||||
strings. The pcre2_match() function that is called by
|
|
||||||
pcre2grep to do the matching has two parameters that can
|
|
||||||
limit the resources that it uses.
|
|
||||||
|
|
||||||
The --match-limit option provides a means of limiting
|
The --match-limit option provides a means of limiting comput-
|
||||||
resource usage when processing patterns that are not going to
|
ing resource usage when processing patterns that are not
|
||||||
match, but which have a very large number of possibilities in
|
going to match, but which have a very large number of possi-
|
||||||
their search trees. The classic example is a pattern that
|
bilities in their search trees. The classic example is a pat-
|
||||||
uses nested unlimited repeats. Internally, PCRE2 uses a func-
|
tern that uses nested unlimited repeats. Internally, PCRE2
|
||||||
tion called match() which it calls repeatedly (sometimes
|
has a counter that is incremented each time around its main
|
||||||
recursively). The limit set by --match-limit is imposed on
|
processing loop. If the value set by --match-limit is
|
||||||
the number of times this function is called during a match,
|
reached, an error occurs.
|
||||||
which has the effect of limiting the amount of backtracking
|
|
||||||
that can take place.
|
|
||||||
|
|
||||||
The --recursion-limit option is similar to --match-limit, but
|
The --depth-limit option limits the depth of nested back-
|
||||||
instead of limiting the total number of times that match() is
|
tracking points, which in turn limits the amount of memory
|
||||||
called, it limits the depth of recursive calls, which in turn
|
that is used. This limit is of use only if it is set smaller
|
||||||
limits the amount of memory that can be used. The recursion
|
than --match-limit.
|
||||||
depth is a smaller number than the total number of calls,
|
|
||||||
because not all calls to match() are recursive. This limit is
|
|
||||||
of use only if it is set smaller than --match-limit.
|
|
||||||
|
|
||||||
There are no short forms for these options. The default set-
|
There are no short forms for these options. The default set-
|
||||||
tings are specified when the PCRE2 library is compiled, with
|
tings are specified when the PCRE2 library is compiled, with
|
||||||
|
@ -834,9 +829,9 @@ MATCHING ERRORS
|
||||||
such errors, pcre2grep gives up.
|
such errors, pcre2grep gives up.
|
||||||
|
|
||||||
The --match-limit option of pcre2grep can be used to set the overall
|
The --match-limit option of pcre2grep can be used to set the overall
|
||||||
resource limit; there is a second option called --recursion-limit that
|
resource limit; there is a second option called --depth-limit that sets
|
||||||
sets a limit on the amount of memory (usually stack) that is used (see
|
a limit on the amount of memory that is used (see the discussion of
|
||||||
the discussion of these options above).
|
these options above).
|
||||||
|
|
||||||
|
|
||||||
DIAGNOSTICS
|
DIAGNOSTICS
|
||||||
|
@ -862,5 +857,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 31 December 2016
|
Last updated: 21 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
|
|
|
@ -91,13 +91,13 @@ INPUT ENCODING
|
||||||
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
|
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
|
||||||
values greater than 0xffff cause an error to occur).
|
values greater than 0xffff cause an error to occur).
|
||||||
|
|
||||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but
|
UTF-8 (in its original definition) is not capable of encoding values
|
||||||
such values can be handled by the 32-bit library. When testing this
|
greater than 0x7fffffff, but such values can be handled by the 32-bit
|
||||||
library in non-UTF mode with utf8_input set, if any character is pre-
|
library. When testing this library in non-UTF mode with utf8_input set,
|
||||||
ceded by the byte 0xff (which is an illegal byte in UTF-8) 0x80000000
|
if any character is preceded by the byte 0xff (which is an illegal byte
|
||||||
is added to the character's value. This is the only way of passing such
|
in UTF-8) 0x80000000 is added to the character's value. This is the
|
||||||
code points in a pattern string. For subject strings, using an escape
|
only way of passing such code points in a pattern string. For subject
|
||||||
sequence is preferable.
|
strings, using an escape sequence is preferable.
|
||||||
|
|
||||||
|
|
||||||
COMMAND LINE OPTIONS
|
COMMAND LINE OPTIONS
|
||||||
|
@ -544,6 +544,7 @@ PATTERN MODIFIERS
|
||||||
/B bincode show binary code without lengths
|
/B bincode show binary code without lengths
|
||||||
callout_info show callout information
|
callout_info show callout information
|
||||||
debug same as info,fullbincode
|
debug same as info,fullbincode
|
||||||
|
framesize show matching frame size
|
||||||
fullbincode show binary code with lengths
|
fullbincode show binary code with lengths
|
||||||
/I info show info about compiled pattern
|
/I info show info about compiled pattern
|
||||||
hex unquoted characters are hexadecimal
|
hex unquoted characters are hexadecimal
|
||||||
|
@ -624,6 +625,10 @@ PATTERN MODIFIERS
|
||||||
last character. These lines are omitted if no starting or ending code
|
last character. These lines are omitted if no starting or ending code
|
||||||
units are recorded.
|
units are recorded.
|
||||||
|
|
||||||
|
The framesize modifier shows the size, in bytes, of the storage frames
|
||||||
|
used by pcre2_match() for handling backtracking. The size depends on
|
||||||
|
the number of capturing parentheses in the pattern.
|
||||||
|
|
||||||
The callout_info modifier requests information about all the callouts
|
The callout_info modifier requests information about all the callouts
|
||||||
in the pattern. A list of them is output at the end of any other infor-
|
in the pattern. A list of them is output at the end of any other infor-
|
||||||
mation that is requested. For each callout, either its number or string
|
mation that is requested. For each callout, either its number or string
|
||||||
|
@ -959,6 +964,7 @@ SUBJECT MODIFIERS
|
||||||
callout_fail=<n>[:<m>] control callout failure
|
callout_fail=<n>[:<m>] control callout failure
|
||||||
callout_none do not supply a callout function
|
callout_none do not supply a callout function
|
||||||
copy=<number or name> copy captured substring
|
copy=<number or name> copy captured substring
|
||||||
|
depth_limit=<n> set a depth limit
|
||||||
dfa use pcre2_dfa_match()
|
dfa use pcre2_dfa_match()
|
||||||
find_limits find match and recursion limits
|
find_limits find match and recursion limits
|
||||||
get=<number or name> extract captured substring
|
get=<number or name> extract captured substring
|
||||||
|
@ -972,7 +978,7 @@ SUBJECT MODIFIERS
|
||||||
offset=<n> set starting offset
|
offset=<n> set starting offset
|
||||||
offset_limit=<n> set offset limit
|
offset_limit=<n> set offset limit
|
||||||
ovector=<n> set size of output vector
|
ovector=<n> set size of output vector
|
||||||
recursion_limit=<n> set a recursion limit
|
recursion_limit=<n> obsolete synonym for depth_limit
|
||||||
replace=<string> specify a replacement string
|
replace=<string> specify a replacement string
|
||||||
startchar show startchar when relevant
|
startchar show startchar when relevant
|
||||||
startoffset=<n> same as offset=<n>
|
startoffset=<n> same as offset=<n>
|
||||||
|
@ -1188,133 +1194,132 @@ SUBJECT MODIFIERS
|
||||||
Providing a stack that is larger than the default 32K is necessary only
|
Providing a stack that is larger than the default 32K is necessary only
|
||||||
for very complicated patterns.
|
for very complicated patterns.
|
||||||
|
|
||||||
Setting match and recursion limits
|
Setting match and depth limits
|
||||||
|
|
||||||
The match_limit and recursion_limit modifiers set the appropriate lim-
|
The match_limit and depth_limit modifiers set the appropriate limits in
|
||||||
its in the match context. These values are ignored when the find_limits
|
the match context. These values are ignored when the find_limits modi-
|
||||||
modifier is specified.
|
fier is specified.
|
||||||
|
|
||||||
Finding minimum limits
|
Finding minimum limits
|
||||||
|
|
||||||
If the find_limits modifier is present, pcre2test calls pcre2_match()
|
If the find_limits modifier is present, pcre2test calls pcre2_match()
|
||||||
several times, setting different values in the match context via
|
several times, setting different values in the match context via
|
||||||
pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds
|
pcre2_set_match_limit() and pcre2_set_depth_limit() until it finds the
|
||||||
the minimum values for each parameter that allow pcre2_match() to com-
|
minimum values for each parameter that allow pcre2_match() to complete
|
||||||
plete without error.
|
without error.
|
||||||
|
|
||||||
If JIT is being used, only the match limit is relevant. If DFA matching
|
If JIT is being used, only the match limit is relevant. If DFA matching
|
||||||
is being used, neither limit is relevant, and this modifier is ignored
|
is being used, only the depth limit is relevant, but at present this
|
||||||
(with a warning message).
|
modifier is ignored (with a warning message).
|
||||||
|
|
||||||
The match_limit number is a measure of the amount of backtracking that
|
The match_limit number is a measure of the amount of backtracking that
|
||||||
takes place, and learning the minimum value can be instructive. For
|
takes place, and learning the minimum value can be instructive. For
|
||||||
most simple matches, the number is quite small, but for patterns with
|
most simple matches, the number is quite small, but for patterns with
|
||||||
very large numbers of matching possibilities, it can become large very
|
very large numbers of matching possibilities, it can become large very
|
||||||
quickly with increasing length of subject string. The
|
quickly with increasing length of subject string. The depth_limit num-
|
||||||
match_limit_recursion number is a measure of how much stack (or, if
|
ber is a measure of how much memory for recording backtracking points
|
||||||
PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to
|
is needed to complete the match attempt.
|
||||||
complete the match attempt.
|
|
||||||
|
|
||||||
Showing MARK names
|
Showing MARK names
|
||||||
|
|
||||||
|
|
||||||
The mark modifier causes the names from backtracking control verbs that
|
The mark modifier causes the names from backtracking control verbs that
|
||||||
are returned from calls to pcre2_match() to be displayed. If a mark is
|
are returned from calls to pcre2_match() to be displayed. If a mark is
|
||||||
returned for a match, non-match, or partial match, pcre2test shows it.
|
returned for a match, non-match, or partial match, pcre2test shows it.
|
||||||
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
||||||
it is added to the non-match message.
|
it is added to the non-match message.
|
||||||
|
|
||||||
Showing memory usage
|
Showing memory usage
|
||||||
|
|
||||||
The memory modifier causes pcre2test to log all memory allocation and
|
The memory modifier causes pcre2test to log all memory allocation and
|
||||||
freeing calls that occur during a match operation.
|
freeing calls that occur during a match operation.
|
||||||
|
|
||||||
Setting a starting offset
|
Setting a starting offset
|
||||||
|
|
||||||
The offset modifier sets an offset in the subject string at which
|
The offset modifier sets an offset in the subject string at which
|
||||||
matching starts. Its value is a number of code units, not characters.
|
matching starts. Its value is a number of code units, not characters.
|
||||||
|
|
||||||
Setting an offset limit
|
Setting an offset limit
|
||||||
|
|
||||||
The offset_limit modifier sets a limit for unanchored matches. If a
|
The offset_limit modifier sets a limit for unanchored matches. If a
|
||||||
match cannot be found starting at or before this offset in the subject,
|
match cannot be found starting at or before this offset in the subject,
|
||||||
a "no match" return is given. The data value is a number of code units,
|
a "no match" return is given. The data value is a number of code units,
|
||||||
not characters. When this modifier is used, the use_offset_limit modi-
|
not characters. When this modifier is used, the use_offset_limit modi-
|
||||||
fier must have been set for the pattern; if not, an error is generated.
|
fier must have been set for the pattern; if not, an error is generated.
|
||||||
|
|
||||||
Setting the size of the output vector
|
Setting the size of the output vector
|
||||||
|
|
||||||
The ovector modifier applies only to the subject line in which it
|
The ovector modifier applies only to the subject line in which it
|
||||||
appears, though of course it can also be used to set a default in a
|
appears, though of course it can also be used to set a default in a
|
||||||
#subject command. It specifies the number of pairs of offsets that are
|
#subject command. It specifies the number of pairs of offsets that are
|
||||||
available for storing matching information. The default is 15.
|
available for storing matching information. The default is 15.
|
||||||
|
|
||||||
A value of zero is useful when testing the POSIX API because it causes
|
A value of zero is useful when testing the POSIX API because it causes
|
||||||
regexec() to be called with a NULL capture vector. When not testing the
|
regexec() to be called with a NULL capture vector. When not testing the
|
||||||
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
||||||
ate_from_pattern() to be called, in order to create a match block of
|
ate_from_pattern() to be called, in order to create a match block of
|
||||||
exactly the right size for the pattern. (It is not possible to create a
|
exactly the right size for the pattern. (It is not possible to create a
|
||||||
match block with a zero-length ovector; there is always at least one
|
match block with a zero-length ovector; there is always at least one
|
||||||
pair of offsets.)
|
pair of offsets.)
|
||||||
|
|
||||||
Passing the subject as zero-terminated
|
Passing the subject as zero-terminated
|
||||||
|
|
||||||
By default, the subject string is passed to a native API matching func-
|
By default, the subject string is passed to a native API matching func-
|
||||||
tion with its correct length. In order to test the facility for passing
|
tion with its correct length. In order to test the facility for passing
|
||||||
a zero-terminated string, the zero_terminate modifier is provided. It
|
a zero-terminated string, the zero_terminate modifier is provided. It
|
||||||
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
|
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
|
||||||
via the POSIX interface, this modifier has no effect, as there is no
|
via the POSIX interface, this modifier has no effect, as there is no
|
||||||
facility for passing a length.)
|
facility for passing a length.)
|
||||||
|
|
||||||
When testing pcre2_substitute(), this modifier also has the effect of
|
When testing pcre2_substitute(), this modifier also has the effect of
|
||||||
passing the replacement string as zero-terminated.
|
passing the replacement string as zero-terminated.
|
||||||
|
|
||||||
Passing a NULL context
|
Passing a NULL context
|
||||||
|
|
||||||
Normally, pcre2test passes a context block to pcre2_match(),
|
Normally, pcre2test passes a context block to pcre2_match(),
|
||||||
pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is
|
pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is
|
||||||
set, however, NULL is passed. This is for testing that the matching
|
set, however, NULL is passed. This is for testing that the matching
|
||||||
functions behave correctly in this case (they use default values). This
|
functions behave correctly in this case (they use default values). This
|
||||||
modifier cannot be used with the find_limits modifier or when testing
|
modifier cannot be used with the find_limits modifier or when testing
|
||||||
the substitution function.
|
the substitution function.
|
||||||
|
|
||||||
|
|
||||||
THE ALTERNATIVE MATCHING FUNCTION
|
THE ALTERNATIVE MATCHING FUNCTION
|
||||||
|
|
||||||
By default, pcre2test uses the standard PCRE2 matching function,
|
By default, pcre2test uses the standard PCRE2 matching function,
|
||||||
pcre2_match() to match each subject line. PCRE2 also supports an alter-
|
pcre2_match() to match each subject line. PCRE2 also supports an alter-
|
||||||
native matching function, pcre2_dfa_match(), which operates in a dif-
|
native matching function, pcre2_dfa_match(), which operates in a dif-
|
||||||
ferent way, and has some restrictions. The differences between the two
|
ferent way, and has some restrictions. The differences between the two
|
||||||
functions are described in the pcre2matching documentation.
|
functions are described in the pcre2matching documentation.
|
||||||
|
|
||||||
If the dfa modifier is set, the alternative matching function is used.
|
If the dfa modifier is set, the alternative matching function is used.
|
||||||
This function finds all possible matches at a given point in the sub-
|
This function finds all possible matches at a given point in the sub-
|
||||||
ject. If, however, the dfa_shortest modifier is set, processing stops
|
ject. If, however, the dfa_shortest modifier is set, processing stops
|
||||||
after the first match is found. This is always the shortest possible
|
after the first match is found. This is always the shortest possible
|
||||||
match.
|
match.
|
||||||
|
|
||||||
|
|
||||||
DEFAULT OUTPUT FROM pcre2test
|
DEFAULT OUTPUT FROM pcre2test
|
||||||
|
|
||||||
This section describes the output when the normal matching function,
|
This section describes the output when the normal matching function,
|
||||||
pcre2_match(), is being used.
|
pcre2_match(), is being used.
|
||||||
|
|
||||||
When a match succeeds, pcre2test outputs the list of captured sub-
|
When a match succeeds, pcre2test outputs the list of captured sub-
|
||||||
strings, starting with number 0 for the string that matched the whole
|
strings, starting with number 0 for the string that matched the whole
|
||||||
pattern. Otherwise, it outputs "No match" when the return is
|
pattern. Otherwise, it outputs "No match" when the return is
|
||||||
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
|
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
|
||||||
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
|
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
|
||||||
this is the entire substring that was inspected during the partial
|
this is the entire substring that was inspected during the partial
|
||||||
match; it may include characters before the actual match start if a
|
match; it may include characters before the actual match start if a
|
||||||
lookbehind assertion, \K, \b, or \B was involved.)
|
lookbehind assertion, \K, \b, or \B was involved.)
|
||||||
|
|
||||||
For any other return, pcre2test outputs the PCRE2 negative error number
|
For any other return, pcre2test outputs the PCRE2 negative error number
|
||||||
and a short descriptive phrase. If the error is a failed UTF string
|
and a short descriptive phrase. If the error is a failed UTF string
|
||||||
check, the code unit offset of the start of the failing character is
|
check, the code unit offset of the start of the failing character is
|
||||||
also output. Here is an example of an interactive pcre2test run.
|
also output. Here is an example of an interactive pcre2test run.
|
||||||
|
|
||||||
$ pcre2test
|
$ pcre2test
|
||||||
PCRE2 version 9.00 2014-05-10
|
PCRE2 version 10.22 2016-07-29
|
||||||
|
|
||||||
re> /^abc(\d+)/
|
re> /^abc(\d+)/
|
||||||
data> abc123
|
data> abc123
|
||||||
|
@ -1326,8 +1331,8 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
Unset capturing substrings that are not followed by one that is set are
|
Unset capturing substrings that are not followed by one that is set are
|
||||||
not shown by pcre2test unless the allcaptures modifier is specified. In
|
not shown by pcre2test unless the allcaptures modifier is specified. In
|
||||||
the following example, there are two capturing substrings, but when the
|
the following example, there are two capturing substrings, but when the
|
||||||
first data line is matched, the second, unset substring is not shown.
|
first data line is matched, the second, unset substring is not shown.
|
||||||
An "internal" unset substring is shown as "<unset>", as for the second
|
An "internal" unset substring is shown as "<unset>", as for the second
|
||||||
data line.
|
data line.
|
||||||
|
|
||||||
re> /(a)|(b)/
|
re> /(a)|(b)/
|
||||||
|
@ -1339,11 +1344,11 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
1: <unset>
|
1: <unset>
|
||||||
2: b
|
2: b
|
||||||
|
|
||||||
If the strings contain any non-printing characters, they are output as
|
If the strings contain any non-printing characters, they are output as
|
||||||
\xhh escapes if the value is less than 256 and UTF mode is not set.
|
\xhh escapes if the value is less than 256 and UTF mode is not set.
|
||||||
Otherwise they are output as \x{hh...} escapes. See below for the defi-
|
Otherwise they are output as \x{hh...} escapes. See below for the defi-
|
||||||
nition of non-printing characters. If the aftertext modifier is set,
|
nition of non-printing characters. If the aftertext modifier is set,
|
||||||
the output for substring 0 is followed by the the rest of the subject
|
the output for substring 0 is followed by the the rest of the subject
|
||||||
string, identified by "0+" like this:
|
string, identified by "0+" like this:
|
||||||
|
|
||||||
re> /cat/aftertext
|
re> /cat/aftertext
|
||||||
|
@ -1351,7 +1356,7 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
0: cat
|
0: cat
|
||||||
0+ aract
|
0+ aract
|
||||||
|
|
||||||
If global matching is requested, the results of successive matching
|
If global matching is requested, the results of successive matching
|
||||||
attempts are output in sequence, like this:
|
attempts are output in sequence, like this:
|
||||||
|
|
||||||
re> /\Bi(\w\w)/g
|
re> /\Bi(\w\w)/g
|
||||||
|
@ -1363,8 +1368,8 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
0: ipp
|
0: ipp
|
||||||
1: pp
|
1: pp
|
||||||
|
|
||||||
"No match" is output only if the first match attempt fails. Here is an
|
"No match" is output only if the first match attempt fails. Here is an
|
||||||
example of a failure message (the offset 4 that is specified by the
|
example of a failure message (the offset 4 that is specified by the
|
||||||
offset modifier is past the end of the subject string):
|
offset modifier is past the end of the subject string):
|
||||||
|
|
||||||
re> /xyz/
|
re> /xyz/
|
||||||
|
@ -1372,7 +1377,7 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
Error -24 (bad offset value)
|
Error -24 (bad offset value)
|
||||||
|
|
||||||
Note that whereas patterns can be continued over several lines (a plain
|
Note that whereas patterns can be continued over several lines (a plain
|
||||||
">" prompt is used for continuations), subject lines may not. However
|
">" prompt is used for continuations), subject lines may not. However
|
||||||
newlines can be included in a subject by means of the \n escape (or \r,
|
newlines can be included in a subject by means of the \n escape (or \r,
|
||||||
\r\n, etc., depending on the newline sequence setting).
|
\r\n, etc., depending on the newline sequence setting).
|
||||||
|
|
||||||
|
@ -1380,7 +1385,7 @@ DEFAULT OUTPUT FROM pcre2test
|
||||||
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
||||||
|
|
||||||
When the alternative matching function, pcre2_dfa_match(), is used, the
|
When the alternative matching function, pcre2_dfa_match(), is used, the
|
||||||
output consists of a list of all the matches that start at the first
|
output consists of a list of all the matches that start at the first
|
||||||
point in the subject where there is at least one match. For example:
|
point in the subject where there is at least one match. For example:
|
||||||
|
|
||||||
re> /(tang|tangerine|tan)/
|
re> /(tang|tangerine|tan)/
|
||||||
|
@ -1389,11 +1394,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
||||||
1: tang
|
1: tang
|
||||||
2: tan
|
2: tan
|
||||||
|
|
||||||
Using the normal matching function on this data finds only "tang". The
|
Using the normal matching function on this data finds only "tang". The
|
||||||
longest matching string is always given first (and numbered zero).
|
longest matching string is always given first (and numbered zero).
|
||||||
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
|
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
|
||||||
followed by the partially matching substring. Note that this is the
|
followed by the partially matching substring. Note that this is the
|
||||||
entire substring that was inspected during the partial match; it may
|
entire substring that was inspected during the partial match; it may
|
||||||
include characters before the actual match start if a lookbehind asser-
|
include characters before the actual match start if a lookbehind asser-
|
||||||
tion, \b, or \B was involved. (\K is not supported for DFA matching.)
|
tion, \b, or \B was involved. (\K is not supported for DFA matching.)
|
||||||
|
|
||||||
|
@ -1409,16 +1414,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
||||||
1: tan
|
1: tan
|
||||||
0: tan
|
0: tan
|
||||||
|
|
||||||
The alternative matching function does not support substring capture,
|
The alternative matching function does not support substring capture,
|
||||||
so the modifiers that are concerned with captured substrings are not
|
so the modifiers that are concerned with captured substrings are not
|
||||||
relevant.
|
relevant.
|
||||||
|
|
||||||
|
|
||||||
RESTARTING AFTER A PARTIAL MATCH
|
RESTARTING AFTER A PARTIAL MATCH
|
||||||
|
|
||||||
When the alternative matching function has given the PCRE2_ERROR_PAR-
|
When the alternative matching function has given the PCRE2_ERROR_PAR-
|
||||||
TIAL return, indicating that the subject partially matched the pattern,
|
TIAL return, indicating that the subject partially matched the pattern,
|
||||||
you can restart the match with additional subject data by means of the
|
you can restart the match with additional subject data by means of the
|
||||||
dfa_restart modifier. For example:
|
dfa_restart modifier. For example:
|
||||||
|
|
||||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||||
|
@ -1427,45 +1432,45 @@ RESTARTING AFTER A PARTIAL MATCH
|
||||||
data> n05\=dfa,dfa_restart
|
data> n05\=dfa,dfa_restart
|
||||||
0: n05
|
0: n05
|
||||||
|
|
||||||
For further information about partial matching, see the pcre2partial
|
For further information about partial matching, see the pcre2partial
|
||||||
documentation.
|
documentation.
|
||||||
|
|
||||||
|
|
||||||
CALLOUTS
|
CALLOUTS
|
||||||
|
|
||||||
If the pattern contains any callout requests, pcre2test's callout func-
|
If the pattern contains any callout requests, pcre2test's callout func-
|
||||||
tion is called during matching unless callout_none is specified. This
|
tion is called during matching unless callout_none is specified. This
|
||||||
works with both matching functions.
|
works with both matching functions.
|
||||||
|
|
||||||
The callout function in pcre2test returns zero (carry on matching) by
|
The callout function in pcre2test returns zero (carry on matching) by
|
||||||
default, but you can use a callout_fail modifier in a subject line (as
|
default, but you can use a callout_fail modifier in a subject line (as
|
||||||
described above) to change this and other parameters of the callout.
|
described above) to change this and other parameters of the callout.
|
||||||
|
|
||||||
Inserting callouts can be helpful when using pcre2test to check compli-
|
Inserting callouts can be helpful when using pcre2test to check compli-
|
||||||
cated regular expressions. For further information about callouts, see
|
cated regular expressions. For further information about callouts, see
|
||||||
the pcre2callout documentation.
|
the pcre2callout documentation.
|
||||||
|
|
||||||
The output for callouts with numerical arguments and those with string
|
The output for callouts with numerical arguments and those with string
|
||||||
arguments is slightly different.
|
arguments is slightly different.
|
||||||
|
|
||||||
Callouts with numerical arguments
|
Callouts with numerical arguments
|
||||||
|
|
||||||
By default, the callout function displays the callout number, the start
|
By default, the callout function displays the callout number, the start
|
||||||
and current positions in the subject text at the callout time, and the
|
and current positions in the subject text at the callout time, and the
|
||||||
next pattern item to be tested. For example:
|
next pattern item to be tested. For example:
|
||||||
|
|
||||||
--->pqrabcdef
|
--->pqrabcdef
|
||||||
0 ^ ^ \d
|
0 ^ ^ \d
|
||||||
|
|
||||||
This output indicates that callout number 0 occurred for a match
|
This output indicates that callout number 0 occurred for a match
|
||||||
attempt starting at the fourth character of the subject string, when
|
attempt starting at the fourth character of the subject string, when
|
||||||
the pointer was at the seventh character, and when the next pattern
|
the pointer was at the seventh character, and when the next pattern
|
||||||
item was \d. Just one circumflex is output if the start and current
|
item was \d. Just one circumflex is output if the start and current
|
||||||
positions are the same, or if the current position precedes the start
|
positions are the same, or if the current position precedes the start
|
||||||
position, which can happen if the callout is in a lookbehind assertion.
|
position, which can happen if the callout is in a lookbehind assertion.
|
||||||
|
|
||||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as
|
Callouts numbered 255 are assumed to be automatic callouts, inserted as
|
||||||
a result of the /auto_callout pattern modifier. In this case, instead
|
a result of the /auto_callout pattern modifier. In this case, instead
|
||||||
of showing the callout number, the offset in the pattern, preceded by a
|
of showing the callout number, the offset in the pattern, preceded by a
|
||||||
plus, is output. For example:
|
plus, is output. For example:
|
||||||
|
|
||||||
|
@ -1479,7 +1484,7 @@ CALLOUTS
|
||||||
0: E*
|
0: E*
|
||||||
|
|
||||||
If a pattern contains (*MARK) items, an additional line is output when-
|
If a pattern contains (*MARK) items, an additional line is output when-
|
||||||
ever a change of latest mark is passed to the callout function. For
|
ever a change of latest mark is passed to the callout function. For
|
||||||
example:
|
example:
|
||||||
|
|
||||||
re> /a(*MARK:X)bc/auto_callout
|
re> /a(*MARK:X)bc/auto_callout
|
||||||
|
@ -1493,17 +1498,17 @@ CALLOUTS
|
||||||
+12 ^ ^
|
+12 ^ ^
|
||||||
0: abc
|
0: abc
|
||||||
|
|
||||||
The mark changes between matching "a" and "b", but stays the same for
|
The mark changes between matching "a" and "b", but stays the same for
|
||||||
the rest of the match, so nothing more is output. If, as a result of
|
the rest of the match, so nothing more is output. If, as a result of
|
||||||
backtracking, the mark reverts to being unset, the text "<unset>" is
|
backtracking, the mark reverts to being unset, the text "<unset>" is
|
||||||
output.
|
output.
|
||||||
|
|
||||||
Callouts with string arguments
|
Callouts with string arguments
|
||||||
|
|
||||||
The output for a callout with a string argument is similar, except that
|
The output for a callout with a string argument is similar, except that
|
||||||
instead of outputting a callout number before the position indicators,
|
instead of outputting a callout number before the position indicators,
|
||||||
the callout string and its offset in the pattern string are output
|
the callout string and its offset in the pattern string are output
|
||||||
before the reflection of the subject string, and the subject string is
|
before the reflection of the subject string, and the subject string is
|
||||||
reflected for each callout. For example:
|
reflected for each callout. For example:
|
||||||
|
|
||||||
re> /^ab(?C'first')cd(?C"second")ef/
|
re> /^ab(?C'first')cd(?C"second")ef/
|
||||||
|
@ -1520,43 +1525,43 @@ CALLOUTS
|
||||||
NON-PRINTING CHARACTERS
|
NON-PRINTING CHARACTERS
|
||||||
|
|
||||||
When pcre2test is outputting text in the compiled version of a pattern,
|
When pcre2test is outputting text in the compiled version of a pattern,
|
||||||
bytes other than 32-126 are always treated as non-printing characters
|
bytes other than 32-126 are always treated as non-printing characters
|
||||||
and are therefore shown as hex escapes.
|
and are therefore shown as hex escapes.
|
||||||
|
|
||||||
When pcre2test is outputting text that is a matched part of a subject
|
When pcre2test is outputting text that is a matched part of a subject
|
||||||
string, it behaves in the same way, unless a different locale has been
|
string, it behaves in the same way, unless a different locale has been
|
||||||
set for the pattern (using the locale modifier). In this case, the
|
set for the pattern (using the locale modifier). In this case, the
|
||||||
isprint() function is used to distinguish printing and non-printing
|
isprint() function is used to distinguish printing and non-printing
|
||||||
characters.
|
characters.
|
||||||
|
|
||||||
|
|
||||||
SAVING AND RESTORING COMPILED PATTERNS
|
SAVING AND RESTORING COMPILED PATTERNS
|
||||||
|
|
||||||
It is possible to save compiled patterns on disc or elsewhere, and
|
It is possible to save compiled patterns on disc or elsewhere, and
|
||||||
reload them later, subject to a number of restrictions. JIT data cannot
|
reload them later, subject to a number of restrictions. JIT data cannot
|
||||||
be saved. The host on which the patterns are reloaded must be running
|
be saved. The host on which the patterns are reloaded must be running
|
||||||
the same version of PCRE2, with the same code unit width, and must also
|
the same version of PCRE2, with the same code unit width, and must also
|
||||||
have the same endianness, pointer width and PCRE2_SIZE type. Before
|
have the same endianness, pointer width and PCRE2_SIZE type. Before
|
||||||
compiled patterns can be saved they must be serialized, that is, con-
|
compiled patterns can be saved they must be serialized, that is, con-
|
||||||
verted to a stream of bytes. A single byte stream may contain any num-
|
verted to a stream of bytes. A single byte stream may contain any num-
|
||||||
ber of compiled patterns, but they must all use the same character
|
ber of compiled patterns, but they must all use the same character
|
||||||
tables. A single copy of the tables is included in the byte stream (its
|
tables. A single copy of the tables is included in the byte stream (its
|
||||||
size is 1088 bytes).
|
size is 1088 bytes).
|
||||||
|
|
||||||
The functions whose names begin with pcre2_serialize_ are used for
|
The functions whose names begin with pcre2_serialize_ are used for
|
||||||
serializing and de-serializing. They are described in the pcre2serial-
|
serializing and de-serializing. They are described in the pcre2serial-
|
||||||
ize documentation. In this section we describe the features of
|
ize documentation. In this section we describe the features of
|
||||||
pcre2test that can be used to test these functions.
|
pcre2test that can be used to test these functions.
|
||||||
|
|
||||||
When a pattern with push modifier is successfully compiled, it is
|
When a pattern with push modifier is successfully compiled, it is
|
||||||
pushed onto a stack of compiled patterns, and pcre2test expects the
|
pushed onto a stack of compiled patterns, and pcre2test expects the
|
||||||
next line to contain a new pattern (or command) instead of a subject
|
next line to contain a new pattern (or command) instead of a subject
|
||||||
line. By contrast, the pushcopy modifier causes a copy of the compiled
|
line. By contrast, the pushcopy modifier causes a copy of the compiled
|
||||||
pattern to be stacked, leaving the original available for immediate
|
pattern to be stacked, leaving the original available for immediate
|
||||||
matching. By using push and/or pushcopy, a number of patterns can be
|
matching. By using push and/or pushcopy, a number of patterns can be
|
||||||
compiled and retained. These modifiers are incompatible with posix, and
|
compiled and retained. These modifiers are incompatible with posix, and
|
||||||
control modifiers that act at match time are ignored (with a message)
|
control modifiers that act at match time are ignored (with a message)
|
||||||
for the stacked patterns. The jitverify modifier applies only at com-
|
for the stacked patterns. The jitverify modifier applies only at com-
|
||||||
pile time.
|
pile time.
|
||||||
|
|
||||||
The command
|
The command
|
||||||
|
@ -1564,21 +1569,21 @@ SAVING AND RESTORING COMPILED PATTERNS
|
||||||
#save <filename>
|
#save <filename>
|
||||||
|
|
||||||
causes all the stacked patterns to be serialized and the result written
|
causes all the stacked patterns to be serialized and the result written
|
||||||
to the named file. Afterwards, all the stacked patterns are freed. The
|
to the named file. Afterwards, all the stacked patterns are freed. The
|
||||||
command
|
command
|
||||||
|
|
||||||
#load <filename>
|
#load <filename>
|
||||||
|
|
||||||
reads the data in the file, and then arranges for it to be de-serial-
|
reads the data in the file, and then arranges for it to be de-serial-
|
||||||
ized, with the resulting compiled patterns added to the pattern stack.
|
ized, with the resulting compiled patterns added to the pattern stack.
|
||||||
The pattern on the top of the stack can be retrieved by the #pop com-
|
The pattern on the top of the stack can be retrieved by the #pop com-
|
||||||
mand, which must be followed by lines of subjects that are to be
|
mand, which must be followed by lines of subjects that are to be
|
||||||
matched with the pattern, terminated as usual by an empty line or end
|
matched with the pattern, terminated as usual by an empty line or end
|
||||||
of file. This command may be followed by a modifier list containing
|
of file. This command may be followed by a modifier list containing
|
||||||
only control modifiers that act after a pattern has been compiled. In
|
only control modifiers that act after a pattern has been compiled. In
|
||||||
particular, hex, posix, posix_nosub, push, and pushcopy are not
|
particular, hex, posix, posix_nosub, push, and pushcopy are not
|
||||||
allowed, nor are any option-setting modifiers. The JIT modifiers are,
|
allowed, nor are any option-setting modifiers. The JIT modifiers are,
|
||||||
however permitted. Here is an example that saves and reloads two pat-
|
however permitted. Here is an example that saves and reloads two pat-
|
||||||
terns.
|
terns.
|
||||||
|
|
||||||
/abc/push
|
/abc/push
|
||||||
|
@ -1591,10 +1596,10 @@ SAVING AND RESTORING COMPILED PATTERNS
|
||||||
#pop jit,bincode
|
#pop jit,bincode
|
||||||
abc
|
abc
|
||||||
|
|
||||||
If jitverify is used with #pop, it does not automatically imply jit,
|
If jitverify is used with #pop, it does not automatically imply jit,
|
||||||
which is different behaviour from when it is used on a pattern.
|
which is different behaviour from when it is used on a pattern.
|
||||||
|
|
||||||
The #popcopy command is analagous to the pushcopy modifier in that it
|
The #popcopy command is analagous to the pushcopy modifier in that it
|
||||||
makes current a copy of the topmost stack pattern, leaving the original
|
makes current a copy of the topmost stack pattern, leaving the original
|
||||||
still on the stack.
|
still on the stack.
|
||||||
|
|
||||||
|
@ -1614,5 +1619,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 28 December 2016
|
Last updated: 21 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
|
|
Loading…
Reference in New Issue