Documentation update.
This commit is contained in:
parent
32bab50c01
commit
3aeb812180
|
@ -1,10 +1,6 @@
|
|||
Building PCRE2 without using autotools
|
||||
--------------------------------------
|
||||
|
||||
This document has been converted from the PCRE1 document. I have removed a
|
||||
number of sections about building in various environments, as they applied only
|
||||
to PCRE1 and are probably out of date.
|
||||
|
||||
This document contains the following sections:
|
||||
|
||||
General
|
||||
|
@ -183,21 +179,9 @@ can skip ahead to the CMake section.
|
|||
|
||||
STACK SIZE IN WINDOWS ENVIRONMENTS
|
||||
|
||||
The default processor stack size of 1Mb in some Windows environments is too
|
||||
small for matching patterns that need much recursion. In particular, test 2 may
|
||||
fail because of this. Normally, running out of stack causes a crash, but there
|
||||
have been cases where the test program has just died silently. See your linker
|
||||
documentation for how to increase stack size if you experience problems. If you
|
||||
are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
|
||||
compiler, you can increase the stack size for pcre2test and pcre2grep by
|
||||
setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
|
||||
example). The Linux default of 8Mb is a reasonable choice for the stack, though
|
||||
even that can be too small for some pattern/subject combinations.
|
||||
|
||||
PCRE2 has a compile configuration option to disable the use of stack for
|
||||
recursion so that heap is used instead. However, pattern matching is
|
||||
significantly slower when this is done. There is more about stack usage in the
|
||||
"pcre2stack" documentation.
|
||||
Prior to release 10.30 the default system stack size of 1Mb in some Windows
|
||||
environments caused issues with some tests. This should no longer be the case
|
||||
for 10.30 and later releases.
|
||||
|
||||
|
||||
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
|
||||
|
@ -393,4 +377,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
|
|||
recommended download site.
|
||||
|
||||
=============================
|
||||
Last Updated: 13 October 2016
|
||||
Last Updated: 17 March 2017
|
||||
|
|
|
@ -15,8 +15,8 @@ subscribe or manage your subscription here:
|
|||
|
||||
https://lists.exim.org/mailman/listinfo/pcre-dev
|
||||
|
||||
Please read the NEWS file if you are upgrading from a previous release.
|
||||
The contents of this README file are:
|
||||
Please read the NEWS file if you are upgrading from a previous release. The
|
||||
contents of this README file are:
|
||||
|
||||
The PCRE2 APIs
|
||||
Documentation for PCRE2
|
||||
|
@ -44,8 +44,8 @@ wrappers.
|
|||
|
||||
The distribution does contain a set of C wrapper functions for the 8-bit
|
||||
library that are based on the POSIX regular expression API (see the pcre2posix
|
||||
man page). These can be found in a library called libpcre2-posix. Note that this
|
||||
just provides a POSIX calling interface to PCRE2; the regular expressions
|
||||
man page). These can be found in a library called libpcre2-posix. Note that
|
||||
this just provides a POSIX calling interface to PCRE2; the regular expressions
|
||||
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
|
||||
and does not give full access to all of PCRE2's facilities.
|
||||
|
||||
|
@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms:
|
|||
Building PCRE2 on non-Unix-like systems
|
||||
---------------------------------------
|
||||
|
||||
For a non-Unix-like system, please read the comments in the file
|
||||
NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and
|
||||
"make" you may be able to build PCRE2 using autotools in the same way as for
|
||||
many Unix-like systems.
|
||||
For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
|
||||
your system supports the use of "configure" and "make" you may be able to build
|
||||
PCRE2 using autotools in the same way as for many Unix-like systems.
|
||||
|
||||
PCRE2 can also be configured using CMake, which can be run in various ways
|
||||
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
|
||||
|
@ -174,19 +173,19 @@ library. They are also documented in the pcre2build man page.
|
|||
architectures. If you try to enable it on an unsupported architecture, there
|
||||
will be a compile time error.
|
||||
|
||||
. If you do not want to make use of the support for UTF-8 Unicode character
|
||||
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
|
||||
library, or UTF-32 Unicode character strings in the 32-bit library, you can
|
||||
add --disable-unicode to the "configure" command. This reduces the size of
|
||||
the libraries. It is not possible to configure one library with Unicode
|
||||
support, and another without, in the same configuration.
|
||||
. If you do not want to make use of the default support for UTF-8 Unicode
|
||||
character strings in the 8-bit library, UTF-16 Unicode character strings in
|
||||
the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
|
||||
library, you can add --disable-unicode to the "configure" command. This
|
||||
reduces the size of the libraries. It is not possible to configure one
|
||||
library with Unicode support, and another without, in the same configuration.
|
||||
It is also not possible to use --enable-ebcdic (see below) with Unicode
|
||||
support, so if this option is set, you must also use --disable-unicode.
|
||||
|
||||
When Unicode support is available, the use of a UTF encoding still has to be
|
||||
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
||||
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
||||
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
|
||||
not possible to use both --enable-unicode and --enable-ebcdic at the same
|
||||
time.
|
||||
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
|
||||
|
||||
As well as supporting UTF strings, Unicode support includes support for the
|
||||
\P, \p, and \X sequences that recognize Unicode character properties.
|
||||
|
@ -232,18 +231,18 @@ library. They are also documented in the pcre2build man page.
|
|||
--with-match-limit=500000
|
||||
|
||||
on the "configure" command. This is just the default; individual calls to
|
||||
pcre2_match() can supply their own value. There is more discussion on the
|
||||
pcre2api man page.
|
||||
pcre2_match() can supply their own value. There is more discussion in the
|
||||
pcre2api man page (search for pcre2_set_match_limit).
|
||||
|
||||
. There is a separate counter that limits the depth of recursive function calls
|
||||
during a matching process. This also has a default of ten million, which is
|
||||
essentially "unlimited". You can change the default by setting, for example,
|
||||
. There is a separate counter that limits the depth of nested backtracking
|
||||
during a matching process, which in turn limits the amount of memory that is
|
||||
used. This also has a default of ten million, which is essentially
|
||||
"unlimited". You can change the default by setting, for example,
|
||||
|
||||
--with-match-limit-recursion=500000
|
||||
--with-match-limit-depth=5000
|
||||
|
||||
Recursive function calls use up the runtime stack; running out of stack can
|
||||
cause programs to crash in strange ways. There is a discussion about stack
|
||||
sizes in the pcre2stack man page.
|
||||
There is more discussion in the pcre2api man page (search for
|
||||
pcre2_set_depth_limit).
|
||||
|
||||
. In the 8-bit library, the default maximum compiled pattern size is around
|
||||
64K bytes. You can increase this by adding --with-link-size=3 to the
|
||||
|
@ -254,20 +253,6 @@ library. They are also documented in the pcre2build man page.
|
|||
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
|
||||
link size setting is ignored, as 4-byte offsets are always used.
|
||||
|
||||
. You can build PCRE2 so that its internal match() function that is called from
|
||||
pcre2_match() does not call itself recursively. Instead, it uses memory
|
||||
blocks obtained from the heap to save data that would otherwise be saved on
|
||||
the stack. To build PCRE2 like this, use
|
||||
|
||||
--disable-stack-for-recursion
|
||||
|
||||
on the "configure" command. PCRE2 runs more slowly in this mode, but it may
|
||||
be necessary in environments with limited stack sizes. This applies only to
|
||||
the normal execution of the pcre2_match() function; if JIT support is being
|
||||
successfully used, it is not relevant. Equally, it does not apply to
|
||||
pcre2_dfa_match(), which does not use deeply nested recursion. There is a
|
||||
discussion about stack sizes in the pcre2stack man page.
|
||||
|
||||
. For speed, PCRE2 uses four tables for manipulating and identifying characters
|
||||
whose code point values are less than 256. By default, it uses a set of
|
||||
tables for ASCII encoding that is part of the distribution. If you specify
|
||||
|
@ -389,6 +374,13 @@ library. They are also documented in the pcre2build man page.
|
|||
string. Otherwise, it is assumed to be a file name, and the contents of the
|
||||
file are the test string.
|
||||
|
||||
. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
|
||||
which caused pcre2_match() to use individual blocks on the heap for
|
||||
backtracking instead of recursive function calls (which use the stack). This
|
||||
is now obsolete since pcre2_match() was refactored always to use the heap (in
|
||||
a much more efficient way than before). This option is retained for backwards
|
||||
compatibility, but has no effect other than to output a warning.
|
||||
|
||||
The "configure" script builds the following files for the basic C library:
|
||||
|
||||
. Makefile the makefile that builds the library
|
||||
|
@ -662,25 +654,32 @@ Unicode support is enabled.
|
|||
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
|
||||
16-bit and 32-bit modes. These are tests that generate different output in
|
||||
8-bit mode. Each pair are for general cases and Unicode support, respectively.
|
||||
|
||||
Test 13 checks the handling of non-UTF characters greater than 255 by
|
||||
pcre2_dfa_match() in 16-bit and 32-bit modes.
|
||||
|
||||
Test 14 contains a number of tests that must not be run with JIT. They check,
|
||||
Test 14 contains some special UTF and UCP tests that give different output for
|
||||
the different widths.
|
||||
|
||||
Test 15 contains a number of tests that must not be run with JIT. They check,
|
||||
among other non-JIT things, the match-limiting features of the intepretive
|
||||
matcher.
|
||||
|
||||
Test 15 is run only when JIT support is not available. It checks that an
|
||||
Test 16 is run only when JIT support is not available. It checks that an
|
||||
attempt to use JIT has the expected behaviour.
|
||||
|
||||
Test 16 is run only when JIT support is available. It checks JIT complete and
|
||||
Test 17 is run only when JIT support is available. It checks JIT complete and
|
||||
partial modes, match-limiting under JIT, and other JIT-specific features.
|
||||
|
||||
Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to
|
||||
Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
|
||||
the 8-bit library, without and with Unicode support, respectively.
|
||||
|
||||
Test 19 checks the serialization functions by writing a set of compiled
|
||||
Test 20 checks the serialization functions by writing a set of compiled
|
||||
patterns to a file, and then reloading and checking them.
|
||||
|
||||
Tests 21 and 22 test \C support when the use of \C is not locked out, without
|
||||
and with UTF support, respectively. Test 23 tests \C when it is locked out.
|
||||
|
||||
|
||||
Character tables
|
||||
----------------
|
||||
|
@ -866,4 +865,4 @@ The distribution should contain the files listed below.
|
|||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 01 November 2016
|
||||
Last updated: 17 March 2017
|
||||
|
|
|
@ -109,7 +109,7 @@ lose performance.
|
|||
One way of guarding against this possibility is to use the
|
||||
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
|
||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
|
||||
<b>pcre2_compile()</b>. This causes an compile time error if a pattern contains
|
||||
<b>pcre2_compile()</b>. This causes a compile time error if the pattern contains
|
||||
a UTF-setting sequence.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -137,7 +137,8 @@ large search tree against a string that will never match. Nested unlimited
|
|||
repeats in a pattern are a common example. PCRE2 provides some protection
|
||||
against this: see the <b>pcre2_set_match_limit()</b> function in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page.
|
||||
page. There is a similar function called <b>pcre2_set_depth_limit()</b> that can
|
||||
be used to restrict the amount of memory that is used.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
|
||||
<P>
|
||||
|
@ -166,7 +167,7 @@ listing), and the short pages for individual functions, are concatenated in
|
|||
pcre2perform discussion of performance issues
|
||||
pcre2posix the POSIX-compatible C API for the 8-bit library
|
||||
pcre2sample discussion of the pcre2demo program
|
||||
pcre2stack discussion of stack usage
|
||||
pcre2stack discussion of stack and memory usage
|
||||
pcre2syntax quick syntax reference
|
||||
pcre2test description of the <b>pcre2test</b> command
|
||||
pcre2unicode discussion of Unicode and UTF support
|
||||
|
@ -189,9 +190,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
|
|||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 16 October 2015
|
||||
Last updated: 27 March 2017
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -36,20 +36,21 @@ for success and non-zero otherwise. The arguments are:
|
|||
<i>callout_data</i> User data that is passed to the callback
|
||||
</pre>
|
||||
The <i>callback()</i> function is passed a pointer to a data block containing
|
||||
the following fields:
|
||||
the following fields (not necessarily in this order):
|
||||
<pre>
|
||||
<i>version</i> Block version number
|
||||
<i>pattern_position</i> Offset to next item in pattern
|
||||
<i>next_item_length</i> Length of next item in pattern
|
||||
<i>callout_number</i> Number for numbered callouts
|
||||
<i>callout_string_offset</i> Offset to string within pattern
|
||||
<i>callout_string_length</i> Length of callout string
|
||||
<i>callout_string</i> Points to callout string or is NULL
|
||||
uint32_t <i>version</i> Block version number
|
||||
uint32_t <i>callout_number</i> Number for numbered callouts
|
||||
PCRE2_SIZE <i>pattern_position</i> Offset to next item in pattern
|
||||
PCRE2_SIZE <i>next_item_length</i> Length of next item in pattern
|
||||
PCRE2_SIZE <i>callout_string_offset</i> Offset to string within pattern
|
||||
PCRE2_SIZE <i>callout_string_length</i> Length of callout string
|
||||
PCRE2_SPTR <i>callout_string</i> Points to callout string or is NULL
|
||||
</pre>
|
||||
The second argument is the callout data that was passed to
|
||||
<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero
|
||||
for success. Any other value causes the pattern scan to stop, with the value
|
||||
being passed back as the result of <b>pcre2_callout_enumerate()</b>.
|
||||
The second argument passed to the <b>callback()</b> function is the callout data
|
||||
that was passed to <b>pcre2_callout_enumerate()</b>. The <b>callback()</b>
|
||||
function must return zero for success. Any other value causes the pattern scan
|
||||
to stop, with the value being passed back as the result of
|
||||
<b>pcre2_callout_enumerate()</b>.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
|
|
|
@ -26,7 +26,9 @@ DESCRIPTION
|
|||
</b><br>
|
||||
<P>
|
||||
This function frees the memory used for a compiled pattern, including any
|
||||
memory used by the JIT compiler.
|
||||
memory used by the JIT compiler. If the compiled pattern was created by a call
|
||||
to <b>pcre2_code_copy_with_tables()</b>, the memory for the character tables is
|
||||
also freed.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
|
|
|
@ -37,19 +37,24 @@ arguments are:
|
|||
<i>erroffset</i> Where to put an error offset
|
||||
<i>ccontext</i> Pointer to a compile context or NULL
|
||||
</pre>
|
||||
The length of the string and any error offset that is returned are in code
|
||||
units, not characters. A compile context is needed only if you want to change
|
||||
The length of the pattern and any error offset that is returned are in code
|
||||
units, not characters. A compile context is needed only if you want to provide
|
||||
custom memory allocation functions, or to provide an external function for
|
||||
system stack size checking, or to change one or more of these parameters:
|
||||
<pre>
|
||||
What \R matches (Unicode newlines or CR, LF, CRLF only)
|
||||
PCRE2's character tables
|
||||
The newline character sequence
|
||||
The compile time nested parentheses limit
|
||||
What \R matches (Unicode newlines, or CR, LF, CRLF only);
|
||||
PCRE2's character tables;
|
||||
The newline character sequence;
|
||||
The compile time nested parentheses limit;
|
||||
The maximum pattern length (in code units) that is allowed.
|
||||
</pre>
|
||||
or provide an external function for stack size checking. The option bits are:
|
||||
The option bits are:
|
||||
<pre>
|
||||
PCRE2_ANCHORED Force pattern anchoring
|
||||
PCRE2_ALLOW_EMPTY_CLASS Allow empty classes
|
||||
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
|
||||
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
|
||||
PCRE2_ALT_VERBNAMES Process backslashes in verb names
|
||||
PCRE2_AUTO_CALLOUT Compile automatic callouts
|
||||
PCRE2_CASELESS Do caseless matching
|
||||
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
|
||||
|
@ -71,19 +76,21 @@ or provide an external function for stack size checking. The option bits are:
|
|||
(only relevant if PCRE2_UTF is set)
|
||||
PCRE2_UCP Use Unicode properties for \d, \w, etc.
|
||||
PCRE2_UNGREEDY Invert greediness of quantifiers
|
||||
PCRE2_USE_OFFSET_LIMIT Enable offset limit for unanchored matching
|
||||
PCRE2_UTF Treat pattern and subjects as UTF strings
|
||||
</pre>
|
||||
PCRE2 must be built with Unicode support in order to use PCRE2_UTF, PCRE2_UCP
|
||||
and related options.
|
||||
PCRE2 must be built with Unicode support (the default) in order to use
|
||||
PCRE2_UTF, PCRE2_UCP and related options.
|
||||
</P>
|
||||
<P>
|
||||
The yield of the function is a pointer to a private data structure that
|
||||
contains the compiled pattern, or NULL if an error was detected.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
There is a complete description of the PCRE2 native API, with more detail on
|
||||
each option, in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
page, and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
|
|
|
@ -45,10 +45,9 @@ point to a uint32_t integer variable. The available codes are:
|
|||
PCRE2_CONFIG_BSR Indicates what \R matches by default:
|
||||
PCRE2_BSR_UNICODE
|
||||
PCRE2_BSR_ANYCRLF
|
||||
PCRE2_CONFIG_JIT Availability of just-in-time compiler
|
||||
support (1=yes 0=no)
|
||||
PCRE2_CONFIG_JITTARGET Information about the target archi-
|
||||
tecture for the JIT compiler
|
||||
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
|
||||
PCRE2_CONFIG_JIT Availability of just-in-time compiler support (1=yes 0=no)
|
||||
PCRE2_CONFIG_JITTARGET Information (a string) about the target architecture for the JIT compiler
|
||||
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
||||
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
||||
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
||||
|
@ -58,11 +57,9 @@ point to a uint32_t integer variable. The available codes are:
|
|||
PCRE2_NEWLINE_ANY
|
||||
PCRE2_NEWLINE_ANYCRLF
|
||||
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
||||
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
|
||||
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
|
||||
0=heap)
|
||||
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
|
||||
0=no)
|
||||
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
|
||||
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
|
||||
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes 0=no)
|
||||
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
||||
PCRE2_CONFIG_VERSION The PCRE2 version (a string)
|
||||
</pre>
|
||||
|
|
|
@ -31,8 +31,9 @@ DESCRIPTION
|
|||
<P>
|
||||
This function matches a compiled regular expression against a given subject
|
||||
string, using an alternative matching algorithm that scans the subject string
|
||||
just once (<i>not</i> Perl-compatible). (The Perl-compatible matching function
|
||||
is <b>pcre2_match()</b>.) The arguments for this function are:
|
||||
just once (except when processing lookaround assertions). This function is
|
||||
<i>not</i> Perl-compatible (the Perl-compatible matching function is
|
||||
<b>pcre2_match()</b>). The arguments for this function are:
|
||||
<pre>
|
||||
<i>code</i> Points to the compiled pattern
|
||||
<i>subject</i> Points to the subject string
|
||||
|
@ -45,22 +46,18 @@ is <b>pcre2_match()</b>.) The arguments for this function are:
|
|||
<i>wscount</i> Number of elements in the vector
|
||||
</pre>
|
||||
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
|
||||
up a callout function or specify the recursion limit. The <i>length</i> and
|
||||
<i>startoffset</i> values are code units, not characters. The options are:
|
||||
up a callout function or specify the recursion depth limit. The <i>length</i>
|
||||
and <i>startoffset</i> values are code units, not characters. The options are:
|
||||
<pre>
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||
PCRE2_NOTEOL Subject is not the end of a line
|
||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
|
||||
is not a valid match
|
||||
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
|
||||
validity (only relevant if PCRE2_UTF
|
||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
|
||||
PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
|
||||
was set at compile time)
|
||||
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
|
||||
match if no full matches are found
|
||||
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
|
||||
even if there is a full match as well
|
||||
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
|
||||
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
|
||||
PCRE2_DFA_RESTART Restart after a partial match
|
||||
PCRE2_DFA_SHORTEST Return only the shortest match
|
||||
</pre>
|
||||
|
|
|
@ -34,11 +34,11 @@ errors are negative numbers. The arguments are:
|
|||
<i>buffer</i> where to put the message
|
||||
<i>bufflen</i> the length of the buffer (code units)
|
||||
</pre>
|
||||
The function returns the length of the message, excluding the trailing zero, or
|
||||
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
|
||||
this case, the returned message is truncated (but still with a trailing zero).
|
||||
If <i>errorcode</i> does not contain a recognized error code number, the
|
||||
negative value PCRE2_ERROR_BADDATA is returned.
|
||||
The function returns the length of the message in code units, excluding the
|
||||
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
|
||||
too small. In this case, the returned message is truncated (but still with a
|
||||
trailing zero). If <i>errorcode</i> does not contain a recognized error code
|
||||
number, the negative value PCRE2_ERROR_BADDATA is returned.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
|
|
|
@ -32,10 +32,9 @@ maximum size to which it is allowed to grow. The final argument is a general
|
|||
context, for memory allocation functions, or NULL for standard memory
|
||||
allocation. The result can be passed to the JIT run-time code by calling
|
||||
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
|
||||
which can then be processed by <b>pcre2_match()</b>. If the "fast path" JIT
|
||||
matcher, <b>pcre2_jit_match()</b> is used, the stack can be passed directly as
|
||||
an argument. A maximum stack size of 512K to 1M should be more than enough for
|
||||
any pattern. For more details, see the
|
||||
which can then be processed by <b>pcre2_match()</b> or <b>pcre2_jit_match()</b>.
|
||||
A maximum stack size of 512K to 1M should be more than enough for any pattern.
|
||||
For more details, see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
page.
|
||||
</P>
|
||||
|
|
|
@ -25,10 +25,10 @@ SYNOPSIS
|
|||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function builds a set of character tables for character values less than
|
||||
256. These can be passed to <b>pcre2_compile()</b> in a compile context in order
|
||||
to override the internal, built-in tables (which were either defaulted or made
|
||||
by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
|
||||
This function builds a set of character tables for character code points that
|
||||
are less than 256. These can be passed to <b>pcre2_compile()</b> in a compile
|
||||
context in order to override the internal, built-in tables (which were either
|
||||
defaulted or made by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
|
||||
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
|
||||
page. You might want to do this if you are using a non-standard locale.
|
||||
</P>
|
||||
|
|
|
@ -2575,8 +2575,8 @@ The internal recursion limit was reached.
|
|||
A text message for an error code from any PCRE2 function (compile, match, or
|
||||
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
|
||||
is passed as the first argument, with the remaining two arguments specifying a
|
||||
code unit buffer and its length, into which the text message is placed. Note
|
||||
that the message is returned in code units of the appropriate width for the
|
||||
code unit buffer and its length in code units, into which the text message is
|
||||
placed. The message is returned in code units of the appropriate width for the
|
||||
library that is being used.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -3265,9 +3265,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC41" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 23 December 2016
|
||||
Last updated: 21 March 2017
|
||||
<br>
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -280,6 +280,10 @@ operating systems the effect of reading a directory like this is an immediate
|
|||
end-of-file; in others it may provoke an error.
|
||||
</P>
|
||||
<P>
|
||||
<b>--depth-limit</b>=<i>number</i>
|
||||
See <b>--match-limit</b> below.
|
||||
</P>
|
||||
<P>
|
||||
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
|
||||
Specify a pattern to be matched. This option can be used multiple times in
|
||||
order to specify several patterns. It can also be used as a way of specifying a
|
||||
|
@ -498,29 +502,22 @@ used. There is no short form for this option.
|
|||
</P>
|
||||
<P>
|
||||
<b>--match-limit</b>=<i>number</i>
|
||||
Processing some regular expression patterns can require a very large amount of
|
||||
memory, leading in some cases to a program crash if not enough is available.
|
||||
Other patterns may take a very long time to search for all possible matching
|
||||
strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
|
||||
do the matching has two parameters that can limit the resources that it uses.
|
||||
Processing some regular expression patterns may take a very long time to search
|
||||
for all possible matching strings. Others may require a very large amount of
|
||||
memory. There are two options that set resource limits for matching.
|
||||
<br>
|
||||
<br>
|
||||
The <b>--match-limit</b> option provides a means of limiting resource usage
|
||||
when processing patterns that are not going to match, but which have a very
|
||||
large number of possibilities in their search trees. The classic example is a
|
||||
pattern that uses nested unlimited repeats. Internally, PCRE2 uses a function
|
||||
called <b>match()</b> which it calls repeatedly (sometimes recursively). The
|
||||
limit set by <b>--match-limit</b> is imposed on the number of times this
|
||||
function is called during a match, which has the effect of limiting the amount
|
||||
of backtracking that can take place.
|
||||
The <b>--match-limit</b> option provides a means of limiting computing resource
|
||||
usage when processing patterns that are not going to match, but which have a
|
||||
very large number of possibilities in their search trees. The classic example
|
||||
is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
|
||||
counter that is incremented each time around its main processing loop. If the
|
||||
value set by <b>--match-limit</b> is reached, an error occurs.
|
||||
<br>
|
||||
<br>
|
||||
The <b>--recursion-limit</b> option is similar to <b>--match-limit</b>, but
|
||||
instead of limiting the total number of times that <b>match()</b> is called, it
|
||||
limits the depth of recursive calls, which in turn limits the amount of memory
|
||||
that can be used. The recursion depth is a smaller number than the total number
|
||||
of calls, because not all calls to <b>match()</b> are recursive. This limit is
|
||||
of use only if it is set smaller than <b>--match-limit</b>.
|
||||
The <b>--depth-limit</b> option limits the depth of nested backtracking points,
|
||||
which in turn limits the amount of memory that is used. This limit is of use
|
||||
only if it is set smaller than <b>--match-limit</b>.
|
||||
<br>
|
||||
<br>
|
||||
There are no short forms for these options. The default settings are specified
|
||||
|
@ -843,9 +840,9 @@ there are more than 20 such errors, <b>pcre2grep</b> gives up.
|
|||
</P>
|
||||
<P>
|
||||
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
|
||||
overall resource limit; there is a second option called <b>--recursion-limit</b>
|
||||
that sets a limit on the amount of memory (usually stack) that is used (see the
|
||||
discussion of these options above).
|
||||
overall resource limit; there is a second option called <b>--depth-limit</b>
|
||||
that sets a limit on the amount of memory that is used (see the discussion of
|
||||
these options above).
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
|
||||
<P>
|
||||
|
@ -870,9 +867,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 31 December 2016
|
||||
Last updated: 21 March 2017
|
||||
<br>
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -170,20 +170,24 @@ the application to apply the JIT optimization by calling
|
|||
<b>pcre2_jit_compile()</b> is ignored.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting match and recursion limits
|
||||
Setting match and backtracking depth limits
|
||||
</b><br>
|
||||
<P>
|
||||
The caller of <b>pcre2_match()</b> can set a limit on the number of times the
|
||||
internal <b>match()</b> function is called and on the maximum depth of
|
||||
recursive calls. These facilities are provided to catch runaway matches that
|
||||
are provoked by patterns with huge matching trees (a typical example is a
|
||||
pattern with nested unlimited repeats) and to avoid running out of system stack
|
||||
by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
|
||||
gives an error return. The limits can also be set by items at the start of the
|
||||
pattern of the form
|
||||
The pcre2_match() function contains a counter that is incremented every time it
|
||||
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
|
||||
this counter, which therefore limits the amount of computing resource used for
|
||||
a match. The maximum depth of nested backtracking can also be limited, and this
|
||||
restricts the amount of heap memory that is used.
|
||||
</P>
|
||||
<P>
|
||||
These facilities are provided to catch runaway matches that are provoked by
|
||||
patterns with huge matching trees (a typical example is a pattern with nested
|
||||
unlimited repeats applied to a long string that does not match). When one of
|
||||
these limits is reached, <b>pcre2_match()</b> gives an error return. The limits
|
||||
can also be set by items at the start of the pattern of the form
|
||||
<pre>
|
||||
(*LIMIT_MATCH=d)
|
||||
(*LIMIT_RECURSION=d)
|
||||
(*LIMIT_DEPTH=d)
|
||||
</pre>
|
||||
where d is any number of decimal digits. However, the value of the setting must
|
||||
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
||||
|
@ -192,10 +196,15 @@ limits set by the programmer, but not raise them. If there is more than one
|
|||
setting of one of these limits, the lower value is used.
|
||||
</P>
|
||||
<P>
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
still recognized for backwards compatibility.
|
||||
</P>
|
||||
<P>
|
||||
The match limit is used (but in a different way) when JIT is being used, but it
|
||||
is not relevant, and is ignored, when matching with <b>pcre2_dfa_match()</b>.
|
||||
However, the recursion limit is relevant for DFA matching, which does use some
|
||||
function recursion, in particular, for recursions within the pattern.
|
||||
However, the depth limit is relevant for DFA matching, which uses function
|
||||
recursion for recursions within the pattern. In this case, the depth limit
|
||||
controls the amount of system stack that is used.
|
||||
<a name="newlines"></a></P>
|
||||
<br><b>
|
||||
Newline conventions
|
||||
|
@ -235,8 +244,8 @@ The newline convention affects where the circumflex and dollar assertions are
|
|||
true. It also affects the interpretation of the dot metacharacter when
|
||||
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
||||
what the \R escape sequence matches. By default, this is any Unicode newline
|
||||
sequence, for Perl compatibility. However, this can be changed; see the
|
||||
description of \R in the section entitled
|
||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||
section and the description of \R in the section entitled
|
||||
<a href="#newlineseq">"Newline sequences"</a>
|
||||
below. A change of \R setting can be combined with a change of newline
|
||||
convention.
|
||||
|
@ -254,7 +263,7 @@ corresponding to PCRE2_BSR_UNICODE.
|
|||
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
|
||||
<P>
|
||||
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
||||
character code rather than ASCII or Unicode (typically a mainframe system). In
|
||||
character code instead of ASCII or Unicode (typically a mainframe system). In
|
||||
the sections below, character code values are ASCII or Unicode; in an EBCDIC
|
||||
environment these characters may have different code values, and there are no
|
||||
code points greater than 255.
|
||||
|
@ -318,11 +327,11 @@ that character may have. This use of backslash as an escape character applies
|
|||
both inside and outside character classes.
|
||||
</P>
|
||||
<P>
|
||||
For example, if you want to match a * character, you write \* in the pattern.
|
||||
This escaping action applies whether or not the following character would
|
||||
otherwise be interpreted as a metacharacter, so it is always safe to precede a
|
||||
non-alphanumeric with backslash to specify that it stands for itself. In
|
||||
particular, if you want to match a backslash, you write \\.
|
||||
For example, if you want to match a * character, you must write \* in the
|
||||
pattern. This escaping action applies whether or not the following character
|
||||
would otherwise be interpreted as a metacharacter, so it is always safe to
|
||||
precede a non-alphanumeric with backslash to specify that it stands for itself.
|
||||
In particular, if you want to match a backslash, you write \\.
|
||||
</P>
|
||||
<P>
|
||||
In a UTF mode, only ASCII numbers and letters have any special meaning after a
|
||||
|
@ -353,7 +362,7 @@ An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
|
|||
by \E later in the pattern, the literal interpretation continues to the end of
|
||||
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
|
||||
a character class, this causes an error, because the character class is not
|
||||
terminated.
|
||||
terminated by a closing square bracket.
|
||||
<a name="digitsafterbackslash"></a></P>
|
||||
<br><b>
|
||||
Non-printing characters
|
||||
|
@ -476,9 +485,9 @@ a hexadecimal digit appears between \x{ and }, or if there is no terminating
|
|||
<P>
|
||||
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
|
||||
described only when it is followed by two hexadecimal digits. Otherwise, it
|
||||
matches a literal "x" character. In this mode mode, support for code points
|
||||
greater than 256 is provided by \u, which must be followed by four hexadecimal
|
||||
digits; otherwise it matches a literal "u" character.
|
||||
matches a literal "x" character. In this mode, support for code points greater
|
||||
than 256 is provided by \u, which must be followed by four hexadecimal digits;
|
||||
otherwise it matches a literal "u" character.
|
||||
</P>
|
||||
<P>
|
||||
Characters whose value is less than 256 can be defined by either of the two
|
||||
|
@ -493,12 +502,10 @@ Constraints on character values
|
|||
Characters that are specified using octal or hexadecimal numbers are
|
||||
limited to certain values, as follows:
|
||||
<pre>
|
||||
8-bit non-UTF mode less than 0x100
|
||||
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
|
||||
16-bit non-UTF mode less than 0x10000
|
||||
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
|
||||
32-bit non-UTF mode less than 0x100000000
|
||||
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
|
||||
8-bit non-UTF mode no greater than 0xff
|
||||
16-bit non-UTF mode no greater than 0xffff
|
||||
32-bit non-UTF mode no greater than 0xffffffff
|
||||
All UTF modes no greater than 0x10ffff and a valid codepoint
|
||||
</pre>
|
||||
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
|
||||
"surrogate" codepoints), and 0xffef.
|
||||
|
@ -525,7 +532,7 @@ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
|
|||
handler and used to modify the case of following characters. By default, PCRE2
|
||||
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
|
||||
is set, \U matches a "U" character, and \u can be used to define a character
|
||||
by code point, as described in the previous section.
|
||||
by code point, as described above.
|
||||
</P>
|
||||
<br><b>
|
||||
Absolute and relative back references
|
||||
|
@ -714,7 +721,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
|
|||
sequences that match characters with specific properties are available. In
|
||||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||
characters whose codepoints are less than 256, but they do work in this mode.
|
||||
The extra escape sequences are:
|
||||
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
|
||||
may be encountered. These are all treated as being in the Common script and
|
||||
with an unassigned type. The extra escape sequences are:
|
||||
<pre>
|
||||
\p{<i>xx</i>} a character with the <i>xx</i> property
|
||||
\P{<i>xx</i>} a character without the <i>xx</i> property
|
||||
|
@ -2214,16 +2223,8 @@ except that it does not cause the current matching position to be changed.
|
|||
Assertion subpatterns are not capturing subpatterns. If such an assertion
|
||||
contains capturing subpatterns within it, these are counted for the purposes of
|
||||
numbering the capturing subpatterns in the whole pattern. However, substring
|
||||
capturing is carried out only for positive assertions. (Perl sometimes, but not
|
||||
always, does do capturing in negative assertions.)
|
||||
</P>
|
||||
<P>
|
||||
WARNING: If a positive assertion containing one or more capturing subpatterns
|
||||
succeeds, but failure to match later in the pattern causes backtracking over
|
||||
this assertion, the captures within the assertion are reset only if no higher
|
||||
numbered captures are already set. This is, unfortunately, a fundamental
|
||||
limitation of the current implementation; it may get removed in a future
|
||||
reworking.
|
||||
capturing is normally carried out only for positive assertions (but see the
|
||||
discussion of conditional subpatterns below).
|
||||
</P>
|
||||
<P>
|
||||
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
||||
|
@ -2601,6 +2602,12 @@ presence of at least one letter in the subject. If a letter is found, the
|
|||
subject is matched against the first alternative; otherwise it is matched
|
||||
against the second. This pattern matches strings in one of the two forms
|
||||
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
|
||||
</P>
|
||||
<P>
|
||||
For Perl compatibility, if an assertion that is a condition contains capturing
|
||||
subpatterns, any capturing that occurs is retained afterwards, for both
|
||||
positive and negative assertions. (Compare non-conditional assertions, when
|
||||
captures are retained only for positive assertions.)
|
||||
<a name="comments"></a></P>
|
||||
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
|
||||
<P>
|
||||
|
@ -2773,93 +2780,57 @@ is the actual recursive call.
|
|||
Differences in recursion processing between PCRE2 and Perl
|
||||
</b><br>
|
||||
<P>
|
||||
Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2
|
||||
(like Python, but unlike Perl), a recursive subpattern call is always treated
|
||||
as an atomic group. That is, once it has matched some of the subject string, it
|
||||
is never re-entered, even if it contains untried alternatives and there is a
|
||||
subsequent matching failure. This can be illustrated by the following pattern,
|
||||
which purports to match a palindromic string that contains an odd number of
|
||||
characters (for example, "a", "aba", "abcba", "abcdcba"):
|
||||
<pre>
|
||||
^(.|(.)(?1)\2)$
|
||||
</pre>
|
||||
The idea is that it either matches a single character, or two identical
|
||||
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
|
||||
it does not if the pattern is longer than three characters. Consider the
|
||||
subject string "abcba":
|
||||
Some former differences between PCRE2 and Perl no longer exist.
|
||||
</P>
|
||||
<P>
|
||||
At the top level, the first character is matched, but as it is not at the end
|
||||
of the string, the first alternative fails; the second alternative is taken
|
||||
and the recursion kicks in. The recursive call to subpattern 1 successfully
|
||||
matches the next character ("b"). (Note that the beginning and end of line
|
||||
tests are not part of the recursion).
|
||||
Before release 10.30, recursion processing in PCRE2 differed from Perl in that
|
||||
a recursive subpattern call was always treated as an atomic group. That is,
|
||||
once it had matched some of the subject string, it was never re-entered, even
|
||||
if it contained untried alternatives and there was a subsequent matching
|
||||
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
||||
</P>
|
||||
<P>
|
||||
Back at the top level, the next character ("c") is compared with what
|
||||
subpattern 2 matched, which was "a". This fails. Because the recursion is
|
||||
treated as an atomic group, there are now no backtracking points, and so the
|
||||
entire match fails. (Perl is able, at this point, to re-enter the recursion and
|
||||
try the second alternative.) However, if the pattern is written with the
|
||||
alternatives in the other order, things are different:
|
||||
<pre>
|
||||
^((.)(?1)\2|.)$
|
||||
</pre>
|
||||
This time, the recursing alternative is tried first, and continues to recurse
|
||||
until it runs out of characters, at which point the recursion fails. But this
|
||||
time we do have another alternative to try at the higher level. That is the big
|
||||
difference: in the previous case the remaining alternative is at a deeper
|
||||
recursion level, which PCRE2 cannot use.
|
||||
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||
is a matching failure later in the pattern. This is now compatible with the way
|
||||
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
||||
enclose it in an atomic group.
|
||||
</P>
|
||||
<P>
|
||||
To change the pattern so that it matches all palindromic strings, not just
|
||||
those with an odd number of characters, it is tempting to change the pattern to
|
||||
this:
|
||||
Supporting backtracking into recursions simplifies certain types of recursive
|
||||
pattern. For example, this pattern matches palindromic strings:
|
||||
<pre>
|
||||
^((.)(?1)\2|.?)$
|
||||
</pre>
|
||||
Again, this works in Perl, but not in PCRE2, and for the same reason. When a
|
||||
deeper recursion has matched a single character, it cannot be entered again in
|
||||
order to match an empty string. The solution is to separate the two cases, and
|
||||
write out the odd and even cases as alternatives at the higher level:
|
||||
The second branch in the group matches a single central character in the
|
||||
palindrome when there are an odd number of characters, or nothing when there
|
||||
are an even number of characters, but in order to work it has to be able to try
|
||||
the second case when the rest of the pattern match fails. If you want to match
|
||||
typical palindromic phrases, the pattern has to ignore all non-word characters,
|
||||
which can be done like this:
|
||||
<pre>
|
||||
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
|
||||
</pre>
|
||||
If you want to match typical palindromic phrases, the pattern has to ignore all
|
||||
non-word characters, which can be done like this:
|
||||
<pre>
|
||||
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
|
||||
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
|
||||
</pre>
|
||||
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
|
||||
man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the
|
||||
use of the possessive quantifier *+ to avoid backtracking into sequences of
|
||||
non-word characters. Without this, PCRE2 takes a great deal longer (ten times
|
||||
or more) to match typical phrases, and Perl takes so long that you think it has
|
||||
gone into a loop.
|
||||
man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
|
||||
avoid backtracking into sequences of non-word characters. Without this, PCRE2
|
||||
takes a great deal longer (ten times or more) to match typical phrases, and
|
||||
Perl takes so long that you think it has gone into a loop.
|
||||
</P>
|
||||
<P>
|
||||
<b>WARNING</b>: The palindrome-matching patterns above work only if the subject
|
||||
string does not start with a palindrome that is shorter than the entire string.
|
||||
For example, although "abcba" is correctly matched, if the subject is "ababa",
|
||||
PCRE2 finds the palindrome "aba" at the start, then fails at top level because
|
||||
the end of the string does not follow. Once again, it cannot jump back into the
|
||||
recursion to try other alternatives, so the entire match fails.
|
||||
</P>
|
||||
<P>
|
||||
The second way in which PCRE2 and Perl differ in their recursion processing is
|
||||
in the handling of captured values. In Perl, when a subpattern is called
|
||||
recursively or as a subpattern (see the next section), it has no access to any
|
||||
values that were captured outside the recursion, whereas in PCRE2 these values
|
||||
can be referenced. Consider this pattern:
|
||||
Another way in which PCRE2 and Perl used to differ in their recursion
|
||||
processing is in the handling of captured values. Formerly in Perl, when a
|
||||
subpattern was called recursively or as a subpattern (see the next section), it
|
||||
had no access to any values that were captured outside the recursion, whereas
|
||||
in PCRE2 these values can be referenced. Consider this pattern:
|
||||
<pre>
|
||||
^(.)(\1|a(?2))
|
||||
</pre>
|
||||
In PCRE2, this pattern matches "bab". The first capturing parentheses match "b",
|
||||
then in the second group, when the back reference \1 fails to match "b", the
|
||||
second alternative matches "a" and then recurses. In the recursion, \1 does
|
||||
now match "b" and so the whole match succeeds. In Perl, the pattern fails to
|
||||
match because inside the recursive call \1 cannot access the externally set
|
||||
value.
|
||||
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||
the second group, when the back reference \1 fails to match "b", the second
|
||||
alternative matches "a" and then recurses. In the recursion, \1 does now match
|
||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
later versions (I tried 5.024) it now works.
|
||||
<a name="subpatternsassubroutines"></a></P>
|
||||
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
||||
<P>
|
||||
|
@ -2886,11 +2857,10 @@ is used, it does match "sense and responsibility" as well as the other two
|
|||
strings. Another example is given in the discussion of DEFINE above.
|
||||
</P>
|
||||
<P>
|
||||
All subroutine calls, whether recursive or not, are always treated as atomic
|
||||
groups. That is, once a subroutine has matched some of the subject string, it
|
||||
is never re-entered, even if it contains untried alternatives and there is a
|
||||
subsequent matching failure. Any capturing parentheses that are set during the
|
||||
subroutine call revert to their previous values afterwards.
|
||||
Like recursions, subroutine calls used to be treated as atomic, but this
|
||||
changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
|
||||
occur. However, any capturing parentheses that are set during the subroutine
|
||||
call revert to their previous values afterwards.
|
||||
</P>
|
||||
<P>
|
||||
Processing options such as case-independence are fixed when a subpattern is
|
||||
|
@ -2998,17 +2968,10 @@ The doubling is removed before the string is passed to the callout function.
|
|||
<a name="backtrackcontrol"></a></P>
|
||||
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
|
||||
are still described in the Perl documentation as "experimental and subject to
|
||||
change or removal in a future version of Perl". It goes on to say: "Their usage
|
||||
in production code should be noted to avoid problems during upgrades." The same
|
||||
remarks apply to the PCRE2 features described in this section.
|
||||
</P>
|
||||
<P>
|
||||
The new verbs make use of what was previously invalid syntax: an opening
|
||||
parenthesis followed by an asterisk. They are generally of the form (*VERB) or
|
||||
(*VERB:NAME). Some verbs take either form, possibly behaving differently
|
||||
depending on whether or not a name is present.
|
||||
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||
terminology) that modify the behaviour of backtracking during matching. They
|
||||
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
|
||||
possibly behaving differently depending on whether or not a name is present.
|
||||
</P>
|
||||
<P>
|
||||
By default, for compatibility with Perl, a name is any sequence of characters
|
||||
|
@ -3040,7 +3003,7 @@ not there. Any number of these verbs may occur in a pattern.
|
|||
<P>
|
||||
Since these verbs are specifically related to backtracking, most of them can be
|
||||
used only when the pattern is to be matched using the traditional matching
|
||||
function, because these use a backtracking algorithm. With the exception of
|
||||
function, because that uses a backtracking algorithm. With the exception of
|
||||
(*FAIL), which behaves like a failing negative assertion, the backtracking
|
||||
control verbs cause an error if encountered by the DFA matching function.
|
||||
</P>
|
||||
|
@ -3178,11 +3141,11 @@ Verbs that act after backtracking
|
|||
The following verbs do nothing when they are encountered. Matching continues
|
||||
with what follows, but if there is no subsequent match, causing a backtrack to
|
||||
the verb, a failure is forced. That is, backtracking cannot pass to the left of
|
||||
the verb. However, when one of these verbs appears inside an atomic group
|
||||
(which includes any group that is called as a subroutine) or in an assertion
|
||||
that is true, its effect is confined to that group, because once the group has
|
||||
been matched, there is never any backtracking into it. In this situation,
|
||||
backtracking has to jump to the left of the entire atomic group or assertion.
|
||||
the verb. However, when one of these verbs appears inside an atomic group or in
|
||||
an assertion that is true, its effect is confined to that group, because once
|
||||
the group has been matched, there is never any backtracking into it. In this
|
||||
situation, backtracking has to jump to the left of the entire atomic group or
|
||||
assertion.
|
||||
</P>
|
||||
<P>
|
||||
These verbs differ in exactly what kind of failure occurs when backtracking
|
||||
|
@ -3246,8 +3209,8 @@ expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
|
|||
as (*COMMIT).
|
||||
</P>
|
||||
<P>
|
||||
The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
|
||||
It is like (*MARK:NAME) in that the name is remembered for passing back to the
|
||||
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
|
||||
like (*MARK:NAME) in that the name is remembered for passing back to the
|
||||
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
|
||||
ignoring those set by (*PRUNE) or (*THEN).
|
||||
<pre>
|
||||
|
@ -3452,9 +3415,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 27 December 2016
|
||||
Last updated: 18 March 2017
|
||||
<br>
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -55,7 +55,10 @@ The facility for saving and restoring compiled patterns is intended for use
|
|||
within individual applications. As such, the data supplied to
|
||||
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
|
||||
arbitrary external sources. There is only some simple consistency checking, not
|
||||
complete validation of what is being re-loaded.
|
||||
complete validation of what is being re-loaded. Corrupted data may cause
|
||||
undefined results. For example, if the length field of a pattern in the
|
||||
serialized data is corrupted, the deserializing code may read beyond the end of
|
||||
the byte stream that is passed to it.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
|
||||
<P>
|
||||
|
@ -190,9 +193,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 24 May 2016
|
||||
Last updated: 21 March 2017
|
||||
<br>
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -126,12 +126,13 @@ character values up to 0x7fffffff. Each character is placed in one 16-bit or
|
|||
to occur).
|
||||
</P>
|
||||
<P>
|
||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but such
|
||||
values can be handled by the 32-bit library. When testing this library in
|
||||
non-UTF mode with <b>utf8_input</b> set, if any character is preceded by the
|
||||
byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the
|
||||
character's value. This is the only way of passing such code points in a
|
||||
pattern string. For subject strings, using an escape sequence is preferable.
|
||||
UTF-8 (in its original definition) is not capable of encoding values greater
|
||||
than 0x7fffffff, but such values can be handled by the 32-bit library. When
|
||||
testing this library in non-UTF mode with <b>utf8_input</b> set, if any
|
||||
character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
|
||||
0x80000000 is added to the character's value. This is the only way of passing
|
||||
such code points in a pattern string. For subject strings, using an escape
|
||||
sequence is preferable.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
|
||||
<P>
|
||||
|
@ -602,6 +603,7 @@ about the pattern:
|
|||
/B bincode show binary code without lengths
|
||||
callout_info show callout information
|
||||
debug same as info,fullbincode
|
||||
framesize show matching frame size
|
||||
fullbincode show binary code with lengths
|
||||
/I info show info about compiled pattern
|
||||
hex unquoted characters are hexadecimal
|
||||
|
@ -689,6 +691,11 @@ not necessarily the last character. These lines are omitted if no starting or
|
|||
ending code units are recorded.
|
||||
</P>
|
||||
<P>
|
||||
The <b>framesize</b> modifier shows the size, in bytes, of the storage frames
|
||||
used by <b>pcre2_match()</b> for handling backtracking. The size depends on the
|
||||
number of capturing parentheses in the pattern.
|
||||
</P>
|
||||
<P>
|
||||
The <b>callout_info</b> modifier requests information about all the callouts in
|
||||
the pattern. A list of them is output at the end of any other information that
|
||||
is requested. For each callout, either its number or string is given, followed
|
||||
|
@ -1073,6 +1080,7 @@ pattern.
|
|||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
depth_limit=<n> set a depth limit
|
||||
dfa use <b>pcre2_dfa_match()</b>
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
|
@ -1086,7 +1094,7 @@ pattern.
|
|||
offset=<n> set starting offset
|
||||
offset_limit=<n> set offset limit
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
recursion_limit=<n> obsolete synonym for depth_limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
startoffset=<n> same as offset=<n>
|
||||
|
@ -1320,10 +1328,10 @@ stack that is larger than the default 32K is necessary only for very
|
|||
complicated patterns.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting match and recursion limits
|
||||
Setting match and depth limits
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate
|
||||
The <b>match_limit</b> and <b>depth_limit</b> modifiers set the appropriate
|
||||
limits in the match context. These values are ignored when the
|
||||
<b>find_limits</b> modifier is specified.
|
||||
</P>
|
||||
|
@ -1333,23 +1341,23 @@ Finding minimum limits
|
|||
<P>
|
||||
If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
|
||||
<b>pcre2_match()</b> several times, setting different values in the match
|
||||
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b>
|
||||
context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_depth_limit()</b>
|
||||
until it finds the minimum values for each parameter that allow
|
||||
<b>pcre2_match()</b> to complete without error.
|
||||
</P>
|
||||
<P>
|
||||
If JIT is being used, only the match limit is relevant. If DFA matching is
|
||||
being used, neither limit is relevant, and this modifier is ignored (with a
|
||||
warning message).
|
||||
being used, only the depth limit is relevant, but at present this modifier is
|
||||
ignored (with a warning message).
|
||||
</P>
|
||||
<P>
|
||||
The <i>match_limit</i> number is a measure of the amount of backtracking
|
||||
that takes place, and learning the minimum value can be instructive. For most
|
||||
simple matches, the number is quite small, but for patterns with very large
|
||||
numbers of matching possibilities, it can become large very quickly with
|
||||
increasing length of subject string. The <i>match_limit_recursion</i> number is
|
||||
a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much
|
||||
heap) memory is needed to complete the match attempt.
|
||||
increasing length of subject string. The <i>depth_limit</i> number is
|
||||
a measure of how much memory for recording backtracking points is needed to
|
||||
complete the match attempt.
|
||||
</P>
|
||||
<br><b>
|
||||
Showing MARK names
|
||||
|
@ -1466,7 +1474,7 @@ code unit offset of the start of the failing character is also output. Here is
|
|||
an example of an interactive <b>pcre2test</b> run.
|
||||
<pre>
|
||||
$ pcre2test
|
||||
PCRE2 version 9.00 2014-05-10
|
||||
PCRE2 version 10.22 2016-07-29
|
||||
|
||||
re> /^abc(\d+)/
|
||||
data> abc123
|
||||
|
@ -1779,9 +1787,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 December 2016
|
||||
Last updated: 21 March 2017
|
||||
<br>
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
1603
doc/pcre2.txt
1603
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_CONFIG 3 "20 April 2014" "PCRE2 10.0"
|
||||
.TH PCRE2_CONFIG 3 "24 March 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -31,10 +31,13 @@ point to a uint32_t integer variable. The available codes are:
|
|||
PCRE2_CONFIG_BSR Indicates what \eR matches by default:
|
||||
PCRE2_BSR_UNICODE
|
||||
PCRE2_BSR_ANYCRLF
|
||||
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
|
||||
.\" JOIN
|
||||
PCRE2_CONFIG_JIT Availability of just-in-time compiler
|
||||
support (1=yes 0=no)
|
||||
PCRE2_CONFIG_JITTARGET Information about the target archi-
|
||||
tecture for the JIT compiler
|
||||
.\" JOIN
|
||||
PCRE2_CONFIG_JITTARGET Information (a string) about the target
|
||||
architecture for the JIT compiler
|
||||
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
||||
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
||||
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
||||
|
@ -44,9 +47,9 @@ point to a uint32_t integer variable. The available codes are:
|
|||
PCRE2_NEWLINE_ANY
|
||||
PCRE2_NEWLINE_ANYCRLF
|
||||
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
||||
PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
|
||||
PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
|
||||
0=heap)
|
||||
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
|
||||
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
|
||||
.\" JOIN
|
||||
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
|
||||
0=no)
|
||||
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_DFA_MATCH 3 "23 December 2016" "PCRE2 10.23"
|
||||
.TH PCRE2_DFA_MATCH 3 "24 March 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -19,8 +19,9 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
|||
.sp
|
||||
This function matches a compiled regular expression against a given subject
|
||||
string, using an alternative matching algorithm that scans the subject string
|
||||
just once (\fInot\fP Perl-compatible). (The Perl-compatible matching function
|
||||
is \fBpcre2_match()\fP.) The arguments for this function are:
|
||||
just once (except when processing lookaround assertions). This function is
|
||||
\fInot\fP Perl-compatible (the Perl-compatible matching function is
|
||||
\fBpcre2_match()\fP). The arguments for this function are:
|
||||
.sp
|
||||
\fIcode\fP Points to the compiled pattern
|
||||
\fIsubject\fP Points to the subject string
|
||||
|
@ -33,22 +34,26 @@ is \fBpcre2_match()\fP.) The arguments for this function are:
|
|||
\fIwscount\fP Number of elements in the vector
|
||||
.sp
|
||||
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
|
||||
up a callout function or specify the recursion limit. The \fIlength\fP and
|
||||
\fIstartoffset\fP values are code units, not characters. The options are:
|
||||
up a callout function or specify the recursion depth limit. The \fIlength\fP
|
||||
and \fIstartoffset\fP values are code units, not characters. The options are:
|
||||
.sp
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||
PCRE2_NOTEOL Subject is not the end of a line
|
||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||
.\" JOIN
|
||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
|
||||
is not a valid match
|
||||
.\" JOIN
|
||||
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
|
||||
validity (only relevant if PCRE2_UTF
|
||||
was set at compile time)
|
||||
.\" JOIN
|
||||
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial
|
||||
match even if there is a full match
|
||||
.\" JOIN
|
||||
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
|
||||
match if no full matches are found
|
||||
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
|
||||
even if there is a full match as well
|
||||
match if no full matches are found
|
||||
PCRE2_DFA_RESTART Restart after a partial match
|
||||
PCRE2_DFA_SHORTEST Return only the shortest match
|
||||
.sp
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_GET_ERROR_MESSAGE 3 "17 June 2016" "PCRE2 10.22"
|
||||
.TH PCRE2_GET_ERROR_MESSAGE 3 "24 March 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -22,11 +22,11 @@ errors are negative numbers. The arguments are:
|
|||
\fIbuffer\fP where to put the message
|
||||
\fIbufflen\fP the length of the buffer (code units)
|
||||
.sp
|
||||
The function returns the length of the message, excluding the trailing zero, or
|
||||
the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
|
||||
this case, the returned message is truncated (but still with a trailing zero).
|
||||
If \fIerrorcode\fP does not contain a recognized error code number, the
|
||||
negative value PCRE2_ERROR_BADDATA is returned.
|
||||
The function returns the length of the message in code units, excluding the
|
||||
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
|
||||
too small. In this case, the returned message is truncated (but still with a
|
||||
trailing zero). If \fIerrorcode\fP does not contain a recognized error code
|
||||
number, the negative value PCRE2_ERROR_BADDATA is returned.
|
||||
.P
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
.\" HREF
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_JIT_STACK_CREATE 3 "03 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2_JIT_STACK_CREATE 3 "24 March 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -20,10 +20,9 @@ maximum size to which it is allowed to grow. The final argument is a general
|
|||
context, for memory allocation functions, or NULL for standard memory
|
||||
allocation. The result can be passed to the JIT run-time code by calling
|
||||
\fBpcre2_jit_stack_assign()\fP to associate the stack with a compiled pattern,
|
||||
which can then be processed by \fBpcre2_match()\fP. If the "fast path" JIT
|
||||
matcher, \fBpcre2_jit_match()\fP is used, the stack can be passed directly as
|
||||
an argument. A maximum stack size of 512K to 1M should be more than enough for
|
||||
any pattern. For more details, see the
|
||||
which can then be processed by \fBpcre2_match()\fP or \fBpcre2_jit_match()\fP.
|
||||
A maximum stack size of 512K to 1M should be more than enough for any pattern.
|
||||
For more details, see the
|
||||
.\" HREF
|
||||
\fBpcre2jit\fP
|
||||
.\"
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_MAKETABLES 3 "21 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2_MAKETABLES 3 "24 March 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -12,10 +12,10 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
|||
.SH DESCRIPTION
|
||||
.rs
|
||||
.sp
|
||||
This function builds a set of character tables for character values less than
|
||||
256. These can be passed to \fBpcre2_compile()\fP in a compile context in order
|
||||
to override the internal, built-in tables (which were either defaulted or made
|
||||
by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
|
||||
This function builds a set of character tables for character code points that
|
||||
are less than 256. These can be passed to \fBpcre2_compile()\fP in a compile
|
||||
context in order to override the internal, built-in tables (which were either
|
||||
defaulted or made by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
|
||||
.\" HREF
|
||||
\fBpcre2_set_character_tables()\fP
|
||||
.\"
|
||||
|
|
|
@ -255,6 +255,9 @@ OPTIONS
|
|||
directory like this is an immediate end-of-file; in others it
|
||||
may provoke an error.
|
||||
|
||||
--depth-limit=number
|
||||
See --match-limit below.
|
||||
|
||||
-e pattern, --regex=pattern, --regexp=pattern
|
||||
Specify a pattern to be matched. This option can be used mul-
|
||||
tiple times in order to specify several patterns. It can also
|
||||
|
@ -477,32 +480,24 @@ OPTIONS
|
|||
no short form for this option.
|
||||
|
||||
--match-limit=number
|
||||
Processing some regular expression patterns can require a
|
||||
very large amount of memory, leading in some cases to a pro-
|
||||
gram crash if not enough is available. Other patterns may
|
||||
take a very long time to search for all possible matching
|
||||
strings. The pcre2_match() function that is called by
|
||||
pcre2grep to do the matching has two parameters that can
|
||||
limit the resources that it uses.
|
||||
Processing some regular expression patterns may take a very
|
||||
long time to search for all possible matching strings. Others
|
||||
may require a very large amount of memory. There are two
|
||||
options that set resource limits for matching.
|
||||
|
||||
The --match-limit option provides a means of limiting
|
||||
resource usage when processing patterns that are not going to
|
||||
match, but which have a very large number of possibilities in
|
||||
their search trees. The classic example is a pattern that
|
||||
uses nested unlimited repeats. Internally, PCRE2 uses a func-
|
||||
tion called match() which it calls repeatedly (sometimes
|
||||
recursively). The limit set by --match-limit is imposed on
|
||||
the number of times this function is called during a match,
|
||||
which has the effect of limiting the amount of backtracking
|
||||
that can take place.
|
||||
The --match-limit option provides a means of limiting comput-
|
||||
ing resource usage when processing patterns that are not
|
||||
going to match, but which have a very large number of possi-
|
||||
bilities in their search trees. The classic example is a pat-
|
||||
tern that uses nested unlimited repeats. Internally, PCRE2
|
||||
has a counter that is incremented each time around its main
|
||||
processing loop. If the value set by --match-limit is
|
||||
reached, an error occurs.
|
||||
|
||||
The --recursion-limit option is similar to --match-limit, but
|
||||
instead of limiting the total number of times that match() is
|
||||
called, it limits the depth of recursive calls, which in turn
|
||||
limits the amount of memory that can be used. The recursion
|
||||
depth is a smaller number than the total number of calls,
|
||||
because not all calls to match() are recursive. This limit is
|
||||
of use only if it is set smaller than --match-limit.
|
||||
The --depth-limit option limits the depth of nested back-
|
||||
tracking points, which in turn limits the amount of memory
|
||||
that is used. This limit is of use only if it is set smaller
|
||||
than --match-limit.
|
||||
|
||||
There are no short forms for these options. The default set-
|
||||
tings are specified when the PCRE2 library is compiled, with
|
||||
|
@ -834,9 +829,9 @@ MATCHING ERRORS
|
|||
such errors, pcre2grep gives up.
|
||||
|
||||
The --match-limit option of pcre2grep can be used to set the overall
|
||||
resource limit; there is a second option called --recursion-limit that
|
||||
sets a limit on the amount of memory (usually stack) that is used (see
|
||||
the discussion of these options above).
|
||||
resource limit; there is a second option called --depth-limit that sets
|
||||
a limit on the amount of memory that is used (see the discussion of
|
||||
these options above).
|
||||
|
||||
|
||||
DIAGNOSTICS
|
||||
|
@ -862,5 +857,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 31 December 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
Last updated: 21 March 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
|
|
|
@ -91,13 +91,13 @@ INPUT ENCODING
|
|||
ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
|
||||
values greater than 0xffff cause an error to occur).
|
||||
|
||||
UTF-8 is not capable of encoding values greater than 0x7fffffff, but
|
||||
such values can be handled by the 32-bit library. When testing this
|
||||
library in non-UTF mode with utf8_input set, if any character is pre-
|
||||
ceded by the byte 0xff (which is an illegal byte in UTF-8) 0x80000000
|
||||
is added to the character's value. This is the only way of passing such
|
||||
code points in a pattern string. For subject strings, using an escape
|
||||
sequence is preferable.
|
||||
UTF-8 (in its original definition) is not capable of encoding values
|
||||
greater than 0x7fffffff, but such values can be handled by the 32-bit
|
||||
library. When testing this library in non-UTF mode with utf8_input set,
|
||||
if any character is preceded by the byte 0xff (which is an illegal byte
|
||||
in UTF-8) 0x80000000 is added to the character's value. This is the
|
||||
only way of passing such code points in a pattern string. For subject
|
||||
strings, using an escape sequence is preferable.
|
||||
|
||||
|
||||
COMMAND LINE OPTIONS
|
||||
|
@ -544,6 +544,7 @@ PATTERN MODIFIERS
|
|||
/B bincode show binary code without lengths
|
||||
callout_info show callout information
|
||||
debug same as info,fullbincode
|
||||
framesize show matching frame size
|
||||
fullbincode show binary code with lengths
|
||||
/I info show info about compiled pattern
|
||||
hex unquoted characters are hexadecimal
|
||||
|
@ -624,6 +625,10 @@ PATTERN MODIFIERS
|
|||
last character. These lines are omitted if no starting or ending code
|
||||
units are recorded.
|
||||
|
||||
The framesize modifier shows the size, in bytes, of the storage frames
|
||||
used by pcre2_match() for handling backtracking. The size depends on
|
||||
the number of capturing parentheses in the pattern.
|
||||
|
||||
The callout_info modifier requests information about all the callouts
|
||||
in the pattern. A list of them is output at the end of any other infor-
|
||||
mation that is requested. For each callout, either its number or string
|
||||
|
@ -959,6 +964,7 @@ SUBJECT MODIFIERS
|
|||
callout_fail=<n>[:<m>] control callout failure
|
||||
callout_none do not supply a callout function
|
||||
copy=<number or name> copy captured substring
|
||||
depth_limit=<n> set a depth limit
|
||||
dfa use pcre2_dfa_match()
|
||||
find_limits find match and recursion limits
|
||||
get=<number or name> extract captured substring
|
||||
|
@ -972,7 +978,7 @@ SUBJECT MODIFIERS
|
|||
offset=<n> set starting offset
|
||||
offset_limit=<n> set offset limit
|
||||
ovector=<n> set size of output vector
|
||||
recursion_limit=<n> set a recursion limit
|
||||
recursion_limit=<n> obsolete synonym for depth_limit
|
||||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
startoffset=<n> same as offset=<n>
|
||||
|
@ -1188,133 +1194,132 @@ SUBJECT MODIFIERS
|
|||
Providing a stack that is larger than the default 32K is necessary only
|
||||
for very complicated patterns.
|
||||
|
||||
Setting match and recursion limits
|
||||
Setting match and depth limits
|
||||
|
||||
The match_limit and recursion_limit modifiers set the appropriate lim-
|
||||
its in the match context. These values are ignored when the find_limits
|
||||
modifier is specified.
|
||||
The match_limit and depth_limit modifiers set the appropriate limits in
|
||||
the match context. These values are ignored when the find_limits modi-
|
||||
fier is specified.
|
||||
|
||||
Finding minimum limits
|
||||
|
||||
If the find_limits modifier is present, pcre2test calls pcre2_match()
|
||||
several times, setting different values in the match context via
|
||||
pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds
|
||||
the minimum values for each parameter that allow pcre2_match() to com-
|
||||
plete without error.
|
||||
pcre2_set_match_limit() and pcre2_set_depth_limit() until it finds the
|
||||
minimum values for each parameter that allow pcre2_match() to complete
|
||||
without error.
|
||||
|
||||
If JIT is being used, only the match limit is relevant. If DFA matching
|
||||
is being used, neither limit is relevant, and this modifier is ignored
|
||||
(with a warning message).
|
||||
is being used, only the depth limit is relevant, but at present this
|
||||
modifier is ignored (with a warning message).
|
||||
|
||||
The match_limit number is a measure of the amount of backtracking that
|
||||
takes place, and learning the minimum value can be instructive. For
|
||||
most simple matches, the number is quite small, but for patterns with
|
||||
very large numbers of matching possibilities, it can become large very
|
||||
quickly with increasing length of subject string. The
|
||||
match_limit_recursion number is a measure of how much stack (or, if
|
||||
PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to
|
||||
complete the match attempt.
|
||||
quickly with increasing length of subject string. The depth_limit num-
|
||||
ber is a measure of how much memory for recording backtracking points
|
||||
is needed to complete the match attempt.
|
||||
|
||||
Showing MARK names
|
||||
|
||||
|
||||
The mark modifier causes the names from backtracking control verbs that
|
||||
are returned from calls to pcre2_match() to be displayed. If a mark is
|
||||
returned for a match, non-match, or partial match, pcre2test shows it.
|
||||
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
||||
are returned from calls to pcre2_match() to be displayed. If a mark is
|
||||
returned for a match, non-match, or partial match, pcre2test shows it.
|
||||
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
||||
it is added to the non-match message.
|
||||
|
||||
Showing memory usage
|
||||
|
||||
The memory modifier causes pcre2test to log all memory allocation and
|
||||
The memory modifier causes pcre2test to log all memory allocation and
|
||||
freeing calls that occur during a match operation.
|
||||
|
||||
Setting a starting offset
|
||||
|
||||
The offset modifier sets an offset in the subject string at which
|
||||
The offset modifier sets an offset in the subject string at which
|
||||
matching starts. Its value is a number of code units, not characters.
|
||||
|
||||
Setting an offset limit
|
||||
|
||||
The offset_limit modifier sets a limit for unanchored matches. If a
|
||||
The offset_limit modifier sets a limit for unanchored matches. If a
|
||||
match cannot be found starting at or before this offset in the subject,
|
||||
a "no match" return is given. The data value is a number of code units,
|
||||
not characters. When this modifier is used, the use_offset_limit modi-
|
||||
not characters. When this modifier is used, the use_offset_limit modi-
|
||||
fier must have been set for the pattern; if not, an error is generated.
|
||||
|
||||
Setting the size of the output vector
|
||||
|
||||
The ovector modifier applies only to the subject line in which it
|
||||
appears, though of course it can also be used to set a default in a
|
||||
#subject command. It specifies the number of pairs of offsets that are
|
||||
The ovector modifier applies only to the subject line in which it
|
||||
appears, though of course it can also be used to set a default in a
|
||||
#subject command. It specifies the number of pairs of offsets that are
|
||||
available for storing matching information. The default is 15.
|
||||
|
||||
A value of zero is useful when testing the POSIX API because it causes
|
||||
A value of zero is useful when testing the POSIX API because it causes
|
||||
regexec() to be called with a NULL capture vector. When not testing the
|
||||
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
||||
ate_from_pattern() to be called, in order to create a match block of
|
||||
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
||||
ate_from_pattern() to be called, in order to create a match block of
|
||||
exactly the right size for the pattern. (It is not possible to create a
|
||||
match block with a zero-length ovector; there is always at least one
|
||||
match block with a zero-length ovector; there is always at least one
|
||||
pair of offsets.)
|
||||
|
||||
Passing the subject as zero-terminated
|
||||
|
||||
By default, the subject string is passed to a native API matching func-
|
||||
tion with its correct length. In order to test the facility for passing
|
||||
a zero-terminated string, the zero_terminate modifier is provided. It
|
||||
a zero-terminated string, the zero_terminate modifier is provided. It
|
||||
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
|
||||
via the POSIX interface, this modifier has no effect, as there is no
|
||||
via the POSIX interface, this modifier has no effect, as there is no
|
||||
facility for passing a length.)
|
||||
|
||||
When testing pcre2_substitute(), this modifier also has the effect of
|
||||
When testing pcre2_substitute(), this modifier also has the effect of
|
||||
passing the replacement string as zero-terminated.
|
||||
|
||||
Passing a NULL context
|
||||
|
||||
Normally, pcre2test passes a context block to pcre2_match(),
|
||||
Normally, pcre2test passes a context block to pcre2_match(),
|
||||
pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is
|
||||
set, however, NULL is passed. This is for testing that the matching
|
||||
set, however, NULL is passed. This is for testing that the matching
|
||||
functions behave correctly in this case (they use default values). This
|
||||
modifier cannot be used with the find_limits modifier or when testing
|
||||
modifier cannot be used with the find_limits modifier or when testing
|
||||
the substitution function.
|
||||
|
||||
|
||||
THE ALTERNATIVE MATCHING FUNCTION
|
||||
|
||||
By default, pcre2test uses the standard PCRE2 matching function,
|
||||
By default, pcre2test uses the standard PCRE2 matching function,
|
||||
pcre2_match() to match each subject line. PCRE2 also supports an alter-
|
||||
native matching function, pcre2_dfa_match(), which operates in a dif-
|
||||
ferent way, and has some restrictions. The differences between the two
|
||||
native matching function, pcre2_dfa_match(), which operates in a dif-
|
||||
ferent way, and has some restrictions. The differences between the two
|
||||
functions are described in the pcre2matching documentation.
|
||||
|
||||
If the dfa modifier is set, the alternative matching function is used.
|
||||
This function finds all possible matches at a given point in the sub-
|
||||
ject. If, however, the dfa_shortest modifier is set, processing stops
|
||||
after the first match is found. This is always the shortest possible
|
||||
If the dfa modifier is set, the alternative matching function is used.
|
||||
This function finds all possible matches at a given point in the sub-
|
||||
ject. If, however, the dfa_shortest modifier is set, processing stops
|
||||
after the first match is found. This is always the shortest possible
|
||||
match.
|
||||
|
||||
|
||||
DEFAULT OUTPUT FROM pcre2test
|
||||
|
||||
This section describes the output when the normal matching function,
|
||||
This section describes the output when the normal matching function,
|
||||
pcre2_match(), is being used.
|
||||
|
||||
When a match succeeds, pcre2test outputs the list of captured sub-
|
||||
strings, starting with number 0 for the string that matched the whole
|
||||
pattern. Otherwise, it outputs "No match" when the return is
|
||||
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
|
||||
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
|
||||
this is the entire substring that was inspected during the partial
|
||||
match; it may include characters before the actual match start if a
|
||||
When a match succeeds, pcre2test outputs the list of captured sub-
|
||||
strings, starting with number 0 for the string that matched the whole
|
||||
pattern. Otherwise, it outputs "No match" when the return is
|
||||
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
|
||||
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
|
||||
this is the entire substring that was inspected during the partial
|
||||
match; it may include characters before the actual match start if a
|
||||
lookbehind assertion, \K, \b, or \B was involved.)
|
||||
|
||||
For any other return, pcre2test outputs the PCRE2 negative error number
|
||||
and a short descriptive phrase. If the error is a failed UTF string
|
||||
check, the code unit offset of the start of the failing character is
|
||||
and a short descriptive phrase. If the error is a failed UTF string
|
||||
check, the code unit offset of the start of the failing character is
|
||||
also output. Here is an example of an interactive pcre2test run.
|
||||
|
||||
$ pcre2test
|
||||
PCRE2 version 9.00 2014-05-10
|
||||
PCRE2 version 10.22 2016-07-29
|
||||
|
||||
re> /^abc(\d+)/
|
||||
data> abc123
|
||||
|
@ -1326,8 +1331,8 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
Unset capturing substrings that are not followed by one that is set are
|
||||
not shown by pcre2test unless the allcaptures modifier is specified. In
|
||||
the following example, there are two capturing substrings, but when the
|
||||
first data line is matched, the second, unset substring is not shown.
|
||||
An "internal" unset substring is shown as "<unset>", as for the second
|
||||
first data line is matched, the second, unset substring is not shown.
|
||||
An "internal" unset substring is shown as "<unset>", as for the second
|
||||
data line.
|
||||
|
||||
re> /(a)|(b)/
|
||||
|
@ -1339,11 +1344,11 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
1: <unset>
|
||||
2: b
|
||||
|
||||
If the strings contain any non-printing characters, they are output as
|
||||
\xhh escapes if the value is less than 256 and UTF mode is not set.
|
||||
If the strings contain any non-printing characters, they are output as
|
||||
\xhh escapes if the value is less than 256 and UTF mode is not set.
|
||||
Otherwise they are output as \x{hh...} escapes. See below for the defi-
|
||||
nition of non-printing characters. If the aftertext modifier is set,
|
||||
the output for substring 0 is followed by the the rest of the subject
|
||||
nition of non-printing characters. If the aftertext modifier is set,
|
||||
the output for substring 0 is followed by the the rest of the subject
|
||||
string, identified by "0+" like this:
|
||||
|
||||
re> /cat/aftertext
|
||||
|
@ -1351,7 +1356,7 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
0: cat
|
||||
0+ aract
|
||||
|
||||
If global matching is requested, the results of successive matching
|
||||
If global matching is requested, the results of successive matching
|
||||
attempts are output in sequence, like this:
|
||||
|
||||
re> /\Bi(\w\w)/g
|
||||
|
@ -1363,8 +1368,8 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
0: ipp
|
||||
1: pp
|
||||
|
||||
"No match" is output only if the first match attempt fails. Here is an
|
||||
example of a failure message (the offset 4 that is specified by the
|
||||
"No match" is output only if the first match attempt fails. Here is an
|
||||
example of a failure message (the offset 4 that is specified by the
|
||||
offset modifier is past the end of the subject string):
|
||||
|
||||
re> /xyz/
|
||||
|
@ -1372,7 +1377,7 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
Error -24 (bad offset value)
|
||||
|
||||
Note that whereas patterns can be continued over several lines (a plain
|
||||
">" prompt is used for continuations), subject lines may not. However
|
||||
">" prompt is used for continuations), subject lines may not. However
|
||||
newlines can be included in a subject by means of the \n escape (or \r,
|
||||
\r\n, etc., depending on the newline sequence setting).
|
||||
|
||||
|
@ -1380,7 +1385,7 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
||||
|
||||
When the alternative matching function, pcre2_dfa_match(), is used, the
|
||||
output consists of a list of all the matches that start at the first
|
||||
output consists of a list of all the matches that start at the first
|
||||
point in the subject where there is at least one match. For example:
|
||||
|
||||
re> /(tang|tangerine|tan)/
|
||||
|
@ -1389,11 +1394,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
|||
1: tang
|
||||
2: tan
|
||||
|
||||
Using the normal matching function on this data finds only "tang". The
|
||||
longest matching string is always given first (and numbered zero).
|
||||
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
|
||||
followed by the partially matching substring. Note that this is the
|
||||
entire substring that was inspected during the partial match; it may
|
||||
Using the normal matching function on this data finds only "tang". The
|
||||
longest matching string is always given first (and numbered zero).
|
||||
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
|
||||
followed by the partially matching substring. Note that this is the
|
||||
entire substring that was inspected during the partial match; it may
|
||||
include characters before the actual match start if a lookbehind asser-
|
||||
tion, \b, or \B was involved. (\K is not supported for DFA matching.)
|
||||
|
||||
|
@ -1409,16 +1414,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
|||
1: tan
|
||||
0: tan
|
||||
|
||||
The alternative matching function does not support substring capture,
|
||||
so the modifiers that are concerned with captured substrings are not
|
||||
The alternative matching function does not support substring capture,
|
||||
so the modifiers that are concerned with captured substrings are not
|
||||
relevant.
|
||||
|
||||
|
||||
RESTARTING AFTER A PARTIAL MATCH
|
||||
|
||||
When the alternative matching function has given the PCRE2_ERROR_PAR-
|
||||
When the alternative matching function has given the PCRE2_ERROR_PAR-
|
||||
TIAL return, indicating that the subject partially matched the pattern,
|
||||
you can restart the match with additional subject data by means of the
|
||||
you can restart the match with additional subject data by means of the
|
||||
dfa_restart modifier. For example:
|
||||
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
|
@ -1427,45 +1432,45 @@ RESTARTING AFTER A PARTIAL MATCH
|
|||
data> n05\=dfa,dfa_restart
|
||||
0: n05
|
||||
|
||||
For further information about partial matching, see the pcre2partial
|
||||
For further information about partial matching, see the pcre2partial
|
||||
documentation.
|
||||
|
||||
|
||||
CALLOUTS
|
||||
|
||||
If the pattern contains any callout requests, pcre2test's callout func-
|
||||
tion is called during matching unless callout_none is specified. This
|
||||
tion is called during matching unless callout_none is specified. This
|
||||
works with both matching functions.
|
||||
|
||||
The callout function in pcre2test returns zero (carry on matching) by
|
||||
default, but you can use a callout_fail modifier in a subject line (as
|
||||
The callout function in pcre2test returns zero (carry on matching) by
|
||||
default, but you can use a callout_fail modifier in a subject line (as
|
||||
described above) to change this and other parameters of the callout.
|
||||
|
||||
Inserting callouts can be helpful when using pcre2test to check compli-
|
||||
cated regular expressions. For further information about callouts, see
|
||||
cated regular expressions. For further information about callouts, see
|
||||
the pcre2callout documentation.
|
||||
|
||||
The output for callouts with numerical arguments and those with string
|
||||
The output for callouts with numerical arguments and those with string
|
||||
arguments is slightly different.
|
||||
|
||||
Callouts with numerical arguments
|
||||
|
||||
By default, the callout function displays the callout number, the start
|
||||
and current positions in the subject text at the callout time, and the
|
||||
and current positions in the subject text at the callout time, and the
|
||||
next pattern item to be tested. For example:
|
||||
|
||||
--->pqrabcdef
|
||||
0 ^ ^ \d
|
||||
|
||||
This output indicates that callout number 0 occurred for a match
|
||||
attempt starting at the fourth character of the subject string, when
|
||||
the pointer was at the seventh character, and when the next pattern
|
||||
item was \d. Just one circumflex is output if the start and current
|
||||
positions are the same, or if the current position precedes the start
|
||||
This output indicates that callout number 0 occurred for a match
|
||||
attempt starting at the fourth character of the subject string, when
|
||||
the pointer was at the seventh character, and when the next pattern
|
||||
item was \d. Just one circumflex is output if the start and current
|
||||
positions are the same, or if the current position precedes the start
|
||||
position, which can happen if the callout is in a lookbehind assertion.
|
||||
|
||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as
|
||||
a result of the /auto_callout pattern modifier. In this case, instead
|
||||
a result of the /auto_callout pattern modifier. In this case, instead
|
||||
of showing the callout number, the offset in the pattern, preceded by a
|
||||
plus, is output. For example:
|
||||
|
||||
|
@ -1479,7 +1484,7 @@ CALLOUTS
|
|||
0: E*
|
||||
|
||||
If a pattern contains (*MARK) items, an additional line is output when-
|
||||
ever a change of latest mark is passed to the callout function. For
|
||||
ever a change of latest mark is passed to the callout function. For
|
||||
example:
|
||||
|
||||
re> /a(*MARK:X)bc/auto_callout
|
||||
|
@ -1493,17 +1498,17 @@ CALLOUTS
|
|||
+12 ^ ^
|
||||
0: abc
|
||||
|
||||
The mark changes between matching "a" and "b", but stays the same for
|
||||
the rest of the match, so nothing more is output. If, as a result of
|
||||
backtracking, the mark reverts to being unset, the text "<unset>" is
|
||||
The mark changes between matching "a" and "b", but stays the same for
|
||||
the rest of the match, so nothing more is output. If, as a result of
|
||||
backtracking, the mark reverts to being unset, the text "<unset>" is
|
||||
output.
|
||||
|
||||
Callouts with string arguments
|
||||
|
||||
The output for a callout with a string argument is similar, except that
|
||||
instead of outputting a callout number before the position indicators,
|
||||
the callout string and its offset in the pattern string are output
|
||||
before the reflection of the subject string, and the subject string is
|
||||
instead of outputting a callout number before the position indicators,
|
||||
the callout string and its offset in the pattern string are output
|
||||
before the reflection of the subject string, and the subject string is
|
||||
reflected for each callout. For example:
|
||||
|
||||
re> /^ab(?C'first')cd(?C"second")ef/
|
||||
|
@ -1520,43 +1525,43 @@ CALLOUTS
|
|||
NON-PRINTING CHARACTERS
|
||||
|
||||
When pcre2test is outputting text in the compiled version of a pattern,
|
||||
bytes other than 32-126 are always treated as non-printing characters
|
||||
bytes other than 32-126 are always treated as non-printing characters
|
||||
and are therefore shown as hex escapes.
|
||||
|
||||
When pcre2test is outputting text that is a matched part of a subject
|
||||
string, it behaves in the same way, unless a different locale has been
|
||||
set for the pattern (using the locale modifier). In this case, the
|
||||
isprint() function is used to distinguish printing and non-printing
|
||||
When pcre2test is outputting text that is a matched part of a subject
|
||||
string, it behaves in the same way, unless a different locale has been
|
||||
set for the pattern (using the locale modifier). In this case, the
|
||||
isprint() function is used to distinguish printing and non-printing
|
||||
characters.
|
||||
|
||||
|
||||
SAVING AND RESTORING COMPILED PATTERNS
|
||||
|
||||
It is possible to save compiled patterns on disc or elsewhere, and
|
||||
It is possible to save compiled patterns on disc or elsewhere, and
|
||||
reload them later, subject to a number of restrictions. JIT data cannot
|
||||
be saved. The host on which the patterns are reloaded must be running
|
||||
be saved. The host on which the patterns are reloaded must be running
|
||||
the same version of PCRE2, with the same code unit width, and must also
|
||||
have the same endianness, pointer width and PCRE2_SIZE type. Before
|
||||
compiled patterns can be saved they must be serialized, that is, con-
|
||||
verted to a stream of bytes. A single byte stream may contain any num-
|
||||
ber of compiled patterns, but they must all use the same character
|
||||
have the same endianness, pointer width and PCRE2_SIZE type. Before
|
||||
compiled patterns can be saved they must be serialized, that is, con-
|
||||
verted to a stream of bytes. A single byte stream may contain any num-
|
||||
ber of compiled patterns, but they must all use the same character
|
||||
tables. A single copy of the tables is included in the byte stream (its
|
||||
size is 1088 bytes).
|
||||
|
||||
The functions whose names begin with pcre2_serialize_ are used for
|
||||
serializing and de-serializing. They are described in the pcre2serial-
|
||||
The functions whose names begin with pcre2_serialize_ are used for
|
||||
serializing and de-serializing. They are described in the pcre2serial-
|
||||
ize documentation. In this section we describe the features of
|
||||
pcre2test that can be used to test these functions.
|
||||
|
||||
When a pattern with push modifier is successfully compiled, it is
|
||||
pushed onto a stack of compiled patterns, and pcre2test expects the
|
||||
next line to contain a new pattern (or command) instead of a subject
|
||||
line. By contrast, the pushcopy modifier causes a copy of the compiled
|
||||
pattern to be stacked, leaving the original available for immediate
|
||||
matching. By using push and/or pushcopy, a number of patterns can be
|
||||
When a pattern with push modifier is successfully compiled, it is
|
||||
pushed onto a stack of compiled patterns, and pcre2test expects the
|
||||
next line to contain a new pattern (or command) instead of a subject
|
||||
line. By contrast, the pushcopy modifier causes a copy of the compiled
|
||||
pattern to be stacked, leaving the original available for immediate
|
||||
matching. By using push and/or pushcopy, a number of patterns can be
|
||||
compiled and retained. These modifiers are incompatible with posix, and
|
||||
control modifiers that act at match time are ignored (with a message)
|
||||
for the stacked patterns. The jitverify modifier applies only at com-
|
||||
control modifiers that act at match time are ignored (with a message)
|
||||
for the stacked patterns. The jitverify modifier applies only at com-
|
||||
pile time.
|
||||
|
||||
The command
|
||||
|
@ -1564,21 +1569,21 @@ SAVING AND RESTORING COMPILED PATTERNS
|
|||
#save <filename>
|
||||
|
||||
causes all the stacked patterns to be serialized and the result written
|
||||
to the named file. Afterwards, all the stacked patterns are freed. The
|
||||
to the named file. Afterwards, all the stacked patterns are freed. The
|
||||
command
|
||||
|
||||
#load <filename>
|
||||
|
||||
reads the data in the file, and then arranges for it to be de-serial-
|
||||
ized, with the resulting compiled patterns added to the pattern stack.
|
||||
The pattern on the top of the stack can be retrieved by the #pop com-
|
||||
mand, which must be followed by lines of subjects that are to be
|
||||
matched with the pattern, terminated as usual by an empty line or end
|
||||
of file. This command may be followed by a modifier list containing
|
||||
only control modifiers that act after a pattern has been compiled. In
|
||||
reads the data in the file, and then arranges for it to be de-serial-
|
||||
ized, with the resulting compiled patterns added to the pattern stack.
|
||||
The pattern on the top of the stack can be retrieved by the #pop com-
|
||||
mand, which must be followed by lines of subjects that are to be
|
||||
matched with the pattern, terminated as usual by an empty line or end
|
||||
of file. This command may be followed by a modifier list containing
|
||||
only control modifiers that act after a pattern has been compiled. In
|
||||
particular, hex, posix, posix_nosub, push, and pushcopy are not
|
||||
allowed, nor are any option-setting modifiers. The JIT modifiers are,
|
||||
however permitted. Here is an example that saves and reloads two pat-
|
||||
allowed, nor are any option-setting modifiers. The JIT modifiers are,
|
||||
however permitted. Here is an example that saves and reloads two pat-
|
||||
terns.
|
||||
|
||||
/abc/push
|
||||
|
@ -1591,10 +1596,10 @@ SAVING AND RESTORING COMPILED PATTERNS
|
|||
#pop jit,bincode
|
||||
abc
|
||||
|
||||
If jitverify is used with #pop, it does not automatically imply jit,
|
||||
If jitverify is used with #pop, it does not automatically imply jit,
|
||||
which is different behaviour from when it is used on a pattern.
|
||||
|
||||
The #popcopy command is analagous to the pushcopy modifier in that it
|
||||
The #popcopy command is analagous to the pushcopy modifier in that it
|
||||
makes current a copy of the topmost stack pattern, leaving the original
|
||||
still on the stack.
|
||||
|
||||
|
@ -1614,5 +1619,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 28 December 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
Last updated: 21 March 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
|
|
Loading…
Reference in New Issue