Documentation update.
This commit is contained in:
parent
380738d981
commit
7fe5e441ff
|
@ -23,18 +23,18 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<li><a name="TOC8" href="#SEC8">NEWLINE RECOGNITION</a>
|
<li><a name="TOC8" href="#SEC8">NEWLINE RECOGNITION</a>
|
||||||
<li><a name="TOC9" href="#SEC9">WHAT \R MATCHES</a>
|
<li><a name="TOC9" href="#SEC9">WHAT \R MATCHES</a>
|
||||||
<li><a name="TOC10" href="#SEC10">HANDLING VERY LARGE PATTERNS</a>
|
<li><a name="TOC10" href="#SEC10">HANDLING VERY LARGE PATTERNS</a>
|
||||||
<li><a name="TOC11" href="#SEC11">AVOIDING EXCESSIVE STACK USAGE</a>
|
<li><a name="TOC11" href="#SEC11">LIMITING PCRE2 RESOURCE USAGE</a>
|
||||||
<li><a name="TOC12" href="#SEC12">LIMITING PCRE2 RESOURCE USAGE</a>
|
<li><a name="TOC12" href="#SEC12">CREATING CHARACTER TABLES AT BUILD TIME</a>
|
||||||
<li><a name="TOC13" href="#SEC13">CREATING CHARACTER TABLES AT BUILD TIME</a>
|
<li><a name="TOC13" href="#SEC13">USING EBCDIC CODE</a>
|
||||||
<li><a name="TOC14" href="#SEC14">USING EBCDIC CODE</a>
|
<li><a name="TOC14" href="#SEC14">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a>
|
||||||
<li><a name="TOC15" href="#SEC15">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a>
|
<li><a name="TOC15" href="#SEC15">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
|
||||||
<li><a name="TOC16" href="#SEC16">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
|
<li><a name="TOC16" href="#SEC16">PCRE2GREP BUFFER SIZE</a>
|
||||||
<li><a name="TOC17" href="#SEC17">PCRE2GREP BUFFER SIZE</a>
|
<li><a name="TOC17" href="#SEC17">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
|
||||||
<li><a name="TOC18" href="#SEC18">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
|
<li><a name="TOC18" href="#SEC18">INCLUDING DEBUGGING CODE</a>
|
||||||
<li><a name="TOC19" href="#SEC19">INCLUDING DEBUGGING CODE</a>
|
<li><a name="TOC19" href="#SEC19">DEBUGGING WITH VALGRIND SUPPORT</a>
|
||||||
<li><a name="TOC20" href="#SEC20">DEBUGGING WITH VALGRIND SUPPORT</a>
|
<li><a name="TOC20" href="#SEC20">CODE COVERAGE REPORTING</a>
|
||||||
<li><a name="TOC21" href="#SEC21">CODE COVERAGE REPORTING</a>
|
<li><a name="TOC21" href="#SEC21">SUPPORT FOR FUZZERS</a>
|
||||||
<li><a name="TOC22" href="#SEC22">SUPPORT FOR FUZZERS</a>
|
<li><a name="TOC22" href="#SEC22">OBSOLETE OPTION</a>
|
||||||
<li><a name="TOC23" href="#SEC23">SEE ALSO</a>
|
<li><a name="TOC23" href="#SEC23">SEE ALSO</a>
|
||||||
<li><a name="TOC24" href="#SEC24">AUTHOR</a>
|
<li><a name="TOC24" href="#SEC24">AUTHOR</a>
|
||||||
<li><a name="TOC25" href="#SEC25">REVISION</a>
|
<li><a name="TOC25" href="#SEC25">REVISION</a>
|
||||||
|
@ -78,11 +78,11 @@ running
|
||||||
<pre>
|
<pre>
|
||||||
./configure --help
|
./configure --help
|
||||||
</pre>
|
</pre>
|
||||||
The following sections include descriptions of options whose names begin with
|
The following sections include descriptions of "on/off" options whose names
|
||||||
--enable or --disable. These settings specify changes to the defaults for the
|
begin with --enable or --disable. Because of the way that <b>configure</b>
|
||||||
<b>configure</b> command. Because of the way that <b>configure</b> works,
|
works, --enable and --disable always come in pairs, so the complementary option
|
||||||
--enable and --disable always come in pairs, so the complementary option always
|
always exists as well, but as it specifies the default, it is not described.
|
||||||
exists as well, but as it specifies the default, it is not described.
|
Options that specify values have names that start with --with.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC3" href="#TOC1">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br>
|
<br><a name="SEC3" href="#TOC1">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -138,10 +138,10 @@ locked this out by setting PCRE2_NEVER_UTF.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
UTF support allows the libraries to process character code points up to
|
UTF support allows the libraries to process character code points up to
|
||||||
0x10ffff in the strings that they handle. It also provides support for
|
0x10ffff in the strings that they handle. Unicode support also gives access to
|
||||||
accessing the Unicode properties of such characters, using pattern escapes such
|
the Unicode properties of characters, using pattern escapes such as \P, \p,
|
||||||
as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
|
and \X. Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
|
||||||
<i>Nd</i> are supported. Details are given in the
|
supported. Details are given in the
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
documentation.
|
documentation.
|
||||||
</P>
|
</P>
|
||||||
|
@ -165,7 +165,7 @@ out by setting the PCRE2_NEVER_BACKSLASH_C option when calling
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC7" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
|
<br><a name="SEC7" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||||
<P>
|
<P>
|
||||||
Just-in-time compiler support is included in the build by specifying
|
Just-in-time (JIT) compiler support is included in the build by specifying
|
||||||
<pre>
|
<pre>
|
||||||
--enable-jit
|
--enable-jit
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -227,7 +227,7 @@ specify
|
||||||
</pre>
|
</pre>
|
||||||
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
|
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
|
||||||
selected when PCRE2 is built can be overridden by applications that use the
|
selected when PCRE2 is built can be overridden by applications that use the
|
||||||
called.
|
library.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC10" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
|
<br><a name="SEC10" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -248,36 +248,12 @@ longer offsets slows down the operation of PCRE2 because it has to load
|
||||||
additional data when handling them. For the 32-bit library the value is always
|
additional data when handling them. For the 32-bit library the value is always
|
||||||
4 and cannot be overridden; the value of --with-link-size is ignored.
|
4 and cannot be overridden; the value of --with-link-size is ignored.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC11" href="#TOC1">AVOIDING EXCESSIVE STACK USAGE</a><br>
|
<br><a name="SEC11" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
|
||||||
<P>
|
<P>
|
||||||
When matching with the <b>pcre2_match()</b> function, PCRE2 implements
|
The <b>pcre2_match()</b> function increments a counter each time it goes round
|
||||||
backtracking by making recursive calls to an internal function called
|
its main loop. Putting a limit on this counter controls the amount of computing
|
||||||
<b>match()</b>. In environments where the size of the stack is limited, this can
|
resource used by a single call to <b>pcre2_match()</b>. The limit can be changed
|
||||||
severely limit PCRE2's operation. (The Unix environment does not usually suffer
|
at run time, as described in the
|
||||||
from this problem, but it may sometimes be necessary to increase the maximum
|
|
||||||
stack size. There is a discussion in the
|
|
||||||
<a href="pcre2stack.html"><b>pcre2stack</b></a>
|
|
||||||
documentation.) An alternative approach to recursion that uses memory from the
|
|
||||||
heap to remember data, instead of using recursive function calls, has been
|
|
||||||
implemented to work round the problem of limited stack size. If you want to
|
|
||||||
build a version of PCRE2 that works this way, add
|
|
||||||
<pre>
|
|
||||||
--disable-stack-for-recursion
|
|
||||||
</pre>
|
|
||||||
to the <b>configure</b> command. By default, the system functions <b>malloc()</b>
|
|
||||||
and <b>free()</b> are called to manage the heap memory that is required, but
|
|
||||||
custom memory management functions can be called instead. PCRE2 runs noticeably
|
|
||||||
more slowly when built in this way. This option affects only the
|
|
||||||
<b>pcre2_match()</b> function; it is not relevant for <b>pcre2_dfa_match()</b>.
|
|
||||||
</P>
|
|
||||||
<br><a name="SEC12" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
|
|
||||||
<P>
|
|
||||||
Internally, PCRE2 has a function called <b>match()</b>, which it calls
|
|
||||||
repeatedly (sometimes recursively) when matching a pattern with the
|
|
||||||
<b>pcre2_match()</b> function. By controlling the maximum number of times this
|
|
||||||
function may be called during a single matching operation, a limit can be
|
|
||||||
placed on the resources used by a single call to <b>pcre2_match()</b>. The limit
|
|
||||||
can be changed at run time, as described in the
|
|
||||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
documentation. The default is 10 million, but this can be changed by adding a
|
documentation. The default is 10 million, but this can be changed by adding a
|
||||||
setting such as
|
setting such as
|
||||||
|
@ -285,21 +261,23 @@ setting such as
|
||||||
--with-match-limit=500000
|
--with-match-limit=500000
|
||||||
</pre>
|
</pre>
|
||||||
to the <b>configure</b> command. This setting has no effect on the
|
to the <b>configure</b> command. This setting has no effect on the
|
||||||
<b>pcre2_dfa_match()</b> matching function.
|
<b>pcre2_dfa_match()</b> matching function, but it does also limit JIT matching
|
||||||
|
(though the counting is done differently).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In some environments it is desirable to limit the depth of recursive calls of
|
In some environments it is desirable to limit the depth of nested backtracking
|
||||||
<b>match()</b> more strictly than the total number of calls, in order to
|
in order to restrict the maximum amount of heap memory that is used. A second
|
||||||
restrict the maximum amount of stack (or heap, if --disable-stack-for-recursion
|
limit controls this; it defaults to the value that is set for
|
||||||
is specified) that is used. A second limit controls this; it defaults to the
|
--with-match-limit. You can set a lower default limit by adding, for example,
|
||||||
value that is set for --with-match-limit, which imposes no additional
|
|
||||||
constraints. However, you can set a lower limit by adding, for example,
|
|
||||||
<pre>
|
<pre>
|
||||||
--with-match-limit-recursion=10000
|
--with-match-limit_depth=10000
|
||||||
</pre>
|
</pre>
|
||||||
to the <b>configure</b> command. This value can also be overridden at run time.
|
to the <b>configure</b> command. This value can also be overridden at run time.
|
||||||
|
As well as applying to <b>pcre2_match()</b>, this limit also controls the depth
|
||||||
|
of recursive function calls in <b>pcre2_dfa_match()</b>. These are used for
|
||||||
|
lookaround assertions and recursion within patterns.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC13" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
|
<br><a name="SEC12" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
|
||||||
<P>
|
<P>
|
||||||
PCRE2 uses fixed tables for processing characters whose code points are less
|
PCRE2 uses fixed tables for processing characters whose code points are less
|
||||||
than 256. By default, PCRE2 is built with a set of tables that are distributed
|
than 256. By default, PCRE2 is built with a set of tables that are distributed
|
||||||
|
@ -311,12 +289,12 @@ only. If you add
|
||||||
to the <b>configure</b> command, the distributed tables are no longer used.
|
to the <b>configure</b> command, the distributed tables are no longer used.
|
||||||
Instead, a program called <b>dftables</b> is compiled and run. This outputs the
|
Instead, a program called <b>dftables</b> is compiled and run. This outputs the
|
||||||
source for new set of tables, created in the default locale of your C run-time
|
source for new set of tables, created in the default locale of your C run-time
|
||||||
system. (This method of replacing the tables does not work if you are cross
|
system. This method of replacing the tables does not work if you are cross
|
||||||
compiling, because <b>dftables</b> is run on the local host. If you need to
|
compiling, because <b>dftables</b> is run on the local host. If you need to
|
||||||
create alternative tables when cross compiling, you will have to do so "by
|
create alternative tables when cross compiling, you will have to do so "by
|
||||||
hand".)
|
hand".
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC14" href="#TOC1">USING EBCDIC CODE</a><br>
|
<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
|
||||||
<P>
|
<P>
|
||||||
PCRE2 assumes by default that it will run in an environment where the character
|
PCRE2 assumes by default that it will run in an environment where the character
|
||||||
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
|
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
|
||||||
|
@ -351,7 +329,7 @@ The options that select newline behaviour, such as --enable-newline-is-cr,
|
||||||
and equivalent run-time options, refer to these character values in an EBCDIC
|
and equivalent run-time options, refer to these character values in an EBCDIC
|
||||||
environment.
|
environment.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC15" href="#TOC1">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a><br>
|
<br><a name="SEC14" href="#TOC1">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a><br>
|
||||||
<P>
|
<P>
|
||||||
By default, on non-Windows systems, <b>pcre2grep</b> supports the use of
|
By default, on non-Windows systems, <b>pcre2grep</b> supports the use of
|
||||||
callouts with string arguments within the patterns it is matching, in order to
|
callouts with string arguments within the patterns it is matching, in order to
|
||||||
|
@ -360,7 +338,7 @@ run external scripts. For details, see the
|
||||||
documentation. This support can be disabled by adding
|
documentation. This support can be disabled by adding
|
||||||
--disable-pcre2grep-callout to the <b>configure</b> command.
|
--disable-pcre2grep-callout to the <b>configure</b> command.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC16" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
|
<br><a name="SEC15" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
|
||||||
<P>
|
<P>
|
||||||
By default, <b>pcre2grep</b> reads all files as plain text. You can build it so
|
By default, <b>pcre2grep</b> reads all files as plain text. You can build it so
|
||||||
that it recognizes files whose names end in <b>.gz</b> or <b>.bz2</b>, and reads
|
that it recognizes files whose names end in <b>.gz</b> or <b>.bz2</b>, and reads
|
||||||
|
@ -373,7 +351,7 @@ to the <b>configure</b> command. These options naturally require that the
|
||||||
relevant libraries are installed on your system. Configuration will fail if
|
relevant libraries are installed on your system. Configuration will fail if
|
||||||
they are not.
|
they are not.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC17" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
|
<br><a name="SEC16" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>pcre2grep</b> uses an internal buffer to hold a "window" on the file it is
|
<b>pcre2grep</b> uses an internal buffer to hold a "window" on the file it is
|
||||||
scanning, in order to be able to output "before" and "after" lines when it
|
scanning, in order to be able to output "before" and "after" lines when it
|
||||||
|
@ -391,7 +369,7 @@ the larger. You can change the default parameter values by adding, for example,
|
||||||
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override
|
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override
|
||||||
these values by using --buffer-size and --max-buffer-size on the command line.
|
these values by using --buffer-size and --max-buffer-size on the command line.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC18" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
|
<br><a name="SEC17" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
|
||||||
<P>
|
<P>
|
||||||
If you add one of
|
If you add one of
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -425,7 +403,7 @@ automatically included, you may need to add something like
|
||||||
</pre>
|
</pre>
|
||||||
immediately before the <b>configure</b> command.
|
immediately before the <b>configure</b> command.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC19" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
|
<br><a name="SEC18" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
|
||||||
<P>
|
<P>
|
||||||
If you add
|
If you add
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -434,7 +412,7 @@ If you add
|
||||||
to the <b>configure</b> command, additional debugging code is included in the
|
to the <b>configure</b> command, additional debugging code is included in the
|
||||||
build. This feature is intended for use by the PCRE2 maintainers.
|
build. This feature is intended for use by the PCRE2 maintainers.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC20" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
<br><a name="SEC19" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
||||||
<P>
|
<P>
|
||||||
If you add
|
If you add
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -444,7 +422,7 @@ to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
|
||||||
certain memory regions as unaddressable. This allows it to detect invalid
|
certain memory regions as unaddressable. This allows it to detect invalid
|
||||||
memory accesses, and is mostly useful for debugging PCRE2 itself.
|
memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
<br><a name="SEC20" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
||||||
<P>
|
<P>
|
||||||
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
|
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
|
||||||
code coverage report for its test suite. To enable this, you must install
|
code coverage report for its test suite. To enable this, you must install
|
||||||
|
@ -501,7 +479,7 @@ This cleans all coverage data including the generated coverage report. For more
|
||||||
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
|
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
|
||||||
documentation.
|
documentation.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC22" href="#TOC1">SUPPORT FOR FUZZERS</a><br>
|
<br><a name="SEC21" href="#TOC1">SUPPORT FOR FUZZERS</a><br>
|
||||||
<P>
|
<P>
|
||||||
There is a special option for use by people who want to run fuzzing tests on
|
There is a special option for use by people who want to run fuzzing tests on
|
||||||
PCRE2:
|
PCRE2:
|
||||||
|
@ -514,13 +492,28 @@ contains a single function called LLVMFuzzerTestOneInput() whose arguments are
|
||||||
a pointer to a string and the length of the string. When called, this function
|
a pointer to a string and the length of the string. When called, this function
|
||||||
tries to compile the string as a pattern, and if that succeeds, to match it.
|
tries to compile the string as a pattern, and if that succeeds, to match it.
|
||||||
This is done both with no options and with some random options bits that are
|
This is done both with no options and with some random options bits that are
|
||||||
generated from the string. Setting --enable-fuzz-support also causes a binary
|
generated from the string.
|
||||||
called <b>pcre2fuzzcheck</b> to be created. This is normally run under valgrind
|
</P>
|
||||||
or used when PCRE2 is compiled with address sanitizing enabled. It calls the
|
<P>
|
||||||
fuzzing function and outputs information about it is doing. The input strings
|
Setting --enable-fuzz-support also causes a binary called <b>pcre2fuzzcheck</b>
|
||||||
are specified by arguments: if an argument starts with "=" the rest of it is a
|
to be created. This is normally run under valgrind or used when PCRE2 is
|
||||||
literal input string. Otherwise, it is assumed to be a file name, and the
|
compiled with address sanitizing enabled. It calls the fuzzing function and
|
||||||
contents of the file are the test string.
|
outputs information about it is doing. The input strings are specified by
|
||||||
|
arguments: if an argument starts with "=" the rest of it is a literal input
|
||||||
|
string. Otherwise, it is assumed to be a file name, and the contents of the
|
||||||
|
file are the test string.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC22" href="#TOC1">OBSOLETE OPTION</a><br>
|
||||||
|
<P>
|
||||||
|
In versions of PCRE2 prior to 10.30, there were two ways of handling
|
||||||
|
backtracking in the <b>pcre2_match()</b> function. The default was to use the
|
||||||
|
system stack, but if
|
||||||
|
<pre>
|
||||||
|
--disable-stack-for-recursion
|
||||||
|
</pre>
|
||||||
|
was set, memory on the heap was used. From release 10.30 onwards this has
|
||||||
|
changed (the stack is no lonter used) and this option now does nothing except
|
||||||
|
give a warning.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC23" href="#TOC1">SEE ALSO</a><br>
|
<br><a name="SEC23" href="#TOC1">SEE ALSO</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -537,9 +530,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC25" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC25" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 01 November 2016
|
Last updated: 29 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -57,8 +57,8 @@ two callout points:
|
||||||
</pre>
|
</pre>
|
||||||
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
|
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
|
||||||
automatically inserts callouts, all with number 255, before each item in the
|
automatically inserts callouts, all with number 255, before each item in the
|
||||||
pattern except for immediately before or after a callout item in the pattern.
|
pattern except for immediately before or after an explicit callout. For
|
||||||
For example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
||||||
<pre>
|
<pre>
|
||||||
A(?C3)B
|
A(?C3)B
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -71,11 +71,9 @@ Here is a more complicated example:
|
||||||
A(\d{2}|--)
|
A(\d{2}|--)
|
||||||
</pre>
|
</pre>
|
||||||
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
||||||
<br>
|
<pre>
|
||||||
<br>
|
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
||||||
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
</pre>
|
||||||
<br>
|
|
||||||
<br>
|
|
||||||
Notice that there is a callout before and after each parenthesis and
|
Notice that there is a callout before and after each parenthesis and
|
||||||
alternation bar. If the pattern contains a conditional group whose condition is
|
alternation bar. If the pattern contains a conditional group whose condition is
|
||||||
an assertion, an automatic callout is inserted immediately before the
|
an assertion, an automatic callout is inserted immediately before the
|
||||||
|
@ -140,10 +138,14 @@ By default, an optimization is applied when .* is the first significant item in
|
||||||
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
|
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
|
||||||
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
|
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
|
||||||
start only after an internal newline or at the beginning of the subject, and
|
start only after an internal newline or at the beginning of the subject, and
|
||||||
<b>pcre2_compile()</b> remembers this. This optimization is disabled, however,
|
<b>pcre2_compile()</b> remembers this. If a pattern has more than one top-level
|
||||||
if .* is in an atomic group or if there is a back reference to the capturing
|
branch, automatic anchoring occurs if all branches are anchorable.
|
||||||
group in which it appears. It is also disabled if the pattern contains (*PRUNE)
|
</P>
|
||||||
or (*SKIP). However, the presence of callouts does not affect it.
|
<P>
|
||||||
|
This optimization is disabled, however, if .* is in an atomic group or if there
|
||||||
|
is a back reference to the capturing group in which it appears. It is also
|
||||||
|
disabled if the pattern contains (*PRUNE) or (*SKIP). However, the presence of
|
||||||
|
callouts does not affect it.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT and
|
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT and
|
||||||
|
@ -175,10 +177,6 @@ This shows more match attempts, starting at the second subject character.
|
||||||
Another optimization, described in the next section, means that there is no
|
Another optimization, described in the next section, means that there is no
|
||||||
subsequent attempt to match with an empty subject.
|
subsequent attempt to match with an empty subject.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
|
||||||
If a pattern has more than one top-level branch, automatic anchoring occurs if
|
|
||||||
all branches are anchorable.
|
|
||||||
</P>
|
|
||||||
<br><b>
|
<br><b>
|
||||||
Other optimizations
|
Other optimizations
|
||||||
</b><br>
|
</b><br>
|
||||||
|
@ -194,9 +192,10 @@ start, and the callout is never reached. However, with "abyd", though the
|
||||||
result is still no match, the callout is obeyed.
|
result is still no match, the callout is obeyed.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
PCRE2 also knows the minimum length of a matching string, and will immediately
|
For most patterns PCRE2 also knows the minimum length of a matching string, and
|
||||||
give a "no match" return without actually running a match if the subject is not
|
will immediately give a "no match" return without actually running a match if
|
||||||
long enough, or, for unanchored patterns, if it has been scanned far enough.
|
the subject is not long enough, or, for unanchored patterns, if it has been
|
||||||
|
scanned far enough.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
You can disable these optimizations by passing the PCRE2_NO_START_OPTIMIZE
|
You can disable these optimizations by passing the PCRE2_NO_START_OPTIMIZE
|
||||||
|
@ -276,12 +275,41 @@ The remaining fields in the callout block are the same for both kinds of
|
||||||
callout.
|
callout.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The <i>offset_vector</i> field is a pointer to the vector of capturing offsets
|
The <i>offset_vector</i> field is a pointer to a vector of capturing offsets
|
||||||
(the "ovector") that was passed to the matching function in the match data
|
(the "ovector"). You may read certain elements in this vector, but you must not
|
||||||
block. When <b>pcre2_match()</b> is used, the contents can be inspected in
|
change any of them.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
For calls to <b>pcre2_match()</b>, the <i>offset_vector</i> field is not (since
|
||||||
|
release 10.30) a pointer to the actual ovector that was passed to the matching
|
||||||
|
function in the match data block. Instead it points to an internal ovector of a
|
||||||
|
size large enough to hold all possible captured substrings in the pattern. Note
|
||||||
|
that whenever a recursion or subroutine call within a pattern completes, the
|
||||||
|
capturing state is reset to what it was before.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The <i>capture_last</i> field contains the number of the most recently captured
|
||||||
|
substring, and the <i>capture_top</i> field contains one more than the number of
|
||||||
|
the highest numbered captured substring so far. If no substrings have yet been
|
||||||
|
captured, the value of <i>capture_last</i> is 0 and the value of
|
||||||
|
<i>capture_top</i> is 1. The values of these fields do not always differ by one;
|
||||||
|
for example, when the callout in the pattern ((a)(b))(?C2) is taken,
|
||||||
|
<i>capture_last</i> is 1 but <i>capture_top</i> is 4.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The contents of ovector[2] to ovector[<capture_top>*2-1] can be inspected in
|
||||||
order to extract substrings that have been matched so far, in the same way as
|
order to extract substrings that have been matched so far, in the same way as
|
||||||
for extracting substrings after a match has completed. For the DFA matching
|
extracting substrings after a match has completed. The values in ovector[0] and
|
||||||
function, this field is not useful.
|
ovector[1] are undefined and should not be used in any way. Substrings that
|
||||||
|
have not been captured (but whose numbers are less than <i>capture_top</i>) have
|
||||||
|
both of their ovector slots set to PCRE2_UNSET.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
For DFA matching, the <i>offset_vector</i> field points to the ovector that was
|
||||||
|
passed to the matching function in the match data block, but it holds no useful
|
||||||
|
information at callout time because <b>pcre2_dfa_match()</b> does not support
|
||||||
|
substring capturing. The value of <i>capture_top</i> is always 1 and the value
|
||||||
|
of <i>capture_last</i> is always 0 for DFA matching.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The <i>subject</i> and <i>subject_length</i> fields contain copies of the values
|
The <i>subject</i> and <i>subject_length</i> fields contain copies of the values
|
||||||
|
@ -300,20 +328,6 @@ The <i>current_position</i> field contains the offset within the subject of the
|
||||||
current match pointer.
|
current match pointer.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When the <b>pcre2_match()</b> is used, the <i>capture_top</i> field contains one
|
|
||||||
more than the number of the highest numbered captured substring so far. If no
|
|
||||||
substrings have been captured, the value of <i>capture_top</i> is one. This is
|
|
||||||
always the case when the DFA functions are used, because they do not support
|
|
||||||
captured substrings.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
The <i>capture_last</i> field contains the number of the most recently captured
|
|
||||||
substring. However, when a recursion exits, the value reverts to what it was
|
|
||||||
outside the recursion, as do the values of all captured substrings. If no
|
|
||||||
substrings have been captured, the value of <i>capture_last</i> is 0. This is
|
|
||||||
always the case for the DFA matching functions.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
The <i>pattern_position</i> field contains the offset in the pattern string to
|
The <i>pattern_position</i> field contains the offset in the pattern string to
|
||||||
the next item to be matched.
|
the next item to be matched.
|
||||||
</P>
|
</P>
|
||||||
|
@ -413,9 +427,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 29 September 2016
|
Last updated: 29 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
362
doc/pcre2.txt
362
doc/pcre2.txt
|
@ -3221,12 +3221,12 @@ PCRE2 BUILD-TIME OPTIONS
|
||||||
|
|
||||||
./configure --help
|
./configure --help
|
||||||
|
|
||||||
The following sections include descriptions of options whose names
|
The following sections include descriptions of "on/off" options whose
|
||||||
begin with --enable or --disable. These settings specify changes to the
|
names begin with --enable or --disable. Because of the way that config-
|
||||||
defaults for the configure command. Because of the way that configure
|
ure works, --enable and --disable always come in pairs, so the comple-
|
||||||
works, --enable and --disable always come in pairs, so the complemen-
|
mentary option always exists as well, but as it specifies the default,
|
||||||
tary option always exists as well, but as it specifies the default, it
|
it is not described. Options that specify values have names that start
|
||||||
is not described.
|
with --with.
|
||||||
|
|
||||||
|
|
||||||
BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
||||||
|
@ -3283,11 +3283,11 @@ UNICODE AND UTF SUPPORT
|
||||||
application has locked this out by setting PCRE2_NEVER_UTF.
|
application has locked this out by setting PCRE2_NEVER_UTF.
|
||||||
|
|
||||||
UTF support allows the libraries to process character code points up to
|
UTF support allows the libraries to process character code points up to
|
||||||
0x10ffff in the strings that they handle. It also provides support for
|
0x10ffff in the strings that they handle. Unicode support also gives
|
||||||
accessing the Unicode properties of such characters, using pattern
|
access to the Unicode properties of characters, using pattern escapes
|
||||||
escapes such as \P, \p, and \X. Only the general category properties
|
such as \P, \p, and \X. Only the general category properties such as Lu
|
||||||
such as Lu and Nd are supported. Details are given in the pcre2pattern
|
and Nd are supported. Details are given in the pcre2pattern documenta-
|
||||||
documentation.
|
tion.
|
||||||
|
|
||||||
Pattern escapes such as \d and \w do not by default make use of Unicode
|
Pattern escapes such as \d and \w do not by default make use of Unicode
|
||||||
properties. The application can request that they do by setting the
|
properties. The application can request that they do by setting the
|
||||||
|
@ -3310,14 +3310,15 @@ DISABLING THE USE OF \C
|
||||||
|
|
||||||
JUST-IN-TIME COMPILER SUPPORT
|
JUST-IN-TIME COMPILER SUPPORT
|
||||||
|
|
||||||
Just-in-time compiler support is included in the build by specifying
|
Just-in-time (JIT) compiler support is included in the build by speci-
|
||||||
|
fying
|
||||||
|
|
||||||
--enable-jit
|
--enable-jit
|
||||||
|
|
||||||
This support is available only for certain hardware architectures. If
|
This support is available only for certain hardware architectures. If
|
||||||
this option is set for an unsupported architecture, a building error
|
this option is set for an unsupported architecture, a building error
|
||||||
occurs. See the pcre2jit documentation for a discussion of JIT usage.
|
occurs. See the pcre2jit documentation for a discussion of JIT usage.
|
||||||
When JIT support is enabled, pcre2grep automatically makes use of it,
|
When JIT support is enabled, pcre2grep automatically makes use of it,
|
||||||
unless you add
|
unless you add
|
||||||
|
|
||||||
--disable-pcre2grep-jit
|
--disable-pcre2grep-jit
|
||||||
|
@ -3327,14 +3328,14 @@ JUST-IN-TIME COMPILER SUPPORT
|
||||||
|
|
||||||
NEWLINE RECOGNITION
|
NEWLINE RECOGNITION
|
||||||
|
|
||||||
By default, PCRE2 interprets the linefeed (LF) character as indicating
|
By default, PCRE2 interprets the linefeed (LF) character as indicating
|
||||||
the end of a line. This is the normal newline character on Unix-like
|
the end of a line. This is the normal newline character on Unix-like
|
||||||
systems. You can compile PCRE2 to use carriage return (CR) instead, by
|
systems. You can compile PCRE2 to use carriage return (CR) instead, by
|
||||||
adding
|
adding
|
||||||
|
|
||||||
--enable-newline-is-cr
|
--enable-newline-is-cr
|
||||||
|
|
||||||
to the configure command. There is also an --enable-newline-is-lf
|
to the configure command. There is also an --enable-newline-is-lf
|
||||||
option, which explicitly specifies linefeed as the newline character.
|
option, which explicitly specifies linefeed as the newline character.
|
||||||
|
|
||||||
Alternatively, you can specify that line endings are to be indicated by
|
Alternatively, you can specify that line endings are to be indicated by
|
||||||
|
@ -3347,108 +3348,84 @@ NEWLINE RECOGNITION
|
||||||
|
|
||||||
--enable-newline-is-anycrlf
|
--enable-newline-is-anycrlf
|
||||||
|
|
||||||
which causes PCRE2 to recognize any of the three sequences CR, LF, or
|
which causes PCRE2 to recognize any of the three sequences CR, LF, or
|
||||||
CRLF as indicating a line ending. Finally, a fifth option, specified by
|
CRLF as indicating a line ending. Finally, a fifth option, specified by
|
||||||
|
|
||||||
--enable-newline-is-any
|
--enable-newline-is-any
|
||||||
|
|
||||||
causes PCRE2 to recognize any Unicode newline sequence. The Unicode
|
causes PCRE2 to recognize any Unicode newline sequence. The Unicode
|
||||||
newline sequences are the three just mentioned, plus the single charac-
|
newline sequences are the three just mentioned, plus the single charac-
|
||||||
ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
|
ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
|
||||||
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
|
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
|
||||||
U+2029).
|
U+2029).
|
||||||
|
|
||||||
Whatever default line ending convention is selected when PCRE2 is built
|
Whatever default line ending convention is selected when PCRE2 is built
|
||||||
can be overridden by applications that use the library. At build time
|
can be overridden by applications that use the library. At build time
|
||||||
it is conventional to use the standard for your operating system.
|
it is conventional to use the standard for your operating system.
|
||||||
|
|
||||||
|
|
||||||
WHAT \R MATCHES
|
WHAT \R MATCHES
|
||||||
|
|
||||||
By default, the sequence \R in a pattern matches any Unicode newline
|
By default, the sequence \R in a pattern matches any Unicode newline
|
||||||
sequence, independently of what has been selected as the line ending
|
sequence, independently of what has been selected as the line ending
|
||||||
sequence. If you specify
|
sequence. If you specify
|
||||||
|
|
||||||
--enable-bsr-anycrlf
|
--enable-bsr-anycrlf
|
||||||
|
|
||||||
the default is changed so that \R matches only CR, LF, or CRLF. What-
|
the default is changed so that \R matches only CR, LF, or CRLF. What-
|
||||||
ever is selected when PCRE2 is built can be overridden by applications
|
ever is selected when PCRE2 is built can be overridden by applications
|
||||||
that use the called.
|
that use the library.
|
||||||
|
|
||||||
|
|
||||||
HANDLING VERY LARGE PATTERNS
|
HANDLING VERY LARGE PATTERNS
|
||||||
|
|
||||||
Within a compiled pattern, offset values are used to point from one
|
Within a compiled pattern, offset values are used to point from one
|
||||||
part to another (for example, from an opening parenthesis to an alter-
|
part to another (for example, from an opening parenthesis to an alter-
|
||||||
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
|
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
|
||||||
two-byte values are used for these offsets, leading to a maximum size
|
two-byte values are used for these offsets, leading to a maximum size
|
||||||
for a compiled pattern of around 64K code units. This is sufficient to
|
for a compiled pattern of around 64K code units. This is sufficient to
|
||||||
handle all but the most gigantic patterns. Nevertheless, some people do
|
handle all but the most gigantic patterns. Nevertheless, some people do
|
||||||
want to process truly enormous patterns, so it is possible to compile
|
want to process truly enormous patterns, so it is possible to compile
|
||||||
PCRE2 to use three-byte or four-byte offsets by adding a setting such
|
PCRE2 to use three-byte or four-byte offsets by adding a setting such
|
||||||
as
|
as
|
||||||
|
|
||||||
--with-link-size=3
|
--with-link-size=3
|
||||||
|
|
||||||
to the configure command. The value given must be 2, 3, or 4. For the
|
to the configure command. The value given must be 2, 3, or 4. For the
|
||||||
16-bit library, a value of 3 is rounded up to 4. In these libraries,
|
16-bit library, a value of 3 is rounded up to 4. In these libraries,
|
||||||
using longer offsets slows down the operation of PCRE2 because it has
|
using longer offsets slows down the operation of PCRE2 because it has
|
||||||
to load additional data when handling them. For the 32-bit library the
|
to load additional data when handling them. For the 32-bit library the
|
||||||
value is always 4 and cannot be overridden; the value of --with-link-
|
value is always 4 and cannot be overridden; the value of --with-link-
|
||||||
size is ignored.
|
size is ignored.
|
||||||
|
|
||||||
|
|
||||||
AVOIDING EXCESSIVE STACK USAGE
|
|
||||||
|
|
||||||
When matching with the pcre2_match() function, PCRE2 implements back-
|
|
||||||
tracking by making recursive calls to an internal function called
|
|
||||||
match(). In environments where the size of the stack is limited, this
|
|
||||||
can severely limit PCRE2's operation. (The Unix environment does not
|
|
||||||
usually suffer from this problem, but it may sometimes be necessary to
|
|
||||||
increase the maximum stack size. There is a discussion in the
|
|
||||||
pcre2stack documentation.) An alternative approach to recursion that
|
|
||||||
uses memory from the heap to remember data, instead of using recursive
|
|
||||||
function calls, has been implemented to work round the problem of lim-
|
|
||||||
ited stack size. If you want to build a version of PCRE2 that works
|
|
||||||
this way, add
|
|
||||||
|
|
||||||
--disable-stack-for-recursion
|
|
||||||
|
|
||||||
to the configure command. By default, the system functions malloc() and
|
|
||||||
free() are called to manage the heap memory that is required, but cus-
|
|
||||||
tom memory management functions can be called instead. PCRE2 runs
|
|
||||||
noticeably more slowly when built in this way. This option affects only
|
|
||||||
the pcre2_match() function; it is not relevant for pcre2_dfa_match().
|
|
||||||
|
|
||||||
|
|
||||||
LIMITING PCRE2 RESOURCE USAGE
|
LIMITING PCRE2 RESOURCE USAGE
|
||||||
|
|
||||||
Internally, PCRE2 has a function called match(), which it calls repeat-
|
The pcre2_match() function increments a counter each time it goes round
|
||||||
edly (sometimes recursively) when matching a pattern with the
|
its main loop. Putting a limit on this counter controls the amount of
|
||||||
pcre2_match() function. By controlling the maximum number of times this
|
computing resource used by a single call to pcre2_match(). The limit
|
||||||
function may be called during a single matching operation, a limit can
|
can be changed at run time, as described in the pcre2api documentation.
|
||||||
be placed on the resources used by a single call to pcre2_match(). The
|
The default is 10 million, but this can be changed by adding a setting
|
||||||
limit can be changed at run time, as described in the pcre2api documen-
|
such as
|
||||||
tation. The default is 10 million, but this can be changed by adding a
|
|
||||||
setting such as
|
|
||||||
|
|
||||||
--with-match-limit=500000
|
--with-match-limit=500000
|
||||||
|
|
||||||
to the configure command. This setting has no effect on the
|
to the configure command. This setting has no effect on the
|
||||||
pcre2_dfa_match() matching function.
|
pcre2_dfa_match() matching function, but it does also limit JIT match-
|
||||||
|
ing (though the counting is done differently).
|
||||||
|
|
||||||
In some environments it is desirable to limit the depth of recursive
|
In some environments it is desirable to limit the depth of nested back-
|
||||||
calls of match() more strictly than the total number of calls, in order
|
tracking in order to restrict the maximum amount of heap memory that is
|
||||||
to restrict the maximum amount of stack (or heap, if --disable-stack-
|
used. A second limit controls this; it defaults to the value that is
|
||||||
for-recursion is specified) that is used. A second limit controls this;
|
set for --with-match-limit. You can set a lower default limit by
|
||||||
it defaults to the value that is set for --with-match-limit, which
|
adding, for example,
|
||||||
imposes no additional constraints. However, you can set a lower limit
|
|
||||||
by adding, for example,
|
|
||||||
|
|
||||||
--with-match-limit-recursion=10000
|
--with-match-limit_depth=10000
|
||||||
|
|
||||||
to the configure command. This value can also be overridden at run
|
to the configure command. This value can also be overridden at run
|
||||||
time.
|
time. As well as applying to pcre2_match(), this limit also controls
|
||||||
|
the depth of recursive function calls in pcre2_dfa_match(). These are
|
||||||
|
used for lookaround assertions and recursion within patterns.
|
||||||
|
|
||||||
|
|
||||||
CREATING CHARACTER TABLES AT BUILD TIME
|
CREATING CHARACTER TABLES AT BUILD TIME
|
||||||
|
@ -3463,10 +3440,10 @@ CREATING CHARACTER TABLES AT BUILD TIME
|
||||||
to the configure command, the distributed tables are no longer used.
|
to the configure command, the distributed tables are no longer used.
|
||||||
Instead, a program called dftables is compiled and run. This outputs
|
Instead, a program called dftables is compiled and run. This outputs
|
||||||
the source for new set of tables, created in the default locale of your
|
the source for new set of tables, created in the default locale of your
|
||||||
C run-time system. (This method of replacing the tables does not work
|
C run-time system. This method of replacing the tables does not work if
|
||||||
if you are cross compiling, because dftables is run on the local host.
|
you are cross compiling, because dftables is run on the local host. If
|
||||||
If you need to create alternative tables when cross compiling, you will
|
you need to create alternative tables when cross compiling, you will
|
||||||
have to do so "by hand".)
|
have to do so "by hand".
|
||||||
|
|
||||||
|
|
||||||
USING EBCDIC CODE
|
USING EBCDIC CODE
|
||||||
|
@ -3672,13 +3649,28 @@ SUPPORT FOR FUZZERS
|
||||||
string. When called, this function tries to compile the string as a
|
string. When called, this function tries to compile the string as a
|
||||||
pattern, and if that succeeds, to match it. This is done both with no
|
pattern, and if that succeeds, to match it. This is done both with no
|
||||||
options and with some random options bits that are generated from the
|
options and with some random options bits that are generated from the
|
||||||
string. Setting --enable-fuzz-support also causes a binary called
|
string.
|
||||||
pcre2fuzzcheck to be created. This is normally run under valgrind or
|
|
||||||
used when PCRE2 is compiled with address sanitizing enabled. It calls
|
Setting --enable-fuzz-support also causes a binary called pcre2fuz-
|
||||||
the fuzzing function and outputs information about it is doing. The
|
zcheck to be created. This is normally run under valgrind or used when
|
||||||
input strings are specified by arguments: if an argument starts with
|
PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
|
||||||
"=" the rest of it is a literal input string. Otherwise, it is assumed
|
function and outputs information about it is doing. The input strings
|
||||||
to be a file name, and the contents of the file are the test string.
|
are specified by arguments: if an argument starts with "=" the rest of
|
||||||
|
it is a literal input string. Otherwise, it is assumed to be a file
|
||||||
|
name, and the contents of the file are the test string.
|
||||||
|
|
||||||
|
|
||||||
|
OBSOLETE OPTION
|
||||||
|
|
||||||
|
In versions of PCRE2 prior to 10.30, there were two ways of handling
|
||||||
|
backtracking in the pcre2_match() function. The default was to use the
|
||||||
|
system stack, but if
|
||||||
|
|
||||||
|
--disable-stack-for-recursion
|
||||||
|
|
||||||
|
was set, memory on the heap was used. From release 10.30 onwards this
|
||||||
|
has changed (the stack is no lonter used) and this option now does
|
||||||
|
nothing except give a warning.
|
||||||
|
|
||||||
|
|
||||||
SEE ALSO
|
SEE ALSO
|
||||||
|
@ -3695,8 +3687,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 01 November 2016
|
Last updated: 29 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -3740,9 +3732,8 @@ DESCRIPTION
|
||||||
|
|
||||||
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
|
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
|
||||||
PCRE2 automatically inserts callouts, all with number 255, before each
|
PCRE2 automatically inserts callouts, all with number 255, before each
|
||||||
item in the pattern except for immediately before or after a callout
|
item in the pattern except for immediately before or after an explicit
|
||||||
item in the pattern. For example, if PCRE2_AUTO_CALLOUT is used with
|
callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
||||||
the pattern
|
|
||||||
|
|
||||||
A(?C3)B
|
A(?C3)B
|
||||||
|
|
||||||
|
@ -3756,38 +3747,38 @@ DESCRIPTION
|
||||||
|
|
||||||
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
||||||
|
|
||||||
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
||||||
|
|
||||||
Notice that there is a callout before and after each parenthesis and
|
Notice that there is a callout before and after each parenthesis and
|
||||||
alternation bar. If the pattern contains a conditional group whose con-
|
alternation bar. If the pattern contains a conditional group whose con-
|
||||||
dition is an assertion, an automatic callout is inserted immediately
|
dition is an assertion, an automatic callout is inserted immediately
|
||||||
before the condition. Such a callout may also be inserted explicitly,
|
before the condition. Such a callout may also be inserted explicitly,
|
||||||
for example:
|
for example:
|
||||||
|
|
||||||
(?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de)
|
(?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de)
|
||||||
|
|
||||||
This applies only to assertion conditions (because they are themselves
|
This applies only to assertion conditions (because they are themselves
|
||||||
independent groups).
|
independent groups).
|
||||||
|
|
||||||
Callouts can be useful for tracking the progress of pattern matching.
|
Callouts can be useful for tracking the progress of pattern matching.
|
||||||
The pcre2test program has a pattern qualifier (/auto_callout) that sets
|
The pcre2test program has a pattern qualifier (/auto_callout) that sets
|
||||||
automatic callouts. When any callouts are present, the output from
|
automatic callouts. When any callouts are present, the output from
|
||||||
pcre2test indicates how the pattern is being matched. This is useful
|
pcre2test indicates how the pattern is being matched. This is useful
|
||||||
information when you are trying to optimize the performance of a par-
|
information when you are trying to optimize the performance of a par-
|
||||||
ticular pattern.
|
ticular pattern.
|
||||||
|
|
||||||
|
|
||||||
MISSING CALLOUTS
|
MISSING CALLOUTS
|
||||||
|
|
||||||
You should be aware that, because of optimizations in the way PCRE2
|
You should be aware that, because of optimizations in the way PCRE2
|
||||||
compiles and matches patterns, callouts sometimes do not happen exactly
|
compiles and matches patterns, callouts sometimes do not happen exactly
|
||||||
as you might expect.
|
as you might expect.
|
||||||
|
|
||||||
Auto-possessification
|
Auto-possessification
|
||||||
|
|
||||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
|
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
|
||||||
that what follows cannot be part of the repeat. For example, a+[bc] is
|
that what follows cannot be part of the repeat. For example, a+[bc] is
|
||||||
compiled as if it were a++[bc]. The pcre2test output when this pattern
|
compiled as if it were a++[bc]. The pcre2test output when this pattern
|
||||||
is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
|
is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
|
||||||
to the string "aaaa" is:
|
to the string "aaaa" is:
|
||||||
|
|
||||||
|
@ -3796,11 +3787,11 @@ MISSING CALLOUTS
|
||||||
+2 ^ ^ [bc]
|
+2 ^ ^ [bc]
|
||||||
No match
|
No match
|
||||||
|
|
||||||
This indicates that when matching [bc] fails, there is no backtracking
|
This indicates that when matching [bc] fails, there is no backtracking
|
||||||
into a+ (because it is being treated as a++) and therefore the callouts
|
into a+ (because it is being treated as a++) and therefore the callouts
|
||||||
that would be taken for the backtracks do not occur. You can disable
|
that would be taken for the backtracks do not occur. You can disable
|
||||||
the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
|
the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
|
||||||
pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In
|
pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In
|
||||||
this case, the output changes to this:
|
this case, the output changes to this:
|
||||||
|
|
||||||
--->aaaa
|
--->aaaa
|
||||||
|
@ -3817,14 +3808,17 @@ MISSING CALLOUTS
|
||||||
Automatic .* anchoring
|
Automatic .* anchoring
|
||||||
|
|
||||||
By default, an optimization is applied when .* is the first significant
|
By default, an optimization is applied when .* is the first significant
|
||||||
item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
|
item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
|
||||||
any character, the pattern is automatically anchored. If PCRE2_DOTALL
|
any character, the pattern is automatically anchored. If PCRE2_DOTALL
|
||||||
is not set, a match can start only after an internal newline or at the
|
is not set, a match can start only after an internal newline or at the
|
||||||
beginning of the subject, and pcre2_compile() remembers this. This
|
beginning of the subject, and pcre2_compile() remembers this. If a pat-
|
||||||
optimization is disabled, however, if .* is in an atomic group or if
|
tern has more than one top-level branch, automatic anchoring occurs if
|
||||||
there is a back reference to the capturing group in which it appears.
|
all branches are anchorable.
|
||||||
It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
|
|
||||||
ever, the presence of callouts does not affect it.
|
This optimization is disabled, however, if .* is in an atomic group or
|
||||||
|
if there is a back reference to the capturing group in which it
|
||||||
|
appears. It is also disabled if the pattern contains (*PRUNE) or
|
||||||
|
(*SKIP). However, the presence of callouts does not affect it.
|
||||||
|
|
||||||
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
|
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
|
||||||
and applied to the string "aa", the pcre2test output is:
|
and applied to the string "aa", the pcre2test output is:
|
||||||
|
@ -3856,39 +3850,36 @@ MISSING CALLOUTS
|
||||||
ter. Another optimization, described in the next section, means that
|
ter. Another optimization, described in the next section, means that
|
||||||
there is no subsequent attempt to match with an empty subject.
|
there is no subsequent attempt to match with an empty subject.
|
||||||
|
|
||||||
If a pattern has more than one top-level branch, automatic anchoring
|
|
||||||
occurs if all branches are anchorable.
|
|
||||||
|
|
||||||
Other optimizations
|
Other optimizations
|
||||||
|
|
||||||
Other optimizations that provide fast "no match" results also affect
|
Other optimizations that provide fast "no match" results also affect
|
||||||
callouts. For example, if the pattern is
|
callouts. For example, if the pattern is
|
||||||
|
|
||||||
ab(?C4)cd
|
ab(?C4)cd
|
||||||
|
|
||||||
PCRE2 knows that any matching string must contain the letter "d". If
|
PCRE2 knows that any matching string must contain the letter "d". If
|
||||||
the subject string is "abyz", the lack of "d" means that matching
|
the subject string is "abyz", the lack of "d" means that matching
|
||||||
doesn't ever start, and the callout is never reached. However, with
|
doesn't ever start, and the callout is never reached. However, with
|
||||||
"abyd", though the result is still no match, the callout is obeyed.
|
"abyd", though the result is still no match, the callout is obeyed.
|
||||||
|
|
||||||
PCRE2 also knows the minimum length of a matching string, and will
|
For most patterns PCRE2 also knows the minimum length of a matching
|
||||||
immediately give a "no match" return without actually running a match
|
string, and will immediately give a "no match" return without actually
|
||||||
if the subject is not long enough, or, for unanchored patterns, if it
|
running a match if the subject is not long enough, or, for unanchored
|
||||||
has been scanned far enough.
|
patterns, if it has been scanned far enough.
|
||||||
|
|
||||||
You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
|
You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
|
||||||
MIZE option to pcre2_compile(), or by starting the pattern with
|
MIZE option to pcre2_compile(), or by starting the pattern with
|
||||||
(*NO_START_OPT). This slows down the matching process, but does ensure
|
(*NO_START_OPT). This slows down the matching process, but does ensure
|
||||||
that callouts such as the example above are obeyed.
|
that callouts such as the example above are obeyed.
|
||||||
|
|
||||||
|
|
||||||
THE CALLOUT INTERFACE
|
THE CALLOUT INTERFACE
|
||||||
|
|
||||||
During matching, when PCRE2 reaches a callout point, if an external
|
During matching, when PCRE2 reaches a callout point, if an external
|
||||||
function is set in the match context, it is called. This applies to
|
function is set in the match context, it is called. This applies to
|
||||||
both normal and DFA matching. The first argument to the callout func-
|
both normal and DFA matching. The first argument to the callout func-
|
||||||
tion is a pointer to a pcre2_callout block. The second argument is the
|
tion is a pointer to a pcre2_callout block. The second argument is the
|
||||||
void * callout data that was supplied when the callout was set up by
|
void * callout data that was supplied when the callout was set up by
|
||||||
calling pcre2_set_callout() (see the pcre2api documentation). The call-
|
calling pcre2_set_callout() (see the pcre2api documentation). The call-
|
||||||
out block structure contains the following fields:
|
out block structure contains the following fields:
|
||||||
|
|
||||||
|
@ -3908,50 +3899,77 @@ THE CALLOUT INTERFACE
|
||||||
PCRE2_SIZE callout_string_length;
|
PCRE2_SIZE callout_string_length;
|
||||||
PCRE2_SPTR callout_string;
|
PCRE2_SPTR callout_string;
|
||||||
|
|
||||||
The version field contains the version number of the block format. The
|
The version field contains the version number of the block format. The
|
||||||
current version is 1; the three callout string fields were added for
|
current version is 1; the three callout string fields were added for
|
||||||
this version. If you are writing an application that might use an ear-
|
this version. If you are writing an application that might use an ear-
|
||||||
lier release of PCRE2, you should check the version number before
|
lier release of PCRE2, you should check the version number before
|
||||||
accessing any of these fields. The version number will increase in
|
accessing any of these fields. The version number will increase in
|
||||||
future if more fields are added, but the intention is never to remove
|
future if more fields are added, but the intention is never to remove
|
||||||
any of the existing fields.
|
any of the existing fields.
|
||||||
|
|
||||||
Fields for numerical callouts
|
Fields for numerical callouts
|
||||||
|
|
||||||
For a numerical callout, callout_string is NULL, and callout_number
|
For a numerical callout, callout_string is NULL, and callout_number
|
||||||
contains the number of the callout, in the range 0-255. This is the
|
contains the number of the callout, in the range 0-255. This is the
|
||||||
number that follows (?C for callouts that part of the pattern; it is
|
number that follows (?C for callouts that part of the pattern; it is
|
||||||
255 for automatically generated callouts.
|
255 for automatically generated callouts.
|
||||||
|
|
||||||
Fields for string callouts
|
Fields for string callouts
|
||||||
|
|
||||||
For callouts with string arguments, callout_number is always zero, and
|
For callouts with string arguments, callout_number is always zero, and
|
||||||
callout_string points to the string that is contained within the com-
|
callout_string points to the string that is contained within the com-
|
||||||
piled pattern. Its length is given by callout_string_length. Duplicated
|
piled pattern. Its length is given by callout_string_length. Duplicated
|
||||||
ending delimiters that were present in the original pattern string have
|
ending delimiters that were present in the original pattern string have
|
||||||
been turned into single characters, but there is no other processing of
|
been turned into single characters, but there is no other processing of
|
||||||
the callout string argument. An additional code unit containing binary
|
the callout string argument. An additional code unit containing binary
|
||||||
zero is present after the string, but is not included in the length.
|
zero is present after the string, but is not included in the length.
|
||||||
The delimiter that was used to start the string is also stored within
|
The delimiter that was used to start the string is also stored within
|
||||||
the pattern, immediately before the string itself. You can access this
|
the pattern, immediately before the string itself. You can access this
|
||||||
delimiter as callout_string[-1] if you need it.
|
delimiter as callout_string[-1] if you need it.
|
||||||
|
|
||||||
The callout_string_offset field is the code unit offset to the start of
|
The callout_string_offset field is the code unit offset to the start of
|
||||||
the callout argument string within the original pattern string. This is
|
the callout argument string within the original pattern string. This is
|
||||||
provided for the benefit of applications such as script languages that
|
provided for the benefit of applications such as script languages that
|
||||||
might need to report errors in the callout string within the pattern.
|
might need to report errors in the callout string within the pattern.
|
||||||
|
|
||||||
Fields for all callouts
|
Fields for all callouts
|
||||||
|
|
||||||
The remaining fields in the callout block are the same for both kinds
|
The remaining fields in the callout block are the same for both kinds
|
||||||
of callout.
|
of callout.
|
||||||
|
|
||||||
The offset_vector field is a pointer to the vector of capturing offsets
|
The offset_vector field is a pointer to a vector of capturing offsets
|
||||||
(the "ovector") that was passed to the matching function in the match
|
(the "ovector"). You may read certain elements in this vector, but you
|
||||||
data block. When pcre2_match() is used, the contents can be inspected
|
must not change any of them.
|
||||||
in order to extract substrings that have been matched so far, in the
|
|
||||||
same way as for extracting substrings after a match has completed. For
|
For calls to pcre2_match(), the offset_vector field is not (since
|
||||||
the DFA matching function, this field is not useful.
|
release 10.30) a pointer to the actual ovector that was passed to the
|
||||||
|
matching function in the match data block. Instead it points to an
|
||||||
|
internal ovector of a size large enough to hold all possible captured
|
||||||
|
substrings in the pattern. Note that whenever a recursion or subroutine
|
||||||
|
call within a pattern completes, the capturing state is reset to what
|
||||||
|
it was before.
|
||||||
|
|
||||||
|
The capture_last field contains the number of the most recently cap-
|
||||||
|
tured substring, and the capture_top field contains one more than the
|
||||||
|
number of the highest numbered captured substring so far. If no sub-
|
||||||
|
strings have yet been captured, the value of capture_last is 0 and the
|
||||||
|
value of capture_top is 1. The values of these fields do not always
|
||||||
|
differ by one; for example, when the callout in the pattern
|
||||||
|
((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
|
||||||
|
|
||||||
|
The contents of ovector[2] to ovector[<capture_top>*2-1] can be
|
||||||
|
inspected in order to extract substrings that have been matched so far,
|
||||||
|
in the same way as extracting substrings after a match has completed.
|
||||||
|
The values in ovector[0] and ovector[1] are undefined and should not be
|
||||||
|
used in any way. Substrings that have not been captured (but whose num-
|
||||||
|
bers are less than capture_top) have both of their ovector slots set to
|
||||||
|
PCRE2_UNSET.
|
||||||
|
|
||||||
|
For DFA matching, the offset_vector field points to the ovector that
|
||||||
|
was passed to the matching function in the match data block, but it
|
||||||
|
holds no useful information at callout time because pcre2_dfa_match()
|
||||||
|
does not support substring capturing. The value of capture_top is
|
||||||
|
always 1 and the value of capture_last is always 0 for DFA matching.
|
||||||
|
|
||||||
The subject and subject_length fields contain copies of the values that
|
The subject and subject_length fields contain copies of the values that
|
||||||
were passed to the matching function.
|
were passed to the matching function.
|
||||||
|
@ -3966,18 +3984,6 @@ THE CALLOUT INTERFACE
|
||||||
The current_position field contains the offset within the subject of
|
The current_position field contains the offset within the subject of
|
||||||
the current match pointer.
|
the current match pointer.
|
||||||
|
|
||||||
When the pcre2_match() is used, the capture_top field contains one more
|
|
||||||
than the number of the highest numbered captured substring so far. If
|
|
||||||
no substrings have been captured, the value of capture_top is one. This
|
|
||||||
is always the case when the DFA functions are used, because they do not
|
|
||||||
support captured substrings.
|
|
||||||
|
|
||||||
The capture_last field contains the number of the most recently cap-
|
|
||||||
tured substring. However, when a recursion exits, the value reverts to
|
|
||||||
what it was outside the recursion, as do the values of all captured
|
|
||||||
substrings. If no substrings have been captured, the value of cap-
|
|
||||||
ture_last is 0. This is always the case for the DFA matching functions.
|
|
||||||
|
|
||||||
The pattern_position field contains the offset in the pattern string to
|
The pattern_position field contains the offset in the pattern string to
|
||||||
the next item to be matched.
|
the next item to be matched.
|
||||||
|
|
||||||
|
@ -4075,8 +4081,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 29 September 2016
|
Last updated: 29 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
116
doc/pcre2build.3
116
doc/pcre2build.3
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2BUILD 3 "01 November 2016" "PCRE2 10.23"
|
.TH PCRE2BUILD 3 "29 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.
|
.
|
||||||
|
@ -55,11 +55,11 @@ running
|
||||||
.sp
|
.sp
|
||||||
./configure --help
|
./configure --help
|
||||||
.sp
|
.sp
|
||||||
The following sections include descriptions of options whose names begin with
|
The following sections include descriptions of "on/off" options whose names
|
||||||
--enable or --disable. These settings specify changes to the defaults for the
|
begin with --enable or --disable. Because of the way that \fBconfigure\fP
|
||||||
\fBconfigure\fP command. Because of the way that \fBconfigure\fP works,
|
works, --enable and --disable always come in pairs, so the complementary option
|
||||||
--enable and --disable always come in pairs, so the complementary option always
|
always exists as well, but as it specifies the default, it is not described.
|
||||||
exists as well, but as it specifies the default, it is not described.
|
Options that specify values have names that start with --with.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES"
|
.SH "BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES"
|
||||||
|
@ -119,10 +119,10 @@ Alternatively, patterns may be started with (*UTF) unless the application has
|
||||||
locked this out by setting PCRE2_NEVER_UTF.
|
locked this out by setting PCRE2_NEVER_UTF.
|
||||||
.P
|
.P
|
||||||
UTF support allows the libraries to process character code points up to
|
UTF support allows the libraries to process character code points up to
|
||||||
0x10ffff in the strings that they handle. It also provides support for
|
0x10ffff in the strings that they handle. Unicode support also gives access to
|
||||||
accessing the Unicode properties of such characters, using pattern escapes such
|
the Unicode properties of characters, using pattern escapes such as \eP, \ep,
|
||||||
as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
|
and \eX. Only the general category properties such as \fILu\fP and \fINd\fP are
|
||||||
\fINd\fP are supported. Details are given in the
|
supported. Details are given in the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
|
@ -151,7 +151,7 @@ out by setting the PCRE2_NEVER_BACKSLASH_C option when calling
|
||||||
.SH "JUST-IN-TIME COMPILER SUPPORT"
|
.SH "JUST-IN-TIME COMPILER SUPPORT"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
Just-in-time compiler support is included in the build by specifying
|
Just-in-time (JIT) compiler support is included in the build by specifying
|
||||||
.sp
|
.sp
|
||||||
--enable-jit
|
--enable-jit
|
||||||
.sp
|
.sp
|
||||||
|
@ -217,7 +217,7 @@ specify
|
||||||
.sp
|
.sp
|
||||||
the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
|
the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
|
||||||
selected when PCRE2 is built can be overridden by applications that use the
|
selected when PCRE2 is built can be overridden by applications that use the
|
||||||
called.
|
library.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "HANDLING VERY LARGE PATTERNS"
|
.SH "HANDLING VERY LARGE PATTERNS"
|
||||||
|
@ -241,41 +241,13 @@ additional data when handling them. For the 32-bit library the value is always
|
||||||
4 and cannot be overridden; the value of --with-link-size is ignored.
|
4 and cannot be overridden; the value of --with-link-size is ignored.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "AVOIDING EXCESSIVE STACK USAGE"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
When matching with the \fBpcre2_match()\fP function, PCRE2 implements
|
|
||||||
backtracking by making recursive calls to an internal function called
|
|
||||||
\fBmatch()\fP. In environments where the size of the stack is limited, this can
|
|
||||||
severely limit PCRE2's operation. (The Unix environment does not usually suffer
|
|
||||||
from this problem, but it may sometimes be necessary to increase the maximum
|
|
||||||
stack size. There is a discussion in the
|
|
||||||
.\" HREF
|
|
||||||
\fBpcre2stack\fP
|
|
||||||
.\"
|
|
||||||
documentation.) An alternative approach to recursion that uses memory from the
|
|
||||||
heap to remember data, instead of using recursive function calls, has been
|
|
||||||
implemented to work round the problem of limited stack size. If you want to
|
|
||||||
build a version of PCRE2 that works this way, add
|
|
||||||
.sp
|
|
||||||
--disable-stack-for-recursion
|
|
||||||
.sp
|
|
||||||
to the \fBconfigure\fP command. By default, the system functions \fBmalloc()\fP
|
|
||||||
and \fBfree()\fP are called to manage the heap memory that is required, but
|
|
||||||
custom memory management functions can be called instead. PCRE2 runs noticeably
|
|
||||||
more slowly when built in this way. This option affects only the
|
|
||||||
\fBpcre2_match()\fP function; it is not relevant for \fBpcre2_dfa_match()\fP.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SH "LIMITING PCRE2 RESOURCE USAGE"
|
.SH "LIMITING PCRE2 RESOURCE USAGE"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
Internally, PCRE2 has a function called \fBmatch()\fP, which it calls
|
The \fBpcre2_match()\fP function increments a counter each time it goes round
|
||||||
repeatedly (sometimes recursively) when matching a pattern with the
|
its main loop. Putting a limit on this counter controls the amount of computing
|
||||||
\fBpcre2_match()\fP function. By controlling the maximum number of times this
|
resource used by a single call to \fBpcre2_match()\fP. The limit can be changed
|
||||||
function may be called during a single matching operation, a limit can be
|
at run time, as described in the
|
||||||
placed on the resources used by a single call to \fBpcre2_match()\fP. The limit
|
|
||||||
can be changed at run time, as described in the
|
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2api\fP
|
\fBpcre2api\fP
|
||||||
.\"
|
.\"
|
||||||
|
@ -285,18 +257,20 @@ setting such as
|
||||||
--with-match-limit=500000
|
--with-match-limit=500000
|
||||||
.sp
|
.sp
|
||||||
to the \fBconfigure\fP command. This setting has no effect on the
|
to the \fBconfigure\fP command. This setting has no effect on the
|
||||||
\fBpcre2_dfa_match()\fP matching function.
|
\fBpcre2_dfa_match()\fP matching function, but it does also limit JIT matching
|
||||||
|
(though the counting is done differently).
|
||||||
.P
|
.P
|
||||||
In some environments it is desirable to limit the depth of recursive calls of
|
In some environments it is desirable to limit the depth of nested backtracking
|
||||||
\fBmatch()\fP more strictly than the total number of calls, in order to
|
in order to restrict the maximum amount of heap memory that is used. A second
|
||||||
restrict the maximum amount of stack (or heap, if --disable-stack-for-recursion
|
limit controls this; it defaults to the value that is set for
|
||||||
is specified) that is used. A second limit controls this; it defaults to the
|
--with-match-limit. You can set a lower default limit by adding, for example,
|
||||||
value that is set for --with-match-limit, which imposes no additional
|
|
||||||
constraints. However, you can set a lower limit by adding, for example,
|
|
||||||
.sp
|
.sp
|
||||||
--with-match-limit-recursion=10000
|
--with-match-limit_depth=10000
|
||||||
.sp
|
.sp
|
||||||
to the \fBconfigure\fP command. This value can also be overridden at run time.
|
to the \fBconfigure\fP command. This value can also be overridden at run time.
|
||||||
|
As well as applying to \fBpcre2_match()\fP, this limit also controls the depth
|
||||||
|
of recursive function calls in \fBpcre2_dfa_match()\fP. These are used for
|
||||||
|
lookaround assertions and recursion within patterns.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "CREATING CHARACTER TABLES AT BUILD TIME"
|
.SH "CREATING CHARACTER TABLES AT BUILD TIME"
|
||||||
|
@ -312,10 +286,10 @@ only. If you add
|
||||||
to the \fBconfigure\fP command, the distributed tables are no longer used.
|
to the \fBconfigure\fP command, the distributed tables are no longer used.
|
||||||
Instead, a program called \fBdftables\fP is compiled and run. This outputs the
|
Instead, a program called \fBdftables\fP is compiled and run. This outputs the
|
||||||
source for new set of tables, created in the default locale of your C run-time
|
source for new set of tables, created in the default locale of your C run-time
|
||||||
system. (This method of replacing the tables does not work if you are cross
|
system. This method of replacing the tables does not work if you are cross
|
||||||
compiling, because \fBdftables\fP is run on the local host. If you need to
|
compiling, because \fBdftables\fP is run on the local host. If you need to
|
||||||
create alternative tables when cross compiling, you will have to do so "by
|
create alternative tables when cross compiling, you will have to do so "by
|
||||||
hand".)
|
hand".
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "USING EBCDIC CODE"
|
.SH "USING EBCDIC CODE"
|
||||||
|
@ -529,13 +503,29 @@ contains a single function called LLVMFuzzerTestOneInput() whose arguments are
|
||||||
a pointer to a string and the length of the string. When called, this function
|
a pointer to a string and the length of the string. When called, this function
|
||||||
tries to compile the string as a pattern, and if that succeeds, to match it.
|
tries to compile the string as a pattern, and if that succeeds, to match it.
|
||||||
This is done both with no options and with some random options bits that are
|
This is done both with no options and with some random options bits that are
|
||||||
generated from the string. Setting --enable-fuzz-support also causes a binary
|
generated from the string.
|
||||||
called \fBpcre2fuzzcheck\fP to be created. This is normally run under valgrind
|
.P
|
||||||
or used when PCRE2 is compiled with address sanitizing enabled. It calls the
|
Setting --enable-fuzz-support also causes a binary called \fBpcre2fuzzcheck\fP
|
||||||
fuzzing function and outputs information about it is doing. The input strings
|
to be created. This is normally run under valgrind or used when PCRE2 is
|
||||||
are specified by arguments: if an argument starts with "=" the rest of it is a
|
compiled with address sanitizing enabled. It calls the fuzzing function and
|
||||||
literal input string. Otherwise, it is assumed to be a file name, and the
|
outputs information about it is doing. The input strings are specified by
|
||||||
contents of the file are the test string.
|
arguments: if an argument starts with "=" the rest of it is a literal input
|
||||||
|
string. Otherwise, it is assumed to be a file name, and the contents of the
|
||||||
|
file are the test string.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "OBSOLETE OPTION"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
In versions of PCRE2 prior to 10.30, there were two ways of handling
|
||||||
|
backtracking in the \fBpcre2_match()\fP function. The default was to use the
|
||||||
|
system stack, but if
|
||||||
|
.sp
|
||||||
|
--disable-stack-for-recursion
|
||||||
|
.sp
|
||||||
|
was set, memory on the heap was used. From release 10.30 onwards this has
|
||||||
|
changed (the stack is no lonter used) and this option now does nothing except
|
||||||
|
give a warning.
|
||||||
.
|
.
|
||||||
.SH "SEE ALSO"
|
.SH "SEE ALSO"
|
||||||
.rs
|
.rs
|
||||||
|
@ -557,6 +547,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 01 November 2016
|
Last updated: 29 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2CALLOUT 3 "29 September 2016" "PCRE2 10.23"
|
.TH PCRE2CALLOUT 3 "29 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -40,8 +40,8 @@ two callout points:
|
||||||
.sp
|
.sp
|
||||||
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
|
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
|
||||||
automatically inserts callouts, all with number 255, before each item in the
|
automatically inserts callouts, all with number 255, before each item in the
|
||||||
pattern except for immediately before or after a callout item in the pattern.
|
pattern except for immediately before or after an explicit callout. For
|
||||||
For example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
||||||
.sp
|
.sp
|
||||||
A(?C3)B
|
A(?C3)B
|
||||||
.sp
|
.sp
|
||||||
|
@ -55,7 +55,7 @@ Here is a more complicated example:
|
||||||
.sp
|
.sp
|
||||||
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
||||||
.sp
|
.sp
|
||||||
(?C255)A(?C255)((?C255)\ed{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
(?C255)A(?C255)((?C255)\ed{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
||||||
.sp
|
.sp
|
||||||
Notice that there is a callout before and after each parenthesis and
|
Notice that there is a callout before and after each parenthesis and
|
||||||
alternation bar. If the pattern contains a conditional group whose condition is
|
alternation bar. If the pattern contains a conditional group whose condition is
|
||||||
|
@ -124,10 +124,13 @@ By default, an optimization is applied when .* is the first significant item in
|
||||||
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
|
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
|
||||||
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
|
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
|
||||||
start only after an internal newline or at the beginning of the subject, and
|
start only after an internal newline or at the beginning of the subject, and
|
||||||
\fBpcre2_compile()\fP remembers this. This optimization is disabled, however,
|
\fBpcre2_compile()\fP remembers this. If a pattern has more than one top-level
|
||||||
if .* is in an atomic group or if there is a back reference to the capturing
|
branch, automatic anchoring occurs if all branches are anchorable.
|
||||||
group in which it appears. It is also disabled if the pattern contains (*PRUNE)
|
.P
|
||||||
or (*SKIP). However, the presence of callouts does not affect it.
|
This optimization is disabled, however, if .* is in an atomic group or if there
|
||||||
|
is a back reference to the capturing group in which it appears. It is also
|
||||||
|
disabled if the pattern contains (*PRUNE) or (*SKIP). However, the presence of
|
||||||
|
callouts does not affect it.
|
||||||
.P
|
.P
|
||||||
For example, if the pattern .*\ed is compiled with PCRE2_AUTO_CALLOUT and
|
For example, if the pattern .*\ed is compiled with PCRE2_AUTO_CALLOUT and
|
||||||
applied to the string "aa", the \fBpcre2test\fP output is:
|
applied to the string "aa", the \fBpcre2test\fP output is:
|
||||||
|
@ -157,9 +160,6 @@ pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
|
||||||
This shows more match attempts, starting at the second subject character.
|
This shows more match attempts, starting at the second subject character.
|
||||||
Another optimization, described in the next section, means that there is no
|
Another optimization, described in the next section, means that there is no
|
||||||
subsequent attempt to match with an empty subject.
|
subsequent attempt to match with an empty subject.
|
||||||
.P
|
|
||||||
If a pattern has more than one top-level branch, automatic anchoring occurs if
|
|
||||||
all branches are anchorable.
|
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Other optimizations"
|
.SS "Other optimizations"
|
||||||
|
@ -175,9 +175,10 @@ subject string is "abyz", the lack of "d" means that matching doesn't ever
|
||||||
start, and the callout is never reached. However, with "abyd", though the
|
start, and the callout is never reached. However, with "abyd", though the
|
||||||
result is still no match, the callout is obeyed.
|
result is still no match, the callout is obeyed.
|
||||||
.P
|
.P
|
||||||
PCRE2 also knows the minimum length of a matching string, and will immediately
|
For most patterns PCRE2 also knows the minimum length of a matching string, and
|
||||||
give a "no match" return without actually running a match if the subject is not
|
will immediately give a "no match" return without actually running a match if
|
||||||
long enough, or, for unanchored patterns, if it has been scanned far enough.
|
the subject is not long enough, or, for unanchored patterns, if it has been
|
||||||
|
scanned far enough.
|
||||||
.P
|
.P
|
||||||
You can disable these optimizations by passing the PCRE2_NO_START_OPTIMIZE
|
You can disable these optimizations by passing the PCRE2_NO_START_OPTIMIZE
|
||||||
option to \fBpcre2_compile()\fP, or by starting the pattern with
|
option to \fBpcre2_compile()\fP, or by starting the pattern with
|
||||||
|
@ -259,12 +260,37 @@ need to report errors in the callout string within the pattern.
|
||||||
The remaining fields in the callout block are the same for both kinds of
|
The remaining fields in the callout block are the same for both kinds of
|
||||||
callout.
|
callout.
|
||||||
.P
|
.P
|
||||||
The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
|
The \fIoffset_vector\fP field is a pointer to a vector of capturing offsets
|
||||||
(the "ovector") that was passed to the matching function in the match data
|
(the "ovector"). You may read certain elements in this vector, but you must not
|
||||||
block. When \fBpcre2_match()\fP is used, the contents can be inspected in
|
change any of them.
|
||||||
|
.P
|
||||||
|
For calls to \fBpcre2_match()\fP, the \fIoffset_vector\fP field is not (since
|
||||||
|
release 10.30) a pointer to the actual ovector that was passed to the matching
|
||||||
|
function in the match data block. Instead it points to an internal ovector of a
|
||||||
|
size large enough to hold all possible captured substrings in the pattern. Note
|
||||||
|
that whenever a recursion or subroutine call within a pattern completes, the
|
||||||
|
capturing state is reset to what it was before.
|
||||||
|
.P
|
||||||
|
The \fIcapture_last\fP field contains the number of the most recently captured
|
||||||
|
substring, and the \fIcapture_top\fP field contains one more than the number of
|
||||||
|
the highest numbered captured substring so far. If no substrings have yet been
|
||||||
|
captured, the value of \fIcapture_last\fP is 0 and the value of
|
||||||
|
\fIcapture_top\fP is 1. The values of these fields do not always differ by one;
|
||||||
|
for example, when the callout in the pattern ((a)(b))(?C2) is taken,
|
||||||
|
\fIcapture_last\fP is 1 but \fIcapture_top\fP is 4.
|
||||||
|
.P
|
||||||
|
The contents of ovector[2] to ovector[<capture_top>*2-1] can be inspected in
|
||||||
order to extract substrings that have been matched so far, in the same way as
|
order to extract substrings that have been matched so far, in the same way as
|
||||||
for extracting substrings after a match has completed. For the DFA matching
|
extracting substrings after a match has completed. The values in ovector[0] and
|
||||||
function, this field is not useful.
|
ovector[1] are undefined and should not be used in any way. Substrings that
|
||||||
|
have not been captured (but whose numbers are less than \fIcapture_top\fP) have
|
||||||
|
both of their ovector slots set to PCRE2_UNSET.
|
||||||
|
.P
|
||||||
|
For DFA matching, the \fIoffset_vector\fP field points to the ovector that was
|
||||||
|
passed to the matching function in the match data block, but it holds no useful
|
||||||
|
information at callout time because \fBpcre2_dfa_match()\fP does not support
|
||||||
|
substring capturing. The value of \fIcapture_top\fP is always 1 and the value
|
||||||
|
of \fIcapture_last\fP is always 0 for DFA matching.
|
||||||
.P
|
.P
|
||||||
The \fIsubject\fP and \fIsubject_length\fP fields contain copies of the values
|
The \fIsubject\fP and \fIsubject_length\fP fields contain copies of the values
|
||||||
that were passed to the matching function.
|
that were passed to the matching function.
|
||||||
|
@ -279,18 +305,6 @@ in the subject.
|
||||||
The \fIcurrent_position\fP field contains the offset within the subject of the
|
The \fIcurrent_position\fP field contains the offset within the subject of the
|
||||||
current match pointer.
|
current match pointer.
|
||||||
.P
|
.P
|
||||||
When the \fBpcre2_match()\fP is used, the \fIcapture_top\fP field contains one
|
|
||||||
more than the number of the highest numbered captured substring so far. If no
|
|
||||||
substrings have been captured, the value of \fIcapture_top\fP is one. This is
|
|
||||||
always the case when the DFA functions are used, because they do not support
|
|
||||||
captured substrings.
|
|
||||||
.P
|
|
||||||
The \fIcapture_last\fP field contains the number of the most recently captured
|
|
||||||
substring. However, when a recursion exits, the value reverts to what it was
|
|
||||||
outside the recursion, as do the values of all captured substrings. If no
|
|
||||||
substrings have been captured, the value of \fIcapture_last\fP is 0. This is
|
|
||||||
always the case for the DFA matching functions.
|
|
||||||
.P
|
|
||||||
The \fIpattern_position\fP field contains the offset in the pattern string to
|
The \fIpattern_position\fP field contains the offset in the pattern string to
|
||||||
the next item to be matched.
|
the next item to be matched.
|
||||||
.P
|
.P
|
||||||
|
@ -396,6 +410,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 29 September 2016
|
Last updated: 29 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
Loading…
Reference in New Issue