Documentation update
This commit is contained in:
parent
a073581116
commit
ed9f34b06b
|
@ -103,7 +103,6 @@ dist_html_DATA = \
|
||||||
doc/html/pcre2posix.html \
|
doc/html/pcre2posix.html \
|
||||||
doc/html/pcre2sample.html \
|
doc/html/pcre2sample.html \
|
||||||
doc/html/pcre2serialize.html \
|
doc/html/pcre2serialize.html \
|
||||||
doc/html/pcre2stack.html \
|
|
||||||
doc/html/pcre2syntax.html \
|
doc/html/pcre2syntax.html \
|
||||||
doc/html/pcre2test.html \
|
doc/html/pcre2test.html \
|
||||||
doc/html/pcre2unicode.html
|
doc/html/pcre2unicode.html
|
||||||
|
@ -187,7 +186,6 @@ dist_man_MANS = \
|
||||||
doc/pcre2posix.3 \
|
doc/pcre2posix.3 \
|
||||||
doc/pcre2sample.3 \
|
doc/pcre2sample.3 \
|
||||||
doc/pcre2serialize.3 \
|
doc/pcre2serialize.3 \
|
||||||
doc/pcre2stack.3 \
|
|
||||||
doc/pcre2syntax.3 \
|
doc/pcre2syntax.3 \
|
||||||
doc/pcre2test.1 \
|
doc/pcre2test.1 \
|
||||||
doc/pcre2unicode.3
|
doc/pcre2unicode.3
|
||||||
|
|
|
@ -68,9 +68,6 @@ first.
|
||||||
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
|
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
|
||||||
<td> Serializing functions for saving precompiled patterns</td></tr>
|
<td> Serializing functions for saving precompiled patterns</td></tr>
|
||||||
|
|
||||||
<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
|
|
||||||
<td> Discussion of PCRE2's stack usage</td></tr>
|
|
||||||
|
|
||||||
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
|
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
|
||||||
<td> Syntax quick-reference summary</td></tr>
|
<td> Syntax quick-reference summary</td></tr>
|
||||||
|
|
||||||
|
|
|
@ -18,7 +18,8 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
||||||
<P>
|
<P>
|
||||||
This document describes the differences in the ways that PCRE2 and Perl handle
|
This document describes the differences in the ways that PCRE2 and Perl handle
|
||||||
regular expressions. The differences described here are with respect to Perl
|
regular expressions. The differences described here are with respect to Perl
|
||||||
versions 5.10 and above.
|
versions 5.24, but as both Perl and PCRE2 are continually changing, the
|
||||||
|
information may sometimes be out of date.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
|
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
|
||||||
|
@ -27,17 +28,18 @@ have are given in the
|
||||||
page.
|
page.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
2. PCRE2 allows repeat quantifiers only on parenthesized assertions, but they
|
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
|
||||||
do not mean what you might think. For example, (?!a){3} does not assert that
|
they do not mean what you might think. For example, (?!a){3} does not assert
|
||||||
the next three characters are not "a". It just asserts that the next character
|
that the next three characters are not "a". It just asserts that the next
|
||||||
is not "a" three times (in principle: PCRE2 optimizes this to run the assertion
|
character is not "a" three times (in principle: PCRE2 optimizes this to run the
|
||||||
just once). Perl allows repeat quantifiers on other assertions such as \b, but
|
assertion just once). Perl allows some repeat quantifiers on other assertions,
|
||||||
these do not seem to have any use.
|
for example, \b* (but not \b{3}), but these do not seem to have any use.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
3. Capturing subpatterns that occur inside negative lookahead assertions are
|
3. Capturing subpatterns that occur inside negative lookaround assertions are
|
||||||
counted, but their entries in the offsets vector are never set. Perl sometimes
|
counted, but their entries in the offsets vector are set only if the assertion
|
||||||
(but not always) sets its numerical variables from inside negative assertions.
|
is a condition. Perl has changed its behaviour in this regard from time to
|
||||||
|
time.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
4. The following Perl escape sequences are not supported: \l, \u, \L,
|
4. The following Perl escape sequences are not supported: \l, \u, \L,
|
||||||
|
@ -50,13 +52,13 @@ generated by default. However, if the PCRE2_ALT_BSUX option is set,
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
|
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
|
||||||
built with Unicode support. The properties that can be tested with \p and \P
|
built with Unicode support (the default). The properties that can be tested
|
||||||
are limited to the general category properties such as Lu and Nd, script names
|
with \p and \P are limited to the general category properties such as Lu and
|
||||||
such as Greek or Han, and the derived properties Any and L&. PCRE2 does support
|
Nd, script names such as Greek or Han, and the derived properties Any and L&.
|
||||||
the Cs (surrogate) property, which Perl does not; the Perl documentation says
|
PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
|
||||||
"Because Perl hides the need for the user to understand the internal
|
documentation says "Because Perl hides the need for the user to understand the
|
||||||
representation of Unicode characters, there is no need to implement the
|
internal representation of Unicode characters, there is no need to implement
|
||||||
somewhat messy concept of surrogates."
|
the somewhat messy concept of surrogates."
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters
|
6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters
|
||||||
|
@ -75,23 +77,15 @@ The \Q...\E sequence is recognized both inside and outside character classes.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
|
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
|
||||||
constructions. However, there is support for recursive patterns. This is not
|
constructions. However, there is support PCRE2's "callout" feature, which
|
||||||
available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE2 "callout"
|
allows an external function to be called during pattern matching. See the
|
||||||
feature allows an external function to be called during pattern matching. See
|
|
||||||
the
|
|
||||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||||
documentation for details.
|
documentation for details.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
8. Subroutine calls (whether recursive or not) are treated as atomic groups.
|
8. Subroutine calls (whether recursive or not) were treated as atomic groups up
|
||||||
Atomic recursion is like Python, but unlike Perl. Captured values that are set
|
to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
|
||||||
outside a subroutine call can be referenced from inside in PCRE2, but not in
|
into subroutine calls is now supported, as in Perl.
|
||||||
Perl. There is a discussion that explains these differences in more detail in
|
|
||||||
the
|
|
||||||
<a href="pcre2pattern.html#recursiondifference">section on recursion differences from Perl</a>
|
|
||||||
in the
|
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
|
||||||
page.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
9. If any of the backtracking control verbs are used in a subpattern that is
|
9. If any of the backtracking control verbs are used in a subpattern that is
|
||||||
|
@ -147,14 +141,14 @@ certainly user mistakes.
|
||||||
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
|
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
|
||||||
affected when case-independent matching is specified. For example, \p{Lu}
|
affected when case-independent matching is specified. For example, \p{Lu}
|
||||||
always matches an upper case letter. I think Perl has changed in this respect;
|
always matches an upper case letter. I think Perl has changed in this respect;
|
||||||
in the release at the time of writing (5.16), \p{Lu} and \p{Ll} match all
|
in the release at the time of writing (5.24), \p{Lu} and \p{Ll} match all
|
||||||
letters, regardless of case, when case independence is specified.
|
letters, regardless of case, when case independence is specified.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
17. PCRE2 provides some extensions to the Perl regular expression facilities.
|
17. PCRE2 provides some extensions to the Perl regular expression facilities.
|
||||||
Perl 5.10 includes new features that are not in earlier versions of Perl, some
|
Perl 5.10 includes new features that are not in earlier versions of Perl, some
|
||||||
of which (such as named parentheses) have been in PCRE2 for some time. This
|
of which (such as named parentheses) were in PCRE2 for some time before. This
|
||||||
list is with respect to Perl 5.10:
|
list is with respect to Perl 5.24:
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
|
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
|
||||||
|
@ -220,9 +214,9 @@ Cambridge, England.
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 18 October 2016
|
Last updated: 29 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -173,7 +173,7 @@ below for a discussion of JIT stack usage.
|
||||||
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
|
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
|
||||||
a very large pattern tree goes on for too long, as it is in the same
|
a very large pattern tree goes on for too long, as it is in the same
|
||||||
circumstance when JIT is not used, but the details of exactly what is counted
|
circumstance when JIT is not used, but the details of exactly what is counted
|
||||||
are not the same. The PCRE2_ERROR_RECURSIONLIMIT error code is never returned
|
are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
|
||||||
when JIT matching is used.
|
when JIT matching is used.
|
||||||
<a name="stackcontrol"></a></P>
|
<a name="stackcontrol"></a></P>
|
||||||
<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
|
<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
|
||||||
|
@ -436,9 +436,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 05 June 2016
|
Last updated: 30 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -44,14 +44,6 @@ integer type, usually defined as size_t. Its maximum value (that is
|
||||||
and unset offsets.
|
and unset offsets.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Note that when using the traditional matching function, PCRE2 uses recursion to
|
|
||||||
handle subpatterns and indefinite repetition. This means that the available
|
|
||||||
stack space may limit the size of a subject string that can be processed by
|
|
||||||
certain patterns. For a discussion of stack issues, see the
|
|
||||||
<a href="pcre2stack.html"><b>pcre2stack</b></a>
|
|
||||||
documentation.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
All values in repeating quantifiers must be less than 65536.
|
All values in repeating quantifiers must be less than 65536.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -94,9 +86,9 @@ Cambridge, England.
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 26 October 2016
|
Last updated: 30 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2016 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<ul>
|
<ul>
|
||||||
<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
|
<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
|
||||||
<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
|
<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
|
||||||
<li><a name="TOC3" href="#SEC3">STACK USAGE AT RUN TIME</a>
|
<li><a name="TOC3" href="#SEC3">STACK AND HEAP USAGE AT RUN TIME</a>
|
||||||
<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
|
<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
|
||||||
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
|
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
|
||||||
<li><a name="TOC6" href="#SEC6">REVISION</a>
|
<li><a name="TOC6" href="#SEC6">REVISION</a>
|
||||||
|
@ -29,11 +29,11 @@ of them.
|
||||||
<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
|
<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
|
||||||
<P>
|
<P>
|
||||||
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
||||||
so that most simple patterns do not use much memory. However, there is one case
|
so that most simple patterns do not use much memory for storing the compiled
|
||||||
where the memory usage of a compiled pattern can be unexpectedly large. If a
|
version. However, there is one case where the memory usage of a compiled
|
||||||
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
|
pattern can be unexpectedly large. If a parenthesized subpattern has a
|
||||||
a limited maximum, the whole subpattern is repeated in the compiled code. For
|
quantifier with a minimum greater than 1 and/or a limited maximum, the whole
|
||||||
example, the pattern
|
subpattern is repeated in the compiled code. For example, the pattern
|
||||||
<pre>
|
<pre>
|
||||||
(abc|def){2,4}
|
(abc|def){2,4}
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -52,13 +52,13 @@ example, the very simple pattern
|
||||||
<pre>
|
<pre>
|
||||||
((ab){1,1000}c){1,3}
|
((ab){1,1000}c){1,3}
|
||||||
</pre>
|
</pre>
|
||||||
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
|
uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
|
||||||
with its default internal pointer size of two bytes, the size limit on a
|
compiled with its default internal pointer size of two bytes, the size limit on
|
||||||
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
|
a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
|
||||||
is reached with the above pattern if the outer repetition is increased from 3
|
this is reached with the above pattern if the outer repetition is increased
|
||||||
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
|
from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
|
||||||
larger compiled patterns, but it is better to try to rewrite your pattern to
|
handle larger compiled patterns, but it is better to try to rewrite your
|
||||||
use less memory if you can.
|
pattern to use less memory if you can.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
One way of reducing the memory usage for such patterns is to make use of
|
One way of reducing the memory usage for such patterns is to make use of
|
||||||
|
@ -68,25 +68,33 @@ facility. Re-writing the above pattern as
|
||||||
<pre>
|
<pre>
|
||||||
((ab)(?2){0,999}c)(?1){0,2}
|
((ab)(?2){0,999}c)(?1){0,2}
|
||||||
</pre>
|
</pre>
|
||||||
reduces the memory requirements to 18K, and indeed it remains under 20K even
|
reduces the memory requirements to around 16K, and indeed it remains under 20K
|
||||||
with the outer repetition increased to 100. However, this pattern is not
|
even with the outer repetition increased to 100. However, this kind of pattern
|
||||||
exactly equivalent, because the "subroutine" calls are treated as
|
is not always exactly equivalent, because any captures within subroutine calls
|
||||||
<a href="pcre2pattern.html#atomicgroup">atomic groups</a>
|
are lost when the subroutine completes. If this is not a problem, this kind of
|
||||||
into which there can be no backtracking if there is a subsequent matching
|
rewriting will allow you to process patterns that PCRE2 cannot otherwise
|
||||||
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
|
handle. The matching performance of the two different versions of the pattern
|
||||||
Furthermore, there is a noticeable loss of speed when executing the modified
|
are roughly the same. (This applies from release 10.30 - things were different
|
||||||
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
|
in earlier releases.)
|
||||||
speed is acceptable, this kind of rewriting will allow you to process patterns
|
|
||||||
that PCRE2 cannot otherwise handle.
|
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC3" href="#TOC1">STACK USAGE AT RUN TIME</a><br>
|
<br><a name="SEC3" href="#TOC1">STACK AND HEAP USAGE AT RUN TIME</a><br>
|
||||||
<P>
|
<P>
|
||||||
When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
|
From release 10.30, the interpretive (non-JIT) version of <b>pcre2_match()</b>
|
||||||
cause it to use large amounts of the process stack. In some environments the
|
uses very little system stack at run time. In earlier releases recursive
|
||||||
default process stack is quite small, and if it runs out the result is often
|
function calls could use a great deal of stack, and this could cause problems,
|
||||||
SIGSEGV. Rewriting your pattern can often help. The
|
but this usage has been eliminated. Backtracking positions are now explicitly
|
||||||
<a href="pcre2stack.html"><b>pcre2stack</b></a>
|
remembered in memory frames controlled by the code. An initial 10K vector of
|
||||||
documentation discusses this issue in detail.
|
frames is allocated on the system stack (enough for about 50 frames for small
|
||||||
|
patterns), but if this is insufficient, heap memory is used. Rewriting patterns
|
||||||
|
to be time-efficient, as described below, may also reduce the memory
|
||||||
|
requirements.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive
|
||||||
|
function calls, but only for processing atomic groups, lookaround assertions,
|
||||||
|
and recursion within the pattern. Too much nested recursion may cause stack
|
||||||
|
issues. The "match depth" parameter can be used to limit the depth of function
|
||||||
|
recursion in <b>pcre2_dfa_match()</b>.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
|
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -175,7 +183,54 @@ appreciable time with strings longer than about 20 characters.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In many cases, the solution to this kind of performance issue is to use an
|
In many cases, the solution to this kind of performance issue is to use an
|
||||||
atomic group or a possessive quantifier.
|
atomic group or a possessive quantifier. This can often reduce memory
|
||||||
|
requirements as well. As another example, consider this pattern:
|
||||||
|
<pre>
|
||||||
|
([^<]|<(?!inet))+
|
||||||
|
</pre>
|
||||||
|
It matches from wherever it starts until it encounters "<inet" or the end of
|
||||||
|
the data, and is the kind of pattern that might be used when processing an XML
|
||||||
|
file. Each iteration of the outer parentheses matches either one character that
|
||||||
|
is not "<" or a "<" that is not followed by "inet". However, each time a
|
||||||
|
parenthesis is processed, a backtracking position is passed, so this
|
||||||
|
formulation uses a memory frame for each matched character. For a long string,
|
||||||
|
a lot of memory is required. Consider now this rewritten pattern, which matches
|
||||||
|
exactly the same strings:
|
||||||
|
<pre>
|
||||||
|
([^<]++|<(?!inet))+
|
||||||
|
</pre>
|
||||||
|
This runs much faster, because sequences of characters that do not contain "<"
|
||||||
|
are "swallowed" in one item inside the parentheses, and a possessive quantifier
|
||||||
|
is used to stop any backtracking into the runs of non-"<" characters. This
|
||||||
|
version also uses a lot less memory because entry to a new set of parentheses
|
||||||
|
happens only when a "<" character that is not followed by "inet" is encountered
|
||||||
|
(and we assume this is relatively rare).
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
This example shows that one way of optimizing performance when matching long
|
||||||
|
subject strings is to write repeated parenthesized subpatterns to match more
|
||||||
|
than one character whenever possible.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
SETTING RESOURCE LIMITS
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
You can set limits on the amount of processing that takes place when matching,
|
||||||
|
and on the amount of heap memory that is used. The default values of the limits
|
||||||
|
are very large, and unlikely ever to operate. They can be changed when PCRE2 is
|
||||||
|
built, and they can also be set when <b>pcre2_match()</b> or
|
||||||
|
<b>pcre2_dfa_match()</b> is called. For details of these interfaces, see the
|
||||||
|
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||||
|
documentation and the section entitled
|
||||||
|
<a href="pcre2api.html#matchcontext">"The match context"</a>
|
||||||
|
in the
|
||||||
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
|
documentation.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
|
||||||
|
applied to a subject line, causes it to find the smallest limits that allow a
|
||||||
|
pattern to match. This is done by repeatedly matching with different limits.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
|
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -188,9 +243,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 02 January 2015
|
Last updated: 31 March 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2015 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -68,9 +68,6 @@ first.
|
||||||
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
|
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
|
||||||
<td> Serializing functions for saving precompiled patterns</td></tr>
|
<td> Serializing functions for saving precompiled patterns</td></tr>
|
||||||
|
|
||||||
<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
|
|
||||||
<td> Discussion of PCRE2's stack usage</td></tr>
|
|
||||||
|
|
||||||
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
|
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
|
||||||
<td> Syntax quick-reference summary</td></tr>
|
<td> Syntax quick-reference summary</td></tr>
|
||||||
|
|
||||||
|
|
311
doc/pcre2.txt
311
doc/pcre2.txt
|
@ -4097,45 +4097,46 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
||||||
|
|
||||||
This document describes the differences in the ways that PCRE2 and Perl
|
This document describes the differences in the ways that PCRE2 and Perl
|
||||||
handle regular expressions. The differences described here are with
|
handle regular expressions. The differences described here are with
|
||||||
respect to Perl versions 5.10 and above.
|
respect to Perl versions 5.24, but as both Perl and PCRE2 are continu-
|
||||||
|
ally changing, the information may sometimes be out of date.
|
||||||
|
|
||||||
1. PCRE2 has only a subset of Perl's Unicode support. Details of what
|
1. PCRE2 has only a subset of Perl's Unicode support. Details of what
|
||||||
it does have are given in the pcre2unicode page.
|
it does have are given in the pcre2unicode page.
|
||||||
|
|
||||||
2. PCRE2 allows repeat quantifiers only on parenthesized assertions,
|
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
|
||||||
but they do not mean what you might think. For example, (?!a){3} does
|
tions, but they do not mean what you might think. For example, (?!a){3}
|
||||||
not assert that the next three characters are not "a". It just asserts
|
does not assert that the next three characters are not "a". It just
|
||||||
that the next character is not "a" three times (in principle: PCRE2
|
asserts that the next character is not "a" three times (in principle:
|
||||||
optimizes this to run the assertion just once). Perl allows repeat
|
PCRE2 optimizes this to run the assertion just once). Perl allows some
|
||||||
quantifiers on other assertions such as \b, but these do not seem to
|
repeat quantifiers on other assertions, for example, \b* (but not
|
||||||
have any use.
|
\b{3}), but these do not seem to have any use.
|
||||||
|
|
||||||
3. Capturing subpatterns that occur inside negative lookahead asser-
|
3. Capturing subpatterns that occur inside negative lookaround asser-
|
||||||
tions are counted, but their entries in the offsets vector are never
|
tions are counted, but their entries in the offsets vector are set only
|
||||||
set. Perl sometimes (but not always) sets its numerical variables from
|
if the assertion is a condition. Perl has changed its behaviour in this
|
||||||
inside negative assertions.
|
regard from time to time.
|
||||||
|
|
||||||
4. The following Perl escape sequences are not supported: \l, \u, \L,
|
4. The following Perl escape sequences are not supported: \l, \u, \L,
|
||||||
\U, and \N when followed by a character name or Unicode value. (\N on
|
\U, and \N when followed by a character name or Unicode value. (\N on
|
||||||
its own, matching a non-newline character, is supported.) In fact these
|
its own, matching a non-newline character, is supported.) In fact these
|
||||||
are implemented by Perl's general string-handling and are not part of
|
are implemented by Perl's general string-handling and are not part of
|
||||||
its pattern matching engine. If any of these are encountered by PCRE2,
|
its pattern matching engine. If any of these are encountered by PCRE2,
|
||||||
an error is generated by default. However, if the PCRE2_ALT_BSUX option
|
an error is generated by default. However, if the PCRE2_ALT_BSUX option
|
||||||
is set, \U and \u are interpreted as ECMAScript interprets them.
|
is set, \U and \u are interpreted as ECMAScript interprets them.
|
||||||
|
|
||||||
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
|
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
|
||||||
is built with Unicode support. The properties that can be tested with
|
is built with Unicode support (the default). The properties that can be
|
||||||
\p and \P are limited to the general category properties such as Lu and
|
tested with \p and \P are limited to the general category properties
|
||||||
Nd, script names such as Greek or Han, and the derived properties Any
|
such as Lu and Nd, script names such as Greek or Han, and the derived
|
||||||
and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
|
properties Any and L&. PCRE2 does support the Cs (surrogate) property,
|
||||||
not; the Perl documentation says "Because Perl hides the need for the
|
which Perl does not; the Perl documentation says "Because Perl hides
|
||||||
user to understand the internal representation of Unicode characters,
|
the need for the user to understand the internal representation of Uni-
|
||||||
there is no need to implement the somewhat messy concept of surro-
|
code characters, there is no need to implement the somewhat messy con-
|
||||||
gates."
|
cept of surrogates."
|
||||||
|
|
||||||
6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
|
6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
|
||||||
acters in between are treated as literals. This is slightly different
|
acters in between are treated as literals. This is slightly different
|
||||||
from Perl in that $ and @ are also handled as literals inside the
|
from Perl in that $ and @ are also handled as literals inside the
|
||||||
quotes. In Perl, they cause variable interpolation (but of course PCRE2
|
quotes. In Perl, they cause variable interpolation (but of course PCRE2
|
||||||
does not have variables). Note the following examples:
|
does not have variables). Note the following examples:
|
||||||
|
|
||||||
|
@ -4146,22 +4147,17 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
||||||
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
||||||
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
||||||
|
|
||||||
The \Q...\E sequence is recognized both inside and outside character
|
The \Q...\E sequence is recognized both inside and outside character
|
||||||
classes.
|
classes.
|
||||||
|
|
||||||
7. Fairly obviously, PCRE2 does not support the (?{code}) and
|
7. Fairly obviously, PCRE2 does not support the (?{code}) and
|
||||||
(??{code}) constructions. However, there is support for recursive pat-
|
(??{code}) constructions. However, there is support PCRE2's "callout"
|
||||||
terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
|
feature, which allows an external function to be called during pattern
|
||||||
the PCRE2 "callout" feature allows an external function to be called
|
matching. See the pcre2callout documentation for details.
|
||||||
during pattern matching. See the pcre2callout documentation for
|
|
||||||
details.
|
|
||||||
|
|
||||||
8. Subroutine calls (whether recursive or not) are treated as atomic
|
8. Subroutine calls (whether recursive or not) were treated as atomic
|
||||||
groups. Atomic recursion is like Python, but unlike Perl. Captured
|
groups up to PCRE2 release 10.23, but from release 10.30 this changed,
|
||||||
values that are set outside a subroutine call can be referenced from
|
and backtracking into subroutine calls is now supported, as in Perl.
|
||||||
inside in PCRE2, but not in Perl. There is a discussion that explains
|
|
||||||
these differences in more detail in the section on recursion differ-
|
|
||||||
ences from Perl in the pcre2pattern page.
|
|
||||||
|
|
||||||
9. If any of the backtracking control verbs are used in a subpattern
|
9. If any of the backtracking control verbs are used in a subpattern
|
||||||
that is called as a subroutine (whether or not recursively), their
|
that is called as a subroutine (whether or not recursively), their
|
||||||
|
@ -4211,14 +4207,14 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
||||||
16. In PCRE2, the upper/lower case character properties Lu and Ll are
|
16. In PCRE2, the upper/lower case character properties Lu and Ll are
|
||||||
not affected when case-independent matching is specified. For example,
|
not affected when case-independent matching is specified. For example,
|
||||||
\p{Lu} always matches an upper case letter. I think Perl has changed in
|
\p{Lu} always matches an upper case letter. I think Perl has changed in
|
||||||
this respect; in the release at the time of writing (5.16), \p{Lu} and
|
this respect; in the release at the time of writing (5.24), \p{Lu} and
|
||||||
\p{Ll} match all letters, regardless of case, when case independence is
|
\p{Ll} match all letters, regardless of case, when case independence is
|
||||||
specified.
|
specified.
|
||||||
|
|
||||||
17. PCRE2 provides some extensions to the Perl regular expression
|
17. PCRE2 provides some extensions to the Perl regular expression
|
||||||
facilities. Perl 5.10 includes new features that are not in earlier
|
facilities. Perl 5.10 includes new features that are not in earlier
|
||||||
versions of Perl, some of which (such as named parentheses) have been
|
versions of Perl, some of which (such as named parentheses) were in
|
||||||
in PCRE2 for some time. This list is with respect to Perl 5.10:
|
PCRE2 for some time before. This list is with respect to Perl 5.24:
|
||||||
|
|
||||||
(a) Although lookbehind assertions in PCRE2 must match fixed length
|
(a) Although lookbehind assertions in PCRE2 must match fixed length
|
||||||
strings, each alternative branch of a lookbehind assertion can match a
|
strings, each alternative branch of a lookbehind assertion can match a
|
||||||
|
@ -4271,8 +4267,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 18 October 2016
|
Last updated: 29 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -4420,8 +4416,8 @@ RETURN VALUES FROM JIT MATCHING
|
||||||
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
|
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
|
||||||
searching a very large pattern tree goes on for too long, as it is in
|
searching a very large pattern tree goes on for too long, as it is in
|
||||||
the same circumstance when JIT is not used, but the details of exactly
|
the same circumstance when JIT is not used, but the details of exactly
|
||||||
what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error
|
what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
|
||||||
code is never returned when JIT matching is used.
|
is never returned when JIT matching is used.
|
||||||
|
|
||||||
|
|
||||||
CONTROLLING THE JIT STACK
|
CONTROLLING THE JIT STACK
|
||||||
|
@ -4668,8 +4664,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 05 June 2016
|
Last updated: 30 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -4706,12 +4702,6 @@ SIZE AND OTHER LIMITATIONS
|
||||||
(that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
|
(that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
|
||||||
terminated strings and unset offsets.
|
terminated strings and unset offsets.
|
||||||
|
|
||||||
Note that when using the traditional matching function, PCRE2 uses
|
|
||||||
recursion to handle subpatterns and indefinite repetition. This means
|
|
||||||
that the available stack space may limit the size of a subject string
|
|
||||||
that can be processed by certain patterns. For a discussion of stack
|
|
||||||
issues, see the pcre2stack documentation.
|
|
||||||
|
|
||||||
All values in repeating quantifiers must be less than 65536.
|
All values in repeating quantifiers must be less than 65536.
|
||||||
|
|
||||||
The maximum length of a lookbehind assertion is 65535 characters.
|
The maximum length of a lookbehind assertion is 65535 characters.
|
||||||
|
@ -4745,8 +4735,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 26 October 2016
|
Last updated: 30 March 2017
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -8485,11 +8475,12 @@ PCRE2 PERFORMANCE
|
||||||
COMPILED PATTERN MEMORY USAGE
|
COMPILED PATTERN MEMORY USAGE
|
||||||
|
|
||||||
Patterns are compiled by PCRE2 into a reasonably efficient interpretive
|
Patterns are compiled by PCRE2 into a reasonably efficient interpretive
|
||||||
code, so that most simple patterns do not use much memory. However,
|
code, so that most simple patterns do not use much memory for storing
|
||||||
there is one case where the memory usage of a compiled pattern can be
|
the compiled version. However, there is one case where the memory usage
|
||||||
unexpectedly large. If a parenthesized subpattern has a quantifier with
|
of a compiled pattern can be unexpectedly large. If a parenthesized
|
||||||
a minimum greater than 1 and/or a limited maximum, the whole subpattern
|
subpattern has a quantifier with a minimum greater than 1 and/or a lim-
|
||||||
is repeated in the compiled code. For example, the pattern
|
ited maximum, the whole subpattern is repeated in the compiled code.
|
||||||
|
For example, the pattern
|
||||||
|
|
||||||
(abc|def){2,4}
|
(abc|def){2,4}
|
||||||
|
|
||||||
|
@ -8497,134 +8488,186 @@ COMPILED PATTERN MEMORY USAGE
|
||||||
|
|
||||||
(abc|def)(abc|def)((abc|def)(abc|def)?)?
|
(abc|def)(abc|def)((abc|def)(abc|def)?)?
|
||||||
|
|
||||||
(Technical aside: It is done this way so that backtrack points within
|
(Technical aside: It is done this way so that backtrack points within
|
||||||
each of the repetitions can be independently maintained.)
|
each of the repetitions can be independently maintained.)
|
||||||
|
|
||||||
For regular expressions whose quantifiers use only small numbers, this
|
For regular expressions whose quantifiers use only small numbers, this
|
||||||
is not usually a problem. However, if the numbers are large, and par-
|
is not usually a problem. However, if the numbers are large, and par-
|
||||||
ticularly if such repetitions are nested, the memory usage can become
|
ticularly if such repetitions are nested, the memory usage can become
|
||||||
an embarrassment. For example, the very simple pattern
|
an embarrassment. For example, the very simple pattern
|
||||||
|
|
||||||
((ab){1,1000}c){1,3}
|
((ab){1,1000}c){1,3}
|
||||||
|
|
||||||
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is
|
uses over 50K bytes when compiled using the 8-bit library. When PCRE2
|
||||||
compiled with its default internal pointer size of two bytes, the size
|
is compiled with its default internal pointer size of two bytes, the
|
||||||
limit on a compiled pattern is 64K code units in the 8-bit and 16-bit
|
size limit on a compiled pattern is 64K code units in the 8-bit and
|
||||||
libraries, and this is reached with the above pattern if the outer rep-
|
16-bit libraries, and this is reached with the above pattern if the
|
||||||
etition is increased from 3 to 4. PCRE2 can be compiled to use larger
|
outer repetition is increased from 3 to 4. PCRE2 can be compiled to use
|
||||||
internal pointers and thus handle larger compiled patterns, but it is
|
larger internal pointers and thus handle larger compiled patterns, but
|
||||||
better to try to rewrite your pattern to use less memory if you can.
|
it is better to try to rewrite your pattern to use less memory if you
|
||||||
|
can.
|
||||||
|
|
||||||
One way of reducing the memory usage for such patterns is to make use
|
One way of reducing the memory usage for such patterns is to make use
|
||||||
of PCRE2's "subroutine" facility. Re-writing the above pattern as
|
of PCRE2's "subroutine" facility. Re-writing the above pattern as
|
||||||
|
|
||||||
((ab)(?2){0,999}c)(?1){0,2}
|
((ab)(?2){0,999}c)(?1){0,2}
|
||||||
|
|
||||||
reduces the memory requirements to 18K, and indeed it remains under 20K
|
reduces the memory requirements to around 16K, and indeed it remains
|
||||||
even with the outer repetition increased to 100. However, this pattern
|
under 20K even with the outer repetition increased to 100. However,
|
||||||
is not exactly equivalent, because the "subroutine" calls are treated
|
this kind of pattern is not always exactly equivalent, because any cap-
|
||||||
as atomic groups into which there can be no backtracking if there is a
|
tures within subroutine calls are lost when the subroutine completes.
|
||||||
subsequent matching failure. Therefore, PCRE2 cannot do this kind of
|
If this is not a problem, this kind of rewriting will allow you to
|
||||||
rewriting automatically. Furthermore, there is a noticeable loss of
|
process patterns that PCRE2 cannot otherwise handle. The matching per-
|
||||||
speed when executing the modified pattern. Nevertheless, if the atomic
|
formance of the two different versions of the pattern are roughly the
|
||||||
grouping is not a problem and the loss of speed is acceptable, this
|
same. (This applies from release 10.30 - things were different in ear-
|
||||||
kind of rewriting will allow you to process patterns that PCRE2 cannot
|
lier releases.)
|
||||||
otherwise handle.
|
|
||||||
|
|
||||||
|
|
||||||
STACK USAGE AT RUN TIME
|
STACK AND HEAP USAGE AT RUN TIME
|
||||||
|
|
||||||
When pcre2_match() is used for matching, certain kinds of pattern can
|
From release 10.30, the interpretive (non-JIT) version of pcre2_match()
|
||||||
cause it to use large amounts of the process stack. In some environ-
|
uses very little system stack at run time. In earlier releases recur-
|
||||||
ments the default process stack is quite small, and if it runs out the
|
sive function calls could use a great deal of stack, and this could
|
||||||
result is often SIGSEGV. Rewriting your pattern can often help. The
|
cause problems, but this usage has been eliminated. Backtracking posi-
|
||||||
pcre2stack documentation discusses this issue in detail.
|
tions are now explicitly remembered in memory frames controlled by the
|
||||||
|
code. An initial 10K vector of frames is allocated on the system stack
|
||||||
|
(enough for about 50 frames for small patterns), but if this is insuf-
|
||||||
|
ficient, heap memory is used. Rewriting patterns to be time-efficient,
|
||||||
|
as described below, may also reduce the memory requirements.
|
||||||
|
|
||||||
|
In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
|
||||||
|
function calls, but only for processing atomic groups, lookaround
|
||||||
|
assertions, and recursion within the pattern. Too much nested recursion
|
||||||
|
may cause stack issues. The "match depth" parameter can be used to
|
||||||
|
limit the depth of function recursion in pcre2_dfa_match().
|
||||||
|
|
||||||
|
|
||||||
PROCESSING TIME
|
PROCESSING TIME
|
||||||
|
|
||||||
Certain items in regular expression patterns are processed more effi-
|
Certain items in regular expression patterns are processed more effi-
|
||||||
ciently than others. It is more efficient to use a character class like
|
ciently than others. It is more efficient to use a character class like
|
||||||
[aeiou] than a set of single-character alternatives such as
|
[aeiou] than a set of single-character alternatives such as
|
||||||
(a|e|i|o|u). In general, the simplest construction that provides the
|
(a|e|i|o|u). In general, the simplest construction that provides the
|
||||||
required behaviour is usually the most efficient. Jeffrey Friedl's book
|
required behaviour is usually the most efficient. Jeffrey Friedl's book
|
||||||
contains a lot of useful general discussion about optimizing regular
|
contains a lot of useful general discussion about optimizing regular
|
||||||
expressions for efficient performance. This document contains a few
|
expressions for efficient performance. This document contains a few
|
||||||
observations about PCRE2.
|
observations about PCRE2.
|
||||||
|
|
||||||
Using Unicode character properties (the \p, \P, and \X escapes) is
|
Using Unicode character properties (the \p, \P, and \X escapes) is
|
||||||
slow, because PCRE2 has to use a multi-stage table lookup whenever it
|
slow, because PCRE2 has to use a multi-stage table lookup whenever it
|
||||||
needs a character's property. If you can find an alternative pattern
|
needs a character's property. If you can find an alternative pattern
|
||||||
that does not use character properties, it will probably be faster.
|
that does not use character properties, it will probably be faster.
|
||||||
|
|
||||||
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
|
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
|
||||||
character classes such as [:alpha:] do not use Unicode properties,
|
character classes such as [:alpha:] do not use Unicode properties,
|
||||||
partly for backwards compatibility, and partly for performance reasons.
|
partly for backwards compatibility, and partly for performance reasons.
|
||||||
However, you can set the PCRE2_UCP option or start the pattern with
|
However, you can set the PCRE2_UCP option or start the pattern with
|
||||||
(*UCP) if you want Unicode character properties to be used. This can
|
(*UCP) if you want Unicode character properties to be used. This can
|
||||||
double the matching time for items such as \d, when matched with
|
double the matching time for items such as \d, when matched with
|
||||||
pcre2_match(); the performance loss is less with a DFA matching func-
|
pcre2_match(); the performance loss is less with a DFA matching func-
|
||||||
tion, and in both cases there is not much difference for \b.
|
tion, and in both cases there is not much difference for \b.
|
||||||
|
|
||||||
When a pattern begins with .* not in atomic parentheses, nor in paren-
|
When a pattern begins with .* not in atomic parentheses, nor in paren-
|
||||||
theses that are the subject of a backreference, and the PCRE2_DOTALL
|
theses that are the subject of a backreference, and the PCRE2_DOTALL
|
||||||
option is set, the pattern is implicitly anchored by PCRE2, since it
|
option is set, the pattern is implicitly anchored by PCRE2, since it
|
||||||
can match only at the start of a subject string. If the pattern has
|
can match only at the start of a subject string. If the pattern has
|
||||||
multiple top-level branches, they must all be anchorable. The optimiza-
|
multiple top-level branches, they must all be anchorable. The optimiza-
|
||||||
tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
|
tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
|
||||||
automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
|
automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
|
||||||
|
|
||||||
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
|
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
|
||||||
because the dot metacharacter does not then match a newline, and if the
|
because the dot metacharacter does not then match a newline, and if the
|
||||||
subject string contains newlines, the pattern may match from the char-
|
subject string contains newlines, the pattern may match from the char-
|
||||||
acter immediately following one of them instead of from the very start.
|
acter immediately following one of them instead of from the very start.
|
||||||
For example, the pattern
|
For example, the pattern
|
||||||
|
|
||||||
.*second
|
.*second
|
||||||
|
|
||||||
matches the subject "first\nand second" (where \n stands for a newline
|
matches the subject "first\nand second" (where \n stands for a newline
|
||||||
character), with the match starting at the seventh character. In order
|
character), with the match starting at the seventh character. In order
|
||||||
to do this, PCRE2 has to retry the match starting after every newline
|
to do this, PCRE2 has to retry the match starting after every newline
|
||||||
in the subject.
|
in the subject.
|
||||||
|
|
||||||
If you are using such a pattern with subject strings that do not con-
|
If you are using such a pattern with subject strings that do not con-
|
||||||
tain newlines, the best performance is obtained by setting
|
tain newlines, the best performance is obtained by setting
|
||||||
PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
|
PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
|
||||||
explicit anchoring. That saves PCRE2 from having to scan along the sub-
|
explicit anchoring. That saves PCRE2 from having to scan along the sub-
|
||||||
ject looking for a newline to restart at.
|
ject looking for a newline to restart at.
|
||||||
|
|
||||||
Beware of patterns that contain nested indefinite repeats. These can
|
Beware of patterns that contain nested indefinite repeats. These can
|
||||||
take a long time to run when applied to a string that does not match.
|
take a long time to run when applied to a string that does not match.
|
||||||
Consider the pattern fragment
|
Consider the pattern fragment
|
||||||
|
|
||||||
^(a+)*
|
^(a+)*
|
||||||
|
|
||||||
This can match "aaaa" in 16 different ways, and this number increases
|
This can match "aaaa" in 16 different ways, and this number increases
|
||||||
very rapidly as the string gets longer. (The * repeat can match 0, 1,
|
very rapidly as the string gets longer. (The * repeat can match 0, 1,
|
||||||
2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
|
2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
|
||||||
repeats can match different numbers of times.) When the remainder of
|
repeats can match different numbers of times.) When the remainder of
|
||||||
the pattern is such that the entire match is going to fail, PCRE2 has
|
the pattern is such that the entire match is going to fail, PCRE2 has
|
||||||
in principle to try every possible variation, and this can take an
|
in principle to try every possible variation, and this can take an
|
||||||
extremely long time, even for relatively short strings.
|
extremely long time, even for relatively short strings.
|
||||||
|
|
||||||
An optimization catches some of the more simple cases such as
|
An optimization catches some of the more simple cases such as
|
||||||
|
|
||||||
(a+)*b
|
(a+)*b
|
||||||
|
|
||||||
where a literal character follows. Before embarking on the standard
|
where a literal character follows. Before embarking on the standard
|
||||||
matching procedure, PCRE2 checks that there is a "b" later in the sub-
|
matching procedure, PCRE2 checks that there is a "b" later in the sub-
|
||||||
ject string, and if there is not, it fails the match immediately. How-
|
ject string, and if there is not, it fails the match immediately. How-
|
||||||
ever, when there is no following literal this optimization cannot be
|
ever, when there is no following literal this optimization cannot be
|
||||||
used. You can see the difference by comparing the behaviour of
|
used. You can see the difference by comparing the behaviour of
|
||||||
|
|
||||||
(a+)*\d
|
(a+)*\d
|
||||||
|
|
||||||
with the pattern above. The former gives a failure almost instantly
|
with the pattern above. The former gives a failure almost instantly
|
||||||
when applied to a whole line of "a" characters, whereas the latter
|
when applied to a whole line of "a" characters, whereas the latter
|
||||||
takes an appreciable time with strings longer than about 20 characters.
|
takes an appreciable time with strings longer than about 20 characters.
|
||||||
|
|
||||||
In many cases, the solution to this kind of performance issue is to use
|
In many cases, the solution to this kind of performance issue is to use
|
||||||
an atomic group or a possessive quantifier.
|
an atomic group or a possessive quantifier. This can often reduce mem-
|
||||||
|
ory requirements as well. As another example, consider this pattern:
|
||||||
|
|
||||||
|
([^<]|<(?!inet))+
|
||||||
|
|
||||||
|
It matches from wherever it starts until it encounters "<inet" or the
|
||||||
|
end of the data, and is the kind of pattern that might be used when
|
||||||
|
processing an XML file. Each iteration of the outer parentheses matches
|
||||||
|
either one character that is not "<" or a "<" that is not followed by
|
||||||
|
"inet". However, each time a parenthesis is processed, a backtracking
|
||||||
|
position is passed, so this formulation uses a memory frame for each
|
||||||
|
matched character. For a long string, a lot of memory is required. Con-
|
||||||
|
sider now this rewritten pattern, which matches exactly the same
|
||||||
|
strings:
|
||||||
|
|
||||||
|
([^<]++|<(?!inet))+
|
||||||
|
|
||||||
|
This runs much faster, because sequences of characters that do not con-
|
||||||
|
tain "<" are "swallowed" in one item inside the parentheses, and a pos-
|
||||||
|
sessive quantifier is used to stop any backtracking into the runs of
|
||||||
|
non-"<" characters. This version also uses a lot less memory because
|
||||||
|
entry to a new set of parentheses happens only when a "<" character
|
||||||
|
that is not followed by "inet" is encountered (and we assume this is
|
||||||
|
relatively rare).
|
||||||
|
|
||||||
|
This example shows that one way of optimizing performance when matching
|
||||||
|
long subject strings is to write repeated parenthesized subpatterns to
|
||||||
|
match more than one character whenever possible.
|
||||||
|
|
||||||
|
SETTING RESOURCE LIMITS
|
||||||
|
|
||||||
|
You can set limits on the amount of processing that takes place when
|
||||||
|
matching, and on the amount of heap memory that is used. The default
|
||||||
|
values of the limits are very large, and unlikely ever to operate. They
|
||||||
|
can be changed when PCRE2 is built, and they can also be set when
|
||||||
|
pcre2_match() or pcre2_dfa_match() is called. For details of these
|
||||||
|
interfaces, see the pcre2build documentation and the section entitled
|
||||||
|
"The match context" in the pcre2api documentation.
|
||||||
|
|
||||||
|
The pcre2test test program has a modifier called "find_limits" which,
|
||||||
|
if applied to a subject line, causes it to find the smallest limits
|
||||||
|
that allow a pattern to match. This is done by repeatedly matching with
|
||||||
|
different limits.
|
||||||
|
|
||||||
|
|
||||||
AUTHOR
|
AUTHOR
|
||||||
|
@ -8636,8 +8679,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 02 January 2015
|
Last updated: 31 March 2017
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00"
|
.TH PCRE2PERFORM 3 "31 March 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 PERFORMANCE"
|
.SH "PCRE2 PERFORMANCE"
|
||||||
|
@ -12,11 +12,11 @@ of them.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
||||||
so that most simple patterns do not use much memory. However, there is one case
|
so that most simple patterns do not use much memory for storing the compiled
|
||||||
where the memory usage of a compiled pattern can be unexpectedly large. If a
|
version. However, there is one case where the memory usage of a compiled
|
||||||
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
|
pattern can be unexpectedly large. If a parenthesized subpattern has a
|
||||||
a limited maximum, the whole subpattern is repeated in the compiled code. For
|
quantifier with a minimum greater than 1 and/or a limited maximum, the whole
|
||||||
example, the pattern
|
subpattern is repeated in the compiled code. For example, the pattern
|
||||||
.sp
|
.sp
|
||||||
(abc|def){2,4}
|
(abc|def){2,4}
|
||||||
.sp
|
.sp
|
||||||
|
@ -34,13 +34,13 @@ example, the very simple pattern
|
||||||
.sp
|
.sp
|
||||||
((ab){1,1000}c){1,3}
|
((ab){1,1000}c){1,3}
|
||||||
.sp
|
.sp
|
||||||
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
|
uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
|
||||||
with its default internal pointer size of two bytes, the size limit on a
|
compiled with its default internal pointer size of two bytes, the size limit on
|
||||||
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
|
a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
|
||||||
is reached with the above pattern if the outer repetition is increased from 3
|
this is reached with the above pattern if the outer repetition is increased
|
||||||
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
|
from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
|
||||||
larger compiled patterns, but it is better to try to rewrite your pattern to
|
handle larger compiled patterns, but it is better to try to rewrite your
|
||||||
use less memory if you can.
|
pattern to use less memory if you can.
|
||||||
.P
|
.P
|
||||||
One way of reducing the memory usage for such patterns is to make use of
|
One way of reducing the memory usage for such patterns is to make use of
|
||||||
PCRE2's
|
PCRE2's
|
||||||
|
@ -52,32 +52,34 @@ facility. Re-writing the above pattern as
|
||||||
.sp
|
.sp
|
||||||
((ab)(?2){0,999}c)(?1){0,2}
|
((ab)(?2){0,999}c)(?1){0,2}
|
||||||
.sp
|
.sp
|
||||||
reduces the memory requirements to 18K, and indeed it remains under 20K even
|
reduces the memory requirements to around 16K, and indeed it remains under 20K
|
||||||
with the outer repetition increased to 100. However, this pattern is not
|
even with the outer repetition increased to 100. However, this kind of pattern
|
||||||
exactly equivalent, because the "subroutine" calls are treated as
|
is not always exactly equivalent, because any captures within subroutine calls
|
||||||
.\" HTML <a href="pcre2pattern.html#atomicgroup">
|
are lost when the subroutine completes. If this is not a problem, this kind of
|
||||||
.\" </a>
|
rewriting will allow you to process patterns that PCRE2 cannot otherwise
|
||||||
atomic groups
|
handle. The matching performance of the two different versions of the pattern
|
||||||
.\"
|
are roughly the same. (This applies from release 10.30 - things were different
|
||||||
into which there can be no backtracking if there is a subsequent matching
|
in earlier releases.)
|
||||||
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
|
|
||||||
Furthermore, there is a noticeable loss of speed when executing the modified
|
|
||||||
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
|
|
||||||
speed is acceptable, this kind of rewriting will allow you to process patterns
|
|
||||||
that PCRE2 cannot otherwise handle.
|
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "STACK USAGE AT RUN TIME"
|
.SH "STACK AND HEAP USAGE AT RUN TIME"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can
|
From release 10.30, the interpretive (non-JIT) version of \fBpcre2_match()\fP
|
||||||
cause it to use large amounts of the process stack. In some environments the
|
uses very little system stack at run time. In earlier releases recursive
|
||||||
default process stack is quite small, and if it runs out the result is often
|
function calls could use a great deal of stack, and this could cause problems,
|
||||||
SIGSEGV. Rewriting your pattern can often help. The
|
but this usage has been eliminated. Backtracking positions are now explicitly
|
||||||
.\" HREF
|
remembered in memory frames controlled by the code. An initial 10K vector of
|
||||||
\fBpcre2stack\fP
|
frames is allocated on the system stack (enough for about 50 frames for small
|
||||||
.\"
|
patterns), but if this is insufficient, heap memory is used. Rewriting patterns
|
||||||
documentation discusses this issue in detail.
|
to be time-efficient, as described below, may also reduce the memory
|
||||||
|
requirements.
|
||||||
|
.P
|
||||||
|
In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive
|
||||||
|
function calls, but only for processing atomic groups, lookaround assertions,
|
||||||
|
and recursion within the pattern. Too much nested recursion may cause stack
|
||||||
|
issues. The "match depth" parameter can be used to limit the depth of function
|
||||||
|
recursion in \fBpcre2_dfa_match()\fP.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "PROCESSING TIME"
|
.SH "PROCESSING TIME"
|
||||||
|
@ -160,7 +162,59 @@ applied to a whole line of "a" characters, whereas the latter takes an
|
||||||
appreciable time with strings longer than about 20 characters.
|
appreciable time with strings longer than about 20 characters.
|
||||||
.P
|
.P
|
||||||
In many cases, the solution to this kind of performance issue is to use an
|
In many cases, the solution to this kind of performance issue is to use an
|
||||||
atomic group or a possessive quantifier.
|
atomic group or a possessive quantifier. This can often reduce memory
|
||||||
|
requirements as well. As another example, consider this pattern:
|
||||||
|
.sp
|
||||||
|
([^<]|<(?!inet))+
|
||||||
|
.sp
|
||||||
|
It matches from wherever it starts until it encounters "<inet" or the end of
|
||||||
|
the data, and is the kind of pattern that might be used when processing an XML
|
||||||
|
file. Each iteration of the outer parentheses matches either one character that
|
||||||
|
is not "<" or a "<" that is not followed by "inet". However, each time a
|
||||||
|
parenthesis is processed, a backtracking position is passed, so this
|
||||||
|
formulation uses a memory frame for each matched character. For a long string,
|
||||||
|
a lot of memory is required. Consider now this rewritten pattern, which matches
|
||||||
|
exactly the same strings:
|
||||||
|
.sp
|
||||||
|
([^<]++|<(?!inet))+
|
||||||
|
.sp
|
||||||
|
This runs much faster, because sequences of characters that do not contain "<"
|
||||||
|
are "swallowed" in one item inside the parentheses, and a possessive quantifier
|
||||||
|
is used to stop any backtracking into the runs of non-"<" characters. This
|
||||||
|
version also uses a lot less memory because entry to a new set of parentheses
|
||||||
|
happens only when a "<" character that is not followed by "inet" is encountered
|
||||||
|
(and we assume this is relatively rare).
|
||||||
|
.P
|
||||||
|
This example shows that one way of optimizing performance when matching long
|
||||||
|
subject strings is to write repeated parenthesized subpatterns to match more
|
||||||
|
than one character whenever possible.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "SETTING RESOURCE LIMITS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
You can set limits on the amount of processing that takes place when matching,
|
||||||
|
and on the amount of heap memory that is used. The default values of the limits
|
||||||
|
are very large, and unlikely ever to operate. They can be changed when PCRE2 is
|
||||||
|
built, and they can also be set when \fBpcre2_match()\fP or
|
||||||
|
\fBpcre2_dfa_match()\fP is called. For details of these interfaces, see the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2build\fP
|
||||||
|
.\"
|
||||||
|
documentation and the section entitled
|
||||||
|
.\" HTML <a href="pcre2api.html#matchcontext">
|
||||||
|
.\" </a>
|
||||||
|
"The match context"
|
||||||
|
.\"
|
||||||
|
in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2api\fP
|
||||||
|
.\"
|
||||||
|
documentation.
|
||||||
|
.P
|
||||||
|
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
|
||||||
|
applied to a subject line, causes it to find the smallest limits that allow a
|
||||||
|
pattern to match. This is done by repeatedly matching with different limits.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH AUTHOR
|
.SH AUTHOR
|
||||||
|
@ -177,6 +231,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 02 January 2015
|
Last updated: 31 March 2017
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
212
doc/pcre2stack.3
212
doc/pcre2stack.3
|
@ -1,212 +0,0 @@
|
||||||
.TH PCRE2STACK 3 "23 December 2016" "PCRE2 10.23"
|
|
||||||
.SH NAME
|
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
||||||
.SH "PCRE2 DISCUSSION OF STACK USAGE"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
When you call \fBpcre2_match()\fP, it makes use of an internal function called
|
|
||||||
\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
|
|
||||||
in order to remember the state of the match so that it can back up and try a
|
|
||||||
different alternative after a failure. As matching proceeds deeper and deeper
|
|
||||||
into the tree of possibilities, the recursion depth increases. The
|
|
||||||
\fBmatch()\fP function is also called in other circumstances, for example,
|
|
||||||
whenever a parenthesized sub-pattern is entered, and in certain cases of
|
|
||||||
repetition.
|
|
||||||
.P
|
|
||||||
Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
|
|
||||||
as a* it may be called several times at the same level, after matching
|
|
||||||
different numbers of a's. Furthermore, in a number of cases where the result of
|
|
||||||
the recursive call would immediately be passed back as the result of the
|
|
||||||
current call (a "tail recursion"), the function is just restarted instead.
|
|
||||||
.P
|
|
||||||
Each time the internal \fBmatch()\fP function is called recursively, it uses
|
|
||||||
memory from the process stack. For certain kinds of pattern and data, very
|
|
||||||
large amounts of stack may be needed, despite the recognition of "tail
|
|
||||||
recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
|
|
||||||
of the GCC compiler, the stack requirements are greatly increased.
|
|
||||||
.P
|
|
||||||
The above comments apply when \fBpcre2_match()\fP is run in its normal
|
|
||||||
interpretive manner. If the compiled pattern was processed by
|
|
||||||
\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the
|
|
||||||
options passed to \fBpcre2_match()\fP were not incompatible, the matching
|
|
||||||
process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
|
|
||||||
this case, the memory requirements are handled entirely differently. See the
|
|
||||||
.\" HREF
|
|
||||||
\fBpcre2jit\fP
|
|
||||||
.\"
|
|
||||||
documentation for details.
|
|
||||||
.P
|
|
||||||
The \fBpcre2_dfa_match()\fP function operates in a different way to
|
|
||||||
\fBpcre2_match()\fP, and uses recursion only when there is a regular expression
|
|
||||||
recursion or subroutine call in the pattern. This includes the processing of
|
|
||||||
assertion and "once-only" subpatterns, which are handled like subroutine calls.
|
|
||||||
Normally, these are never very deep, and the limit on the complexity of
|
|
||||||
\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
|
|
||||||
However, it is possible to write patterns with runaway infinite recursions;
|
|
||||||
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack unless a
|
|
||||||
limit is applied (see below).
|
|
||||||
.P
|
|
||||||
The comments in the next three sections do not apply to
|
|
||||||
\fBpcre2_dfa_match()\fP; they are relevant only for \fBpcre2_match()\fP without
|
|
||||||
the JIT optimization.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SS "Reducing \fBpcre2_match()\fP's stack usage"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
You can often reduce the amount of recursion, and therefore the
|
|
||||||
amount of stack used, by modifying the pattern that is being matched. Consider,
|
|
||||||
for example, this pattern:
|
|
||||||
.sp
|
|
||||||
([^<]|<(?!inet))+
|
|
||||||
.sp
|
|
||||||
It matches from wherever it starts until it encounters "<inet" or the end of
|
|
||||||
the data, and is the kind of pattern that might be used when processing an XML
|
|
||||||
file. Each iteration of the outer parentheses matches either one character that
|
|
||||||
is not "<" or a "<" that is not followed by "inet". However, each time a
|
|
||||||
parenthesis is processed, a recursion occurs, so this formulation uses a stack
|
|
||||||
frame for each matched character. For a long string, a lot of stack is
|
|
||||||
required. Consider now this rewritten pattern, which matches exactly the same
|
|
||||||
strings:
|
|
||||||
.sp
|
|
||||||
([^<]++|<(?!inet))+
|
|
||||||
.sp
|
|
||||||
This uses very much less stack, because runs of characters that do not contain
|
|
||||||
"<" are "swallowed" in one item inside the parentheses. Recursion happens only
|
|
||||||
when a "<" character that is not followed by "inet" is encountered (and we
|
|
||||||
assume this is relatively rare). A possessive quantifier is used to stop any
|
|
||||||
backtracking into the runs of non-"<" characters, but that is not related to
|
|
||||||
stack usage.
|
|
||||||
.P
|
|
||||||
This example shows that one way of avoiding stack problems when matching long
|
|
||||||
subject strings is to write repeated parenthesized subpatterns to match more
|
|
||||||
than one character whenever possible.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SS "Compiling PCRE2 to use heap instead of stack for \fBpcre2_match()\fP"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
In environments where stack memory is constrained, you might want to compile
|
|
||||||
PCRE2 to use heap memory instead of stack for remembering back-up points when
|
|
||||||
\fBpcre2_match()\fP is running. This makes it run more slowly, however. Details
|
|
||||||
of how to do this are given in the
|
|
||||||
.\" HREF
|
|
||||||
\fBpcre2build\fP
|
|
||||||
.\"
|
|
||||||
documentation. When built in this way, instead of using the stack, PCRE2
|
|
||||||
gets memory for remembering backup points from the heap. By default, the memory
|
|
||||||
is obtained by calling the system \fBmalloc()\fP function, but you can arrange
|
|
||||||
to supply your own memory management function. For details, see the section
|
|
||||||
entitled
|
|
||||||
.\" HTML <a href="pcre2api.html#matchcontext">
|
|
||||||
.\" </a>
|
|
||||||
"The match context"
|
|
||||||
.\"
|
|
||||||
in the
|
|
||||||
.\" HREF
|
|
||||||
\fBpcre2api\fP
|
|
||||||
.\"
|
|
||||||
documentation. Since the block sizes are always the same, it may be possible to
|
|
||||||
implement a customized memory handler that is more efficient than the standard
|
|
||||||
function. The memory blocks obtained for this purpose are retained and re-used
|
|
||||||
if possible while \fBpcre2_match()\fP is running. They are all freed just
|
|
||||||
before it exits.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SS "Limiting \fBpcre2_match()\fP's stack usage"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
You can set limits on the number of times the internal \fBmatch()\fP function
|
|
||||||
is called, both in total and recursively. If a limit is exceeded,
|
|
||||||
\fBpcre2_match()\fP returns an error code. Setting suitable limits should
|
|
||||||
prevent it from running out of stack. The default values of the limits are very
|
|
||||||
large, and unlikely ever to operate. They can be changed when PCRE2 is built,
|
|
||||||
and they can also be set when \fBpcre2_match()\fP is called. For details of
|
|
||||||
these interfaces, see the
|
|
||||||
.\" HREF
|
|
||||||
\fBpcre2build\fP
|
|
||||||
.\"
|
|
||||||
documentation and the section entitled
|
|
||||||
.\" HTML <a href="pcre2api.html#matchcontext">
|
|
||||||
.\" </a>
|
|
||||||
"The match context"
|
|
||||||
.\"
|
|
||||||
in the
|
|
||||||
.\" HREF
|
|
||||||
\fBpcre2api\fP
|
|
||||||
.\"
|
|
||||||
documentation.
|
|
||||||
.P
|
|
||||||
As a very rough rule of thumb, you should reckon on about 500 bytes per
|
|
||||||
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
|
|
||||||
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
|
|
||||||
around 128000 recursions.
|
|
||||||
.P
|
|
||||||
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
|
|
||||||
applied to a subject line, causes it to find the smallest limits that allow a a
|
|
||||||
pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
|
|
||||||
different limits.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SS "Limiting \fBpcre2_dfa_match()\fP's stack usage"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
The recursion limit, as described above for \fBpcre2_match()\fP, also applies
|
|
||||||
to \fBpcre2_dfa_match()\fP, whose use of recursive function calls for
|
|
||||||
recursions in the pattern can lead to runaway stack usage. The non-recursive
|
|
||||||
match limit is not relevant for DFA matching, and is ignored.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SS "Changing stack size in Unix-like systems"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
In Unix-like environments, there is not often a problem with the stack unless
|
|
||||||
very long strings are involved, though the default limit on stack size varies
|
|
||||||
from system to system. Values from 8Mb to 64Mb are common. You can find your
|
|
||||||
default limit by running the command:
|
|
||||||
.sp
|
|
||||||
ulimit -s
|
|
||||||
.sp
|
|
||||||
Unfortunately, the effect of running out of stack is often SIGSEGV, though
|
|
||||||
sometimes a more explicit error message is given. You can normally increase the
|
|
||||||
limit on stack size by code such as this:
|
|
||||||
.sp
|
|
||||||
struct rlimit rlim;
|
|
||||||
getrlimit(RLIMIT_STACK, &rlim);
|
|
||||||
rlim.rlim_cur = 100*1024*1024;
|
|
||||||
setrlimit(RLIMIT_STACK, &rlim);
|
|
||||||
.sp
|
|
||||||
This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
|
|
||||||
attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
|
|
||||||
do this before calling \fBpcre2_match()\fP.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SS "Changing stack size in Mac OS X"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
|
|
||||||
is also possible to set a stack size when linking a program. There is a
|
|
||||||
discussion about stack sizes in Mac OS X at this web site:
|
|
||||||
.\" HTML <a href="http://developer.apple.com/qa/qa2005/qa1419.html">
|
|
||||||
.\" </a>
|
|
||||||
http://developer.apple.com/qa/qa2005/qa1419.html.
|
|
||||||
.\"
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SH AUTHOR
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
.nf
|
|
||||||
Philip Hazel
|
|
||||||
University Computing Service
|
|
||||||
Cambridge, England.
|
|
||||||
.fi
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SH REVISION
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
.nf
|
|
||||||
Last updated: 23 December 2016
|
|
||||||
Copyright (c) 1997-2016 University of Cambridge.
|
|
||||||
.fi
|
|
Loading…
Reference in New Issue