Documentation update
This commit is contained in:
parent
a073581116
commit
ed9f34b06b
|
@ -103,7 +103,6 @@ dist_html_DATA = \
|
|||
doc/html/pcre2posix.html \
|
||||
doc/html/pcre2sample.html \
|
||||
doc/html/pcre2serialize.html \
|
||||
doc/html/pcre2stack.html \
|
||||
doc/html/pcre2syntax.html \
|
||||
doc/html/pcre2test.html \
|
||||
doc/html/pcre2unicode.html
|
||||
|
@ -187,7 +186,6 @@ dist_man_MANS = \
|
|||
doc/pcre2posix.3 \
|
||||
doc/pcre2sample.3 \
|
||||
doc/pcre2serialize.3 \
|
||||
doc/pcre2stack.3 \
|
||||
doc/pcre2syntax.3 \
|
||||
doc/pcre2test.1 \
|
||||
doc/pcre2unicode.3
|
||||
|
|
|
@ -68,9 +68,6 @@ first.
|
|||
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
|
||||
<td> Serializing functions for saving precompiled patterns</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
|
||||
<td> Discussion of PCRE2's stack usage</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
|
||||
<td> Syntax quick-reference summary</td></tr>
|
||||
|
||||
|
|
|
@ -18,7 +18,8 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
|||
<P>
|
||||
This document describes the differences in the ways that PCRE2 and Perl handle
|
||||
regular expressions. The differences described here are with respect to Perl
|
||||
versions 5.10 and above.
|
||||
versions 5.24, but as both Perl and PCRE2 are continually changing, the
|
||||
information may sometimes be out of date.
|
||||
</P>
|
||||
<P>
|
||||
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
|
||||
|
@ -27,17 +28,18 @@ have are given in the
|
|||
page.
|
||||
</P>
|
||||
<P>
|
||||
2. PCRE2 allows repeat quantifiers only on parenthesized assertions, but they
|
||||
do not mean what you might think. For example, (?!a){3} does not assert that
|
||||
the next three characters are not "a". It just asserts that the next character
|
||||
is not "a" three times (in principle: PCRE2 optimizes this to run the assertion
|
||||
just once). Perl allows repeat quantifiers on other assertions such as \b, but
|
||||
these do not seem to have any use.
|
||||
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
|
||||
they do not mean what you might think. For example, (?!a){3} does not assert
|
||||
that the next three characters are not "a". It just asserts that the next
|
||||
character is not "a" three times (in principle: PCRE2 optimizes this to run the
|
||||
assertion just once). Perl allows some repeat quantifiers on other assertions,
|
||||
for example, \b* (but not \b{3}), but these do not seem to have any use.
|
||||
</P>
|
||||
<P>
|
||||
3. Capturing subpatterns that occur inside negative lookahead assertions are
|
||||
counted, but their entries in the offsets vector are never set. Perl sometimes
|
||||
(but not always) sets its numerical variables from inside negative assertions.
|
||||
3. Capturing subpatterns that occur inside negative lookaround assertions are
|
||||
counted, but their entries in the offsets vector are set only if the assertion
|
||||
is a condition. Perl has changed its behaviour in this regard from time to
|
||||
time.
|
||||
</P>
|
||||
<P>
|
||||
4. The following Perl escape sequences are not supported: \l, \u, \L,
|
||||
|
@ -50,13 +52,13 @@ generated by default. However, if the PCRE2_ALT_BSUX option is set,
|
|||
</P>
|
||||
<P>
|
||||
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
|
||||
built with Unicode support. The properties that can be tested with \p and \P
|
||||
are limited to the general category properties such as Lu and Nd, script names
|
||||
such as Greek or Han, and the derived properties Any and L&. PCRE2 does support
|
||||
the Cs (surrogate) property, which Perl does not; the Perl documentation says
|
||||
"Because Perl hides the need for the user to understand the internal
|
||||
representation of Unicode characters, there is no need to implement the
|
||||
somewhat messy concept of surrogates."
|
||||
built with Unicode support (the default). The properties that can be tested
|
||||
with \p and \P are limited to the general category properties such as Lu and
|
||||
Nd, script names such as Greek or Han, and the derived properties Any and L&.
|
||||
PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
|
||||
documentation says "Because Perl hides the need for the user to understand the
|
||||
internal representation of Unicode characters, there is no need to implement
|
||||
the somewhat messy concept of surrogates."
|
||||
</P>
|
||||
<P>
|
||||
6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters
|
||||
|
@ -75,23 +77,15 @@ The \Q...\E sequence is recognized both inside and outside character classes.
|
|||
</P>
|
||||
<P>
|
||||
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
|
||||
constructions. However, there is support for recursive patterns. This is not
|
||||
available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE2 "callout"
|
||||
feature allows an external function to be called during pattern matching. See
|
||||
the
|
||||
constructions. However, there is support PCRE2's "callout" feature, which
|
||||
allows an external function to be called during pattern matching. See the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation for details.
|
||||
</P>
|
||||
<P>
|
||||
8. Subroutine calls (whether recursive or not) are treated as atomic groups.
|
||||
Atomic recursion is like Python, but unlike Perl. Captured values that are set
|
||||
outside a subroutine call can be referenced from inside in PCRE2, but not in
|
||||
Perl. There is a discussion that explains these differences in more detail in
|
||||
the
|
||||
<a href="pcre2pattern.html#recursiondifference">section on recursion differences from Perl</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page.
|
||||
8. Subroutine calls (whether recursive or not) were treated as atomic groups up
|
||||
to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
|
||||
into subroutine calls is now supported, as in Perl.
|
||||
</P>
|
||||
<P>
|
||||
9. If any of the backtracking control verbs are used in a subpattern that is
|
||||
|
@ -147,14 +141,14 @@ certainly user mistakes.
|
|||
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
|
||||
affected when case-independent matching is specified. For example, \p{Lu}
|
||||
always matches an upper case letter. I think Perl has changed in this respect;
|
||||
in the release at the time of writing (5.16), \p{Lu} and \p{Ll} match all
|
||||
in the release at the time of writing (5.24), \p{Lu} and \p{Ll} match all
|
||||
letters, regardless of case, when case independence is specified.
|
||||
</P>
|
||||
<P>
|
||||
17. PCRE2 provides some extensions to the Perl regular expression facilities.
|
||||
Perl 5.10 includes new features that are not in earlier versions of Perl, some
|
||||
of which (such as named parentheses) have been in PCRE2 for some time. This
|
||||
list is with respect to Perl 5.10:
|
||||
of which (such as named parentheses) were in PCRE2 for some time before. This
|
||||
list is with respect to Perl 5.24:
|
||||
<br>
|
||||
<br>
|
||||
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
|
||||
|
@ -220,9 +214,9 @@ Cambridge, England.
|
|||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 18 October 2016
|
||||
Last updated: 29 March 2017
|
||||
<br>
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -173,7 +173,7 @@ below for a discussion of JIT stack usage.
|
|||
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
|
||||
a very large pattern tree goes on for too long, as it is in the same
|
||||
circumstance when JIT is not used, but the details of exactly what is counted
|
||||
are not the same. The PCRE2_ERROR_RECURSIONLIMIT error code is never returned
|
||||
are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
|
||||
when JIT matching is used.
|
||||
<a name="stackcontrol"></a></P>
|
||||
<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
|
||||
|
@ -436,9 +436,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 05 June 2016
|
||||
Last updated: 30 March 2017
|
||||
<br>
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -44,14 +44,6 @@ integer type, usually defined as size_t. Its maximum value (that is
|
|||
and unset offsets.
|
||||
</P>
|
||||
<P>
|
||||
Note that when using the traditional matching function, PCRE2 uses recursion to
|
||||
handle subpatterns and indefinite repetition. This means that the available
|
||||
stack space may limit the size of a subject string that can be processed by
|
||||
certain patterns. For a discussion of stack issues, see the
|
||||
<a href="pcre2stack.html"><b>pcre2stack</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
All values in repeating quantifiers must be less than 65536.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -94,9 +86,9 @@ Cambridge, England.
|
|||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 26 October 2016
|
||||
Last updated: 30 March 2017
|
||||
<br>
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
|
|||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
|
||||
<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
|
||||
<li><a name="TOC3" href="#SEC3">STACK USAGE AT RUN TIME</a>
|
||||
<li><a name="TOC3" href="#SEC3">STACK AND HEAP USAGE AT RUN TIME</a>
|
||||
<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
|
||||
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
|
||||
<li><a name="TOC6" href="#SEC6">REVISION</a>
|
||||
|
@ -29,11 +29,11 @@ of them.
|
|||
<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
|
||||
<P>
|
||||
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
||||
so that most simple patterns do not use much memory. However, there is one case
|
||||
where the memory usage of a compiled pattern can be unexpectedly large. If a
|
||||
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
|
||||
a limited maximum, the whole subpattern is repeated in the compiled code. For
|
||||
example, the pattern
|
||||
so that most simple patterns do not use much memory for storing the compiled
|
||||
version. However, there is one case where the memory usage of a compiled
|
||||
pattern can be unexpectedly large. If a parenthesized subpattern has a
|
||||
quantifier with a minimum greater than 1 and/or a limited maximum, the whole
|
||||
subpattern is repeated in the compiled code. For example, the pattern
|
||||
<pre>
|
||||
(abc|def){2,4}
|
||||
</pre>
|
||||
|
@ -52,13 +52,13 @@ example, the very simple pattern
|
|||
<pre>
|
||||
((ab){1,1000}c){1,3}
|
||||
</pre>
|
||||
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
|
||||
with its default internal pointer size of two bytes, the size limit on a
|
||||
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
|
||||
is reached with the above pattern if the outer repetition is increased from 3
|
||||
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
|
||||
larger compiled patterns, but it is better to try to rewrite your pattern to
|
||||
use less memory if you can.
|
||||
uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
|
||||
compiled with its default internal pointer size of two bytes, the size limit on
|
||||
a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
|
||||
this is reached with the above pattern if the outer repetition is increased
|
||||
from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
|
||||
handle larger compiled patterns, but it is better to try to rewrite your
|
||||
pattern to use less memory if you can.
|
||||
</P>
|
||||
<P>
|
||||
One way of reducing the memory usage for such patterns is to make use of
|
||||
|
@ -68,25 +68,33 @@ facility. Re-writing the above pattern as
|
|||
<pre>
|
||||
((ab)(?2){0,999}c)(?1){0,2}
|
||||
</pre>
|
||||
reduces the memory requirements to 18K, and indeed it remains under 20K even
|
||||
with the outer repetition increased to 100. However, this pattern is not
|
||||
exactly equivalent, because the "subroutine" calls are treated as
|
||||
<a href="pcre2pattern.html#atomicgroup">atomic groups</a>
|
||||
into which there can be no backtracking if there is a subsequent matching
|
||||
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
|
||||
Furthermore, there is a noticeable loss of speed when executing the modified
|
||||
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
|
||||
speed is acceptable, this kind of rewriting will allow you to process patterns
|
||||
that PCRE2 cannot otherwise handle.
|
||||
reduces the memory requirements to around 16K, and indeed it remains under 20K
|
||||
even with the outer repetition increased to 100. However, this kind of pattern
|
||||
is not always exactly equivalent, because any captures within subroutine calls
|
||||
are lost when the subroutine completes. If this is not a problem, this kind of
|
||||
rewriting will allow you to process patterns that PCRE2 cannot otherwise
|
||||
handle. The matching performance of the two different versions of the pattern
|
||||
are roughly the same. (This applies from release 10.30 - things were different
|
||||
in earlier releases.)
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">STACK USAGE AT RUN TIME</a><br>
|
||||
<br><a name="SEC3" href="#TOC1">STACK AND HEAP USAGE AT RUN TIME</a><br>
|
||||
<P>
|
||||
When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
|
||||
cause it to use large amounts of the process stack. In some environments the
|
||||
default process stack is quite small, and if it runs out the result is often
|
||||
SIGSEGV. Rewriting your pattern can often help. The
|
||||
<a href="pcre2stack.html"><b>pcre2stack</b></a>
|
||||
documentation discusses this issue in detail.
|
||||
From release 10.30, the interpretive (non-JIT) version of <b>pcre2_match()</b>
|
||||
uses very little system stack at run time. In earlier releases recursive
|
||||
function calls could use a great deal of stack, and this could cause problems,
|
||||
but this usage has been eliminated. Backtracking positions are now explicitly
|
||||
remembered in memory frames controlled by the code. An initial 10K vector of
|
||||
frames is allocated on the system stack (enough for about 50 frames for small
|
||||
patterns), but if this is insufficient, heap memory is used. Rewriting patterns
|
||||
to be time-efficient, as described below, may also reduce the memory
|
||||
requirements.
|
||||
</P>
|
||||
<P>
|
||||
In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive
|
||||
function calls, but only for processing atomic groups, lookaround assertions,
|
||||
and recursion within the pattern. Too much nested recursion may cause stack
|
||||
issues. The "match depth" parameter can be used to limit the depth of function
|
||||
recursion in <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
|
||||
<P>
|
||||
|
@ -175,7 +183,54 @@ appreciable time with strings longer than about 20 characters.
|
|||
</P>
|
||||
<P>
|
||||
In many cases, the solution to this kind of performance issue is to use an
|
||||
atomic group or a possessive quantifier.
|
||||
atomic group or a possessive quantifier. This can often reduce memory
|
||||
requirements as well. As another example, consider this pattern:
|
||||
<pre>
|
||||
([^<]|<(?!inet))+
|
||||
</pre>
|
||||
It matches from wherever it starts until it encounters "<inet" or the end of
|
||||
the data, and is the kind of pattern that might be used when processing an XML
|
||||
file. Each iteration of the outer parentheses matches either one character that
|
||||
is not "<" or a "<" that is not followed by "inet". However, each time a
|
||||
parenthesis is processed, a backtracking position is passed, so this
|
||||
formulation uses a memory frame for each matched character. For a long string,
|
||||
a lot of memory is required. Consider now this rewritten pattern, which matches
|
||||
exactly the same strings:
|
||||
<pre>
|
||||
([^<]++|<(?!inet))+
|
||||
</pre>
|
||||
This runs much faster, because sequences of characters that do not contain "<"
|
||||
are "swallowed" in one item inside the parentheses, and a possessive quantifier
|
||||
is used to stop any backtracking into the runs of non-"<" characters. This
|
||||
version also uses a lot less memory because entry to a new set of parentheses
|
||||
happens only when a "<" character that is not followed by "inet" is encountered
|
||||
(and we assume this is relatively rare).
|
||||
</P>
|
||||
<P>
|
||||
This example shows that one way of optimizing performance when matching long
|
||||
subject strings is to write repeated parenthesized subpatterns to match more
|
||||
than one character whenever possible.
|
||||
</P>
|
||||
<br><b>
|
||||
SETTING RESOURCE LIMITS
|
||||
</b><br>
|
||||
<P>
|
||||
You can set limits on the amount of processing that takes place when matching,
|
||||
and on the amount of heap memory that is used. The default values of the limits
|
||||
are very large, and unlikely ever to operate. They can be changed when PCRE2 is
|
||||
built, and they can also be set when <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b> is called. For details of these interfaces, see the
|
||||
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||
documentation and the section entitled
|
||||
<a href="pcre2api.html#matchcontext">"The match context"</a>
|
||||
in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
|
||||
applied to a subject line, causes it to find the smallest limits that allow a
|
||||
pattern to match. This is done by repeatedly matching with different limits.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
|
@ -188,9 +243,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 02 January 2015
|
||||
Last updated: 31 March 2017
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -68,9 +68,6 @@ first.
|
|||
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
|
||||
<td> Serializing functions for saving precompiled patterns</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
|
||||
<td> Discussion of PCRE2's stack usage</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
|
||||
<td> Syntax quick-reference summary</td></tr>
|
||||
|
||||
|
|
311
doc/pcre2.txt
311
doc/pcre2.txt
|
@ -4097,45 +4097,46 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
|||
|
||||
This document describes the differences in the ways that PCRE2 and Perl
|
||||
handle regular expressions. The differences described here are with
|
||||
respect to Perl versions 5.10 and above.
|
||||
respect to Perl versions 5.24, but as both Perl and PCRE2 are continu-
|
||||
ally changing, the information may sometimes be out of date.
|
||||
|
||||
1. PCRE2 has only a subset of Perl's Unicode support. Details of what
|
||||
1. PCRE2 has only a subset of Perl's Unicode support. Details of what
|
||||
it does have are given in the pcre2unicode page.
|
||||
|
||||
2. PCRE2 allows repeat quantifiers only on parenthesized assertions,
|
||||
but they do not mean what you might think. For example, (?!a){3} does
|
||||
not assert that the next three characters are not "a". It just asserts
|
||||
that the next character is not "a" three times (in principle: PCRE2
|
||||
optimizes this to run the assertion just once). Perl allows repeat
|
||||
quantifiers on other assertions such as \b, but these do not seem to
|
||||
have any use.
|
||||
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
|
||||
tions, but they do not mean what you might think. For example, (?!a){3}
|
||||
does not assert that the next three characters are not "a". It just
|
||||
asserts that the next character is not "a" three times (in principle:
|
||||
PCRE2 optimizes this to run the assertion just once). Perl allows some
|
||||
repeat quantifiers on other assertions, for example, \b* (but not
|
||||
\b{3}), but these do not seem to have any use.
|
||||
|
||||
3. Capturing subpatterns that occur inside negative lookahead asser-
|
||||
tions are counted, but their entries in the offsets vector are never
|
||||
set. Perl sometimes (but not always) sets its numerical variables from
|
||||
inside negative assertions.
|
||||
3. Capturing subpatterns that occur inside negative lookaround asser-
|
||||
tions are counted, but their entries in the offsets vector are set only
|
||||
if the assertion is a condition. Perl has changed its behaviour in this
|
||||
regard from time to time.
|
||||
|
||||
4. The following Perl escape sequences are not supported: \l, \u, \L,
|
||||
\U, and \N when followed by a character name or Unicode value. (\N on
|
||||
4. The following Perl escape sequences are not supported: \l, \u, \L,
|
||||
\U, and \N when followed by a character name or Unicode value. (\N on
|
||||
its own, matching a non-newline character, is supported.) In fact these
|
||||
are implemented by Perl's general string-handling and are not part of
|
||||
its pattern matching engine. If any of these are encountered by PCRE2,
|
||||
are implemented by Perl's general string-handling and are not part of
|
||||
its pattern matching engine. If any of these are encountered by PCRE2,
|
||||
an error is generated by default. However, if the PCRE2_ALT_BSUX option
|
||||
is set, \U and \u are interpreted as ECMAScript interprets them.
|
||||
|
||||
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
|
||||
is built with Unicode support. The properties that can be tested with
|
||||
\p and \P are limited to the general category properties such as Lu and
|
||||
Nd, script names such as Greek or Han, and the derived properties Any
|
||||
and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
|
||||
not; the Perl documentation says "Because Perl hides the need for the
|
||||
user to understand the internal representation of Unicode characters,
|
||||
there is no need to implement the somewhat messy concept of surro-
|
||||
gates."
|
||||
is built with Unicode support (the default). The properties that can be
|
||||
tested with \p and \P are limited to the general category properties
|
||||
such as Lu and Nd, script names such as Greek or Han, and the derived
|
||||
properties Any and L&. PCRE2 does support the Cs (surrogate) property,
|
||||
which Perl does not; the Perl documentation says "Because Perl hides
|
||||
the need for the user to understand the internal representation of Uni-
|
||||
code characters, there is no need to implement the somewhat messy con-
|
||||
cept of surrogates."
|
||||
|
||||
6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
|
||||
acters in between are treated as literals. This is slightly different
|
||||
from Perl in that $ and @ are also handled as literals inside the
|
||||
6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
|
||||
acters in between are treated as literals. This is slightly different
|
||||
from Perl in that $ and @ are also handled as literals inside the
|
||||
quotes. In Perl, they cause variable interpolation (but of course PCRE2
|
||||
does not have variables). Note the following examples:
|
||||
|
||||
|
@ -4146,22 +4147,17 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
|||
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
||||
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
||||
|
||||
The \Q...\E sequence is recognized both inside and outside character
|
||||
The \Q...\E sequence is recognized both inside and outside character
|
||||
classes.
|
||||
|
||||
7. Fairly obviously, PCRE2 does not support the (?{code}) and
|
||||
(??{code}) constructions. However, there is support for recursive pat-
|
||||
terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
|
||||
the PCRE2 "callout" feature allows an external function to be called
|
||||
during pattern matching. See the pcre2callout documentation for
|
||||
details.
|
||||
7. Fairly obviously, PCRE2 does not support the (?{code}) and
|
||||
(??{code}) constructions. However, there is support PCRE2's "callout"
|
||||
feature, which allows an external function to be called during pattern
|
||||
matching. See the pcre2callout documentation for details.
|
||||
|
||||
8. Subroutine calls (whether recursive or not) are treated as atomic
|
||||
groups. Atomic recursion is like Python, but unlike Perl. Captured
|
||||
values that are set outside a subroutine call can be referenced from
|
||||
inside in PCRE2, but not in Perl. There is a discussion that explains
|
||||
these differences in more detail in the section on recursion differ-
|
||||
ences from Perl in the pcre2pattern page.
|
||||
8. Subroutine calls (whether recursive or not) were treated as atomic
|
||||
groups up to PCRE2 release 10.23, but from release 10.30 this changed,
|
||||
and backtracking into subroutine calls is now supported, as in Perl.
|
||||
|
||||
9. If any of the backtracking control verbs are used in a subpattern
|
||||
that is called as a subroutine (whether or not recursively), their
|
||||
|
@ -4211,14 +4207,14 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
|
|||
16. In PCRE2, the upper/lower case character properties Lu and Ll are
|
||||
not affected when case-independent matching is specified. For example,
|
||||
\p{Lu} always matches an upper case letter. I think Perl has changed in
|
||||
this respect; in the release at the time of writing (5.16), \p{Lu} and
|
||||
this respect; in the release at the time of writing (5.24), \p{Lu} and
|
||||
\p{Ll} match all letters, regardless of case, when case independence is
|
||||
specified.
|
||||
|
||||
17. PCRE2 provides some extensions to the Perl regular expression
|
||||
facilities. Perl 5.10 includes new features that are not in earlier
|
||||
versions of Perl, some of which (such as named parentheses) have been
|
||||
in PCRE2 for some time. This list is with respect to Perl 5.10:
|
||||
versions of Perl, some of which (such as named parentheses) were in
|
||||
PCRE2 for some time before. This list is with respect to Perl 5.24:
|
||||
|
||||
(a) Although lookbehind assertions in PCRE2 must match fixed length
|
||||
strings, each alternative branch of a lookbehind assertion can match a
|
||||
|
@ -4271,8 +4267,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 18 October 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
Last updated: 29 March 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
@ -4420,8 +4416,8 @@ RETURN VALUES FROM JIT MATCHING
|
|||
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
|
||||
searching a very large pattern tree goes on for too long, as it is in
|
||||
the same circumstance when JIT is not used, but the details of exactly
|
||||
what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error
|
||||
code is never returned when JIT matching is used.
|
||||
what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
|
||||
is never returned when JIT matching is used.
|
||||
|
||||
|
||||
CONTROLLING THE JIT STACK
|
||||
|
@ -4668,8 +4664,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 05 June 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
Last updated: 30 March 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
@ -4706,12 +4702,6 @@ SIZE AND OTHER LIMITATIONS
|
|||
(that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
|
||||
terminated strings and unset offsets.
|
||||
|
||||
Note that when using the traditional matching function, PCRE2 uses
|
||||
recursion to handle subpatterns and indefinite repetition. This means
|
||||
that the available stack space may limit the size of a subject string
|
||||
that can be processed by certain patterns. For a discussion of stack
|
||||
issues, see the pcre2stack documentation.
|
||||
|
||||
All values in repeating quantifiers must be less than 65536.
|
||||
|
||||
The maximum length of a lookbehind assertion is 65535 characters.
|
||||
|
@ -4745,8 +4735,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 26 October 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
Last updated: 30 March 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
@ -8485,11 +8475,12 @@ PCRE2 PERFORMANCE
|
|||
COMPILED PATTERN MEMORY USAGE
|
||||
|
||||
Patterns are compiled by PCRE2 into a reasonably efficient interpretive
|
||||
code, so that most simple patterns do not use much memory. However,
|
||||
there is one case where the memory usage of a compiled pattern can be
|
||||
unexpectedly large. If a parenthesized subpattern has a quantifier with
|
||||
a minimum greater than 1 and/or a limited maximum, the whole subpattern
|
||||
is repeated in the compiled code. For example, the pattern
|
||||
code, so that most simple patterns do not use much memory for storing
|
||||
the compiled version. However, there is one case where the memory usage
|
||||
of a compiled pattern can be unexpectedly large. If a parenthesized
|
||||
subpattern has a quantifier with a minimum greater than 1 and/or a lim-
|
||||
ited maximum, the whole subpattern is repeated in the compiled code.
|
||||
For example, the pattern
|
||||
|
||||
(abc|def){2,4}
|
||||
|
||||
|
@ -8497,134 +8488,186 @@ COMPILED PATTERN MEMORY USAGE
|
|||
|
||||
(abc|def)(abc|def)((abc|def)(abc|def)?)?
|
||||
|
||||
(Technical aside: It is done this way so that backtrack points within
|
||||
(Technical aside: It is done this way so that backtrack points within
|
||||
each of the repetitions can be independently maintained.)
|
||||
|
||||
For regular expressions whose quantifiers use only small numbers, this
|
||||
is not usually a problem. However, if the numbers are large, and par-
|
||||
ticularly if such repetitions are nested, the memory usage can become
|
||||
For regular expressions whose quantifiers use only small numbers, this
|
||||
is not usually a problem. However, if the numbers are large, and par-
|
||||
ticularly if such repetitions are nested, the memory usage can become
|
||||
an embarrassment. For example, the very simple pattern
|
||||
|
||||
((ab){1,1000}c){1,3}
|
||||
|
||||
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is
|
||||
compiled with its default internal pointer size of two bytes, the size
|
||||
limit on a compiled pattern is 64K code units in the 8-bit and 16-bit
|
||||
libraries, and this is reached with the above pattern if the outer rep-
|
||||
etition is increased from 3 to 4. PCRE2 can be compiled to use larger
|
||||
internal pointers and thus handle larger compiled patterns, but it is
|
||||
better to try to rewrite your pattern to use less memory if you can.
|
||||
uses over 50K bytes when compiled using the 8-bit library. When PCRE2
|
||||
is compiled with its default internal pointer size of two bytes, the
|
||||
size limit on a compiled pattern is 64K code units in the 8-bit and
|
||||
16-bit libraries, and this is reached with the above pattern if the
|
||||
outer repetition is increased from 3 to 4. PCRE2 can be compiled to use
|
||||
larger internal pointers and thus handle larger compiled patterns, but
|
||||
it is better to try to rewrite your pattern to use less memory if you
|
||||
can.
|
||||
|
||||
One way of reducing the memory usage for such patterns is to make use
|
||||
of PCRE2's "subroutine" facility. Re-writing the above pattern as
|
||||
|
||||
((ab)(?2){0,999}c)(?1){0,2}
|
||||
|
||||
reduces the memory requirements to 18K, and indeed it remains under 20K
|
||||
even with the outer repetition increased to 100. However, this pattern
|
||||
is not exactly equivalent, because the "subroutine" calls are treated
|
||||
as atomic groups into which there can be no backtracking if there is a
|
||||
subsequent matching failure. Therefore, PCRE2 cannot do this kind of
|
||||
rewriting automatically. Furthermore, there is a noticeable loss of
|
||||
speed when executing the modified pattern. Nevertheless, if the atomic
|
||||
grouping is not a problem and the loss of speed is acceptable, this
|
||||
kind of rewriting will allow you to process patterns that PCRE2 cannot
|
||||
otherwise handle.
|
||||
reduces the memory requirements to around 16K, and indeed it remains
|
||||
under 20K even with the outer repetition increased to 100. However,
|
||||
this kind of pattern is not always exactly equivalent, because any cap-
|
||||
tures within subroutine calls are lost when the subroutine completes.
|
||||
If this is not a problem, this kind of rewriting will allow you to
|
||||
process patterns that PCRE2 cannot otherwise handle. The matching per-
|
||||
formance of the two different versions of the pattern are roughly the
|
||||
same. (This applies from release 10.30 - things were different in ear-
|
||||
lier releases.)
|
||||
|
||||
|
||||
STACK USAGE AT RUN TIME
|
||||
STACK AND HEAP USAGE AT RUN TIME
|
||||
|
||||
When pcre2_match() is used for matching, certain kinds of pattern can
|
||||
cause it to use large amounts of the process stack. In some environ-
|
||||
ments the default process stack is quite small, and if it runs out the
|
||||
result is often SIGSEGV. Rewriting your pattern can often help. The
|
||||
pcre2stack documentation discusses this issue in detail.
|
||||
From release 10.30, the interpretive (non-JIT) version of pcre2_match()
|
||||
uses very little system stack at run time. In earlier releases recur-
|
||||
sive function calls could use a great deal of stack, and this could
|
||||
cause problems, but this usage has been eliminated. Backtracking posi-
|
||||
tions are now explicitly remembered in memory frames controlled by the
|
||||
code. An initial 10K vector of frames is allocated on the system stack
|
||||
(enough for about 50 frames for small patterns), but if this is insuf-
|
||||
ficient, heap memory is used. Rewriting patterns to be time-efficient,
|
||||
as described below, may also reduce the memory requirements.
|
||||
|
||||
In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
|
||||
function calls, but only for processing atomic groups, lookaround
|
||||
assertions, and recursion within the pattern. Too much nested recursion
|
||||
may cause stack issues. The "match depth" parameter can be used to
|
||||
limit the depth of function recursion in pcre2_dfa_match().
|
||||
|
||||
|
||||
PROCESSING TIME
|
||||
|
||||
Certain items in regular expression patterns are processed more effi-
|
||||
Certain items in regular expression patterns are processed more effi-
|
||||
ciently than others. It is more efficient to use a character class like
|
||||
[aeiou] than a set of single-character alternatives such as
|
||||
(a|e|i|o|u). In general, the simplest construction that provides the
|
||||
[aeiou] than a set of single-character alternatives such as
|
||||
(a|e|i|o|u). In general, the simplest construction that provides the
|
||||
required behaviour is usually the most efficient. Jeffrey Friedl's book
|
||||
contains a lot of useful general discussion about optimizing regular
|
||||
expressions for efficient performance. This document contains a few
|
||||
contains a lot of useful general discussion about optimizing regular
|
||||
expressions for efficient performance. This document contains a few
|
||||
observations about PCRE2.
|
||||
|
||||
Using Unicode character properties (the \p, \P, and \X escapes) is
|
||||
slow, because PCRE2 has to use a multi-stage table lookup whenever it
|
||||
needs a character's property. If you can find an alternative pattern
|
||||
Using Unicode character properties (the \p, \P, and \X escapes) is
|
||||
slow, because PCRE2 has to use a multi-stage table lookup whenever it
|
||||
needs a character's property. If you can find an alternative pattern
|
||||
that does not use character properties, it will probably be faster.
|
||||
|
||||
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
|
||||
character classes such as [:alpha:] do not use Unicode properties,
|
||||
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
|
||||
character classes such as [:alpha:] do not use Unicode properties,
|
||||
partly for backwards compatibility, and partly for performance reasons.
|
||||
However, you can set the PCRE2_UCP option or start the pattern with
|
||||
(*UCP) if you want Unicode character properties to be used. This can
|
||||
double the matching time for items such as \d, when matched with
|
||||
pcre2_match(); the performance loss is less with a DFA matching func-
|
||||
However, you can set the PCRE2_UCP option or start the pattern with
|
||||
(*UCP) if you want Unicode character properties to be used. This can
|
||||
double the matching time for items such as \d, when matched with
|
||||
pcre2_match(); the performance loss is less with a DFA matching func-
|
||||
tion, and in both cases there is not much difference for \b.
|
||||
|
||||
When a pattern begins with .* not in atomic parentheses, nor in paren-
|
||||
theses that are the subject of a backreference, and the PCRE2_DOTALL
|
||||
option is set, the pattern is implicitly anchored by PCRE2, since it
|
||||
can match only at the start of a subject string. If the pattern has
|
||||
When a pattern begins with .* not in atomic parentheses, nor in paren-
|
||||
theses that are the subject of a backreference, and the PCRE2_DOTALL
|
||||
option is set, the pattern is implicitly anchored by PCRE2, since it
|
||||
can match only at the start of a subject string. If the pattern has
|
||||
multiple top-level branches, they must all be anchorable. The optimiza-
|
||||
tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
|
||||
tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
|
||||
automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
|
||||
|
||||
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
|
||||
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
|
||||
because the dot metacharacter does not then match a newline, and if the
|
||||
subject string contains newlines, the pattern may match from the char-
|
||||
subject string contains newlines, the pattern may match from the char-
|
||||
acter immediately following one of them instead of from the very start.
|
||||
For example, the pattern
|
||||
|
||||
.*second
|
||||
|
||||
matches the subject "first\nand second" (where \n stands for a newline
|
||||
character), with the match starting at the seventh character. In order
|
||||
to do this, PCRE2 has to retry the match starting after every newline
|
||||
matches the subject "first\nand second" (where \n stands for a newline
|
||||
character), with the match starting at the seventh character. In order
|
||||
to do this, PCRE2 has to retry the match starting after every newline
|
||||
in the subject.
|
||||
|
||||
If you are using such a pattern with subject strings that do not con-
|
||||
tain newlines, the best performance is obtained by setting
|
||||
PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
|
||||
If you are using such a pattern with subject strings that do not con-
|
||||
tain newlines, the best performance is obtained by setting
|
||||
PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
|
||||
explicit anchoring. That saves PCRE2 from having to scan along the sub-
|
||||
ject looking for a newline to restart at.
|
||||
|
||||
Beware of patterns that contain nested indefinite repeats. These can
|
||||
take a long time to run when applied to a string that does not match.
|
||||
Beware of patterns that contain nested indefinite repeats. These can
|
||||
take a long time to run when applied to a string that does not match.
|
||||
Consider the pattern fragment
|
||||
|
||||
^(a+)*
|
||||
|
||||
This can match "aaaa" in 16 different ways, and this number increases
|
||||
very rapidly as the string gets longer. (The * repeat can match 0, 1,
|
||||
2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
|
||||
repeats can match different numbers of times.) When the remainder of
|
||||
the pattern is such that the entire match is going to fail, PCRE2 has
|
||||
in principle to try every possible variation, and this can take an
|
||||
This can match "aaaa" in 16 different ways, and this number increases
|
||||
very rapidly as the string gets longer. (The * repeat can match 0, 1,
|
||||
2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
|
||||
repeats can match different numbers of times.) When the remainder of
|
||||
the pattern is such that the entire match is going to fail, PCRE2 has
|
||||
in principle to try every possible variation, and this can take an
|
||||
extremely long time, even for relatively short strings.
|
||||
|
||||
An optimization catches some of the more simple cases such as
|
||||
|
||||
(a+)*b
|
||||
|
||||
where a literal character follows. Before embarking on the standard
|
||||
matching procedure, PCRE2 checks that there is a "b" later in the sub-
|
||||
ject string, and if there is not, it fails the match immediately. How-
|
||||
ever, when there is no following literal this optimization cannot be
|
||||
where a literal character follows. Before embarking on the standard
|
||||
matching procedure, PCRE2 checks that there is a "b" later in the sub-
|
||||
ject string, and if there is not, it fails the match immediately. How-
|
||||
ever, when there is no following literal this optimization cannot be
|
||||
used. You can see the difference by comparing the behaviour of
|
||||
|
||||
(a+)*\d
|
||||
|
||||
with the pattern above. The former gives a failure almost instantly
|
||||
when applied to a whole line of "a" characters, whereas the latter
|
||||
with the pattern above. The former gives a failure almost instantly
|
||||
when applied to a whole line of "a" characters, whereas the latter
|
||||
takes an appreciable time with strings longer than about 20 characters.
|
||||
|
||||
In many cases, the solution to this kind of performance issue is to use
|
||||
an atomic group or a possessive quantifier.
|
||||
an atomic group or a possessive quantifier. This can often reduce mem-
|
||||
ory requirements as well. As another example, consider this pattern:
|
||||
|
||||
([^<]|<(?!inet))+
|
||||
|
||||
It matches from wherever it starts until it encounters "<inet" or the
|
||||
end of the data, and is the kind of pattern that might be used when
|
||||
processing an XML file. Each iteration of the outer parentheses matches
|
||||
either one character that is not "<" or a "<" that is not followed by
|
||||
"inet". However, each time a parenthesis is processed, a backtracking
|
||||
position is passed, so this formulation uses a memory frame for each
|
||||
matched character. For a long string, a lot of memory is required. Con-
|
||||
sider now this rewritten pattern, which matches exactly the same
|
||||
strings:
|
||||
|
||||
([^<]++|<(?!inet))+
|
||||
|
||||
This runs much faster, because sequences of characters that do not con-
|
||||
tain "<" are "swallowed" in one item inside the parentheses, and a pos-
|
||||
sessive quantifier is used to stop any backtracking into the runs of
|
||||
non-"<" characters. This version also uses a lot less memory because
|
||||
entry to a new set of parentheses happens only when a "<" character
|
||||
that is not followed by "inet" is encountered (and we assume this is
|
||||
relatively rare).
|
||||
|
||||
This example shows that one way of optimizing performance when matching
|
||||
long subject strings is to write repeated parenthesized subpatterns to
|
||||
match more than one character whenever possible.
|
||||
|
||||
SETTING RESOURCE LIMITS
|
||||
|
||||
You can set limits on the amount of processing that takes place when
|
||||
matching, and on the amount of heap memory that is used. The default
|
||||
values of the limits are very large, and unlikely ever to operate. They
|
||||
can be changed when PCRE2 is built, and they can also be set when
|
||||
pcre2_match() or pcre2_dfa_match() is called. For details of these
|
||||
interfaces, see the pcre2build documentation and the section entitled
|
||||
"The match context" in the pcre2api documentation.
|
||||
|
||||
The pcre2test test program has a modifier called "find_limits" which,
|
||||
if applied to a subject line, causes it to find the smallest limits
|
||||
that allow a pattern to match. This is done by repeatedly matching with
|
||||
different limits.
|
||||
|
||||
|
||||
AUTHOR
|
||||
|
@ -8636,8 +8679,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 31 March 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00"
|
||||
.TH PCRE2PERFORM 3 "31 March 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 PERFORMANCE"
|
||||
|
@ -12,11 +12,11 @@ of them.
|
|||
.rs
|
||||
.sp
|
||||
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
||||
so that most simple patterns do not use much memory. However, there is one case
|
||||
where the memory usage of a compiled pattern can be unexpectedly large. If a
|
||||
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
|
||||
a limited maximum, the whole subpattern is repeated in the compiled code. For
|
||||
example, the pattern
|
||||
so that most simple patterns do not use much memory for storing the compiled
|
||||
version. However, there is one case where the memory usage of a compiled
|
||||
pattern can be unexpectedly large. If a parenthesized subpattern has a
|
||||
quantifier with a minimum greater than 1 and/or a limited maximum, the whole
|
||||
subpattern is repeated in the compiled code. For example, the pattern
|
||||
.sp
|
||||
(abc|def){2,4}
|
||||
.sp
|
||||
|
@ -34,13 +34,13 @@ example, the very simple pattern
|
|||
.sp
|
||||
((ab){1,1000}c){1,3}
|
||||
.sp
|
||||
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
|
||||
with its default internal pointer size of two bytes, the size limit on a
|
||||
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
|
||||
is reached with the above pattern if the outer repetition is increased from 3
|
||||
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
|
||||
larger compiled patterns, but it is better to try to rewrite your pattern to
|
||||
use less memory if you can.
|
||||
uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
|
||||
compiled with its default internal pointer size of two bytes, the size limit on
|
||||
a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
|
||||
this is reached with the above pattern if the outer repetition is increased
|
||||
from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
|
||||
handle larger compiled patterns, but it is better to try to rewrite your
|
||||
pattern to use less memory if you can.
|
||||
.P
|
||||
One way of reducing the memory usage for such patterns is to make use of
|
||||
PCRE2's
|
||||
|
@ -52,32 +52,34 @@ facility. Re-writing the above pattern as
|
|||
.sp
|
||||
((ab)(?2){0,999}c)(?1){0,2}
|
||||
.sp
|
||||
reduces the memory requirements to 18K, and indeed it remains under 20K even
|
||||
with the outer repetition increased to 100. However, this pattern is not
|
||||
exactly equivalent, because the "subroutine" calls are treated as
|
||||
.\" HTML <a href="pcre2pattern.html#atomicgroup">
|
||||
.\" </a>
|
||||
atomic groups
|
||||
.\"
|
||||
into which there can be no backtracking if there is a subsequent matching
|
||||
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
|
||||
Furthermore, there is a noticeable loss of speed when executing the modified
|
||||
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
|
||||
speed is acceptable, this kind of rewriting will allow you to process patterns
|
||||
that PCRE2 cannot otherwise handle.
|
||||
reduces the memory requirements to around 16K, and indeed it remains under 20K
|
||||
even with the outer repetition increased to 100. However, this kind of pattern
|
||||
is not always exactly equivalent, because any captures within subroutine calls
|
||||
are lost when the subroutine completes. If this is not a problem, this kind of
|
||||
rewriting will allow you to process patterns that PCRE2 cannot otherwise
|
||||
handle. The matching performance of the two different versions of the pattern
|
||||
are roughly the same. (This applies from release 10.30 - things were different
|
||||
in earlier releases.)
|
||||
.
|
||||
.
|
||||
.SH "STACK USAGE AT RUN TIME"
|
||||
.SH "STACK AND HEAP USAGE AT RUN TIME"
|
||||
.rs
|
||||
.sp
|
||||
When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can
|
||||
cause it to use large amounts of the process stack. In some environments the
|
||||
default process stack is quite small, and if it runs out the result is often
|
||||
SIGSEGV. Rewriting your pattern can often help. The
|
||||
.\" HREF
|
||||
\fBpcre2stack\fP
|
||||
.\"
|
||||
documentation discusses this issue in detail.
|
||||
From release 10.30, the interpretive (non-JIT) version of \fBpcre2_match()\fP
|
||||
uses very little system stack at run time. In earlier releases recursive
|
||||
function calls could use a great deal of stack, and this could cause problems,
|
||||
but this usage has been eliminated. Backtracking positions are now explicitly
|
||||
remembered in memory frames controlled by the code. An initial 10K vector of
|
||||
frames is allocated on the system stack (enough for about 50 frames for small
|
||||
patterns), but if this is insufficient, heap memory is used. Rewriting patterns
|
||||
to be time-efficient, as described below, may also reduce the memory
|
||||
requirements.
|
||||
.P
|
||||
In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive
|
||||
function calls, but only for processing atomic groups, lookaround assertions,
|
||||
and recursion within the pattern. Too much nested recursion may cause stack
|
||||
issues. The "match depth" parameter can be used to limit the depth of function
|
||||
recursion in \fBpcre2_dfa_match()\fP.
|
||||
.
|
||||
.
|
||||
.SH "PROCESSING TIME"
|
||||
|
@ -160,7 +162,59 @@ applied to a whole line of "a" characters, whereas the latter takes an
|
|||
appreciable time with strings longer than about 20 characters.
|
||||
.P
|
||||
In many cases, the solution to this kind of performance issue is to use an
|
||||
atomic group or a possessive quantifier.
|
||||
atomic group or a possessive quantifier. This can often reduce memory
|
||||
requirements as well. As another example, consider this pattern:
|
||||
.sp
|
||||
([^<]|<(?!inet))+
|
||||
.sp
|
||||
It matches from wherever it starts until it encounters "<inet" or the end of
|
||||
the data, and is the kind of pattern that might be used when processing an XML
|
||||
file. Each iteration of the outer parentheses matches either one character that
|
||||
is not "<" or a "<" that is not followed by "inet". However, each time a
|
||||
parenthesis is processed, a backtracking position is passed, so this
|
||||
formulation uses a memory frame for each matched character. For a long string,
|
||||
a lot of memory is required. Consider now this rewritten pattern, which matches
|
||||
exactly the same strings:
|
||||
.sp
|
||||
([^<]++|<(?!inet))+
|
||||
.sp
|
||||
This runs much faster, because sequences of characters that do not contain "<"
|
||||
are "swallowed" in one item inside the parentheses, and a possessive quantifier
|
||||
is used to stop any backtracking into the runs of non-"<" characters. This
|
||||
version also uses a lot less memory because entry to a new set of parentheses
|
||||
happens only when a "<" character that is not followed by "inet" is encountered
|
||||
(and we assume this is relatively rare).
|
||||
.P
|
||||
This example shows that one way of optimizing performance when matching long
|
||||
subject strings is to write repeated parenthesized subpatterns to match more
|
||||
than one character whenever possible.
|
||||
.
|
||||
.
|
||||
.SS "SETTING RESOURCE LIMITS"
|
||||
.rs
|
||||
.sp
|
||||
You can set limits on the amount of processing that takes place when matching,
|
||||
and on the amount of heap memory that is used. The default values of the limits
|
||||
are very large, and unlikely ever to operate. They can be changed when PCRE2 is
|
||||
built, and they can also be set when \fBpcre2_match()\fP or
|
||||
\fBpcre2_dfa_match()\fP is called. For details of these interfaces, see the
|
||||
.\" HREF
|
||||
\fBpcre2build\fP
|
||||
.\"
|
||||
documentation and the section entitled
|
||||
.\" HTML <a href="pcre2api.html#matchcontext">
|
||||
.\" </a>
|
||||
"The match context"
|
||||
.\"
|
||||
in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
documentation.
|
||||
.P
|
||||
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
|
||||
applied to a subject line, causes it to find the smallest limits that allow a
|
||||
pattern to match. This is done by repeatedly matching with different limits.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
|
@ -177,6 +231,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 31 March 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
.fi
|
||||
|
|
212
doc/pcre2stack.3
212
doc/pcre2stack.3
|
@ -1,212 +0,0 @@
|
|||
.TH PCRE2STACK 3 "23 December 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 DISCUSSION OF STACK USAGE"
|
||||
.rs
|
||||
.sp
|
||||
When you call \fBpcre2_match()\fP, it makes use of an internal function called
|
||||
\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
|
||||
in order to remember the state of the match so that it can back up and try a
|
||||
different alternative after a failure. As matching proceeds deeper and deeper
|
||||
into the tree of possibilities, the recursion depth increases. The
|
||||
\fBmatch()\fP function is also called in other circumstances, for example,
|
||||
whenever a parenthesized sub-pattern is entered, and in certain cases of
|
||||
repetition.
|
||||
.P
|
||||
Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
|
||||
as a* it may be called several times at the same level, after matching
|
||||
different numbers of a's. Furthermore, in a number of cases where the result of
|
||||
the recursive call would immediately be passed back as the result of the
|
||||
current call (a "tail recursion"), the function is just restarted instead.
|
||||
.P
|
||||
Each time the internal \fBmatch()\fP function is called recursively, it uses
|
||||
memory from the process stack. For certain kinds of pattern and data, very
|
||||
large amounts of stack may be needed, despite the recognition of "tail
|
||||
recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
|
||||
of the GCC compiler, the stack requirements are greatly increased.
|
||||
.P
|
||||
The above comments apply when \fBpcre2_match()\fP is run in its normal
|
||||
interpretive manner. If the compiled pattern was processed by
|
||||
\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the
|
||||
options passed to \fBpcre2_match()\fP were not incompatible, the matching
|
||||
process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
|
||||
this case, the memory requirements are handled entirely differently. See the
|
||||
.\" HREF
|
||||
\fBpcre2jit\fP
|
||||
.\"
|
||||
documentation for details.
|
||||
.P
|
||||
The \fBpcre2_dfa_match()\fP function operates in a different way to
|
||||
\fBpcre2_match()\fP, and uses recursion only when there is a regular expression
|
||||
recursion or subroutine call in the pattern. This includes the processing of
|
||||
assertion and "once-only" subpatterns, which are handled like subroutine calls.
|
||||
Normally, these are never very deep, and the limit on the complexity of
|
||||
\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
|
||||
However, it is possible to write patterns with runaway infinite recursions;
|
||||
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack unless a
|
||||
limit is applied (see below).
|
||||
.P
|
||||
The comments in the next three sections do not apply to
|
||||
\fBpcre2_dfa_match()\fP; they are relevant only for \fBpcre2_match()\fP without
|
||||
the JIT optimization.
|
||||
.
|
||||
.
|
||||
.SS "Reducing \fBpcre2_match()\fP's stack usage"
|
||||
.rs
|
||||
.sp
|
||||
You can often reduce the amount of recursion, and therefore the
|
||||
amount of stack used, by modifying the pattern that is being matched. Consider,
|
||||
for example, this pattern:
|
||||
.sp
|
||||
([^<]|<(?!inet))+
|
||||
.sp
|
||||
It matches from wherever it starts until it encounters "<inet" or the end of
|
||||
the data, and is the kind of pattern that might be used when processing an XML
|
||||
file. Each iteration of the outer parentheses matches either one character that
|
||||
is not "<" or a "<" that is not followed by "inet". However, each time a
|
||||
parenthesis is processed, a recursion occurs, so this formulation uses a stack
|
||||
frame for each matched character. For a long string, a lot of stack is
|
||||
required. Consider now this rewritten pattern, which matches exactly the same
|
||||
strings:
|
||||
.sp
|
||||
([^<]++|<(?!inet))+
|
||||
.sp
|
||||
This uses very much less stack, because runs of characters that do not contain
|
||||
"<" are "swallowed" in one item inside the parentheses. Recursion happens only
|
||||
when a "<" character that is not followed by "inet" is encountered (and we
|
||||
assume this is relatively rare). A possessive quantifier is used to stop any
|
||||
backtracking into the runs of non-"<" characters, but that is not related to
|
||||
stack usage.
|
||||
.P
|
||||
This example shows that one way of avoiding stack problems when matching long
|
||||
subject strings is to write repeated parenthesized subpatterns to match more
|
||||
than one character whenever possible.
|
||||
.
|
||||
.
|
||||
.SS "Compiling PCRE2 to use heap instead of stack for \fBpcre2_match()\fP"
|
||||
.rs
|
||||
.sp
|
||||
In environments where stack memory is constrained, you might want to compile
|
||||
PCRE2 to use heap memory instead of stack for remembering back-up points when
|
||||
\fBpcre2_match()\fP is running. This makes it run more slowly, however. Details
|
||||
of how to do this are given in the
|
||||
.\" HREF
|
||||
\fBpcre2build\fP
|
||||
.\"
|
||||
documentation. When built in this way, instead of using the stack, PCRE2
|
||||
gets memory for remembering backup points from the heap. By default, the memory
|
||||
is obtained by calling the system \fBmalloc()\fP function, but you can arrange
|
||||
to supply your own memory management function. For details, see the section
|
||||
entitled
|
||||
.\" HTML <a href="pcre2api.html#matchcontext">
|
||||
.\" </a>
|
||||
"The match context"
|
||||
.\"
|
||||
in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
documentation. Since the block sizes are always the same, it may be possible to
|
||||
implement a customized memory handler that is more efficient than the standard
|
||||
function. The memory blocks obtained for this purpose are retained and re-used
|
||||
if possible while \fBpcre2_match()\fP is running. They are all freed just
|
||||
before it exits.
|
||||
.
|
||||
.
|
||||
.SS "Limiting \fBpcre2_match()\fP's stack usage"
|
||||
.rs
|
||||
.sp
|
||||
You can set limits on the number of times the internal \fBmatch()\fP function
|
||||
is called, both in total and recursively. If a limit is exceeded,
|
||||
\fBpcre2_match()\fP returns an error code. Setting suitable limits should
|
||||
prevent it from running out of stack. The default values of the limits are very
|
||||
large, and unlikely ever to operate. They can be changed when PCRE2 is built,
|
||||
and they can also be set when \fBpcre2_match()\fP is called. For details of
|
||||
these interfaces, see the
|
||||
.\" HREF
|
||||
\fBpcre2build\fP
|
||||
.\"
|
||||
documentation and the section entitled
|
||||
.\" HTML <a href="pcre2api.html#matchcontext">
|
||||
.\" </a>
|
||||
"The match context"
|
||||
.\"
|
||||
in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
documentation.
|
||||
.P
|
||||
As a very rough rule of thumb, you should reckon on about 500 bytes per
|
||||
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
|
||||
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
|
||||
around 128000 recursions.
|
||||
.P
|
||||
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
|
||||
applied to a subject line, causes it to find the smallest limits that allow a a
|
||||
pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
|
||||
different limits.
|
||||
.
|
||||
.
|
||||
.SS "Limiting \fBpcre2_dfa_match()\fP's stack usage"
|
||||
.rs
|
||||
.sp
|
||||
The recursion limit, as described above for \fBpcre2_match()\fP, also applies
|
||||
to \fBpcre2_dfa_match()\fP, whose use of recursive function calls for
|
||||
recursions in the pattern can lead to runaway stack usage. The non-recursive
|
||||
match limit is not relevant for DFA matching, and is ignored.
|
||||
.
|
||||
.
|
||||
.SS "Changing stack size in Unix-like systems"
|
||||
.rs
|
||||
.sp
|
||||
In Unix-like environments, there is not often a problem with the stack unless
|
||||
very long strings are involved, though the default limit on stack size varies
|
||||
from system to system. Values from 8Mb to 64Mb are common. You can find your
|
||||
default limit by running the command:
|
||||
.sp
|
||||
ulimit -s
|
||||
.sp
|
||||
Unfortunately, the effect of running out of stack is often SIGSEGV, though
|
||||
sometimes a more explicit error message is given. You can normally increase the
|
||||
limit on stack size by code such as this:
|
||||
.sp
|
||||
struct rlimit rlim;
|
||||
getrlimit(RLIMIT_STACK, &rlim);
|
||||
rlim.rlim_cur = 100*1024*1024;
|
||||
setrlimit(RLIMIT_STACK, &rlim);
|
||||
.sp
|
||||
This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
|
||||
attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
|
||||
do this before calling \fBpcre2_match()\fP.
|
||||
.
|
||||
.
|
||||
.SS "Changing stack size in Mac OS X"
|
||||
.rs
|
||||
.sp
|
||||
Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
|
||||
is also possible to set a stack size when linking a program. There is a
|
||||
discussion about stack sizes in Mac OS X at this web site:
|
||||
.\" HTML <a href="http://developer.apple.com/qa/qa2005/qa1419.html">
|
||||
.\" </a>
|
||||
http://developer.apple.com/qa/qa2005/qa1419.html.
|
||||
.\"
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Cambridge, England.
|
||||
.fi
|
||||
.
|
||||
.
|
||||
.SH REVISION
|
||||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 23 December 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
Loading…
Reference in New Issue