Documentation update

This commit is contained in:
Philip.Hazel 2017-03-31 16:49:33 +00:00
parent a073581116
commit ed9f34b06b
10 changed files with 391 additions and 473 deletions

View File

@ -103,7 +103,6 @@ dist_html_DATA = \
doc/html/pcre2posix.html \ doc/html/pcre2posix.html \
doc/html/pcre2sample.html \ doc/html/pcre2sample.html \
doc/html/pcre2serialize.html \ doc/html/pcre2serialize.html \
doc/html/pcre2stack.html \
doc/html/pcre2syntax.html \ doc/html/pcre2syntax.html \
doc/html/pcre2test.html \ doc/html/pcre2test.html \
doc/html/pcre2unicode.html doc/html/pcre2unicode.html
@ -187,7 +186,6 @@ dist_man_MANS = \
doc/pcre2posix.3 \ doc/pcre2posix.3 \
doc/pcre2sample.3 \ doc/pcre2sample.3 \
doc/pcre2serialize.3 \ doc/pcre2serialize.3 \
doc/pcre2stack.3 \
doc/pcre2syntax.3 \ doc/pcre2syntax.3 \
doc/pcre2test.1 \ doc/pcre2test.1 \
doc/pcre2unicode.3 doc/pcre2unicode.3

View File

@ -68,9 +68,6 @@ first.
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td> <tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
<td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr> <td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr>
<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
<td>&nbsp;&nbsp;Discussion of PCRE2's stack usage</td></tr>
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td> <tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
<td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr> <td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr>

View File

@ -18,7 +18,8 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
<P> <P>
This document describes the differences in the ways that PCRE2 and Perl handle This document describes the differences in the ways that PCRE2 and Perl handle
regular expressions. The differences described here are with respect to Perl regular expressions. The differences described here are with respect to Perl
versions 5.10 and above. versions 5.24, but as both Perl and PCRE2 are continually changing, the
information may sometimes be out of date.
</P> </P>
<P> <P>
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does 1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
@ -27,17 +28,18 @@ have are given in the
page. page.
</P> </P>
<P> <P>
2. PCRE2 allows repeat quantifiers only on parenthesized assertions, but they 2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
do not mean what you might think. For example, (?!a){3} does not assert that they do not mean what you might think. For example, (?!a){3} does not assert
the next three characters are not "a". It just asserts that the next character that the next three characters are not "a". It just asserts that the next
is not "a" three times (in principle: PCRE2 optimizes this to run the assertion character is not "a" three times (in principle: PCRE2 optimizes this to run the
just once). Perl allows repeat quantifiers on other assertions such as \b, but assertion just once). Perl allows some repeat quantifiers on other assertions,
these do not seem to have any use. for example, \b* (but not \b{3}), but these do not seem to have any use.
</P> </P>
<P> <P>
3. Capturing subpatterns that occur inside negative lookahead assertions are 3. Capturing subpatterns that occur inside negative lookaround assertions are
counted, but their entries in the offsets vector are never set. Perl sometimes counted, but their entries in the offsets vector are set only if the assertion
(but not always) sets its numerical variables from inside negative assertions. is a condition. Perl has changed its behaviour in this regard from time to
time.
</P> </P>
<P> <P>
4. The following Perl escape sequences are not supported: \l, \u, \L, 4. The following Perl escape sequences are not supported: \l, \u, \L,
@ -50,13 +52,13 @@ generated by default. However, if the PCRE2_ALT_BSUX option is set,
</P> </P>
<P> <P>
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
built with Unicode support. The properties that can be tested with \p and \P built with Unicode support (the default). The properties that can be tested
are limited to the general category properties such as Lu and Nd, script names with \p and \P are limited to the general category properties such as Lu and
such as Greek or Han, and the derived properties Any and L&. PCRE2 does support Nd, script names such as Greek or Han, and the derived properties Any and L&.
the Cs (surrogate) property, which Perl does not; the Perl documentation says PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
"Because Perl hides the need for the user to understand the internal documentation says "Because Perl hides the need for the user to understand the
representation of Unicode characters, there is no need to implement the internal representation of Unicode characters, there is no need to implement
somewhat messy concept of surrogates." the somewhat messy concept of surrogates."
</P> </P>
<P> <P>
6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters 6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters
@ -75,23 +77,15 @@ The \Q...\E sequence is recognized both inside and outside character classes.
</P> </P>
<P> <P>
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code}) 7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
constructions. However, there is support for recursive patterns. This is not constructions. However, there is support PCRE2's "callout" feature, which
available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE2 "callout" allows an external function to be called during pattern matching. See the
feature allows an external function to be called during pattern matching. See
the
<a href="pcre2callout.html"><b>pcre2callout</b></a> <a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details. documentation for details.
</P> </P>
<P> <P>
8. Subroutine calls (whether recursive or not) are treated as atomic groups. 8. Subroutine calls (whether recursive or not) were treated as atomic groups up
Atomic recursion is like Python, but unlike Perl. Captured values that are set to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
outside a subroutine call can be referenced from inside in PCRE2, but not in into subroutine calls is now supported, as in Perl.
Perl. There is a discussion that explains these differences in more detail in
the
<a href="pcre2pattern.html#recursiondifference">section on recursion differences from Perl</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page.
</P> </P>
<P> <P>
9. If any of the backtracking control verbs are used in a subpattern that is 9. If any of the backtracking control verbs are used in a subpattern that is
@ -147,14 +141,14 @@ certainly user mistakes.
16. In PCRE2, the upper/lower case character properties Lu and Ll are not 16. In PCRE2, the upper/lower case character properties Lu and Ll are not
affected when case-independent matching is specified. For example, \p{Lu} affected when case-independent matching is specified. For example, \p{Lu}
always matches an upper case letter. I think Perl has changed in this respect; always matches an upper case letter. I think Perl has changed in this respect;
in the release at the time of writing (5.16), \p{Lu} and \p{Ll} match all in the release at the time of writing (5.24), \p{Lu} and \p{Ll} match all
letters, regardless of case, when case independence is specified. letters, regardless of case, when case independence is specified.
</P> </P>
<P> <P>
17. PCRE2 provides some extensions to the Perl regular expression facilities. 17. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some Perl 5.10 includes new features that are not in earlier versions of Perl, some
of which (such as named parentheses) have been in PCRE2 for some time. This of which (such as named parentheses) were in PCRE2 for some time before. This
list is with respect to Perl 5.10: list is with respect to Perl 5.24:
<br> <br>
<br> <br>
(a) Although lookbehind assertions in PCRE2 must match fixed length strings, (a) Although lookbehind assertions in PCRE2 must match fixed length strings,
@ -220,9 +214,9 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 18 October 2016 Last updated: 29 March 2017
<br> <br>
Copyright &copy; 1997-2016 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -173,7 +173,7 @@ below for a discussion of JIT stack usage.
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
a very large pattern tree goes on for too long, as it is in the same a very large pattern tree goes on for too long, as it is in the same
circumstance when JIT is not used, but the details of exactly what is counted circumstance when JIT is not used, but the details of exactly what is counted
are not the same. The PCRE2_ERROR_RECURSIONLIMIT error code is never returned are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
when JIT matching is used. when JIT matching is used.
<a name="stackcontrol"></a></P> <a name="stackcontrol"></a></P>
<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br> <br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
@ -436,9 +436,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br> <br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 05 June 2016 Last updated: 30 March 2017
<br> <br>
Copyright &copy; 1997-2016 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -44,14 +44,6 @@ integer type, usually defined as size_t. Its maximum value (that is
and unset offsets. and unset offsets.
</P> </P>
<P> <P>
Note that when using the traditional matching function, PCRE2 uses recursion to
handle subpatterns and indefinite repetition. This means that the available
stack space may limit the size of a subject string that can be processed by
certain patterns. For a discussion of stack issues, see the
<a href="pcre2stack.html"><b>pcre2stack</b></a>
documentation.
</P>
<P>
All values in repeating quantifiers must be less than 65536. All values in repeating quantifiers must be less than 65536.
</P> </P>
<P> <P>
@ -94,9 +86,9 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 26 October 2016 Last updated: 30 March 2017
<br> <br>
Copyright &copy; 1997-2016 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
<ul> <ul>
<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a> <li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a> <li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
<li><a name="TOC3" href="#SEC3">STACK USAGE AT RUN TIME</a> <li><a name="TOC3" href="#SEC3">STACK AND HEAP USAGE AT RUN TIME</a>
<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a> <li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
<li><a name="TOC5" href="#SEC5">AUTHOR</a> <li><a name="TOC5" href="#SEC5">AUTHOR</a>
<li><a name="TOC6" href="#SEC6">REVISION</a> <li><a name="TOC6" href="#SEC6">REVISION</a>
@ -29,11 +29,11 @@ of them.
<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br> <br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
<P> <P>
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code, Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
so that most simple patterns do not use much memory. However, there is one case so that most simple patterns do not use much memory for storing the compiled
where the memory usage of a compiled pattern can be unexpectedly large. If a version. However, there is one case where the memory usage of a compiled
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or pattern can be unexpectedly large. If a parenthesized subpattern has a
a limited maximum, the whole subpattern is repeated in the compiled code. For quantifier with a minimum greater than 1 and/or a limited maximum, the whole
example, the pattern subpattern is repeated in the compiled code. For example, the pattern
<pre> <pre>
(abc|def){2,4} (abc|def){2,4}
</pre> </pre>
@ -52,13 +52,13 @@ example, the very simple pattern
<pre> <pre>
((ab){1,1000}c){1,3} ((ab){1,1000}c){1,3}
</pre> </pre>
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
with its default internal pointer size of two bytes, the size limit on a compiled with its default internal pointer size of two bytes, the size limit on
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
is reached with the above pattern if the outer repetition is increased from 3 this is reached with the above pattern if the outer repetition is increased
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
larger compiled patterns, but it is better to try to rewrite your pattern to handle larger compiled patterns, but it is better to try to rewrite your
use less memory if you can. pattern to use less memory if you can.
</P> </P>
<P> <P>
One way of reducing the memory usage for such patterns is to make use of One way of reducing the memory usage for such patterns is to make use of
@ -68,25 +68,33 @@ facility. Re-writing the above pattern as
<pre> <pre>
((ab)(?2){0,999}c)(?1){0,2} ((ab)(?2){0,999}c)(?1){0,2}
</pre> </pre>
reduces the memory requirements to 18K, and indeed it remains under 20K even reduces the memory requirements to around 16K, and indeed it remains under 20K
with the outer repetition increased to 100. However, this pattern is not even with the outer repetition increased to 100. However, this kind of pattern
exactly equivalent, because the "subroutine" calls are treated as is not always exactly equivalent, because any captures within subroutine calls
<a href="pcre2pattern.html#atomicgroup">atomic groups</a> are lost when the subroutine completes. If this is not a problem, this kind of
into which there can be no backtracking if there is a subsequent matching rewriting will allow you to process patterns that PCRE2 cannot otherwise
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically. handle. The matching performance of the two different versions of the pattern
Furthermore, there is a noticeable loss of speed when executing the modified are roughly the same. (This applies from release 10.30 - things were different
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of in earlier releases.)
speed is acceptable, this kind of rewriting will allow you to process patterns
that PCRE2 cannot otherwise handle.
</P> </P>
<br><a name="SEC3" href="#TOC1">STACK USAGE AT RUN TIME</a><br> <br><a name="SEC3" href="#TOC1">STACK AND HEAP USAGE AT RUN TIME</a><br>
<P> <P>
When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can From release 10.30, the interpretive (non-JIT) version of <b>pcre2_match()</b>
cause it to use large amounts of the process stack. In some environments the uses very little system stack at run time. In earlier releases recursive
default process stack is quite small, and if it runs out the result is often function calls could use a great deal of stack, and this could cause problems,
SIGSEGV. Rewriting your pattern can often help. The but this usage has been eliminated. Backtracking positions are now explicitly
<a href="pcre2stack.html"><b>pcre2stack</b></a> remembered in memory frames controlled by the code. An initial 10K vector of
documentation discusses this issue in detail. frames is allocated on the system stack (enough for about 50 frames for small
patterns), but if this is insufficient, heap memory is used. Rewriting patterns
to be time-efficient, as described below, may also reduce the memory
requirements.
</P>
<P>
In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive
function calls, but only for processing atomic groups, lookaround assertions,
and recursion within the pattern. Too much nested recursion may cause stack
issues. The "match depth" parameter can be used to limit the depth of function
recursion in <b>pcre2_dfa_match()</b>.
</P> </P>
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br> <br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
<P> <P>
@ -175,7 +183,54 @@ appreciable time with strings longer than about 20 characters.
</P> </P>
<P> <P>
In many cases, the solution to this kind of performance issue is to use an In many cases, the solution to this kind of performance issue is to use an
atomic group or a possessive quantifier. atomic group or a possessive quantifier. This can often reduce memory
requirements as well. As another example, consider this pattern:
<pre>
([^&#60;]|&#60;(?!inet))+
</pre>
It matches from wherever it starts until it encounters "&#60;inet" or the end of
the data, and is the kind of pattern that might be used when processing an XML
file. Each iteration of the outer parentheses matches either one character that
is not "&#60;" or a "&#60;" that is not followed by "inet". However, each time a
parenthesis is processed, a backtracking position is passed, so this
formulation uses a memory frame for each matched character. For a long string,
a lot of memory is required. Consider now this rewritten pattern, which matches
exactly the same strings:
<pre>
([^&#60;]++|&#60;(?!inet))+
</pre>
This runs much faster, because sequences of characters that do not contain "&#60;"
are "swallowed" in one item inside the parentheses, and a possessive quantifier
is used to stop any backtracking into the runs of non-"&#60;" characters. This
version also uses a lot less memory because entry to a new set of parentheses
happens only when a "&#60;" character that is not followed by "inet" is encountered
(and we assume this is relatively rare).
</P>
<P>
This example shows that one way of optimizing performance when matching long
subject strings is to write repeated parenthesized subpatterns to match more
than one character whenever possible.
</P>
<br><b>
SETTING RESOURCE LIMITS
</b><br>
<P>
You can set limits on the amount of processing that takes place when matching,
and on the amount of heap memory that is used. The default values of the limits
are very large, and unlikely ever to operate. They can be changed when PCRE2 is
built, and they can also be set when <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b> is called. For details of these interfaces, see the
<a href="pcre2build.html"><b>pcre2build</b></a>
documentation and the section entitled
<a href="pcre2api.html#matchcontext">"The match context"</a>
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<P>
The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
applied to a subject line, causes it to find the smallest limits that allow a
pattern to match. This is done by repeatedly matching with different limits.
</P> </P>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br> <br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P> <P>
@ -188,9 +243,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br> <br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 02 January 2015 Last updated: 31 March 2017
<br> <br>
Copyright &copy; 1997-2015 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -68,9 +68,6 @@ first.
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td> <tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
<td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr> <td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr>
<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
<td>&nbsp;&nbsp;Discussion of PCRE2's stack usage</td></tr>
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td> <tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
<td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr> <td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr>

View File

@ -4097,45 +4097,46 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
This document describes the differences in the ways that PCRE2 and Perl This document describes the differences in the ways that PCRE2 and Perl
handle regular expressions. The differences described here are with handle regular expressions. The differences described here are with
respect to Perl versions 5.10 and above. respect to Perl versions 5.24, but as both Perl and PCRE2 are continu-
ally changing, the information may sometimes be out of date.
1. PCRE2 has only a subset of Perl's Unicode support. Details of what 1. PCRE2 has only a subset of Perl's Unicode support. Details of what
it does have are given in the pcre2unicode page. it does have are given in the pcre2unicode page.
2. PCRE2 allows repeat quantifiers only on parenthesized assertions, 2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
but they do not mean what you might think. For example, (?!a){3} does tions, but they do not mean what you might think. For example, (?!a){3}
not assert that the next three characters are not "a". It just asserts does not assert that the next three characters are not "a". It just
that the next character is not "a" three times (in principle: PCRE2 asserts that the next character is not "a" three times (in principle:
optimizes this to run the assertion just once). Perl allows repeat PCRE2 optimizes this to run the assertion just once). Perl allows some
quantifiers on other assertions such as \b, but these do not seem to repeat quantifiers on other assertions, for example, \b* (but not
have any use. \b{3}), but these do not seem to have any use.
3. Capturing subpatterns that occur inside negative lookahead asser- 3. Capturing subpatterns that occur inside negative lookaround asser-
tions are counted, but their entries in the offsets vector are never tions are counted, but their entries in the offsets vector are set only
set. Perl sometimes (but not always) sets its numerical variables from if the assertion is a condition. Perl has changed its behaviour in this
inside negative assertions. regard from time to time.
4. The following Perl escape sequences are not supported: \l, \u, \L, 4. The following Perl escape sequences are not supported: \l, \u, \L,
\U, and \N when followed by a character name or Unicode value. (\N on \U, and \N when followed by a character name or Unicode value. (\N on
its own, matching a non-newline character, is supported.) In fact these its own, matching a non-newline character, is supported.) In fact these
are implemented by Perl's general string-handling and are not part of are implemented by Perl's general string-handling and are not part of
its pattern matching engine. If any of these are encountered by PCRE2, its pattern matching engine. If any of these are encountered by PCRE2,
an error is generated by default. However, if the PCRE2_ALT_BSUX option an error is generated by default. However, if the PCRE2_ALT_BSUX option
is set, \U and \u are interpreted as ECMAScript interprets them. is set, \U and \u are interpreted as ECMAScript interprets them.
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
is built with Unicode support. The properties that can be tested with is built with Unicode support (the default). The properties that can be
\p and \P are limited to the general category properties such as Lu and tested with \p and \P are limited to the general category properties
Nd, script names such as Greek or Han, and the derived properties Any such as Lu and Nd, script names such as Greek or Han, and the derived
and L&. PCRE2 does support the Cs (surrogate) property, which Perl does properties Any and L&. PCRE2 does support the Cs (surrogate) property,
not; the Perl documentation says "Because Perl hides the need for the which Perl does not; the Perl documentation says "Because Perl hides
user to understand the internal representation of Unicode characters, the need for the user to understand the internal representation of Uni-
there is no need to implement the somewhat messy concept of surro- code characters, there is no need to implement the somewhat messy con-
gates." cept of surrogates."
6. PCRE2 does support the \Q...\E escape for quoting substrings. Char- 6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
acters in between are treated as literals. This is slightly different acters in between are treated as literals. This is slightly different
from Perl in that $ and @ are also handled as literals inside the from Perl in that $ and @ are also handled as literals inside the
quotes. In Perl, they cause variable interpolation (but of course PCRE2 quotes. In Perl, they cause variable interpolation (but of course PCRE2
does not have variables). Note the following examples: does not have variables). Note the following examples:
@ -4146,22 +4147,17 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
\Qabc\$xyz\E abc\$xyz abc\$xyz \Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
The \Q...\E sequence is recognized both inside and outside character The \Q...\E sequence is recognized both inside and outside character
classes. classes.
7. Fairly obviously, PCRE2 does not support the (?{code}) and 7. Fairly obviously, PCRE2 does not support the (?{code}) and
(??{code}) constructions. However, there is support for recursive pat- (??{code}) constructions. However, there is support PCRE2's "callout"
terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also, feature, which allows an external function to be called during pattern
the PCRE2 "callout" feature allows an external function to be called matching. See the pcre2callout documentation for details.
during pattern matching. See the pcre2callout documentation for
details.
8. Subroutine calls (whether recursive or not) are treated as atomic 8. Subroutine calls (whether recursive or not) were treated as atomic
groups. Atomic recursion is like Python, but unlike Perl. Captured groups up to PCRE2 release 10.23, but from release 10.30 this changed,
values that are set outside a subroutine call can be referenced from and backtracking into subroutine calls is now supported, as in Perl.
inside in PCRE2, but not in Perl. There is a discussion that explains
these differences in more detail in the section on recursion differ-
ences from Perl in the pcre2pattern page.
9. If any of the backtracking control verbs are used in a subpattern 9. If any of the backtracking control verbs are used in a subpattern
that is called as a subroutine (whether or not recursively), their that is called as a subroutine (whether or not recursively), their
@ -4211,14 +4207,14 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
16. In PCRE2, the upper/lower case character properties Lu and Ll are 16. In PCRE2, the upper/lower case character properties Lu and Ll are
not affected when case-independent matching is specified. For example, not affected when case-independent matching is specified. For example,
\p{Lu} always matches an upper case letter. I think Perl has changed in \p{Lu} always matches an upper case letter. I think Perl has changed in
this respect; in the release at the time of writing (5.16), \p{Lu} and this respect; in the release at the time of writing (5.24), \p{Lu} and
\p{Ll} match all letters, regardless of case, when case independence is \p{Ll} match all letters, regardless of case, when case independence is
specified. specified.
17. PCRE2 provides some extensions to the Perl regular expression 17. PCRE2 provides some extensions to the Perl regular expression
facilities. Perl 5.10 includes new features that are not in earlier facilities. Perl 5.10 includes new features that are not in earlier
versions of Perl, some of which (such as named parentheses) have been versions of Perl, some of which (such as named parentheses) were in
in PCRE2 for some time. This list is with respect to Perl 5.10: PCRE2 for some time before. This list is with respect to Perl 5.24:
(a) Although lookbehind assertions in PCRE2 must match fixed length (a) Although lookbehind assertions in PCRE2 must match fixed length
strings, each alternative branch of a lookbehind assertion can match a strings, each alternative branch of a lookbehind assertion can match a
@ -4271,8 +4267,8 @@ AUTHOR
REVISION REVISION
Last updated: 18 October 2016 Last updated: 29 March 2017
Copyright (c) 1997-2016 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -4420,8 +4416,8 @@ RETURN VALUES FROM JIT MATCHING
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
searching a very large pattern tree goes on for too long, as it is in searching a very large pattern tree goes on for too long, as it is in
the same circumstance when JIT is not used, but the details of exactly the same circumstance when JIT is not used, but the details of exactly
what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
code is never returned when JIT matching is used. is never returned when JIT matching is used.
CONTROLLING THE JIT STACK CONTROLLING THE JIT STACK
@ -4668,8 +4664,8 @@ AUTHOR
REVISION REVISION
Last updated: 05 June 2016 Last updated: 30 March 2017
Copyright (c) 1997-2016 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -4706,12 +4702,6 @@ SIZE AND OTHER LIMITATIONS
(that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero- (that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
terminated strings and unset offsets. terminated strings and unset offsets.
Note that when using the traditional matching function, PCRE2 uses
recursion to handle subpatterns and indefinite repetition. This means
that the available stack space may limit the size of a subject string
that can be processed by certain patterns. For a discussion of stack
issues, see the pcre2stack documentation.
All values in repeating quantifiers must be less than 65536. All values in repeating quantifiers must be less than 65536.
The maximum length of a lookbehind assertion is 65535 characters. The maximum length of a lookbehind assertion is 65535 characters.
@ -4745,8 +4735,8 @@ AUTHOR
REVISION REVISION
Last updated: 26 October 2016 Last updated: 30 March 2017
Copyright (c) 1997-2016 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -8485,11 +8475,12 @@ PCRE2 PERFORMANCE
COMPILED PATTERN MEMORY USAGE COMPILED PATTERN MEMORY USAGE
Patterns are compiled by PCRE2 into a reasonably efficient interpretive Patterns are compiled by PCRE2 into a reasonably efficient interpretive
code, so that most simple patterns do not use much memory. However, code, so that most simple patterns do not use much memory for storing
there is one case where the memory usage of a compiled pattern can be the compiled version. However, there is one case where the memory usage
unexpectedly large. If a parenthesized subpattern has a quantifier with of a compiled pattern can be unexpectedly large. If a parenthesized
a minimum greater than 1 and/or a limited maximum, the whole subpattern subpattern has a quantifier with a minimum greater than 1 and/or a lim-
is repeated in the compiled code. For example, the pattern ited maximum, the whole subpattern is repeated in the compiled code.
For example, the pattern
(abc|def){2,4} (abc|def){2,4}
@ -8497,134 +8488,186 @@ COMPILED PATTERN MEMORY USAGE
(abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def)(abc|def)((abc|def)(abc|def)?)?
(Technical aside: It is done this way so that backtrack points within (Technical aside: It is done this way so that backtrack points within
each of the repetitions can be independently maintained.) each of the repetitions can be independently maintained.)
For regular expressions whose quantifiers use only small numbers, this For regular expressions whose quantifiers use only small numbers, this
is not usually a problem. However, if the numbers are large, and par- is not usually a problem. However, if the numbers are large, and par-
ticularly if such repetitions are nested, the memory usage can become ticularly if such repetitions are nested, the memory usage can become
an embarrassment. For example, the very simple pattern an embarrassment. For example, the very simple pattern
((ab){1,1000}c){1,3} ((ab){1,1000}c){1,3}
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is uses over 50K bytes when compiled using the 8-bit library. When PCRE2
compiled with its default internal pointer size of two bytes, the size is compiled with its default internal pointer size of two bytes, the
limit on a compiled pattern is 64K code units in the 8-bit and 16-bit size limit on a compiled pattern is 64K code units in the 8-bit and
libraries, and this is reached with the above pattern if the outer rep- 16-bit libraries, and this is reached with the above pattern if the
etition is increased from 3 to 4. PCRE2 can be compiled to use larger outer repetition is increased from 3 to 4. PCRE2 can be compiled to use
internal pointers and thus handle larger compiled patterns, but it is larger internal pointers and thus handle larger compiled patterns, but
better to try to rewrite your pattern to use less memory if you can. it is better to try to rewrite your pattern to use less memory if you
can.
One way of reducing the memory usage for such patterns is to make use One way of reducing the memory usage for such patterns is to make use
of PCRE2's "subroutine" facility. Re-writing the above pattern as of PCRE2's "subroutine" facility. Re-writing the above pattern as
((ab)(?2){0,999}c)(?1){0,2} ((ab)(?2){0,999}c)(?1){0,2}
reduces the memory requirements to 18K, and indeed it remains under 20K reduces the memory requirements to around 16K, and indeed it remains
even with the outer repetition increased to 100. However, this pattern under 20K even with the outer repetition increased to 100. However,
is not exactly equivalent, because the "subroutine" calls are treated this kind of pattern is not always exactly equivalent, because any cap-
as atomic groups into which there can be no backtracking if there is a tures within subroutine calls are lost when the subroutine completes.
subsequent matching failure. Therefore, PCRE2 cannot do this kind of If this is not a problem, this kind of rewriting will allow you to
rewriting automatically. Furthermore, there is a noticeable loss of process patterns that PCRE2 cannot otherwise handle. The matching per-
speed when executing the modified pattern. Nevertheless, if the atomic formance of the two different versions of the pattern are roughly the
grouping is not a problem and the loss of speed is acceptable, this same. (This applies from release 10.30 - things were different in ear-
kind of rewriting will allow you to process patterns that PCRE2 cannot lier releases.)
otherwise handle.
STACK USAGE AT RUN TIME STACK AND HEAP USAGE AT RUN TIME
When pcre2_match() is used for matching, certain kinds of pattern can From release 10.30, the interpretive (non-JIT) version of pcre2_match()
cause it to use large amounts of the process stack. In some environ- uses very little system stack at run time. In earlier releases recur-
ments the default process stack is quite small, and if it runs out the sive function calls could use a great deal of stack, and this could
result is often SIGSEGV. Rewriting your pattern can often help. The cause problems, but this usage has been eliminated. Backtracking posi-
pcre2stack documentation discusses this issue in detail. tions are now explicitly remembered in memory frames controlled by the
code. An initial 10K vector of frames is allocated on the system stack
(enough for about 50 frames for small patterns), but if this is insuf-
ficient, heap memory is used. Rewriting patterns to be time-efficient,
as described below, may also reduce the memory requirements.
In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
function calls, but only for processing atomic groups, lookaround
assertions, and recursion within the pattern. Too much nested recursion
may cause stack issues. The "match depth" parameter can be used to
limit the depth of function recursion in pcre2_dfa_match().
PROCESSING TIME PROCESSING TIME
Certain items in regular expression patterns are processed more effi- Certain items in regular expression patterns are processed more effi-
ciently than others. It is more efficient to use a character class like ciently than others. It is more efficient to use a character class like
[aeiou] than a set of single-character alternatives such as [aeiou] than a set of single-character alternatives such as
(a|e|i|o|u). In general, the simplest construction that provides the (a|e|i|o|u). In general, the simplest construction that provides the
required behaviour is usually the most efficient. Jeffrey Friedl's book required behaviour is usually the most efficient. Jeffrey Friedl's book
contains a lot of useful general discussion about optimizing regular contains a lot of useful general discussion about optimizing regular
expressions for efficient performance. This document contains a few expressions for efficient performance. This document contains a few
observations about PCRE2. observations about PCRE2.
Using Unicode character properties (the \p, \P, and \X escapes) is Using Unicode character properties (the \p, \P, and \X escapes) is
slow, because PCRE2 has to use a multi-stage table lookup whenever it slow, because PCRE2 has to use a multi-stage table lookup whenever it
needs a character's property. If you can find an alternative pattern needs a character's property. If you can find an alternative pattern
that does not use character properties, it will probably be faster. that does not use character properties, it will probably be faster.
By default, the escape sequences \b, \d, \s, and \w, and the POSIX By default, the escape sequences \b, \d, \s, and \w, and the POSIX
character classes such as [:alpha:] do not use Unicode properties, character classes such as [:alpha:] do not use Unicode properties,
partly for backwards compatibility, and partly for performance reasons. partly for backwards compatibility, and partly for performance reasons.
However, you can set the PCRE2_UCP option or start the pattern with However, you can set the PCRE2_UCP option or start the pattern with
(*UCP) if you want Unicode character properties to be used. This can (*UCP) if you want Unicode character properties to be used. This can
double the matching time for items such as \d, when matched with double the matching time for items such as \d, when matched with
pcre2_match(); the performance loss is less with a DFA matching func- pcre2_match(); the performance loss is less with a DFA matching func-
tion, and in both cases there is not much difference for \b. tion, and in both cases there is not much difference for \b.
When a pattern begins with .* not in atomic parentheses, nor in paren- When a pattern begins with .* not in atomic parentheses, nor in paren-
theses that are the subject of a backreference, and the PCRE2_DOTALL theses that are the subject of a backreference, and the PCRE2_DOTALL
option is set, the pattern is implicitly anchored by PCRE2, since it option is set, the pattern is implicitly anchored by PCRE2, since it
can match only at the start of a subject string. If the pattern has can match only at the start of a subject string. If the pattern has
multiple top-level branches, they must all be anchorable. The optimiza- multiple top-level branches, they must all be anchorable. The optimiza-
tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
automatically disabled if the pattern contains (*PRUNE) or (*SKIP). automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
because the dot metacharacter does not then match a newline, and if the because the dot metacharacter does not then match a newline, and if the
subject string contains newlines, the pattern may match from the char- subject string contains newlines, the pattern may match from the char-
acter immediately following one of them instead of from the very start. acter immediately following one of them instead of from the very start.
For example, the pattern For example, the pattern
.*second .*second
matches the subject "first\nand second" (where \n stands for a newline matches the subject "first\nand second" (where \n stands for a newline
character), with the match starting at the seventh character. In order character), with the match starting at the seventh character. In order
to do this, PCRE2 has to retry the match starting after every newline to do this, PCRE2 has to retry the match starting after every newline
in the subject. in the subject.
If you are using such a pattern with subject strings that do not con- If you are using such a pattern with subject strings that do not con-
tain newlines, the best performance is obtained by setting tain newlines, the best performance is obtained by setting
PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
explicit anchoring. That saves PCRE2 from having to scan along the sub- explicit anchoring. That saves PCRE2 from having to scan along the sub-
ject looking for a newline to restart at. ject looking for a newline to restart at.
Beware of patterns that contain nested indefinite repeats. These can Beware of patterns that contain nested indefinite repeats. These can
take a long time to run when applied to a string that does not match. take a long time to run when applied to a string that does not match.
Consider the pattern fragment Consider the pattern fragment
^(a+)* ^(a+)*
This can match "aaaa" in 16 different ways, and this number increases This can match "aaaa" in 16 different ways, and this number increases
very rapidly as the string gets longer. (The * repeat can match 0, 1, very rapidly as the string gets longer. (The * repeat can match 0, 1,
2, 3, or 4 times, and for each of those cases other than 0 or 4, the + 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
repeats can match different numbers of times.) When the remainder of repeats can match different numbers of times.) When the remainder of
the pattern is such that the entire match is going to fail, PCRE2 has the pattern is such that the entire match is going to fail, PCRE2 has
in principle to try every possible variation, and this can take an in principle to try every possible variation, and this can take an
extremely long time, even for relatively short strings. extremely long time, even for relatively short strings.
An optimization catches some of the more simple cases such as An optimization catches some of the more simple cases such as
(a+)*b (a+)*b
where a literal character follows. Before embarking on the standard where a literal character follows. Before embarking on the standard
matching procedure, PCRE2 checks that there is a "b" later in the sub- matching procedure, PCRE2 checks that there is a "b" later in the sub-
ject string, and if there is not, it fails the match immediately. How- ject string, and if there is not, it fails the match immediately. How-
ever, when there is no following literal this optimization cannot be ever, when there is no following literal this optimization cannot be
used. You can see the difference by comparing the behaviour of used. You can see the difference by comparing the behaviour of
(a+)*\d (a+)*\d
with the pattern above. The former gives a failure almost instantly with the pattern above. The former gives a failure almost instantly
when applied to a whole line of "a" characters, whereas the latter when applied to a whole line of "a" characters, whereas the latter
takes an appreciable time with strings longer than about 20 characters. takes an appreciable time with strings longer than about 20 characters.
In many cases, the solution to this kind of performance issue is to use In many cases, the solution to this kind of performance issue is to use
an atomic group or a possessive quantifier. an atomic group or a possessive quantifier. This can often reduce mem-
ory requirements as well. As another example, consider this pattern:
([^<]|<(?!inet))+
It matches from wherever it starts until it encounters "<inet" or the
end of the data, and is the kind of pattern that might be used when
processing an XML file. Each iteration of the outer parentheses matches
either one character that is not "<" or a "<" that is not followed by
"inet". However, each time a parenthesis is processed, a backtracking
position is passed, so this formulation uses a memory frame for each
matched character. For a long string, a lot of memory is required. Con-
sider now this rewritten pattern, which matches exactly the same
strings:
([^<]++|<(?!inet))+
This runs much faster, because sequences of characters that do not con-
tain "<" are "swallowed" in one item inside the parentheses, and a pos-
sessive quantifier is used to stop any backtracking into the runs of
non-"<" characters. This version also uses a lot less memory because
entry to a new set of parentheses happens only when a "<" character
that is not followed by "inet" is encountered (and we assume this is
relatively rare).
This example shows that one way of optimizing performance when matching
long subject strings is to write repeated parenthesized subpatterns to
match more than one character whenever possible.
SETTING RESOURCE LIMITS
You can set limits on the amount of processing that takes place when
matching, and on the amount of heap memory that is used. The default
values of the limits are very large, and unlikely ever to operate. They
can be changed when PCRE2 is built, and they can also be set when
pcre2_match() or pcre2_dfa_match() is called. For details of these
interfaces, see the pcre2build documentation and the section entitled
"The match context" in the pcre2api documentation.
The pcre2test test program has a modifier called "find_limits" which,
if applied to a subject line, causes it to find the smallest limits
that allow a pattern to match. This is done by repeatedly matching with
different limits.
AUTHOR AUTHOR
@ -8636,8 +8679,8 @@ AUTHOR
REVISION REVISION
Last updated: 02 January 2015 Last updated: 31 March 2017
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00" .TH PCRE2PERFORM 3 "31 March 2017" "PCRE2 10.30"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 PERFORMANCE" .SH "PCRE2 PERFORMANCE"
@ -12,11 +12,11 @@ of them.
.rs .rs
.sp .sp
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code, Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
so that most simple patterns do not use much memory. However, there is one case so that most simple patterns do not use much memory for storing the compiled
where the memory usage of a compiled pattern can be unexpectedly large. If a version. However, there is one case where the memory usage of a compiled
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or pattern can be unexpectedly large. If a parenthesized subpattern has a
a limited maximum, the whole subpattern is repeated in the compiled code. For quantifier with a minimum greater than 1 and/or a limited maximum, the whole
example, the pattern subpattern is repeated in the compiled code. For example, the pattern
.sp .sp
(abc|def){2,4} (abc|def){2,4}
.sp .sp
@ -34,13 +34,13 @@ example, the very simple pattern
.sp .sp
((ab){1,1000}c){1,3} ((ab){1,1000}c){1,3}
.sp .sp
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
with its default internal pointer size of two bytes, the size limit on a compiled with its default internal pointer size of two bytes, the size limit on
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
is reached with the above pattern if the outer repetition is increased from 3 this is reached with the above pattern if the outer repetition is increased
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
larger compiled patterns, but it is better to try to rewrite your pattern to handle larger compiled patterns, but it is better to try to rewrite your
use less memory if you can. pattern to use less memory if you can.
.P .P
One way of reducing the memory usage for such patterns is to make use of One way of reducing the memory usage for such patterns is to make use of
PCRE2's PCRE2's
@ -52,32 +52,34 @@ facility. Re-writing the above pattern as
.sp .sp
((ab)(?2){0,999}c)(?1){0,2} ((ab)(?2){0,999}c)(?1){0,2}
.sp .sp
reduces the memory requirements to 18K, and indeed it remains under 20K even reduces the memory requirements to around 16K, and indeed it remains under 20K
with the outer repetition increased to 100. However, this pattern is not even with the outer repetition increased to 100. However, this kind of pattern
exactly equivalent, because the "subroutine" calls are treated as is not always exactly equivalent, because any captures within subroutine calls
.\" HTML <a href="pcre2pattern.html#atomicgroup"> are lost when the subroutine completes. If this is not a problem, this kind of
.\" </a> rewriting will allow you to process patterns that PCRE2 cannot otherwise
atomic groups handle. The matching performance of the two different versions of the pattern
.\" are roughly the same. (This applies from release 10.30 - things were different
into which there can be no backtracking if there is a subsequent matching in earlier releases.)
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
Furthermore, there is a noticeable loss of speed when executing the modified
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
speed is acceptable, this kind of rewriting will allow you to process patterns
that PCRE2 cannot otherwise handle.
. .
. .
.SH "STACK USAGE AT RUN TIME" .SH "STACK AND HEAP USAGE AT RUN TIME"
.rs .rs
.sp .sp
When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can From release 10.30, the interpretive (non-JIT) version of \fBpcre2_match()\fP
cause it to use large amounts of the process stack. In some environments the uses very little system stack at run time. In earlier releases recursive
default process stack is quite small, and if it runs out the result is often function calls could use a great deal of stack, and this could cause problems,
SIGSEGV. Rewriting your pattern can often help. The but this usage has been eliminated. Backtracking positions are now explicitly
.\" HREF remembered in memory frames controlled by the code. An initial 10K vector of
\fBpcre2stack\fP frames is allocated on the system stack (enough for about 50 frames for small
.\" patterns), but if this is insufficient, heap memory is used. Rewriting patterns
documentation discusses this issue in detail. to be time-efficient, as described below, may also reduce the memory
requirements.
.P
In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive
function calls, but only for processing atomic groups, lookaround assertions,
and recursion within the pattern. Too much nested recursion may cause stack
issues. The "match depth" parameter can be used to limit the depth of function
recursion in \fBpcre2_dfa_match()\fP.
. .
. .
.SH "PROCESSING TIME" .SH "PROCESSING TIME"
@ -160,7 +162,59 @@ applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters. appreciable time with strings longer than about 20 characters.
.P .P
In many cases, the solution to this kind of performance issue is to use an In many cases, the solution to this kind of performance issue is to use an
atomic group or a possessive quantifier. atomic group or a possessive quantifier. This can often reduce memory
requirements as well. As another example, consider this pattern:
.sp
([^<]|<(?!inet))+
.sp
It matches from wherever it starts until it encounters "<inet" or the end of
the data, and is the kind of pattern that might be used when processing an XML
file. Each iteration of the outer parentheses matches either one character that
is not "<" or a "<" that is not followed by "inet". However, each time a
parenthesis is processed, a backtracking position is passed, so this
formulation uses a memory frame for each matched character. For a long string,
a lot of memory is required. Consider now this rewritten pattern, which matches
exactly the same strings:
.sp
([^<]++|<(?!inet))+
.sp
This runs much faster, because sequences of characters that do not contain "<"
are "swallowed" in one item inside the parentheses, and a possessive quantifier
is used to stop any backtracking into the runs of non-"<" characters. This
version also uses a lot less memory because entry to a new set of parentheses
happens only when a "<" character that is not followed by "inet" is encountered
(and we assume this is relatively rare).
.P
This example shows that one way of optimizing performance when matching long
subject strings is to write repeated parenthesized subpatterns to match more
than one character whenever possible.
.
.
.SS "SETTING RESOURCE LIMITS"
.rs
.sp
You can set limits on the amount of processing that takes place when matching,
and on the amount of heap memory that is used. The default values of the limits
are very large, and unlikely ever to operate. They can be changed when PCRE2 is
built, and they can also be set when \fBpcre2_match()\fP or
\fBpcre2_dfa_match()\fP is called. For details of these interfaces, see the
.\" HREF
\fBpcre2build\fP
.\"
documentation and the section entitled
.\" HTML <a href="pcre2api.html#matchcontext">
.\" </a>
"The match context"
.\"
in the
.\" HREF
\fBpcre2api\fP
.\"
documentation.
.P
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
applied to a subject line, causes it to find the smallest limits that allow a
pattern to match. This is done by repeatedly matching with different limits.
. .
. .
.SH AUTHOR .SH AUTHOR
@ -177,6 +231,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 02 January 2015 Last updated: 31 March 2017
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
.fi .fi

View File

@ -1,212 +0,0 @@
.TH PCRE2STACK 3 "23 December 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 DISCUSSION OF STACK USAGE"
.rs
.sp
When you call \fBpcre2_match()\fP, it makes use of an internal function called
\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
in order to remember the state of the match so that it can back up and try a
different alternative after a failure. As matching proceeds deeper and deeper
into the tree of possibilities, the recursion depth increases. The
\fBmatch()\fP function is also called in other circumstances, for example,
whenever a parenthesized sub-pattern is entered, and in certain cases of
repetition.
.P
Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
as a* it may be called several times at the same level, after matching
different numbers of a's. Furthermore, in a number of cases where the result of
the recursive call would immediately be passed back as the result of the
current call (a "tail recursion"), the function is just restarted instead.
.P
Each time the internal \fBmatch()\fP function is called recursively, it uses
memory from the process stack. For certain kinds of pattern and data, very
large amounts of stack may be needed, despite the recognition of "tail
recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
of the GCC compiler, the stack requirements are greatly increased.
.P
The above comments apply when \fBpcre2_match()\fP is run in its normal
interpretive manner. If the compiled pattern was processed by
\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the
options passed to \fBpcre2_match()\fP were not incompatible, the matching
process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
this case, the memory requirements are handled entirely differently. See the
.\" HREF
\fBpcre2jit\fP
.\"
documentation for details.
.P
The \fBpcre2_dfa_match()\fP function operates in a different way to
\fBpcre2_match()\fP, and uses recursion only when there is a regular expression
recursion or subroutine call in the pattern. This includes the processing of
assertion and "once-only" subpatterns, which are handled like subroutine calls.
Normally, these are never very deep, and the limit on the complexity of
\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
However, it is possible to write patterns with runaway infinite recursions;
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack unless a
limit is applied (see below).
.P
The comments in the next three sections do not apply to
\fBpcre2_dfa_match()\fP; they are relevant only for \fBpcre2_match()\fP without
the JIT optimization.
.
.
.SS "Reducing \fBpcre2_match()\fP's stack usage"
.rs
.sp
You can often reduce the amount of recursion, and therefore the
amount of stack used, by modifying the pattern that is being matched. Consider,
for example, this pattern:
.sp
([^<]|<(?!inet))+
.sp
It matches from wherever it starts until it encounters "<inet" or the end of
the data, and is the kind of pattern that might be used when processing an XML
file. Each iteration of the outer parentheses matches either one character that
is not "<" or a "<" that is not followed by "inet". However, each time a
parenthesis is processed, a recursion occurs, so this formulation uses a stack
frame for each matched character. For a long string, a lot of stack is
required. Consider now this rewritten pattern, which matches exactly the same
strings:
.sp
([^<]++|<(?!inet))+
.sp
This uses very much less stack, because runs of characters that do not contain
"<" are "swallowed" in one item inside the parentheses. Recursion happens only
when a "<" character that is not followed by "inet" is encountered (and we
assume this is relatively rare). A possessive quantifier is used to stop any
backtracking into the runs of non-"<" characters, but that is not related to
stack usage.
.P
This example shows that one way of avoiding stack problems when matching long
subject strings is to write repeated parenthesized subpatterns to match more
than one character whenever possible.
.
.
.SS "Compiling PCRE2 to use heap instead of stack for \fBpcre2_match()\fP"
.rs
.sp
In environments where stack memory is constrained, you might want to compile
PCRE2 to use heap memory instead of stack for remembering back-up points when
\fBpcre2_match()\fP is running. This makes it run more slowly, however. Details
of how to do this are given in the
.\" HREF
\fBpcre2build\fP
.\"
documentation. When built in this way, instead of using the stack, PCRE2
gets memory for remembering backup points from the heap. By default, the memory
is obtained by calling the system \fBmalloc()\fP function, but you can arrange
to supply your own memory management function. For details, see the section
entitled
.\" HTML <a href="pcre2api.html#matchcontext">
.\" </a>
"The match context"
.\"
in the
.\" HREF
\fBpcre2api\fP
.\"
documentation. Since the block sizes are always the same, it may be possible to
implement a customized memory handler that is more efficient than the standard
function. The memory blocks obtained for this purpose are retained and re-used
if possible while \fBpcre2_match()\fP is running. They are all freed just
before it exits.
.
.
.SS "Limiting \fBpcre2_match()\fP's stack usage"
.rs
.sp
You can set limits on the number of times the internal \fBmatch()\fP function
is called, both in total and recursively. If a limit is exceeded,
\fBpcre2_match()\fP returns an error code. Setting suitable limits should
prevent it from running out of stack. The default values of the limits are very
large, and unlikely ever to operate. They can be changed when PCRE2 is built,
and they can also be set when \fBpcre2_match()\fP is called. For details of
these interfaces, see the
.\" HREF
\fBpcre2build\fP
.\"
documentation and the section entitled
.\" HTML <a href="pcre2api.html#matchcontext">
.\" </a>
"The match context"
.\"
in the
.\" HREF
\fBpcre2api\fP
.\"
documentation.
.P
As a very rough rule of thumb, you should reckon on about 500 bytes per
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
around 128000 recursions.
.P
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
applied to a subject line, causes it to find the smallest limits that allow a a
pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
different limits.
.
.
.SS "Limiting \fBpcre2_dfa_match()\fP's stack usage"
.rs
.sp
The recursion limit, as described above for \fBpcre2_match()\fP, also applies
to \fBpcre2_dfa_match()\fP, whose use of recursive function calls for
recursions in the pattern can lead to runaway stack usage. The non-recursive
match limit is not relevant for DFA matching, and is ignored.
.
.
.SS "Changing stack size in Unix-like systems"
.rs
.sp
In Unix-like environments, there is not often a problem with the stack unless
very long strings are involved, though the default limit on stack size varies
from system to system. Values from 8Mb to 64Mb are common. You can find your
default limit by running the command:
.sp
ulimit -s
.sp
Unfortunately, the effect of running out of stack is often SIGSEGV, though
sometimes a more explicit error message is given. You can normally increase the
limit on stack size by code such as this:
.sp
struct rlimit rlim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur = 100*1024*1024;
setrlimit(RLIMIT_STACK, &rlim);
.sp
This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
do this before calling \fBpcre2_match()\fP.
.
.
.SS "Changing stack size in Mac OS X"
.rs
.sp
Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
is also possible to set a stack size when linking a program. There is a
discussion about stack sizes in Mac OS X at this web site:
.\" HTML <a href="http://developer.apple.com/qa/qa2005/qa1419.html">
.\" </a>
http://developer.apple.com/qa/qa2005/qa1419.html.
.\"
.
.
.SH AUTHOR
.rs
.sp
.nf
Philip Hazel
University Computing Service
Cambridge, England.
.fi
.
.
.SH REVISION
.rs
.sp
.nf
Last updated: 23 December 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi