Documentation update

This commit is contained in:
Philip.Hazel 2017-03-31 16:49:33 +00:00
parent a073581116
commit ed9f34b06b
10 changed files with 391 additions and 473 deletions

View File

@ -103,7 +103,6 @@ dist_html_DATA = \
doc/html/pcre2posix.html \
doc/html/pcre2sample.html \
doc/html/pcre2serialize.html \
doc/html/pcre2stack.html \
doc/html/pcre2syntax.html \
doc/html/pcre2test.html \
doc/html/pcre2unicode.html
@ -187,7 +186,6 @@ dist_man_MANS = \
doc/pcre2posix.3 \
doc/pcre2sample.3 \
doc/pcre2serialize.3 \
doc/pcre2stack.3 \
doc/pcre2syntax.3 \
doc/pcre2test.1 \
doc/pcre2unicode.3

View File

@ -68,9 +68,6 @@ first.
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
<td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr>
<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
<td>&nbsp;&nbsp;Discussion of PCRE2's stack usage</td></tr>
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
<td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr>

View File

@ -18,7 +18,8 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
<P>
This document describes the differences in the ways that PCRE2 and Perl handle
regular expressions. The differences described here are with respect to Perl
versions 5.10 and above.
versions 5.24, but as both Perl and PCRE2 are continually changing, the
information may sometimes be out of date.
</P>
<P>
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
@ -27,17 +28,18 @@ have are given in the
page.
</P>
<P>
2. PCRE2 allows repeat quantifiers only on parenthesized assertions, but they
do not mean what you might think. For example, (?!a){3} does not assert that
the next three characters are not "a". It just asserts that the next character
is not "a" three times (in principle: PCRE2 optimizes this to run the assertion
just once). Perl allows repeat quantifiers on other assertions such as \b, but
these do not seem to have any use.
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
they do not mean what you might think. For example, (?!a){3} does not assert
that the next three characters are not "a". It just asserts that the next
character is not "a" three times (in principle: PCRE2 optimizes this to run the
assertion just once). Perl allows some repeat quantifiers on other assertions,
for example, \b* (but not \b{3}), but these do not seem to have any use.
</P>
<P>
3. Capturing subpatterns that occur inside negative lookahead assertions are
counted, but their entries in the offsets vector are never set. Perl sometimes
(but not always) sets its numerical variables from inside negative assertions.
3. Capturing subpatterns that occur inside negative lookaround assertions are
counted, but their entries in the offsets vector are set only if the assertion
is a condition. Perl has changed its behaviour in this regard from time to
time.
</P>
<P>
4. The following Perl escape sequences are not supported: \l, \u, \L,
@ -50,13 +52,13 @@ generated by default. However, if the PCRE2_ALT_BSUX option is set,
</P>
<P>
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
built with Unicode support. The properties that can be tested with \p and \P
are limited to the general category properties such as Lu and Nd, script names
such as Greek or Han, and the derived properties Any and L&. PCRE2 does support
the Cs (surrogate) property, which Perl does not; the Perl documentation says
"Because Perl hides the need for the user to understand the internal
representation of Unicode characters, there is no need to implement the
somewhat messy concept of surrogates."
built with Unicode support (the default). The properties that can be tested
with \p and \P are limited to the general category properties such as Lu and
Nd, script names such as Greek or Han, and the derived properties Any and L&.
PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
documentation says "Because Perl hides the need for the user to understand the
internal representation of Unicode characters, there is no need to implement
the somewhat messy concept of surrogates."
</P>
<P>
6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters
@ -75,23 +77,15 @@ The \Q...\E sequence is recognized both inside and outside character classes.
</P>
<P>
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
constructions. However, there is support for recursive patterns. This is not
available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE2 "callout"
feature allows an external function to be called during pattern matching. See
the
constructions. However, there is support PCRE2's "callout" feature, which
allows an external function to be called during pattern matching. See the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details.
</P>
<P>
8. Subroutine calls (whether recursive or not) are treated as atomic groups.
Atomic recursion is like Python, but unlike Perl. Captured values that are set
outside a subroutine call can be referenced from inside in PCRE2, but not in
Perl. There is a discussion that explains these differences in more detail in
the
<a href="pcre2pattern.html#recursiondifference">section on recursion differences from Perl</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page.
8. Subroutine calls (whether recursive or not) were treated as atomic groups up
to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
into subroutine calls is now supported, as in Perl.
</P>
<P>
9. If any of the backtracking control verbs are used in a subpattern that is
@ -147,14 +141,14 @@ certainly user mistakes.
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
affected when case-independent matching is specified. For example, \p{Lu}
always matches an upper case letter. I think Perl has changed in this respect;
in the release at the time of writing (5.16), \p{Lu} and \p{Ll} match all
in the release at the time of writing (5.24), \p{Lu} and \p{Ll} match all
letters, regardless of case, when case independence is specified.
</P>
<P>
17. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
of which (such as named parentheses) have been in PCRE2 for some time. This
list is with respect to Perl 5.10:
of which (such as named parentheses) were in PCRE2 for some time before. This
list is with respect to Perl 5.24:
<br>
<br>
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
@ -220,9 +214,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 18 October 2016
Last updated: 29 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -173,7 +173,7 @@ below for a discussion of JIT stack usage.
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
a very large pattern tree goes on for too long, as it is in the same
circumstance when JIT is not used, but the details of exactly what is counted
are not the same. The PCRE2_ERROR_RECURSIONLIMIT error code is never returned
are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
when JIT matching is used.
<a name="stackcontrol"></a></P>
<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
@ -436,9 +436,9 @@ Cambridge, England.
</P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P>
Last updated: 05 June 2016
Last updated: 30 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -44,14 +44,6 @@ integer type, usually defined as size_t. Its maximum value (that is
and unset offsets.
</P>
<P>
Note that when using the traditional matching function, PCRE2 uses recursion to
handle subpatterns and indefinite repetition. This means that the available
stack space may limit the size of a subject string that can be processed by
certain patterns. For a discussion of stack issues, see the
<a href="pcre2stack.html"><b>pcre2stack</b></a>
documentation.
</P>
<P>
All values in repeating quantifiers must be less than 65536.
</P>
<P>
@ -94,9 +86,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 26 October 2016
Last updated: 30 March 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
<li><a name="TOC3" href="#SEC3">STACK USAGE AT RUN TIME</a>
<li><a name="TOC3" href="#SEC3">STACK AND HEAP USAGE AT RUN TIME</a>
<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
<li><a name="TOC6" href="#SEC6">REVISION</a>
@ -29,11 +29,11 @@ of them.
<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
<P>
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
so that most simple patterns do not use much memory. However, there is one case
where the memory usage of a compiled pattern can be unexpectedly large. If a
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
a limited maximum, the whole subpattern is repeated in the compiled code. For
example, the pattern
so that most simple patterns do not use much memory for storing the compiled
version. However, there is one case where the memory usage of a compiled
pattern can be unexpectedly large. If a parenthesized subpattern has a
quantifier with a minimum greater than 1 and/or a limited maximum, the whole
subpattern is repeated in the compiled code. For example, the pattern
<pre>
(abc|def){2,4}
</pre>
@ -52,13 +52,13 @@ example, the very simple pattern
<pre>
((ab){1,1000}c){1,3}
</pre>
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
with its default internal pointer size of two bytes, the size limit on a
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
is reached with the above pattern if the outer repetition is increased from 3
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
larger compiled patterns, but it is better to try to rewrite your pattern to
use less memory if you can.
uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
compiled with its default internal pointer size of two bytes, the size limit on
a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
this is reached with the above pattern if the outer repetition is increased
from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
handle larger compiled patterns, but it is better to try to rewrite your
pattern to use less memory if you can.
</P>
<P>
One way of reducing the memory usage for such patterns is to make use of
@ -68,25 +68,33 @@ facility. Re-writing the above pattern as
<pre>
((ab)(?2){0,999}c)(?1){0,2}
</pre>
reduces the memory requirements to 18K, and indeed it remains under 20K even
with the outer repetition increased to 100. However, this pattern is not
exactly equivalent, because the "subroutine" calls are treated as
<a href="pcre2pattern.html#atomicgroup">atomic groups</a>
into which there can be no backtracking if there is a subsequent matching
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
Furthermore, there is a noticeable loss of speed when executing the modified
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
speed is acceptable, this kind of rewriting will allow you to process patterns
that PCRE2 cannot otherwise handle.
reduces the memory requirements to around 16K, and indeed it remains under 20K
even with the outer repetition increased to 100. However, this kind of pattern
is not always exactly equivalent, because any captures within subroutine calls
are lost when the subroutine completes. If this is not a problem, this kind of
rewriting will allow you to process patterns that PCRE2 cannot otherwise
handle. The matching performance of the two different versions of the pattern
are roughly the same. (This applies from release 10.30 - things were different
in earlier releases.)
</P>
<br><a name="SEC3" href="#TOC1">STACK USAGE AT RUN TIME</a><br>
<br><a name="SEC3" href="#TOC1">STACK AND HEAP USAGE AT RUN TIME</a><br>
<P>
When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
cause it to use large amounts of the process stack. In some environments the
default process stack is quite small, and if it runs out the result is often
SIGSEGV. Rewriting your pattern can often help. The
<a href="pcre2stack.html"><b>pcre2stack</b></a>
documentation discusses this issue in detail.
From release 10.30, the interpretive (non-JIT) version of <b>pcre2_match()</b>
uses very little system stack at run time. In earlier releases recursive
function calls could use a great deal of stack, and this could cause problems,
but this usage has been eliminated. Backtracking positions are now explicitly
remembered in memory frames controlled by the code. An initial 10K vector of
frames is allocated on the system stack (enough for about 50 frames for small
patterns), but if this is insufficient, heap memory is used. Rewriting patterns
to be time-efficient, as described below, may also reduce the memory
requirements.
</P>
<P>
In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive
function calls, but only for processing atomic groups, lookaround assertions,
and recursion within the pattern. Too much nested recursion may cause stack
issues. The "match depth" parameter can be used to limit the depth of function
recursion in <b>pcre2_dfa_match()</b>.
</P>
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
<P>
@ -175,7 +183,54 @@ appreciable time with strings longer than about 20 characters.
</P>
<P>
In many cases, the solution to this kind of performance issue is to use an
atomic group or a possessive quantifier.
atomic group or a possessive quantifier. This can often reduce memory
requirements as well. As another example, consider this pattern:
<pre>
([^&#60;]|&#60;(?!inet))+
</pre>
It matches from wherever it starts until it encounters "&#60;inet" or the end of
the data, and is the kind of pattern that might be used when processing an XML
file. Each iteration of the outer parentheses matches either one character that
is not "&#60;" or a "&#60;" that is not followed by "inet". However, each time a
parenthesis is processed, a backtracking position is passed, so this
formulation uses a memory frame for each matched character. For a long string,
a lot of memory is required. Consider now this rewritten pattern, which matches
exactly the same strings:
<pre>
([^&#60;]++|&#60;(?!inet))+
</pre>
This runs much faster, because sequences of characters that do not contain "&#60;"
are "swallowed" in one item inside the parentheses, and a possessive quantifier
is used to stop any backtracking into the runs of non-"&#60;" characters. This
version also uses a lot less memory because entry to a new set of parentheses
happens only when a "&#60;" character that is not followed by "inet" is encountered
(and we assume this is relatively rare).
</P>
<P>
This example shows that one way of optimizing performance when matching long
subject strings is to write repeated parenthesized subpatterns to match more
than one character whenever possible.
</P>
<br><b>
SETTING RESOURCE LIMITS
</b><br>
<P>
You can set limits on the amount of processing that takes place when matching,
and on the amount of heap memory that is used. The default values of the limits
are very large, and unlikely ever to operate. They can be changed when PCRE2 is
built, and they can also be set when <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b> is called. For details of these interfaces, see the
<a href="pcre2build.html"><b>pcre2build</b></a>
documentation and the section entitled
<a href="pcre2api.html#matchcontext">"The match context"</a>
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<P>
The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
applied to a subject line, causes it to find the smallest limits that allow a
pattern to match. This is done by repeatedly matching with different limits.
</P>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P>
@ -188,9 +243,9 @@ Cambridge, England.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
Last updated: 02 January 2015
Last updated: 31 March 2017
<br>
Copyright &copy; 1997-2015 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -68,9 +68,6 @@ first.
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
<td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr>
<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
<td>&nbsp;&nbsp;Discussion of PCRE2's stack usage</td></tr>
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
<td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr>

View File

@ -4097,45 +4097,46 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
This document describes the differences in the ways that PCRE2 and Perl
handle regular expressions. The differences described here are with
respect to Perl versions 5.10 and above.
respect to Perl versions 5.24, but as both Perl and PCRE2 are continu-
ally changing, the information may sometimes be out of date.
1. PCRE2 has only a subset of Perl's Unicode support. Details of what
1. PCRE2 has only a subset of Perl's Unicode support. Details of what
it does have are given in the pcre2unicode page.
2. PCRE2 allows repeat quantifiers only on parenthesized assertions,
but they do not mean what you might think. For example, (?!a){3} does
not assert that the next three characters are not "a". It just asserts
that the next character is not "a" three times (in principle: PCRE2
optimizes this to run the assertion just once). Perl allows repeat
quantifiers on other assertions such as \b, but these do not seem to
have any use.
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
tions, but they do not mean what you might think. For example, (?!a){3}
does not assert that the next three characters are not "a". It just
asserts that the next character is not "a" three times (in principle:
PCRE2 optimizes this to run the assertion just once). Perl allows some
repeat quantifiers on other assertions, for example, \b* (but not
\b{3}), but these do not seem to have any use.
3. Capturing subpatterns that occur inside negative lookahead asser-
tions are counted, but their entries in the offsets vector are never
set. Perl sometimes (but not always) sets its numerical variables from
inside negative assertions.
3. Capturing subpatterns that occur inside negative lookaround asser-
tions are counted, but their entries in the offsets vector are set only
if the assertion is a condition. Perl has changed its behaviour in this
regard from time to time.
4. The following Perl escape sequences are not supported: \l, \u, \L,
\U, and \N when followed by a character name or Unicode value. (\N on
4. The following Perl escape sequences are not supported: \l, \u, \L,
\U, and \N when followed by a character name or Unicode value. (\N on
its own, matching a non-newline character, is supported.) In fact these
are implemented by Perl's general string-handling and are not part of
its pattern matching engine. If any of these are encountered by PCRE2,
are implemented by Perl's general string-handling and are not part of
its pattern matching engine. If any of these are encountered by PCRE2,
an error is generated by default. However, if the PCRE2_ALT_BSUX option
is set, \U and \u are interpreted as ECMAScript interprets them.
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
is built with Unicode support. The properties that can be tested with
\p and \P are limited to the general category properties such as Lu and
Nd, script names such as Greek or Han, and the derived properties Any
and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
not; the Perl documentation says "Because Perl hides the need for the
user to understand the internal representation of Unicode characters,
there is no need to implement the somewhat messy concept of surro-
gates."
is built with Unicode support (the default). The properties that can be
tested with \p and \P are limited to the general category properties
such as Lu and Nd, script names such as Greek or Han, and the derived
properties Any and L&. PCRE2 does support the Cs (surrogate) property,
which Perl does not; the Perl documentation says "Because Perl hides
the need for the user to understand the internal representation of Uni-
code characters, there is no need to implement the somewhat messy con-
cept of surrogates."
6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
acters in between are treated as literals. This is slightly different
from Perl in that $ and @ are also handled as literals inside the
6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
acters in between are treated as literals. This is slightly different
from Perl in that $ and @ are also handled as literals inside the
quotes. In Perl, they cause variable interpolation (but of course PCRE2
does not have variables). Note the following examples:
@ -4146,22 +4147,17 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
The \Q...\E sequence is recognized both inside and outside character
The \Q...\E sequence is recognized both inside and outside character
classes.
7. Fairly obviously, PCRE2 does not support the (?{code}) and
(??{code}) constructions. However, there is support for recursive pat-
terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
the PCRE2 "callout" feature allows an external function to be called
during pattern matching. See the pcre2callout documentation for
details.
7. Fairly obviously, PCRE2 does not support the (?{code}) and
(??{code}) constructions. However, there is support PCRE2's "callout"
feature, which allows an external function to be called during pattern
matching. See the pcre2callout documentation for details.
8. Subroutine calls (whether recursive or not) are treated as atomic
groups. Atomic recursion is like Python, but unlike Perl. Captured
values that are set outside a subroutine call can be referenced from
inside in PCRE2, but not in Perl. There is a discussion that explains
these differences in more detail in the section on recursion differ-
ences from Perl in the pcre2pattern page.
8. Subroutine calls (whether recursive or not) were treated as atomic
groups up to PCRE2 release 10.23, but from release 10.30 this changed,
and backtracking into subroutine calls is now supported, as in Perl.
9. If any of the backtracking control verbs are used in a subpattern
that is called as a subroutine (whether or not recursively), their
@ -4211,14 +4207,14 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
16. In PCRE2, the upper/lower case character properties Lu and Ll are
not affected when case-independent matching is specified. For example,
\p{Lu} always matches an upper case letter. I think Perl has changed in
this respect; in the release at the time of writing (5.16), \p{Lu} and
this respect; in the release at the time of writing (5.24), \p{Lu} and
\p{Ll} match all letters, regardless of case, when case independence is
specified.
17. PCRE2 provides some extensions to the Perl regular expression
facilities. Perl 5.10 includes new features that are not in earlier
versions of Perl, some of which (such as named parentheses) have been
in PCRE2 for some time. This list is with respect to Perl 5.10:
versions of Perl, some of which (such as named parentheses) were in
PCRE2 for some time before. This list is with respect to Perl 5.24:
(a) Although lookbehind assertions in PCRE2 must match fixed length
strings, each alternative branch of a lookbehind assertion can match a
@ -4271,8 +4267,8 @@ AUTHOR
REVISION
Last updated: 18 October 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 29 March 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@ -4420,8 +4416,8 @@ RETURN VALUES FROM JIT MATCHING
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
searching a very large pattern tree goes on for too long, as it is in
the same circumstance when JIT is not used, but the details of exactly
what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error
code is never returned when JIT matching is used.
what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
is never returned when JIT matching is used.
CONTROLLING THE JIT STACK
@ -4668,8 +4664,8 @@ AUTHOR
REVISION
Last updated: 05 June 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 30 March 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@ -4706,12 +4702,6 @@ SIZE AND OTHER LIMITATIONS
(that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
terminated strings and unset offsets.
Note that when using the traditional matching function, PCRE2 uses
recursion to handle subpatterns and indefinite repetition. This means
that the available stack space may limit the size of a subject string
that can be processed by certain patterns. For a discussion of stack
issues, see the pcre2stack documentation.
All values in repeating quantifiers must be less than 65536.
The maximum length of a lookbehind assertion is 65535 characters.
@ -4745,8 +4735,8 @@ AUTHOR
REVISION
Last updated: 26 October 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 30 March 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@ -8485,11 +8475,12 @@ PCRE2 PERFORMANCE
COMPILED PATTERN MEMORY USAGE
Patterns are compiled by PCRE2 into a reasonably efficient interpretive
code, so that most simple patterns do not use much memory. However,
there is one case where the memory usage of a compiled pattern can be
unexpectedly large. If a parenthesized subpattern has a quantifier with
a minimum greater than 1 and/or a limited maximum, the whole subpattern
is repeated in the compiled code. For example, the pattern
code, so that most simple patterns do not use much memory for storing
the compiled version. However, there is one case where the memory usage
of a compiled pattern can be unexpectedly large. If a parenthesized
subpattern has a quantifier with a minimum greater than 1 and/or a lim-
ited maximum, the whole subpattern is repeated in the compiled code.
For example, the pattern
(abc|def){2,4}
@ -8497,134 +8488,186 @@ COMPILED PATTERN MEMORY USAGE
(abc|def)(abc|def)((abc|def)(abc|def)?)?
(Technical aside: It is done this way so that backtrack points within
(Technical aside: It is done this way so that backtrack points within
each of the repetitions can be independently maintained.)
For regular expressions whose quantifiers use only small numbers, this
is not usually a problem. However, if the numbers are large, and par-
ticularly if such repetitions are nested, the memory usage can become
For regular expressions whose quantifiers use only small numbers, this
is not usually a problem. However, if the numbers are large, and par-
ticularly if such repetitions are nested, the memory usage can become
an embarrassment. For example, the very simple pattern
((ab){1,1000}c){1,3}
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is
compiled with its default internal pointer size of two bytes, the size
limit on a compiled pattern is 64K code units in the 8-bit and 16-bit
libraries, and this is reached with the above pattern if the outer rep-
etition is increased from 3 to 4. PCRE2 can be compiled to use larger
internal pointers and thus handle larger compiled patterns, but it is
better to try to rewrite your pattern to use less memory if you can.
uses over 50K bytes when compiled using the 8-bit library. When PCRE2
is compiled with its default internal pointer size of two bytes, the
size limit on a compiled pattern is 64K code units in the 8-bit and
16-bit libraries, and this is reached with the above pattern if the
outer repetition is increased from 3 to 4. PCRE2 can be compiled to use
larger internal pointers and thus handle larger compiled patterns, but
it is better to try to rewrite your pattern to use less memory if you
can.
One way of reducing the memory usage for such patterns is to make use
of PCRE2's "subroutine" facility. Re-writing the above pattern as
((ab)(?2){0,999}c)(?1){0,2}
reduces the memory requirements to 18K, and indeed it remains under 20K
even with the outer repetition increased to 100. However, this pattern
is not exactly equivalent, because the "subroutine" calls are treated
as atomic groups into which there can be no backtracking if there is a
subsequent matching failure. Therefore, PCRE2 cannot do this kind of
rewriting automatically. Furthermore, there is a noticeable loss of
speed when executing the modified pattern. Nevertheless, if the atomic
grouping is not a problem and the loss of speed is acceptable, this
kind of rewriting will allow you to process patterns that PCRE2 cannot
otherwise handle.
reduces the memory requirements to around 16K, and indeed it remains
under 20K even with the outer repetition increased to 100. However,
this kind of pattern is not always exactly equivalent, because any cap-
tures within subroutine calls are lost when the subroutine completes.
If this is not a problem, this kind of rewriting will allow you to
process patterns that PCRE2 cannot otherwise handle. The matching per-
formance of the two different versions of the pattern are roughly the
same. (This applies from release 10.30 - things were different in ear-
lier releases.)
STACK USAGE AT RUN TIME
STACK AND HEAP USAGE AT RUN TIME
When pcre2_match() is used for matching, certain kinds of pattern can
cause it to use large amounts of the process stack. In some environ-
ments the default process stack is quite small, and if it runs out the
result is often SIGSEGV. Rewriting your pattern can often help. The
pcre2stack documentation discusses this issue in detail.
From release 10.30, the interpretive (non-JIT) version of pcre2_match()
uses very little system stack at run time. In earlier releases recur-
sive function calls could use a great deal of stack, and this could
cause problems, but this usage has been eliminated. Backtracking posi-
tions are now explicitly remembered in memory frames controlled by the
code. An initial 10K vector of frames is allocated on the system stack
(enough for about 50 frames for small patterns), but if this is insuf-
ficient, heap memory is used. Rewriting patterns to be time-efficient,
as described below, may also reduce the memory requirements.
In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
function calls, but only for processing atomic groups, lookaround
assertions, and recursion within the pattern. Too much nested recursion
may cause stack issues. The "match depth" parameter can be used to
limit the depth of function recursion in pcre2_dfa_match().
PROCESSING TIME
Certain items in regular expression patterns are processed more effi-
Certain items in regular expression patterns are processed more effi-
ciently than others. It is more efficient to use a character class like
[aeiou] than a set of single-character alternatives such as
(a|e|i|o|u). In general, the simplest construction that provides the
[aeiou] than a set of single-character alternatives such as
(a|e|i|o|u). In general, the simplest construction that provides the
required behaviour is usually the most efficient. Jeffrey Friedl's book
contains a lot of useful general discussion about optimizing regular
expressions for efficient performance. This document contains a few
contains a lot of useful general discussion about optimizing regular
expressions for efficient performance. This document contains a few
observations about PCRE2.
Using Unicode character properties (the \p, \P, and \X escapes) is
slow, because PCRE2 has to use a multi-stage table lookup whenever it
needs a character's property. If you can find an alternative pattern
Using Unicode character properties (the \p, \P, and \X escapes) is
slow, because PCRE2 has to use a multi-stage table lookup whenever it
needs a character's property. If you can find an alternative pattern
that does not use character properties, it will probably be faster.
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
character classes such as [:alpha:] do not use Unicode properties,
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
character classes such as [:alpha:] do not use Unicode properties,
partly for backwards compatibility, and partly for performance reasons.
However, you can set the PCRE2_UCP option or start the pattern with
(*UCP) if you want Unicode character properties to be used. This can
double the matching time for items such as \d, when matched with
pcre2_match(); the performance loss is less with a DFA matching func-
However, you can set the PCRE2_UCP option or start the pattern with
(*UCP) if you want Unicode character properties to be used. This can
double the matching time for items such as \d, when matched with
pcre2_match(); the performance loss is less with a DFA matching func-
tion, and in both cases there is not much difference for \b.
When a pattern begins with .* not in atomic parentheses, nor in paren-
theses that are the subject of a backreference, and the PCRE2_DOTALL
option is set, the pattern is implicitly anchored by PCRE2, since it
can match only at the start of a subject string. If the pattern has
When a pattern begins with .* not in atomic parentheses, nor in paren-
theses that are the subject of a backreference, and the PCRE2_DOTALL
option is set, the pattern is implicitly anchored by PCRE2, since it
can match only at the start of a subject string. If the pattern has
multiple top-level branches, they must all be anchorable. The optimiza-
tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
because the dot metacharacter does not then match a newline, and if the
subject string contains newlines, the pattern may match from the char-
subject string contains newlines, the pattern may match from the char-
acter immediately following one of them instead of from the very start.
For example, the pattern
.*second
matches the subject "first\nand second" (where \n stands for a newline
character), with the match starting at the seventh character. In order
to do this, PCRE2 has to retry the match starting after every newline
matches the subject "first\nand second" (where \n stands for a newline
character), with the match starting at the seventh character. In order
to do this, PCRE2 has to retry the match starting after every newline
in the subject.
If you are using such a pattern with subject strings that do not con-
tain newlines, the best performance is obtained by setting
PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
If you are using such a pattern with subject strings that do not con-
tain newlines, the best performance is obtained by setting
PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
explicit anchoring. That saves PCRE2 from having to scan along the sub-
ject looking for a newline to restart at.
Beware of patterns that contain nested indefinite repeats. These can
take a long time to run when applied to a string that does not match.
Beware of patterns that contain nested indefinite repeats. These can
take a long time to run when applied to a string that does not match.
Consider the pattern fragment
^(a+)*
This can match "aaaa" in 16 different ways, and this number increases
very rapidly as the string gets longer. (The * repeat can match 0, 1,
2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
repeats can match different numbers of times.) When the remainder of
the pattern is such that the entire match is going to fail, PCRE2 has
in principle to try every possible variation, and this can take an
This can match "aaaa" in 16 different ways, and this number increases
very rapidly as the string gets longer. (The * repeat can match 0, 1,
2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
repeats can match different numbers of times.) When the remainder of
the pattern is such that the entire match is going to fail, PCRE2 has
in principle to try every possible variation, and this can take an
extremely long time, even for relatively short strings.
An optimization catches some of the more simple cases such as
(a+)*b
where a literal character follows. Before embarking on the standard
matching procedure, PCRE2 checks that there is a "b" later in the sub-
ject string, and if there is not, it fails the match immediately. How-
ever, when there is no following literal this optimization cannot be
where a literal character follows. Before embarking on the standard
matching procedure, PCRE2 checks that there is a "b" later in the sub-
ject string, and if there is not, it fails the match immediately. How-
ever, when there is no following literal this optimization cannot be
used. You can see the difference by comparing the behaviour of
(a+)*\d
with the pattern above. The former gives a failure almost instantly
when applied to a whole line of "a" characters, whereas the latter
with the pattern above. The former gives a failure almost instantly
when applied to a whole line of "a" characters, whereas the latter
takes an appreciable time with strings longer than about 20 characters.
In many cases, the solution to this kind of performance issue is to use
an atomic group or a possessive quantifier.
an atomic group or a possessive quantifier. This can often reduce mem-
ory requirements as well. As another example, consider this pattern:
([^<]|<(?!inet))+
It matches from wherever it starts until it encounters "<inet" or the
end of the data, and is the kind of pattern that might be used when
processing an XML file. Each iteration of the outer parentheses matches
either one character that is not "<" or a "<" that is not followed by
"inet". However, each time a parenthesis is processed, a backtracking
position is passed, so this formulation uses a memory frame for each
matched character. For a long string, a lot of memory is required. Con-
sider now this rewritten pattern, which matches exactly the same
strings:
([^<]++|<(?!inet))+
This runs much faster, because sequences of characters that do not con-
tain "<" are "swallowed" in one item inside the parentheses, and a pos-
sessive quantifier is used to stop any backtracking into the runs of
non-"<" characters. This version also uses a lot less memory because
entry to a new set of parentheses happens only when a "<" character
that is not followed by "inet" is encountered (and we assume this is
relatively rare).
This example shows that one way of optimizing performance when matching
long subject strings is to write repeated parenthesized subpatterns to
match more than one character whenever possible.
SETTING RESOURCE LIMITS
You can set limits on the amount of processing that takes place when
matching, and on the amount of heap memory that is used. The default
values of the limits are very large, and unlikely ever to operate. They
can be changed when PCRE2 is built, and they can also be set when
pcre2_match() or pcre2_dfa_match() is called. For details of these
interfaces, see the pcre2build documentation and the section entitled
"The match context" in the pcre2api documentation.
The pcre2test test program has a modifier called "find_limits" which,
if applied to a subject line, causes it to find the smallest limits
that allow a pattern to match. This is done by repeatedly matching with
different limits.
AUTHOR
@ -8636,8 +8679,8 @@ AUTHOR
REVISION
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 31 March 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00"
.TH PCRE2PERFORM 3 "31 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 PERFORMANCE"
@ -12,11 +12,11 @@ of them.
.rs
.sp
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
so that most simple patterns do not use much memory. However, there is one case
where the memory usage of a compiled pattern can be unexpectedly large. If a
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
a limited maximum, the whole subpattern is repeated in the compiled code. For
example, the pattern
so that most simple patterns do not use much memory for storing the compiled
version. However, there is one case where the memory usage of a compiled
pattern can be unexpectedly large. If a parenthesized subpattern has a
quantifier with a minimum greater than 1 and/or a limited maximum, the whole
subpattern is repeated in the compiled code. For example, the pattern
.sp
(abc|def){2,4}
.sp
@ -34,13 +34,13 @@ example, the very simple pattern
.sp
((ab){1,1000}c){1,3}
.sp
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
with its default internal pointer size of two bytes, the size limit on a
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
is reached with the above pattern if the outer repetition is increased from 3
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
larger compiled patterns, but it is better to try to rewrite your pattern to
use less memory if you can.
uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
compiled with its default internal pointer size of two bytes, the size limit on
a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
this is reached with the above pattern if the outer repetition is increased
from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
handle larger compiled patterns, but it is better to try to rewrite your
pattern to use less memory if you can.
.P
One way of reducing the memory usage for such patterns is to make use of
PCRE2's
@ -52,32 +52,34 @@ facility. Re-writing the above pattern as
.sp
((ab)(?2){0,999}c)(?1){0,2}
.sp
reduces the memory requirements to 18K, and indeed it remains under 20K even
with the outer repetition increased to 100. However, this pattern is not
exactly equivalent, because the "subroutine" calls are treated as
.\" HTML <a href="pcre2pattern.html#atomicgroup">
.\" </a>
atomic groups
.\"
into which there can be no backtracking if there is a subsequent matching
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
Furthermore, there is a noticeable loss of speed when executing the modified
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
speed is acceptable, this kind of rewriting will allow you to process patterns
that PCRE2 cannot otherwise handle.
reduces the memory requirements to around 16K, and indeed it remains under 20K
even with the outer repetition increased to 100. However, this kind of pattern
is not always exactly equivalent, because any captures within subroutine calls
are lost when the subroutine completes. If this is not a problem, this kind of
rewriting will allow you to process patterns that PCRE2 cannot otherwise
handle. The matching performance of the two different versions of the pattern
are roughly the same. (This applies from release 10.30 - things were different
in earlier releases.)
.
.
.SH "STACK USAGE AT RUN TIME"
.SH "STACK AND HEAP USAGE AT RUN TIME"
.rs
.sp
When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can
cause it to use large amounts of the process stack. In some environments the
default process stack is quite small, and if it runs out the result is often
SIGSEGV. Rewriting your pattern can often help. The
.\" HREF
\fBpcre2stack\fP
.\"
documentation discusses this issue in detail.
From release 10.30, the interpretive (non-JIT) version of \fBpcre2_match()\fP
uses very little system stack at run time. In earlier releases recursive
function calls could use a great deal of stack, and this could cause problems,
but this usage has been eliminated. Backtracking positions are now explicitly
remembered in memory frames controlled by the code. An initial 10K vector of
frames is allocated on the system stack (enough for about 50 frames for small
patterns), but if this is insufficient, heap memory is used. Rewriting patterns
to be time-efficient, as described below, may also reduce the memory
requirements.
.P
In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive
function calls, but only for processing atomic groups, lookaround assertions,
and recursion within the pattern. Too much nested recursion may cause stack
issues. The "match depth" parameter can be used to limit the depth of function
recursion in \fBpcre2_dfa_match()\fP.
.
.
.SH "PROCESSING TIME"
@ -160,7 +162,59 @@ applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters.
.P
In many cases, the solution to this kind of performance issue is to use an
atomic group or a possessive quantifier.
atomic group or a possessive quantifier. This can often reduce memory
requirements as well. As another example, consider this pattern:
.sp
([^<]|<(?!inet))+
.sp
It matches from wherever it starts until it encounters "<inet" or the end of
the data, and is the kind of pattern that might be used when processing an XML
file. Each iteration of the outer parentheses matches either one character that
is not "<" or a "<" that is not followed by "inet". However, each time a
parenthesis is processed, a backtracking position is passed, so this
formulation uses a memory frame for each matched character. For a long string,
a lot of memory is required. Consider now this rewritten pattern, which matches
exactly the same strings:
.sp
([^<]++|<(?!inet))+
.sp
This runs much faster, because sequences of characters that do not contain "<"
are "swallowed" in one item inside the parentheses, and a possessive quantifier
is used to stop any backtracking into the runs of non-"<" characters. This
version also uses a lot less memory because entry to a new set of parentheses
happens only when a "<" character that is not followed by "inet" is encountered
(and we assume this is relatively rare).
.P
This example shows that one way of optimizing performance when matching long
subject strings is to write repeated parenthesized subpatterns to match more
than one character whenever possible.
.
.
.SS "SETTING RESOURCE LIMITS"
.rs
.sp
You can set limits on the amount of processing that takes place when matching,
and on the amount of heap memory that is used. The default values of the limits
are very large, and unlikely ever to operate. They can be changed when PCRE2 is
built, and they can also be set when \fBpcre2_match()\fP or
\fBpcre2_dfa_match()\fP is called. For details of these interfaces, see the
.\" HREF
\fBpcre2build\fP
.\"
documentation and the section entitled
.\" HTML <a href="pcre2api.html#matchcontext">
.\" </a>
"The match context"
.\"
in the
.\" HREF
\fBpcre2api\fP
.\"
documentation.
.P
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
applied to a subject line, causes it to find the smallest limits that allow a
pattern to match. This is done by repeatedly matching with different limits.
.
.
.SH AUTHOR
@ -177,6 +231,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 31 March 2017
Copyright (c) 1997-2017 University of Cambridge.
.fi

View File

@ -1,212 +0,0 @@
.TH PCRE2STACK 3 "23 December 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 DISCUSSION OF STACK USAGE"
.rs
.sp
When you call \fBpcre2_match()\fP, it makes use of an internal function called
\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
in order to remember the state of the match so that it can back up and try a
different alternative after a failure. As matching proceeds deeper and deeper
into the tree of possibilities, the recursion depth increases. The
\fBmatch()\fP function is also called in other circumstances, for example,
whenever a parenthesized sub-pattern is entered, and in certain cases of
repetition.
.P
Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
as a* it may be called several times at the same level, after matching
different numbers of a's. Furthermore, in a number of cases where the result of
the recursive call would immediately be passed back as the result of the
current call (a "tail recursion"), the function is just restarted instead.
.P
Each time the internal \fBmatch()\fP function is called recursively, it uses
memory from the process stack. For certain kinds of pattern and data, very
large amounts of stack may be needed, despite the recognition of "tail
recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
of the GCC compiler, the stack requirements are greatly increased.
.P
The above comments apply when \fBpcre2_match()\fP is run in its normal
interpretive manner. If the compiled pattern was processed by
\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the
options passed to \fBpcre2_match()\fP were not incompatible, the matching
process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
this case, the memory requirements are handled entirely differently. See the
.\" HREF
\fBpcre2jit\fP
.\"
documentation for details.
.P
The \fBpcre2_dfa_match()\fP function operates in a different way to
\fBpcre2_match()\fP, and uses recursion only when there is a regular expression
recursion or subroutine call in the pattern. This includes the processing of
assertion and "once-only" subpatterns, which are handled like subroutine calls.
Normally, these are never very deep, and the limit on the complexity of
\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
However, it is possible to write patterns with runaway infinite recursions;
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack unless a
limit is applied (see below).
.P
The comments in the next three sections do not apply to
\fBpcre2_dfa_match()\fP; they are relevant only for \fBpcre2_match()\fP without
the JIT optimization.
.
.
.SS "Reducing \fBpcre2_match()\fP's stack usage"
.rs
.sp
You can often reduce the amount of recursion, and therefore the
amount of stack used, by modifying the pattern that is being matched. Consider,
for example, this pattern:
.sp
([^<]|<(?!inet))+
.sp
It matches from wherever it starts until it encounters "<inet" or the end of
the data, and is the kind of pattern that might be used when processing an XML
file. Each iteration of the outer parentheses matches either one character that
is not "<" or a "<" that is not followed by "inet". However, each time a
parenthesis is processed, a recursion occurs, so this formulation uses a stack
frame for each matched character. For a long string, a lot of stack is
required. Consider now this rewritten pattern, which matches exactly the same
strings:
.sp
([^<]++|<(?!inet))+
.sp
This uses very much less stack, because runs of characters that do not contain
"<" are "swallowed" in one item inside the parentheses. Recursion happens only
when a "<" character that is not followed by "inet" is encountered (and we
assume this is relatively rare). A possessive quantifier is used to stop any
backtracking into the runs of non-"<" characters, but that is not related to
stack usage.
.P
This example shows that one way of avoiding stack problems when matching long
subject strings is to write repeated parenthesized subpatterns to match more
than one character whenever possible.
.
.
.SS "Compiling PCRE2 to use heap instead of stack for \fBpcre2_match()\fP"
.rs
.sp
In environments where stack memory is constrained, you might want to compile
PCRE2 to use heap memory instead of stack for remembering back-up points when
\fBpcre2_match()\fP is running. This makes it run more slowly, however. Details
of how to do this are given in the
.\" HREF
\fBpcre2build\fP
.\"
documentation. When built in this way, instead of using the stack, PCRE2
gets memory for remembering backup points from the heap. By default, the memory
is obtained by calling the system \fBmalloc()\fP function, but you can arrange
to supply your own memory management function. For details, see the section
entitled
.\" HTML <a href="pcre2api.html#matchcontext">
.\" </a>
"The match context"
.\"
in the
.\" HREF
\fBpcre2api\fP
.\"
documentation. Since the block sizes are always the same, it may be possible to
implement a customized memory handler that is more efficient than the standard
function. The memory blocks obtained for this purpose are retained and re-used
if possible while \fBpcre2_match()\fP is running. They are all freed just
before it exits.
.
.
.SS "Limiting \fBpcre2_match()\fP's stack usage"
.rs
.sp
You can set limits on the number of times the internal \fBmatch()\fP function
is called, both in total and recursively. If a limit is exceeded,
\fBpcre2_match()\fP returns an error code. Setting suitable limits should
prevent it from running out of stack. The default values of the limits are very
large, and unlikely ever to operate. They can be changed when PCRE2 is built,
and they can also be set when \fBpcre2_match()\fP is called. For details of
these interfaces, see the
.\" HREF
\fBpcre2build\fP
.\"
documentation and the section entitled
.\" HTML <a href="pcre2api.html#matchcontext">
.\" </a>
"The match context"
.\"
in the
.\" HREF
\fBpcre2api\fP
.\"
documentation.
.P
As a very rough rule of thumb, you should reckon on about 500 bytes per
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
around 128000 recursions.
.P
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
applied to a subject line, causes it to find the smallest limits that allow a a
pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
different limits.
.
.
.SS "Limiting \fBpcre2_dfa_match()\fP's stack usage"
.rs
.sp
The recursion limit, as described above for \fBpcre2_match()\fP, also applies
to \fBpcre2_dfa_match()\fP, whose use of recursive function calls for
recursions in the pattern can lead to runaway stack usage. The non-recursive
match limit is not relevant for DFA matching, and is ignored.
.
.
.SS "Changing stack size in Unix-like systems"
.rs
.sp
In Unix-like environments, there is not often a problem with the stack unless
very long strings are involved, though the default limit on stack size varies
from system to system. Values from 8Mb to 64Mb are common. You can find your
default limit by running the command:
.sp
ulimit -s
.sp
Unfortunately, the effect of running out of stack is often SIGSEGV, though
sometimes a more explicit error message is given. You can normally increase the
limit on stack size by code such as this:
.sp
struct rlimit rlim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur = 100*1024*1024;
setrlimit(RLIMIT_STACK, &rlim);
.sp
This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
do this before calling \fBpcre2_match()\fP.
.
.
.SS "Changing stack size in Mac OS X"
.rs
.sp
Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
is also possible to set a stack size when linking a program. There is a
discussion about stack sizes in Mac OS X at this web site:
.\" HTML <a href="http://developer.apple.com/qa/qa2005/qa1419.html">
.\" </a>
http://developer.apple.com/qa/qa2005/qa1419.html.
.\"
.
.
.SH AUTHOR
.rs
.sp
.nf
Philip Hazel
University Computing Service
Cambridge, England.
.fi
.
.
.SH REVISION
.rs
.sp
.nf
Last updated: 23 December 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi