Re-factor pcre2_dfa_match() to use the heap instead of the stack for workspace

vectors when doing recursive function calls.
This commit is contained in:
Philip.Hazel 2018-04-27 16:48:35 +00:00
parent fb413521fc
commit 75747ebb11
28 changed files with 1221 additions and 871 deletions

View File

@ -50,7 +50,15 @@ offset is set zero for early errors.
(c) Support for non-C99 snprintf() that returns -1 in the overflow case. (c) Support for non-C99 snprintf() that returns -1 in the overflow case.
11. Minor tidy of pcre2_dfa_matgch() code. 11. Minor tidy of pcre2_dfa_match() code.
12. Refactored pcre2_dfa_match() so that the internal recursive calls no longer
use the stack for local workspace and local ovectors. Instead, an initial block
of stack is reserved, but if this is insufficient, heap memory is used. The
heap limit parameter now applies to pcre2_dfa_match().
13. If a "find limits" test of DFA matching in pcre2test resulted in too many
matches for the ovector, no matches were displayed.
Version 10.31 12-February-2018 Version 10.31 12-February-2018

12
README
View File

@ -241,9 +241,11 @@ library. They are also documented in the pcre2build man page.
discussion in the pcre2api man page (search for pcre2_set_match_limit). discussion in the pcre2api man page (search for pcre2_set_match_limit).
. There is a separate counter that limits the depth of nested backtracking . There is a separate counter that limits the depth of nested backtracking
during a matching process, which indirectly limits the amount of heap memory (pcre2_match()) or nested function calls (pcre2_dfa_match()) during a
that is used. This also has a default of ten million, which is essentially matching process, which indirectly limits the amount of heap memory that is
"unlimited". You can change the default by setting, for example, used, and in the case of pcre2_dfa_match() the amount of stack as well. This
counter also has a default of ten million, which is essentially "unlimited".
You can change the default by setting, for example,
--with-match-limit-depth=5000 --with-match-limit-depth=5000
@ -251,7 +253,7 @@ library. They are also documented in the pcre2build man page.
pcre2_set_depth_limit). pcre2_set_depth_limit).
. You can also set an explicit limit on the amount of heap memory used by . You can also set an explicit limit on the amount of heap memory used by
the pcre2_match() interpreter: the pcre2_match() and pcre2_dfa_match() interpreters:
--with-heap-limit=500 --with-heap-limit=500
@ -885,4 +887,4 @@ The distribution should contain the files listed below.
Philip Hazel Philip Hazel
Email local part: ph10 Email local part: ph10
Email domain: cam.ac.uk Email domain: cam.ac.uk
Last updated: 25 February 2018 Last updated: 27 April 2018

View File

@ -142,7 +142,7 @@ AC_ARG_ENABLE(jit,
AS_HELP_STRING([--enable-jit], AS_HELP_STRING([--enable-jit],
[enable Just-In-Time compiling support]), [enable Just-In-Time compiling support]),
, enable_jit=no) , enable_jit=no)
# This code enables JIT if the hardware supports it. # This code enables JIT if the hardware supports it.
if test "$enable_jit" = "auto"; then if test "$enable_jit" = "auto"; then
AC_LANG(C) AC_LANG(C)
@ -718,10 +718,11 @@ AC_DEFINE_UNQUOTED([PARENS_NEST_LIMIT], [$with_parens_nest_limit], [
AC_DEFINE_UNQUOTED([MATCH_LIMIT], [$with_match_limit], [ AC_DEFINE_UNQUOTED([MATCH_LIMIT], [$with_match_limit], [
The value of MATCH_LIMIT determines the default number of times the The value of MATCH_LIMIT determines the default number of times the
pcre2_match() function can record a backtrack position during a single pcre2_match() function can record a backtrack position during a single
matching attempt. There is a runtime interface for setting a different limit. matching attempt. The value is also used to limit a loop counter in
The limit exists in order to catch runaway regular expressions that take for pcre2_dfa_match(). There is a runtime interface for setting a different
ever to determine that they do not match. The default is set very large so limit. The limit exists in order to catch runaway regular expressions that
that it does not accidentally catch legitimate cases.]) take for ever to determine that they do not match. The default is set very
large so that it does not accidentally catch legitimate cases.])
# --with-match-limit-recursion is an obsolete synonym for --with-match-limit-depth # --with-match-limit-recursion is an obsolete synonym for --with-match-limit-depth
@ -745,11 +746,15 @@ AC_DEFINE_UNQUOTED([MATCH_LIMIT_DEPTH], [$with_match_limit_depth], [
the maximum amount of heap memory that is used. The value of the maximum amount of heap memory that is used. The value of
MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it must MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it must
be less than the value of MATCH_LIMIT. The default is to use the same value be less than the value of MATCH_LIMIT. The default is to use the same value
as MATCH_LIMIT. There is a runtime method for setting a different limit.]) as MATCH_LIMIT. There is a runtime method for setting a different limit. In
the case of pcre2_dfa_match(), this limit controls the depth of the internal
nested function calls that are used for pattern recursions, lookarounds, and
atomic groups.])
AC_DEFINE_UNQUOTED([HEAP_LIMIT], [$with_heap_limit], [ AC_DEFINE_UNQUOTED([HEAP_LIMIT], [$with_heap_limit], [
This limits the amount of memory that pcre2_match() may use while matching This limits the amount of memory that may be used while matching
a pattern. The value is in kilobytes.]) a pattern. It applies to both pcre2_match() and pcre2_dfa_match(). It does
not apply to JIT matching. The value is in kilobytes.])
AC_DEFINE([MAX_NAME_SIZE], [32], [ AC_DEFINE([MAX_NAME_SIZE], [32], [
This limit is parameterized just in case anybody ever wants to This limit is parameterized just in case anybody ever wants to

View File

@ -10,6 +10,7 @@ This document contains the following sections:
Calling conventions in Windows environments Calling conventions in Windows environments
Comments about Win32 builds Comments about Win32 builds
Building PCRE2 on Windows with CMake Building PCRE2 on Windows with CMake
Building PCRE2 on Windows with Visual Studio
Testing with RunTest.bat Testing with RunTest.bat
Building PCRE2 on native z/OS and z/VM Building PCRE2 on native z/OS and z/VM
@ -328,6 +329,18 @@ cache can be deleted by selecting "File > Delete Cache".
most recent build configuration is targeted by the tests. A summary of most recent build configuration is targeted by the tests. A summary of
test results is presented. Complete test output is subsequently test results is presented. Complete test output is subsequently
available for review in Testing\Temporary under your build dir. available for review in Testing\Temporary under your build dir.
BUILDING PCRE2 ON WINDOWS WITH VISUAL STUDIO
The code currently cannot be compiled without a stdint.h header, which is
available only in relatively recent versions of Visual Studio. However, this
portable and permissively-licensed implementation of the header worked without
issue:
http://www.azillionmonkeys.com/qed/pstdint.h
Just rename it and drop it into the top level of the build tree.
TESTING WITH RUNTEST.BAT TESTING WITH RUNTEST.BAT
@ -382,6 +395,6 @@ Everything in that location, source and executable, is in EBCDIC and native
z/OS file formats. The port provides an API for LE languages such as COBOL and z/OS file formats. The port provides an API for LE languages such as COBOL and
for the z/OS and z/VM versions of the Rexx languages. for the z/OS and z/VM versions of the Rexx languages.
=============================== ===========================
Last Updated: 13 September 2017 Last Updated: 19 April 2018
=============================== ===========================

View File

@ -241,9 +241,11 @@ library. They are also documented in the pcre2build man page.
discussion in the pcre2api man page (search for pcre2_set_match_limit). discussion in the pcre2api man page (search for pcre2_set_match_limit).
. There is a separate counter that limits the depth of nested backtracking . There is a separate counter that limits the depth of nested backtracking
during a matching process, which indirectly limits the amount of heap memory (pcre2_match()) or nested function calls (pcre2_dfa_match()) during a
that is used. This also has a default of ten million, which is essentially matching process, which indirectly limits the amount of heap memory that is
"unlimited". You can change the default by setting, for example, used, and in the case of pcre2_dfa_match() the amount of stack as well. This
counter also has a default of ten million, which is essentially "unlimited".
You can change the default by setting, for example,
--with-match-limit-depth=5000 --with-match-limit-depth=5000
@ -251,7 +253,7 @@ library. They are also documented in the pcre2build man page.
pcre2_set_depth_limit). pcre2_set_depth_limit).
. You can also set an explicit limit on the amount of heap memory used by . You can also set an explicit limit on the amount of heap memory used by
the pcre2_match() interpreter: the pcre2_match() and pcre2_dfa_match() interpreters:
--with-heap-limit=500 --with-heap-limit=500
@ -885,4 +887,4 @@ The distribution should contain the files listed below.
Philip Hazel Philip Hazel
Email local part: ph10 Email local part: ph10
Email domain: cam.ac.uk Email domain: cam.ac.uk
Last updated: 25 February 2018 Last updated: 27 April 2018

View File

@ -46,9 +46,9 @@ just once (except when processing lookaround assertions). This function is
<i>wscount</i> Number of elements in the vector <i>wscount</i> Number of elements in the vector
</pre> </pre>
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
up a callout function or specify the match and/or the recursion depth limits. up a callout function or specify the heap limit or the match or the recursion
The <i>length</i> and <i>startoffset</i> values are code units, not characters. depth limits. The <i>length</i> and <i>startoffset</i> values are code units, not
The options are: characters. The options are:
<pre> <pre>
PCRE2_ANCHORED Match only at the first position PCRE2_ANCHORED Match only at the first position
PCRE2_ENDANCHORED Pattern can match only at end of subject PCRE2_ENDANCHORED Pattern can match only at end of subject

View File

@ -951,14 +951,15 @@ offset limit. In other words, whichever limit comes first is used.
<br> <br>
The <i>heap_limit</i> parameter specifies, in units of kilobytes, the maximum The <i>heap_limit</i> parameter specifies, in units of kilobytes, the maximum
amount of heap memory that <b>pcre2_match()</b> may use to hold backtracking amount of heap memory that <b>pcre2_match()</b> may use to hold backtracking
information when running an interpretive match. This limit does not apply to information when running an interpretive match. This limit also applies to
matching with the JIT optimization, which has its own memory control <b>pcre2_dfa_match()</b>, which may use the heap when processing patterns with a
arrangements (see the lot of nested pattern recursion or lookarounds or atomic groups. This limit
does not apply to matching with the JIT optimization, which has its own memory
control arrangements (see the
<a href="pcre2jit.html"><b>pcre2jit</b></a> <a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for more details), nor does it apply to <b>pcre2_dfa_match()</b>. documentation for more details). If the limit is reached, the negative error
If the limit is reached, the negative error code PCRE2_ERROR_HEAPLIMIT is code PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 is
returned. The default limit is set when PCRE2 is built; the default default is built; the default default is very large and is essentially "unlimited".
very large and is essentially "unlimited".
</P> </P>
<P> <P>
A value for the heap limit may also be supplied by an item at the start of a A value for the heap limit may also be supplied by an item at the start of a
@ -978,6 +979,12 @@ Heap memory is used only if the initial vector is too small. If the heap limit
is set to a value less than 21 (in particular, zero) no heap memory will be is set to a value less than 21 (in particular, zero) no heap memory will be
used. In this case, only patterns that do not have a lot of nested backtracking used. In this case, only patterns that do not have a lot of nested backtracking
can be successfully processed. can be successfully processed.
</P>
<P>
Similarly, for <b>pcre2_dfa_match()</b>, a vector on the system stack is used
when processing pattern recursions, lookarounds, or atomic groups, and only if
this is not big enough is heap memory used. In this case, too, setting a value
of zero disables the use of the heap.
<br> <br>
<br> <br>
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b> <b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
@ -1035,11 +1042,22 @@ backtracking.
<P> <P>
The depth limit is not relevant, and is ignored, when matching is done using The depth limit is not relevant, and is ignored, when matching is done using
JIT compiled code. However, it is supported by <b>pcre2_dfa_match()</b>, which JIT compiled code. However, it is supported by <b>pcre2_dfa_match()</b>, which
uses it to limit the depth of internal recursive function calls that implement uses it to limit the depth of nested internal recursive function calls that
atomic groups, lookaround assertions, and pattern recursions. This is, implement atomic groups, lookaround assertions, and pattern recursions. This
therefore, an indirect limit on the amount of system stack that is used. A limits, indirectly, the amount of system stack this is used. It was more useful
recursive pattern such as /(.)(?1)/, when matched to a very long string using in versions before 10.32, when stack memory was used for local workspace
<b>pcre2_dfa_match()</b>, can use a great deal of stack. vectors for recursive function calls. From version 10.32, only local variables
are allocated on the stack and as each call uses only a few hundred bytes, even
a small stack can support quite a lot of recursion.
</P>
<P>
If the depth of internal recursive function calls is great enough, local
workspace vectors are allocated on the heap from version 10.32 onwards, so the
depth limit also indirectly limits the amount of heap memory that is used. A
recursive pattern such as /(.(?2))((?1)|)/, when matched to a very long string
using <b>pcre2_dfa_match()</b>, can use a great deal of memory. However, it is
probably better to limit heap usage directly by calling
<b>pcre2_set_heap_limit()</b>.
</P> </P>
<P> <P>
The default value for the depth limit can be set when PCRE2 is built; the The default value for the depth limit can be set when PCRE2 is built; the
@ -1096,15 +1114,16 @@ and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively.
PCRE2_CONFIG_DEPTHLIMIT PCRE2_CONFIG_DEPTHLIMIT
</pre> </pre>
The output is a uint32_t integer that gives the default limit for the depth of The output is a uint32_t integer that gives the default limit for the depth of
nested backtracking in <b>pcre2_match()</b> or the depth of nested recursions nested backtracking in <b>pcre2_match()</b> or the depth of nested recursions,
and lookarounds in <b>pcre2_dfa_match()</b>. Further details are given with lookarounds, and atomic groups in <b>pcre2_dfa_match()</b>. Further details are
<b>pcre2_set_depth_limit()</b> above. given with <b>pcre2_set_depth_limit()</b> above.
<pre> <pre>
PCRE2_CONFIG_HEAPLIMIT PCRE2_CONFIG_HEAPLIMIT
</pre> </pre>
The output is a uint32_t integer that gives, in kilobytes, the default limit The output is a uint32_t integer that gives, in kilobytes, the default limit
for the amount of heap memory used by <b>pcre2_match()</b>. Further details are for the amount of heap memory used by <b>pcre2_match()</b> or
given with <b>pcre2_set_heap_limit()</b> above. <b>pcre2_dfa_match()</b>. Further details are given with
<b>pcre2_set_heap_limit()</b> above.
<pre> <pre>
PCRE2_CONFIG_JIT PCRE2_CONFIG_JIT
</pre> </pre>
@ -3510,17 +3529,7 @@ capture.
Calls to the convenience functions that extract substrings by name Calls to the convenience functions that extract substrings by name
return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a
DFA match. The convenience functions that extract substrings by number never DFA match. The convenience functions that extract substrings by number never
return PCRE2_ERROR_NOSUBSTRING, and the meanings of some other errors are return PCRE2_ERROR_NOSUBSTRING.
slightly different:
<pre>
PCRE2_ERROR_UNAVAILABLE
</pre>
The ovector is not big enough to include a slot for the given substring number.
<pre>
PCRE2_ERROR_UNSET
</pre>
There is a slot in the ovector for this substring, but there were insufficient
matches to fill it.
</P> </P>
<P> <P>
The matched strings are stored in the ovector in reverse order of length; that The matched strings are stored in the ovector in reverse order of length; that
@ -3594,9 +3603,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 31 December 2017 Last updated: 27 April 2018
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -295,9 +295,10 @@ change this by a setting such as
--with-heap-limit=500 --with-heap-limit=500
</pre> </pre>
which limits the amount of heap to 500 kilobytes. This limit applies only to which limits the amount of heap to 500 kilobytes. This limit applies only to
interpretive matching in pcre2_match(). It does not apply when JIT (which has interpretive matching in <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, which
its own memory arrangements) is used, nor does it apply to may also use the heap for internal workspace when processing complicated
<b>pcre2_dfa_match()</b>. patterns. This limit does not apply when JIT (which has its own memory
arrangements) is used.
</P> </P>
<P> <P>
You can also explicitly limit the depth of nested backtracking in the You can also explicitly limit the depth of nested backtracking in the
@ -573,7 +574,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC25" href="#TOC1">REVISION</a><br> <br><a name="SEC25" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 25 February 2018 Last updated: 26 April 2018
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>

View File

@ -310,10 +310,12 @@ PCRE2_UNSET.
</P> </P>
<P> <P>
For DFA matching, the <i>offset_vector</i> field points to the ovector that was For DFA matching, the <i>offset_vector</i> field points to the ovector that was
passed to the matching function in the match data block, but it holds no useful passed to the matching function in the match data block for callouts at the top
information at callout time because <b>pcre2_dfa_match()</b> does not support level, but to an internal ovector during the processing of pattern recursions,
substring capturing. The value of <i>capture_top</i> is always 1 and the value lookarounds, and atomic groups. However, these ovectors hold no useful
of <i>capture_last</i> is always 0 for DFA matching. information because <b>pcre2_dfa_match()</b> does not support substring
capturing. The value of <i>capture_top</i> is always 1 and the value of
<i>capture_last</i> is always 0 for DFA matching.
</P> </P>
<P> <P>
The <i>subject</i> and <i>subject_length</i> fields contain copies of the values The <i>subject</i> and <i>subject_length</i> fields contain copies of the values
@ -461,9 +463,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br> <br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 22 December 2017 Last updated: 26 April 2018
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -173,12 +173,12 @@ the application to apply the JIT optimization by calling
Setting match resource limits Setting match resource limits
</b><br> </b><br>
<P> <P>
The pcre2_match() function contains a counter that is incremented every time it The <b>pcre2_match()</b> function contains a counter that is incremented every
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on time it goes round its main loop. The caller of <b>pcre2_match()</b> can set a
this counter, which therefore limits the amount of computing resource used for limit on this counter, which therefore limits the amount of computing resource
a match. The maximum depth of nested backtracking can also be limited; this used for a match. The maximum depth of nested backtracking can also be limited;
indirectly restricts the amount of heap memory that is used, but there is also this indirectly restricts the amount of heap memory that is used, but there is
an explicit memory limit that can be set. also an explicit memory limit that can be set.
</P> </P>
<P> <P>
These facilities are provided to catch runaway matches that are provoked by These facilities are provided to catch runaway matches that are provoked by
@ -195,20 +195,22 @@ where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b> be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
for it to have any effect. In other words, the pattern writer can lower the for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used. setting of one of these limits, the lower value is used. The heap limit is
specified in kilobytes.
</P> </P>
<P> <P>
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
still recognized for backwards compatibility. still recognized for backwards compatibility.
</P> </P>
<P> <P>
The heap limit applies only when the <b>pcre2_match()</b> interpreter is used The heap limit applies only when the <b>pcre2_match()</b> or
for matching. It does not apply to JIT or DFA matching. The match limit is used <b>pcre2_dfa_match()</b> interpreters are used for matching. It does not apply
(but in a different way) when JIT is being used, or when to JIT. The match limit is used (but in a different way) when JIT is being
<b>pcre2_dfa_match()</b> is called, to limit computing resource usage by those used, or when <b>pcre2_dfa_match()</b> is called, to limit computing resource
matching functions. The depth limit is ignored by JIT but is relevant for DFA usage by those matching functions. The depth limit is ignored by JIT but is
matching, which uses function recursion for recursions within the pattern. In relevant for DFA matching, which uses function recursion for recursions within
this case, the depth limit controls the amount of system stack that is used. the pattern and for lookaround assertions and atomic groups. In this case, the
depth limit controls the depth of such recursion.
<a name="newlines"></a></P> <a name="newlines"></a></P>
<br><b> <br><b>
Newline conventions Newline conventions
@ -2818,11 +2820,6 @@ matched at the top level, its final captured value is unset, even if it was
(temporarily) set at a deeper level during the matching process. (temporarily) set at a deeper level during the matching process.
</P> </P>
<P> <P>
If there are more than 15 capturing parentheses in a pattern, PCRE2 has to
obtain extra memory from the heap to store data during a recursion. If no
memory can be obtained, the match fails with the PCRE2_ERROR_NOMEMORY error.
</P>
<P>
Do not confuse the (?R) item with the condition (R), which tests for recursion. Do not confuse the (?R) item with the condition (R), which tests for recursion.
Consider this pattern, which matches text in angle brackets, allowing for Consider this pattern, which matches text in angle brackets, allowing for
arbitrary nesting. Only digits are allowed in nested brackets (that is, when arbitrary nesting. Only digits are allowed in nested brackets (that is, when
@ -3479,9 +3476,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 12 September 2017 Last updated: 25 April 2018
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -93,9 +93,17 @@ may also reduce the memory requirements.
<P> <P>
In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive
function calls, but only for processing atomic groups, lookaround assertions, function calls, but only for processing atomic groups, lookaround assertions,
and recursion within the pattern. Too much nested recursion may cause stack and recursion within the pattern. The original version of the code used to
issues. The "match depth" parameter can be used to limit the depth of function allocate quite large internal workspace vectors on the stack, which caused some
recursion in <b>pcre2_dfa_match()</b>. problems for some patterns in environments with small stacks. From release
10.32 the code for <b>pcre2_dfa_match()</b> has been re-factored to use heap
memory when necessary for internal workspace when recursing, though recursive
function calls are still used.
</P>
<P>
The "match depth" parameter can be used to limit the depth of function
recursion, and the "match heap" parameter to limit heap memory in
<b>pcre2_dfa_match()</b>.
</P> </P>
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br> <br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
<P> <P>
@ -244,9 +252,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br> <br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 08 April 2017 Last updated: 25 April 2018
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -1199,7 +1199,7 @@ pattern.
get=&#60;number or name&#62; extract captured substring get=&#60;number or name&#62; extract captured substring
getall extract all captured substrings getall extract all captured substrings
/g global global matching /g global global matching
heap_limit=&#60;n&#62; set a limit on heap memory heap_limit=&#60;n&#62; set a limit on heap memory (Kbytes)
jitstack=&#60;n&#62; set size of JIT stack jitstack=&#60;n&#62; set size of JIT stack
mark show mark values mark show mark values
match_limit=&#60;n&#62; set a match limit match_limit=&#60;n&#62; set a match limit
@ -1438,20 +1438,17 @@ Finding minimum limits
<P> <P>
If the <b>find_limits</b> modifier is present on a subject line, <b>pcre2test</b> If the <b>find_limits</b> modifier is present on a subject line, <b>pcre2test</b>
calls the relevant matching function several times, setting different values in calls the relevant matching function several times, setting different values in
the match context via <b>pcre2_set_heap_limit(), \fBpcre2_set_match_limit()</b>, the match context via <b>pcre2_set_heap_limit()</b>,
or <b>pcre2_set_depth_limit()</b> until it finds the minimum values for each <b>pcre2_set_match_limit()</b>, or <b>pcre2_set_depth_limit()</b> until it finds
parameter that allows the match to complete without error. the minimum values for each parameter that allows the match to complete without
error. If JIT is being used, only the match limit is relevant.
</P> </P>
<P> <P>
If JIT is being used, only the match limit is relevant. If DFA matching is When using this modifier, the pattern should not contain any limit settings
being used, only the depth limit is relevant. such as (*LIMIT_MATCH=...) within it. If such a setting is present and is
</P> lower than the minimum matching value, the minimum value cannot be found
<P> because <b>pcre2_set_match_limit()</b> etc. are only able to reduce the value of
The <i>match_limit</i> number is a measure of the amount of backtracking an in-pattern limit; they cannot increase it.
that takes place, and learning the minimum value can be instructive. For most
simple matches, the number is quite small, but for patterns with very large
numbers of matching possibilities, it can become large very quickly with
increasing length of subject string.
</P> </P>
<P> <P>
For non-DFA matching, the minimum <i>depth_limit</i> number is a measure of how For non-DFA matching, the minimum <i>depth_limit</i> number is a measure of how
@ -1460,6 +1457,22 @@ searched). In the case of DFA matching, <i>depth_limit</i> controls the depth of
recursive calls of the internal function that is used for handling pattern recursive calls of the internal function that is used for handling pattern
recursion, lookaround assertions, and atomic groups. recursion, lookaround assertions, and atomic groups.
</P> </P>
<P>
For non-DFA matching, the <i>match_limit</i> number is a measure of the amount
of backtracking that takes place, and learning the minimum value can be
instructive. For most simple matches, the number is quite small, but for
patterns with very large numbers of matching possibilities, it can become large
very quickly with increasing length of subject string. In the case of DFA
matching, <i>match_limit</i> controls the total number of calls, both recursive
and non-recursive, to the internal matching function, thus controlling the
overall amount of computing resource that is used.
</P>
<P>
For both kinds of matching, the <i>heap_limit</i> number (which is in kilobytes)
limits the amount of heap memory used for matching. A value of zero disables
the use of any heap memory; many simple pattern matches can be done without
using the heap, so this is not an unreasonable setting.
</P>
<br><b> <br><b>
Showing MARK names Showing MARK names
</b><br> </b><br>
@ -1476,13 +1489,14 @@ Showing memory usage
<P> <P>
The <b>memory</b> modifier causes <b>pcre2test</b> to log the sizes of all heap The <b>memory</b> modifier causes <b>pcre2test</b> to log the sizes of all heap
memory allocation and freeing calls that occur during a call to memory allocation and freeing calls that occur during a call to
<b>pcre2_match()</b>. These occur only when a match requires a bigger vector <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>. These occur only when a match
than the default for remembering backtracking points. In many cases there will requires a bigger vector than the default for remembering backtracking points
be no heap memory used and therefore no additional output. No heap memory is (<b>pcre2_match()</b>) or for internal workspace (<b>pcre2_dfa_match()</b>). In
allocated during matching with <b>pcre2_dfa_match</b> or with JIT, so in those many cases there will be no heap memory used and therefore no additional
cases the <b>memory</b> modifier never has any effect. For this modifier to output. No heap memory is allocated during matching with JIT, so in that case
work, the <b>null_context</b> modifier must not be set on both the pattern and the <b>memory</b> modifier never has any effect. For this modifier to work, the
the subject, though it can be set on one or the other. <b>null_context</b> modifier must not be set on both the pattern and the
subject, though it can be set on one or the other.
</P> </P>
<br><b> <br><b>
Setting a starting offset Setting a starting offset
@ -1982,9 +1996,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 21 December 2017 Last updated: 25 April 2018
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2_DFA_MATCH 3 "30 May 2017" "PCRE2 10.30" .TH PCRE2_DFA_MATCH 3 "26 April 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -34,9 +34,9 @@ just once (except when processing lookaround assertions). This function is
\fIwscount\fP Number of elements in the vector \fIwscount\fP Number of elements in the vector
.sp .sp
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
up a callout function or specify the match and/or the recursion depth limits. up a callout function or specify the heap limit or the match or the recursion
The \fIlength\fP and \fIstartoffset\fP values are code units, not characters. depth limits. The \fIlength\fP and \fIstartoffset\fP values are code units, not
The options are: characters. The options are:
.sp .sp
PCRE2_ANCHORED Match only at the first position PCRE2_ANCHORED Match only at the first position
PCRE2_ENDANCHORED Pattern can match only at end of subject PCRE2_ENDANCHORED Pattern can match only at end of subject

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "31 December 2017" "PCRE2 10.31" .TH PCRE2API 3 "27 April 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -887,16 +887,17 @@ offset limit. In other words, whichever limit comes first is used.
.sp .sp
The \fIheap_limit\fP parameter specifies, in units of kilobytes, the maximum The \fIheap_limit\fP parameter specifies, in units of kilobytes, the maximum
amount of heap memory that \fBpcre2_match()\fP may use to hold backtracking amount of heap memory that \fBpcre2_match()\fP may use to hold backtracking
information when running an interpretive match. This limit does not apply to information when running an interpretive match. This limit also applies to
matching with the JIT optimization, which has its own memory control \fBpcre2_dfa_match()\fP, which may use the heap when processing patterns with a
arrangements (see the lot of nested pattern recursion or lookarounds or atomic groups. This limit
does not apply to matching with the JIT optimization, which has its own memory
control arrangements (see the
.\" HREF .\" HREF
\fBpcre2jit\fP \fBpcre2jit\fP
.\" .\"
documentation for more details), nor does it apply to \fBpcre2_dfa_match()\fP. documentation for more details). If the limit is reached, the negative error
If the limit is reached, the negative error code PCRE2_ERROR_HEAPLIMIT is code PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 is
returned. The default limit is set when PCRE2 is built; the default default is built; the default default is very large and is essentially "unlimited".
very large and is essentially "unlimited".
.P .P
A value for the heap limit may also be supplied by an item at the start of a A value for the heap limit may also be supplied by an item at the start of a
pattern of the form pattern of the form
@ -914,6 +915,11 @@ Heap memory is used only if the initial vector is too small. If the heap limit
is set to a value less than 21 (in particular, zero) no heap memory will be is set to a value less than 21 (in particular, zero) no heap memory will be
used. In this case, only patterns that do not have a lot of nested backtracking used. In this case, only patterns that do not have a lot of nested backtracking
can be successfully processed. can be successfully processed.
.P
Similarly, for \fBpcre2_dfa_match()\fP, a vector on the system stack is used
when processing pattern recursions, lookarounds, or atomic groups, and only if
this is not big enough is heap memory used. In this case, too, setting a value
of zero disables the use of the heap.
.sp .sp
.nf .nf
.B int pcre2_set_match_limit(pcre2_match_context *\fImcontext\fP, .B int pcre2_set_match_limit(pcre2_match_context *\fImcontext\fP,
@ -967,11 +973,21 @@ backtracking.
.P .P
The depth limit is not relevant, and is ignored, when matching is done using The depth limit is not relevant, and is ignored, when matching is done using
JIT compiled code. However, it is supported by \fBpcre2_dfa_match()\fP, which JIT compiled code. However, it is supported by \fBpcre2_dfa_match()\fP, which
uses it to limit the depth of internal recursive function calls that implement uses it to limit the depth of nested internal recursive function calls that
atomic groups, lookaround assertions, and pattern recursions. This is, implement atomic groups, lookaround assertions, and pattern recursions. This
therefore, an indirect limit on the amount of system stack that is used. A limits, indirectly, the amount of system stack this is used. It was more useful
recursive pattern such as /(.)(?1)/, when matched to a very long string using in versions before 10.32, when stack memory was used for local workspace
\fBpcre2_dfa_match()\fP, can use a great deal of stack. vectors for recursive function calls. From version 10.32, only local variables
are allocated on the stack and as each call uses only a few hundred bytes, even
a small stack can support quite a lot of recursion.
.P
If the depth of internal recursive function calls is great enough, local
workspace vectors are allocated on the heap from version 10.32 onwards, so the
depth limit also indirectly limits the amount of heap memory that is used. A
recursive pattern such as /(.(?2))((?1)|)/, when matched to a very long string
using \fBpcre2_dfa_match()\fP, can use a great deal of memory. However, it is
probably better to limit heap usage directly by calling
\fBpcre2_set_heap_limit()\fP.
.P .P
The default value for the depth limit can be set when PCRE2 is built; the The default value for the depth limit can be set when PCRE2 is built; the
default default is the same value as the default for the match limit. If the default default is the same value as the default for the match limit. If the
@ -1028,15 +1044,16 @@ and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively.
PCRE2_CONFIG_DEPTHLIMIT PCRE2_CONFIG_DEPTHLIMIT
.sp .sp
The output is a uint32_t integer that gives the default limit for the depth of The output is a uint32_t integer that gives the default limit for the depth of
nested backtracking in \fBpcre2_match()\fP or the depth of nested recursions nested backtracking in \fBpcre2_match()\fP or the depth of nested recursions,
and lookarounds in \fBpcre2_dfa_match()\fP. Further details are given with lookarounds, and atomic groups in \fBpcre2_dfa_match()\fP. Further details are
\fBpcre2_set_depth_limit()\fP above. given with \fBpcre2_set_depth_limit()\fP above.
.sp .sp
PCRE2_CONFIG_HEAPLIMIT PCRE2_CONFIG_HEAPLIMIT
.sp .sp
The output is a uint32_t integer that gives, in kilobytes, the default limit The output is a uint32_t integer that gives, in kilobytes, the default limit
for the amount of heap memory used by \fBpcre2_match()\fP. Further details are for the amount of heap memory used by \fBpcre2_match()\fP or
given with \fBpcre2_set_heap_limit()\fP above. \fBpcre2_dfa_match()\fP. Further details are given with
\fBpcre2_set_heap_limit()\fP above.
.sp .sp
PCRE2_CONFIG_JIT PCRE2_CONFIG_JIT
.sp .sp
@ -3514,17 +3531,7 @@ capture.
Calls to the convenience functions that extract substrings by name Calls to the convenience functions that extract substrings by name
return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a
DFA match. The convenience functions that extract substrings by number never DFA match. The convenience functions that extract substrings by number never
return PCRE2_ERROR_NOSUBSTRING, and the meanings of some other errors are return PCRE2_ERROR_NOSUBSTRING.
slightly different:
.sp
PCRE2_ERROR_UNAVAILABLE
.sp
The ovector is not big enough to include a slot for the given substring number.
.sp
PCRE2_ERROR_UNSET
.sp
There is a slot in the ovector for this substring, but there were insufficient
matches to fill it.
.P .P
The matched strings are stored in the ovector in reverse order of length; that The matched strings are stored in the ovector in reverse order of length; that
is, the longest matching string is first. If there were too many matches to fit is, the longest matching string is first. If there were too many matches to fit
@ -3605,6 +3612,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 31 December 2017 Last updated: 27 April 2018
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2BUILD 3 "25 February 2018" "PCRE2 10.32" .TH PCRE2BUILD 3 "26 April 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
. .
@ -292,9 +292,10 @@ change this by a setting such as
--with-heap-limit=500 --with-heap-limit=500
.sp .sp
which limits the amount of heap to 500 kilobytes. This limit applies only to which limits the amount of heap to 500 kilobytes. This limit applies only to
interpretive matching in pcre2_match(). It does not apply when JIT (which has interpretive matching in \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, which
its own memory arrangements) is used, nor does it apply to may also use the heap for internal workspace when processing complicated
\fBpcre2_dfa_match()\fP. patterns. This limit does not apply when JIT (which has its own memory
arrangements) is used.
.P .P
You can also explicitly limit the depth of nested backtracking in the You can also explicitly limit the depth of nested backtracking in the
\fBpcre2_match()\fP interpreter. This limit defaults to the value that is set \fBpcre2_match()\fP interpreter. This limit defaults to the value that is set
@ -590,6 +591,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 25 February 2018 Last updated: 26 April 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2CALLOUT 3 "22 December 2017" "PCRE2 10.31" .TH PCRE2CALLOUT 3 "26 April 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -291,10 +291,12 @@ than \fIcapture_top\fP also have both of their ovector slots set to
PCRE2_UNSET. PCRE2_UNSET.
.P .P
For DFA matching, the \fIoffset_vector\fP field points to the ovector that was For DFA matching, the \fIoffset_vector\fP field points to the ovector that was
passed to the matching function in the match data block, but it holds no useful passed to the matching function in the match data block for callouts at the top
information at callout time because \fBpcre2_dfa_match()\fP does not support level, but to an internal ovector during the processing of pattern recursions,
substring capturing. The value of \fIcapture_top\fP is always 1 and the value lookarounds, and atomic groups. However, these ovectors hold no useful
of \fIcapture_last\fP is always 0 for DFA matching. information because \fBpcre2_dfa_match()\fP does not support substring
capturing. The value of \fIcapture_top\fP is always 1 and the value of
\fIcapture_last\fP is always 0 for DFA matching.
.P .P
The \fIsubject\fP and \fIsubject_length\fP fields contain copies of the values The \fIsubject\fP and \fIsubject_length\fP fields contain copies of the values
that were passed to the matching function. that were passed to the matching function.
@ -441,6 +443,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 22 December 2017 Last updated: 26 April 2018
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "12 September 2017" "PCRE2 10.31" .TH PCRE2PATTERN 3 "25 April 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -141,12 +141,12 @@ the application to apply the JIT optimization by calling
.SS "Setting match resource limits" .SS "Setting match resource limits"
.rs .rs
.sp .sp
The pcre2_match() function contains a counter that is incremented every time it The \fBpcre2_match()\fP function contains a counter that is incremented every
goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on time it goes round its main loop. The caller of \fBpcre2_match()\fP can set a
this counter, which therefore limits the amount of computing resource used for limit on this counter, which therefore limits the amount of computing resource
a match. The maximum depth of nested backtracking can also be limited; this used for a match. The maximum depth of nested backtracking can also be limited;
indirectly restricts the amount of heap memory that is used, but there is also this indirectly restricts the amount of heap memory that is used, but there is
an explicit memory limit that can be set. also an explicit memory limit that can be set.
.P .P
These facilities are provided to catch runaway matches that are provoked by These facilities are provided to catch runaway matches that are provoked by
patterns with huge matching trees (a typical example is a pattern with nested patterns with huge matching trees (a typical example is a pattern with nested
@ -162,18 +162,20 @@ where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
for it to have any effect. In other words, the pattern writer can lower the for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used. setting of one of these limits, the lower value is used. The heap limit is
specified in kilobytes.
.P .P
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
still recognized for backwards compatibility. still recognized for backwards compatibility.
.P .P
The heap limit applies only when the \fBpcre2_match()\fP interpreter is used The heap limit applies only when the \fBpcre2_match()\fP or
for matching. It does not apply to JIT or DFA matching. The match limit is used \fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply
(but in a different way) when JIT is being used, or when to JIT. The match limit is used (but in a different way) when JIT is being
\fBpcre2_dfa_match()\fP is called, to limit computing resource usage by those used, or when \fBpcre2_dfa_match()\fP is called, to limit computing resource
matching functions. The depth limit is ignored by JIT but is relevant for DFA usage by those matching functions. The depth limit is ignored by JIT but is
matching, which uses function recursion for recursions within the pattern. In relevant for DFA matching, which uses function recursion for recursions within
this case, the depth limit controls the amount of system stack that is used. the pattern and for lookaround assertions and atomic groups. In this case, the
depth limit controls the depth of such recursion.
. .
. .
.\" HTML <a name="newlines"></a> .\" HTML <a name="newlines"></a>
@ -2838,10 +2840,6 @@ the last value taken on at the top level. If a capturing subpattern is not
matched at the top level, its final captured value is unset, even if it was matched at the top level, its final captured value is unset, even if it was
(temporarily) set at a deeper level during the matching process. (temporarily) set at a deeper level during the matching process.
.P .P
If there are more than 15 capturing parentheses in a pattern, PCRE2 has to
obtain extra memory from the heap to store data during a recursion. If no
memory can be obtained, the match fails with the PCRE2_ERROR_NOMEMORY error.
.P
Do not confuse the (?R) item with the condition (R), which tests for recursion. Do not confuse the (?R) item with the condition (R), which tests for recursion.
Consider this pattern, which matches text in angle brackets, allowing for Consider this pattern, which matches text in angle brackets, allowing for
arbitrary nesting. Only digits are allowed in nested brackets (that is, when arbitrary nesting. Only digits are allowed in nested brackets (that is, when
@ -3505,6 +3503,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 12 September 2017 Last updated: 25 April 2018
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PERFORM 3 "08 April 2017" "PCRE2 10.30" .TH PCRE2PERFORM 3 "25 April 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 PERFORMANCE" .SH "PCRE2 PERFORMANCE"
@ -78,9 +78,16 @@ may also reduce the memory requirements.
.P .P
In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive
function calls, but only for processing atomic groups, lookaround assertions, function calls, but only for processing atomic groups, lookaround assertions,
and recursion within the pattern. Too much nested recursion may cause stack and recursion within the pattern. The original version of the code used to
issues. The "match depth" parameter can be used to limit the depth of function allocate quite large internal workspace vectors on the stack, which caused some
recursion in \fBpcre2_dfa_match()\fP. problems for some patterns in environments with small stacks. From release
10.32 the code for \fBpcre2_dfa_match()\fP has been re-factored to use heap
memory when necessary for internal workspace when recursing, though recursive
function calls are still used.
.P
The "match depth" parameter can be used to limit the depth of function
recursion, and the "match heap" parameter to limit heap memory in
\fBpcre2_dfa_match()\fP.
. .
. .
.SH "PROCESSING TIME" .SH "PROCESSING TIME"
@ -232,6 +239,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 08 April 2017 Last updated: 25 April 2018
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "21 Decbmber 2017" "PCRE 10.31" .TH PCRE2TEST 1 "25 April 2018" "PCRE 10.32"
.SH NAME .SH NAME
pcre2test - a program for testing Perl-compatible regular expressions. pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -1168,7 +1168,7 @@ pattern.
get=<number or name> extract captured substring get=<number or name> extract captured substring
getall extract all captured substrings getall extract all captured substrings
/g global global matching /g global global matching
heap_limit=<n> set a limit on heap memory heap_limit=<n> set a limit on heap memory (Kbytes)
jitstack=<n> set size of JIT stack jitstack=<n> set size of JIT stack
mark show mark values mark show mark values
match_limit=<n> set a match limit match_limit=<n> set a match limit
@ -1401,24 +1401,36 @@ the appropriate limits in the match context. These values are ignored when the
.sp .sp
If the \fBfind_limits\fP modifier is present on a subject line, \fBpcre2test\fP If the \fBfind_limits\fP modifier is present on a subject line, \fBpcre2test\fP
calls the relevant matching function several times, setting different values in calls the relevant matching function several times, setting different values in
the match context via \fBpcre2_set_heap_limit(), \fBpcre2_set_match_limit()\fP, the match context via \fBpcre2_set_heap_limit()\fP,
or \fBpcre2_set_depth_limit()\fP until it finds the minimum values for each \fBpcre2_set_match_limit()\fP, or \fBpcre2_set_depth_limit()\fP until it finds
parameter that allows the match to complete without error. the minimum values for each parameter that allows the match to complete without
error. If JIT is being used, only the match limit is relevant.
.P .P
If JIT is being used, only the match limit is relevant. If DFA matching is When using this modifier, the pattern should not contain any limit settings
being used, only the depth limit is relevant. such as (*LIMIT_MATCH=...) within it. If such a setting is present and is
.P lower than the minimum matching value, the minimum value cannot be found
The \fImatch_limit\fP number is a measure of the amount of backtracking because \fBpcre2_set_match_limit()\fP etc. are only able to reduce the value of
that takes place, and learning the minimum value can be instructive. For most an in-pattern limit; they cannot increase it.
simple matches, the number is quite small, but for patterns with very large
numbers of matching possibilities, it can become large very quickly with
increasing length of subject string.
.P .P
For non-DFA matching, the minimum \fIdepth_limit\fP number is a measure of how For non-DFA matching, the minimum \fIdepth_limit\fP number is a measure of how
much nested backtracking happens (that is, how deeply the pattern's tree is much nested backtracking happens (that is, how deeply the pattern's tree is
searched). In the case of DFA matching, \fIdepth_limit\fP controls the depth of searched). In the case of DFA matching, \fIdepth_limit\fP controls the depth of
recursive calls of the internal function that is used for handling pattern recursive calls of the internal function that is used for handling pattern
recursion, lookaround assertions, and atomic groups. recursion, lookaround assertions, and atomic groups.
.P
For non-DFA matching, the \fImatch_limit\fP number is a measure of the amount
of backtracking that takes place, and learning the minimum value can be
instructive. For most simple matches, the number is quite small, but for
patterns with very large numbers of matching possibilities, it can become large
very quickly with increasing length of subject string. In the case of DFA
matching, \fImatch_limit\fP controls the total number of calls, both recursive
and non-recursive, to the internal matching function, thus controlling the
overall amount of computing resource that is used.
.P
For both kinds of matching, the \fIheap_limit\fP number (which is in kilobytes)
limits the amount of heap memory used for matching. A value of zero disables
the use of any heap memory; many simple pattern matches can be done without
using the heap, so this is not an unreasonable setting.
. .
. .
.SS "Showing MARK names" .SS "Showing MARK names"
@ -1437,13 +1449,14 @@ is added to the non-match message.
.sp .sp
The \fBmemory\fP modifier causes \fBpcre2test\fP to log the sizes of all heap The \fBmemory\fP modifier causes \fBpcre2test\fP to log the sizes of all heap
memory allocation and freeing calls that occur during a call to memory allocation and freeing calls that occur during a call to
\fBpcre2_match()\fP. These occur only when a match requires a bigger vector \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP. These occur only when a match
than the default for remembering backtracking points. In many cases there will requires a bigger vector than the default for remembering backtracking points
be no heap memory used and therefore no additional output. No heap memory is (\fBpcre2_match()\fP) or for internal workspace (\fBpcre2_dfa_match()\fP). In
allocated during matching with \fBpcre2_dfa_match\fP or with JIT, so in those many cases there will be no heap memory used and therefore no additional
cases the \fBmemory\fP modifier never has any effect. For this modifier to output. No heap memory is allocated during matching with JIT, so in that case
work, the \fBnull_context\fP modifier must not be set on both the pattern and the \fBmemory\fP modifier never has any effect. For this modifier to work, the
the subject, though it can be set on one or the other. \fBnull_context\fP modifier must not be set on both the pattern and the
subject, though it can be set on one or the other.
. .
. .
.SS "Setting a starting offset" .SS "Setting a starting offset"
@ -1962,6 +1975,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 21 December 2017 Last updated: 25 April 2018
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1071,7 +1071,7 @@ SUBJECT MODIFIERS
get=<number or name> extract captured substring get=<number or name> extract captured substring
getall extract all captured substrings getall extract all captured substrings
/g global global matching /g global global matching
heap_limit=<n> set a limit on heap memory heap_limit=<n> set a limit on heap memory (Kbytes)
jitstack=<n> set size of JIT stack jitstack=<n> set size of JIT stack
mark show mark values mark show mark values
match_limit=<n> set a match limit match_limit=<n> set a match limit
@ -1291,126 +1291,139 @@ SUBJECT MODIFIERS
values in the match context via pcre2_set_heap_limit(), values in the match context via pcre2_set_heap_limit(),
pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
minimum values for each parameter that allows the match to complete minimum values for each parameter that allows the match to complete
without error. without error. If JIT is being used, only the match limit is relevant.
If JIT is being used, only the match limit is relevant. If DFA matching When using this modifier, the pattern should not contain any limit set-
is being used, only the depth limit is relevant. tings such as (*LIMIT_MATCH=...) within it. If such a setting is
present and is lower than the minimum matching value, the minimum value
cannot be found because pcre2_set_match_limit() etc. are only able to
reduce the value of an in-pattern limit; they cannot increase it.
The match_limit number is a measure of the amount of backtracking that For non-DFA matching, the minimum depth_limit number is a measure of
takes place, and learning the minimum value can be instructive. For
most simple matches, the number is quite small, but for patterns with
very large numbers of matching possibilities, it can become large very
quickly with increasing length of subject string.
For non-DFA matching, the minimum depth_limit number is a measure of
how much nested backtracking happens (that is, how deeply the pattern's how much nested backtracking happens (that is, how deeply the pattern's
tree is searched). In the case of DFA matching, depth_limit controls tree is searched). In the case of DFA matching, depth_limit controls
the depth of recursive calls of the internal function that is used for the depth of recursive calls of the internal function that is used for
handling pattern recursion, lookaround assertions, and atomic groups. handling pattern recursion, lookaround assertions, and atomic groups.
For non-DFA matching, the match_limit number is a measure of the amount
of backtracking that takes place, and learning the minimum value can be
instructive. For most simple matches, the number is quite small, but
for patterns with very large numbers of matching possibilities, it can
become large very quickly with increasing length of subject string. In
the case of DFA matching, match_limit controls the total number of
calls, both recursive and non-recursive, to the internal matching func-
tion, thus controlling the overall amount of computing resource that is
used.
For both kinds of matching, the heap_limit number (which is in kilo-
bytes) limits the amount of heap memory used for matching. A value of
zero disables the use of any heap memory; many simple pattern matches
can be done without using the heap, so this is not an unreasonable set-
ting.
Showing MARK names Showing MARK names
The mark modifier causes the names from backtracking control verbs that The mark modifier causes the names from backtracking control verbs that
are returned from calls to pcre2_match() to be displayed. If a mark is are returned from calls to pcre2_match() to be displayed. If a mark is
returned for a match, non-match, or partial match, pcre2test shows it. returned for a match, non-match, or partial match, pcre2test shows it.
For a match, it is on a line by itself, tagged with "MK:". Otherwise, For a match, it is on a line by itself, tagged with "MK:". Otherwise,
it is added to the non-match message. it is added to the non-match message.
Showing memory usage Showing memory usage
The memory modifier causes pcre2test to log the sizes of all heap mem- The memory modifier causes pcre2test to log the sizes of all heap mem-
ory allocation and freeing calls that occur during a call to ory allocation and freeing calls that occur during a call to
pcre2_match(). These occur only when a match requires a bigger vector pcre2_match() or pcre2_dfa_match(). These occur only when a match
than the default for remembering backtracking points. In many cases requires a bigger vector than the default for remembering backtracking
there will be no heap memory used and therefore no additional output. points (pcre2_match()) or for internal workspace (pcre2_dfa_match()).
No heap memory is allocated during matching with pcre2_dfa_match or In many cases there will be no heap memory used and therefore no addi-
with JIT, so in those cases the memory modifier never has any effect. tional output. No heap memory is allocated during matching with JIT, so
For this modifier to work, the null_context modifier must not be set on in that case the memory modifier never has any effect. For this modi-
both the pattern and the subject, though it can be set on one or the fier to work, the null_context modifier must not be set on both the
other. pattern and the subject, though it can be set on one or the other.
Setting a starting offset Setting a starting offset
The offset modifier sets an offset in the subject string at which The offset modifier sets an offset in the subject string at which
matching starts. Its value is a number of code units, not characters. matching starts. Its value is a number of code units, not characters.
Setting an offset limit Setting an offset limit
The offset_limit modifier sets a limit for unanchored matches. If a The offset_limit modifier sets a limit for unanchored matches. If a
match cannot be found starting at or before this offset in the subject, match cannot be found starting at or before this offset in the subject,
a "no match" return is given. The data value is a number of code units, a "no match" return is given. The data value is a number of code units,
not characters. When this modifier is used, the use_offset_limit modi- not characters. When this modifier is used, the use_offset_limit modi-
fier must have been set for the pattern; if not, an error is generated. fier must have been set for the pattern; if not, an error is generated.
Setting the size of the output vector Setting the size of the output vector
The ovector modifier applies only to the subject line in which it The ovector modifier applies only to the subject line in which it
appears, though of course it can also be used to set a default in a appears, though of course it can also be used to set a default in a
#subject command. It specifies the number of pairs of offsets that are #subject command. It specifies the number of pairs of offsets that are
available for storing matching information. The default is 15. available for storing matching information. The default is 15.
A value of zero is useful when testing the POSIX API because it causes A value of zero is useful when testing the POSIX API because it causes
regexec() to be called with a NULL capture vector. When not testing the regexec() to be called with a NULL capture vector. When not testing the
POSIX API, a value of zero is used to cause pcre2_match_data_cre- POSIX API, a value of zero is used to cause pcre2_match_data_cre-
ate_from_pattern() to be called, in order to create a match block of ate_from_pattern() to be called, in order to create a match block of
exactly the right size for the pattern. (It is not possible to create a exactly the right size for the pattern. (It is not possible to create a
match block with a zero-length ovector; there is always at least one match block with a zero-length ovector; there is always at least one
pair of offsets.) pair of offsets.)
Passing the subject as zero-terminated Passing the subject as zero-terminated
By default, the subject string is passed to a native API matching func- By default, the subject string is passed to a native API matching func-
tion with its correct length. In order to test the facility for passing tion with its correct length. In order to test the facility for passing
a zero-terminated string, the zero_terminate modifier is provided. It a zero-terminated string, the zero_terminate modifier is provided. It
causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching
via the POSIX interface, this modifier is ignored, with a warning. via the POSIX interface, this modifier is ignored, with a warning.
When testing pcre2_substitute(), this modifier also has the effect of When testing pcre2_substitute(), this modifier also has the effect of
passing the replacement string as zero-terminated. passing the replacement string as zero-terminated.
Passing a NULL context Passing a NULL context
Normally, pcre2test passes a context block to pcre2_match(), Normally, pcre2test passes a context block to pcre2_match(),
pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is
set, however, NULL is passed. This is for testing that the matching set, however, NULL is passed. This is for testing that the matching
functions behave correctly in this case (they use default values). This functions behave correctly in this case (they use default values). This
modifier cannot be used with the find_limits modifier or when testing modifier cannot be used with the find_limits modifier or when testing
the substitution function. the substitution function.
THE ALTERNATIVE MATCHING FUNCTION THE ALTERNATIVE MATCHING FUNCTION
By default, pcre2test uses the standard PCRE2 matching function, By default, pcre2test uses the standard PCRE2 matching function,
pcre2_match() to match each subject line. PCRE2 also supports an alter- pcre2_match() to match each subject line. PCRE2 also supports an alter-
native matching function, pcre2_dfa_match(), which operates in a dif- native matching function, pcre2_dfa_match(), which operates in a dif-
ferent way, and has some restrictions. The differences between the two ferent way, and has some restrictions. The differences between the two
functions are described in the pcre2matching documentation. functions are described in the pcre2matching documentation.
If the dfa modifier is set, the alternative matching function is used. If the dfa modifier is set, the alternative matching function is used.
This function finds all possible matches at a given point in the sub- This function finds all possible matches at a given point in the sub-
ject. If, however, the dfa_shortest modifier is set, processing stops ject. If, however, the dfa_shortest modifier is set, processing stops
after the first match is found. This is always the shortest possible after the first match is found. This is always the shortest possible
match. match.
DEFAULT OUTPUT FROM pcre2test DEFAULT OUTPUT FROM pcre2test
This section describes the output when the normal matching function, This section describes the output when the normal matching function,
pcre2_match(), is being used. pcre2_match(), is being used.
When a match succeeds, pcre2test outputs the list of captured sub- When a match succeeds, pcre2test outputs the list of captured sub-
strings, starting with number 0 for the string that matched the whole strings, starting with number 0 for the string that matched the whole
pattern. Otherwise, it outputs "No match" when the return is pattern. Otherwise, it outputs "No match" when the return is
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
this is the entire substring that was inspected during the partial this is the entire substring that was inspected during the partial
match; it may include characters before the actual match start if a match; it may include characters before the actual match start if a
lookbehind assertion, \K, \b, or \B was involved.) lookbehind assertion, \K, \b, or \B was involved.)
For any other return, pcre2test outputs the PCRE2 negative error number For any other return, pcre2test outputs the PCRE2 negative error number
and a short descriptive phrase. If the error is a failed UTF string and a short descriptive phrase. If the error is a failed UTF string
check, the code unit offset of the start of the failing character is check, the code unit offset of the start of the failing character is
also output. Here is an example of an interactive pcre2test run. also output. Here is an example of an interactive pcre2test run.
$ pcre2test $ pcre2test
@ -1426,8 +1439,8 @@ DEFAULT OUTPUT FROM pcre2test
Unset capturing substrings that are not followed by one that is set are Unset capturing substrings that are not followed by one that is set are
not shown by pcre2test unless the allcaptures modifier is specified. In not shown by pcre2test unless the allcaptures modifier is specified. In
the following example, there are two capturing substrings, but when the the following example, there are two capturing substrings, but when the
first data line is matched, the second, unset substring is not shown. first data line is matched, the second, unset substring is not shown.
An "internal" unset substring is shown as "<unset>", as for the second An "internal" unset substring is shown as "<unset>", as for the second
data line. data line.
re> /(a)|(b)/ re> /(a)|(b)/
@ -1439,11 +1452,11 @@ DEFAULT OUTPUT FROM pcre2test
1: <unset> 1: <unset>
2: b 2: b
If the strings contain any non-printing characters, they are output as If the strings contain any non-printing characters, they are output as
\xhh escapes if the value is less than 256 and UTF mode is not set. \xhh escapes if the value is less than 256 and UTF mode is not set.
Otherwise they are output as \x{hh...} escapes. See below for the defi- Otherwise they are output as \x{hh...} escapes. See below for the defi-
nition of non-printing characters. If the aftertext modifier is set, nition of non-printing characters. If the aftertext modifier is set,
the output for substring 0 is followed by the the rest of the subject the output for substring 0 is followed by the the rest of the subject
string, identified by "0+" like this: string, identified by "0+" like this:
re> /cat/aftertext re> /cat/aftertext
@ -1451,7 +1464,7 @@ DEFAULT OUTPUT FROM pcre2test
0: cat 0: cat
0+ aract 0+ aract
If global matching is requested, the results of successive matching If global matching is requested, the results of successive matching
attempts are output in sequence, like this: attempts are output in sequence, like this:
re> /\Bi(\w\w)/g re> /\Bi(\w\w)/g
@ -1463,8 +1476,8 @@ DEFAULT OUTPUT FROM pcre2test
0: ipp 0: ipp
1: pp 1: pp
"No match" is output only if the first match attempt fails. Here is an "No match" is output only if the first match attempt fails. Here is an
example of a failure message (the offset 4 that is specified by the example of a failure message (the offset 4 that is specified by the
offset modifier is past the end of the subject string): offset modifier is past the end of the subject string):
re> /xyz/ re> /xyz/
@ -1472,7 +1485,7 @@ DEFAULT OUTPUT FROM pcre2test
Error -24 (bad offset value) Error -24 (bad offset value)
Note that whereas patterns can be continued over several lines (a plain Note that whereas patterns can be continued over several lines (a plain
">" prompt is used for continuations), subject lines may not. However ">" prompt is used for continuations), subject lines may not. However
newlines can be included in a subject by means of the \n escape (or \r, newlines can be included in a subject by means of the \n escape (or \r,
\r\n, etc., depending on the newline sequence setting). \r\n, etc., depending on the newline sequence setting).
@ -1480,7 +1493,7 @@ DEFAULT OUTPUT FROM pcre2test
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
When the alternative matching function, pcre2_dfa_match(), is used, the When the alternative matching function, pcre2_dfa_match(), is used, the
output consists of a list of all the matches that start at the first output consists of a list of all the matches that start at the first
point in the subject where there is at least one match. For example: point in the subject where there is at least one match. For example:
re> /(tang|tangerine|tan)/ re> /(tang|tangerine|tan)/
@ -1489,11 +1502,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tang 1: tang
2: tan 2: tan
Using the normal matching function on this data finds only "tang". The Using the normal matching function on this data finds only "tang". The
longest matching string is always given first (and numbered zero). longest matching string is always given first (and numbered zero).
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
followed by the partially matching substring. Note that this is the followed by the partially matching substring. Note that this is the
entire substring that was inspected during the partial match; it may entire substring that was inspected during the partial match; it may
include characters before the actual match start if a lookbehind asser- include characters before the actual match start if a lookbehind asser-
tion, \b, or \B was involved. (\K is not supported for DFA matching.) tion, \b, or \B was involved. (\K is not supported for DFA matching.)
@ -1509,16 +1522,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tan 1: tan
0: tan 0: tan
The alternative matching function does not support substring capture, The alternative matching function does not support substring capture,
so the modifiers that are concerned with captured substrings are not so the modifiers that are concerned with captured substrings are not
relevant. relevant.
RESTARTING AFTER A PARTIAL MATCH RESTARTING AFTER A PARTIAL MATCH
When the alternative matching function has given the PCRE2_ERROR_PAR- When the alternative matching function has given the PCRE2_ERROR_PAR-
TIAL return, indicating that the subject partially matched the pattern, TIAL return, indicating that the subject partially matched the pattern,
you can restart the match with additional subject data by means of the you can restart the match with additional subject data by means of the
dfa_restart modifier. For example: dfa_restart modifier. For example:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@ -1527,37 +1540,37 @@ RESTARTING AFTER A PARTIAL MATCH
data> n05\=dfa,dfa_restart data> n05\=dfa,dfa_restart
0: n05 0: n05
For further information about partial matching, see the pcre2partial For further information about partial matching, see the pcre2partial
documentation. documentation.
CALLOUTS CALLOUTS
If the pattern contains any callout requests, pcre2test's callout func- If the pattern contains any callout requests, pcre2test's callout func-
tion is called during matching unless callout_none is specified. This tion is called during matching unless callout_none is specified. This
works with both matching functions, and with JIT, though there are some works with both matching functions, and with JIT, though there are some
differences in behaviour. The output for callouts with numerical argu- differences in behaviour. The output for callouts with numerical argu-
ments and those with string arguments is slightly different. ments and those with string arguments is slightly different.
Callouts with numerical arguments Callouts with numerical arguments
By default, the callout function displays the callout number, the start By default, the callout function displays the callout number, the start
and current positions in the subject text at the callout time, and the and current positions in the subject text at the callout time, and the
next pattern item to be tested. For example: next pattern item to be tested. For example:
--->pqrabcdef --->pqrabcdef
0 ^ ^ \d 0 ^ ^ \d
This output indicates that callout number 0 occurred for a match This output indicates that callout number 0 occurred for a match
attempt starting at the fourth character of the subject string, when attempt starting at the fourth character of the subject string, when
the pointer was at the seventh character, and when the next pattern the pointer was at the seventh character, and when the next pattern
item was \d. Just one circumflex is output if the start and current item was \d. Just one circumflex is output if the start and current
positions are the same, or if the current position precedes the start positions are the same, or if the current position precedes the start
position, which can happen if the callout is in a lookbehind assertion. position, which can happen if the callout is in a lookbehind assertion.
Callouts numbered 255 are assumed to be automatic callouts, inserted as Callouts numbered 255 are assumed to be automatic callouts, inserted as
a result of the auto_callout pattern modifier. In this case, instead of a result of the auto_callout pattern modifier. In this case, instead of
showing the callout number, the offset in the pattern, preceded by a showing the callout number, the offset in the pattern, preceded by a
plus, is output. For example: plus, is output. For example:
re> /\d?[A-E]\*/auto_callout re> /\d?[A-E]\*/auto_callout
@ -1570,7 +1583,7 @@ CALLOUTS
0: E* 0: E*
If a pattern contains (*MARK) items, an additional line is output when- If a pattern contains (*MARK) items, an additional line is output when-
ever a change of latest mark is passed to the callout function. For ever a change of latest mark is passed to the callout function. For
example: example:
re> /a(*MARK:X)bc/auto_callout re> /a(*MARK:X)bc/auto_callout
@ -1584,17 +1597,17 @@ CALLOUTS
+12 ^ ^ +12 ^ ^
0: abc 0: abc
The mark changes between matching "a" and "b", but stays the same for The mark changes between matching "a" and "b", but stays the same for
the rest of the match, so nothing more is output. If, as a result of the rest of the match, so nothing more is output. If, as a result of
backtracking, the mark reverts to being unset, the text "<unset>" is backtracking, the mark reverts to being unset, the text "<unset>" is
output. output.
Callouts with string arguments Callouts with string arguments
The output for a callout with a string argument is similar, except that The output for a callout with a string argument is similar, except that
instead of outputting a callout number before the position indicators, instead of outputting a callout number before the position indicators,
the callout string and its offset in the pattern string are output the callout string and its offset in the pattern string are output
before the reflection of the subject string, and the subject string is before the reflection of the subject string, and the subject string is
reflected for each callout. For example: reflected for each callout. For example:
re> /^ab(?C'first')cd(?C"second")ef/ re> /^ab(?C'first')cd(?C"second")ef/
@ -1610,26 +1623,26 @@ CALLOUTS
Callout modifiers Callout modifiers
The callout function in pcre2test returns zero (carry on matching) by The callout function in pcre2test returns zero (carry on matching) by
default, but you can use a callout_fail modifier in a subject line to default, but you can use a callout_fail modifier in a subject line to
change this and other parameters of the callout (see below). change this and other parameters of the callout (see below).
If the callout_capture modifier is set, the current captured groups are If the callout_capture modifier is set, the current captured groups are
output when a callout occurs. This is useful only for non-DFA matching, output when a callout occurs. This is useful only for non-DFA matching,
as pcre2_dfa_match() does not support capturing, so no captures are as pcre2_dfa_match() does not support capturing, so no captures are
ever shown. ever shown.
The normal callout output, showing the callout number or pattern offset The normal callout output, showing the callout number or pattern offset
(as described above) is suppressed if the callout_no_where modifier is (as described above) is suppressed if the callout_no_where modifier is
set. set.
When using the interpretive matching function pcre2_match() without When using the interpretive matching function pcre2_match() without
JIT, setting the callout_extra modifier causes additional output from JIT, setting the callout_extra modifier causes additional output from
pcre2test's callout function to be generated. For the first callout in pcre2test's callout function to be generated. For the first callout in
a match attempt at a new starting position in the subject, "New match a match attempt at a new starting position in the subject, "New match
attempt" is output. If there has been a backtrack since the last call- attempt" is output. If there has been a backtrack since the last call-
out (or start of matching if this is the first callout), "Backtrack" is out (or start of matching if this is the first callout), "Backtrack" is
output, followed by "No other matching paths" if the backtrack ended output, followed by "No other matching paths" if the backtrack ended
the previous match attempt. For example: the previous match attempt. For example:
re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess
@ -1666,82 +1679,82 @@ CALLOUTS
+1 ^ a+ +1 ^ a+
No match No match
Notice that various optimizations must be turned off if you want all Notice that various optimizations must be turned off if you want all
possible matching paths to be scanned. If no_start_optimize is not possible matching paths to be scanned. If no_start_optimize is not
used, there is an immediate "no match", without any callouts, because used, there is an immediate "no match", without any callouts, because
the starting optimization fails to find "b" in the subject, which it the starting optimization fails to find "b" in the subject, which it
knows must be present for any match. If no_auto_possess is not used, knows must be present for any match. If no_auto_possess is not used,
the "a+" item is turned into "a++", which reduces the number of back- the "a+" item is turned into "a++", which reduces the number of back-
tracks. tracks.
The callout_extra modifier has no effect if used with the DFA matching The callout_extra modifier has no effect if used with the DFA matching
function, or with JIT. function, or with JIT.
Return values from callouts Return values from callouts
The default return from the callout function is zero, which allows The default return from the callout function is zero, which allows
matching to continue. The callout_fail modifier can be given one or two matching to continue. The callout_fail modifier can be given one or two
numbers. If there is only one number, 1 is returned instead of 0 (caus- numbers. If there is only one number, 1 is returned instead of 0 (caus-
ing matching to backtrack) when a callout of that number is reached. If ing matching to backtrack) when a callout of that number is reached. If
two numbers (<n>:<m>) are given, 1 is returned when callout <n> is two numbers (<n>:<m>) are given, 1 is returned when callout <n> is
reached and there have been at least <m> callouts. The callout_error reached and there have been at least <m> callouts. The callout_error
modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus- modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus-
ing the entire matching process to be aborted. If both these modifiers ing the entire matching process to be aborted. If both these modifiers
are set for the same callout number, callout_error takes precedence. are set for the same callout number, callout_error takes precedence.
Note that callouts with string arguments are always given the number Note that callouts with string arguments are always given the number
zero. zero.
The callout_data modifier can be given an unsigned or a negative num- The callout_data modifier can be given an unsigned or a negative num-
ber. This is set as the "user data" that is passed to the matching ber. This is set as the "user data" that is passed to the matching
function, and passed back when the callout function is invoked. Any function, and passed back when the callout function is invoked. Any
value other than zero is used as a return from pcre2test's callout value other than zero is used as a return from pcre2test's callout
function. function.
Inserting callouts can be helpful when using pcre2test to check compli- Inserting callouts can be helpful when using pcre2test to check compli-
cated regular expressions. For further information about callouts, see cated regular expressions. For further information about callouts, see
the pcre2callout documentation. the pcre2callout documentation.
NON-PRINTING CHARACTERS NON-PRINTING CHARACTERS
When pcre2test is outputting text in the compiled version of a pattern, When pcre2test is outputting text in the compiled version of a pattern,
bytes other than 32-126 are always treated as non-printing characters bytes other than 32-126 are always treated as non-printing characters
and are therefore shown as hex escapes. and are therefore shown as hex escapes.
When pcre2test is outputting text that is a matched part of a subject When pcre2test is outputting text that is a matched part of a subject
string, it behaves in the same way, unless a different locale has been string, it behaves in the same way, unless a different locale has been
set for the pattern (using the locale modifier). In this case, the set for the pattern (using the locale modifier). In this case, the
isprint() function is used to distinguish printing and non-printing isprint() function is used to distinguish printing and non-printing
characters. characters.
SAVING AND RESTORING COMPILED PATTERNS SAVING AND RESTORING COMPILED PATTERNS
It is possible to save compiled patterns on disc or elsewhere, and It is possible to save compiled patterns on disc or elsewhere, and
reload them later, subject to a number of restrictions. JIT data cannot reload them later, subject to a number of restrictions. JIT data cannot
be saved. The host on which the patterns are reloaded must be running be saved. The host on which the patterns are reloaded must be running
the same version of PCRE2, with the same code unit width, and must also the same version of PCRE2, with the same code unit width, and must also
have the same endianness, pointer width and PCRE2_SIZE type. Before have the same endianness, pointer width and PCRE2_SIZE type. Before
compiled patterns can be saved they must be serialized, that is, con- compiled patterns can be saved they must be serialized, that is, con-
verted to a stream of bytes. A single byte stream may contain any num- verted to a stream of bytes. A single byte stream may contain any num-
ber of compiled patterns, but they must all use the same character ber of compiled patterns, but they must all use the same character
tables. A single copy of the tables is included in the byte stream (its tables. A single copy of the tables is included in the byte stream (its
size is 1088 bytes). size is 1088 bytes).
The functions whose names begin with pcre2_serialize_ are used for The functions whose names begin with pcre2_serialize_ are used for
serializing and de-serializing. They are described in the pcre2serial- serializing and de-serializing. They are described in the pcre2serial-
ize documentation. In this section we describe the features of ize documentation. In this section we describe the features of
pcre2test that can be used to test these functions. pcre2test that can be used to test these functions.
When a pattern with push modifier is successfully compiled, it is When a pattern with push modifier is successfully compiled, it is
pushed onto a stack of compiled patterns, and pcre2test expects the pushed onto a stack of compiled patterns, and pcre2test expects the
next line to contain a new pattern (or command) instead of a subject next line to contain a new pattern (or command) instead of a subject
line. By contrast, the pushcopy modifier causes a copy of the compiled line. By contrast, the pushcopy modifier causes a copy of the compiled
pattern to be stacked, leaving the original available for immediate pattern to be stacked, leaving the original available for immediate
matching. By using push and/or pushcopy, a number of patterns can be matching. By using push and/or pushcopy, a number of patterns can be
compiled and retained. These modifiers are incompatible with posix, and compiled and retained. These modifiers are incompatible with posix, and
control modifiers that act at match time are ignored (with a message) control modifiers that act at match time are ignored (with a message)
for the stacked patterns. The jitverify modifier applies only at com- for the stacked patterns. The jitverify modifier applies only at com-
pile time. pile time.
The command The command
@ -1749,21 +1762,21 @@ SAVING AND RESTORING COMPILED PATTERNS
#save <filename> #save <filename>
causes all the stacked patterns to be serialized and the result written causes all the stacked patterns to be serialized and the result written
to the named file. Afterwards, all the stacked patterns are freed. The to the named file. Afterwards, all the stacked patterns are freed. The
command command
#load <filename> #load <filename>
reads the data in the file, and then arranges for it to be de-serial- reads the data in the file, and then arranges for it to be de-serial-
ized, with the resulting compiled patterns added to the pattern stack. ized, with the resulting compiled patterns added to the pattern stack.
The pattern on the top of the stack can be retrieved by the #pop com- The pattern on the top of the stack can be retrieved by the #pop com-
mand, which must be followed by lines of subjects that are to be mand, which must be followed by lines of subjects that are to be
matched with the pattern, terminated as usual by an empty line or end matched with the pattern, terminated as usual by an empty line or end
of file. This command may be followed by a modifier list containing of file. This command may be followed by a modifier list containing
only control modifiers that act after a pattern has been compiled. In only control modifiers that act after a pattern has been compiled. In
particular, hex, posix, posix_nosub, push, and pushcopy are not particular, hex, posix, posix_nosub, push, and pushcopy are not
allowed, nor are any option-setting modifiers. The JIT modifiers are, allowed, nor are any option-setting modifiers. The JIT modifiers are,
however permitted. Here is an example that saves and reloads two pat- however permitted. Here is an example that saves and reloads two pat-
terns. terns.
/abc/push /abc/push
@ -1776,10 +1789,10 @@ SAVING AND RESTORING COMPILED PATTERNS
#pop jit,bincode #pop jit,bincode
abc abc
If jitverify is used with #pop, it does not automatically imply jit, If jitverify is used with #pop, it does not automatically imply jit,
which is different behaviour from when it is used on a pattern. which is different behaviour from when it is used on a pattern.
The #popcopy command is analagous to the pushcopy modifier in that it The #popcopy command is analagous to the pushcopy modifier in that it
makes current a copy of the topmost stack pattern, leaving the original makes current a copy of the topmost stack pattern, leaving the original
still on the stack. still on the stack.
@ -1799,5 +1812,5 @@ AUTHOR
REVISION REVISION
Last updated: 21 December 2017 Last updated: 25 April 2018
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.

View File

@ -132,8 +132,9 @@ sure both macros are undefined; an emulation function will then be used. */
/* Define to 1 if you have the <zlib.h> header file. */ /* Define to 1 if you have the <zlib.h> header file. */
#undef HAVE_ZLIB_H #undef HAVE_ZLIB_H
/* This limits the amount of memory that pcre2_match() may use while matching /* This limits the amount of memory that may be used while matching a pattern.
a pattern. The value is in kilobytes. */ It applies to both pcre2_match() and pcre2_dfa_match(). It does not apply
to JIT matching. The value is in kilobytes. */
#undef HEAP_LIMIT #undef HEAP_LIMIT
/* The value of LINK_SIZE determines the number of bytes used to store links /* The value of LINK_SIZE determines the number of bytes used to store links
@ -148,7 +149,8 @@ sure both macros are undefined; an emulation function will then be used. */
/* The value of MATCH_LIMIT determines the default number of times the /* The value of MATCH_LIMIT determines the default number of times the
pcre2_match() function can record a backtrack position during a single pcre2_match() function can record a backtrack position during a single
matching attempt. There is a runtime interface for setting a different matching attempt. The value is also used to limit a loop counter in
pcre2_dfa_match(). There is a runtime interface for setting a different
limit. The limit exists in order to catch runaway regular expressions that limit. The limit exists in order to catch runaway regular expressions that
take for ever to determine that they do not match. The default is set very take for ever to determine that they do not match. The default is set very
large so that it does not accidentally catch legitimate cases. */ large so that it does not accidentally catch legitimate cases. */
@ -161,7 +163,9 @@ sure both macros are undefined; an emulation function will then be used. */
MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it
must be less than the value of MATCH_LIMIT. The default is to use the same must be less than the value of MATCH_LIMIT. The default is to use the same
value as MATCH_LIMIT. There is a runtime method for setting a different value as MATCH_LIMIT. There is a runtime method for setting a different
limit. */ limit. In the case of pcre2_dfa_match(), this limit controls the depth of
the internal nested function calls that are used for pattern recursions,
lookarounds, and atomic groups. */
#undef MATCH_LIMIT_DEPTH #undef MATCH_LIMIT_DEPTH
/* This limit is parameterized just in case anybody ever wants to change it. /* This limit is parameterized just in case anybody ever wants to change it.

View File

@ -292,6 +292,35 @@ typedef struct stateblock {
#define INTS_PER_STATEBLOCK (int)(sizeof(stateblock)/sizeof(int)) #define INTS_PER_STATEBLOCK (int)(sizeof(stateblock)/sizeof(int))
/* Before version 10.32 the recursive calls of internal_dfa_match() were passed
local working space and output vectors that were created on the stack. This has
caused issues for some patterns, especially in small-stack environments such as
Windows. A new scheme is now in use which sets up a vector on the stack, but if
this is too small, heap memory is used, up to the heap_limit. The main
parameters are all numbers of ints because the workspace is a vector of ints.
The size of the starting stack vector, DFA_START_RWS_SIZE, is in bytes, and is
defined in pcre2_internal.h so as to be available to pcre2test when it is
finding the minimum heap requirement for a match. */
#define OVEC_UNIT (sizeof(PCRE2_SIZE)/sizeof(int))
#define RWS_BASE_SIZE (DFA_START_RWS_SIZE/sizeof(int)) /* Stack vector */
#define RWS_RSIZE 1000 /* Work size for recursion */
#define RWS_OVEC_RSIZE (1000*OVEC_UNIT) /* Ovector for recursion */
#define RWS_OVEC_OSIZE (2*OVEC_UNIT) /* Ovector in other cases */
/* This structure is at the start of each workspace block. */
typedef struct RWS_anchor {
struct RWS_anchor *next;
unsigned int size; /* Number of ints */
unsigned int free; /* Number of ints */
} RWS_anchor;
#define RWS_ANCHOR_SIZE (sizeof(RWS_anchor)/sizeof(int))
/************************************************* /*************************************************
* Process a callout * * Process a callout *
@ -353,6 +382,61 @@ return (mb->callout)(cb, mb->callout_data);
/*************************************************
* Expand local workspace memory *
*************************************************/
/* This function is called when internal_dfa_match() is about to be called
recursively and there is insufficient workingspace left in the current work
space block. If there's an existing next block, use it; otherwise get a new
block unless the heap limit is reached.
Arguments:
rwsptr pointer to block pointer (updated)
ovecsize space needed for an ovector
mb the match block
Returns: 0 rwsptr has been updated
!0 an error code
*/
static int
more_workspace(RWS_anchor **rwsptr, unsigned int ovecsize, dfa_match_block *mb)
{
RWS_anchor *rws = *rwsptr;
RWS_anchor *new;
if (rws->next != NULL)
{
new = rws->next;
}
/* All sizes are in units of sizeof(int), except for mb->heaplimit, which is in
kilobytes. */
else
{
unsigned int newsize = rws->size * 2;
unsigned int heapleft = (unsigned int)
(((1024/sizeof(int))*mb->heap_limit - mb->heap_used));
if (newsize > heapleft) newsize = heapleft;
if (newsize < RWS_RSIZE + ovecsize + RWS_ANCHOR_SIZE)
return PCRE2_ERROR_HEAPLIMIT;
new = mb->memctl.malloc(newsize*sizeof(int), mb->memctl.memory_data);
if (new == NULL) return PCRE2_ERROR_NOMEMORY;
mb->heap_used += newsize;
new->next = NULL;
new->size = newsize;
rws->next = new;
}
new->free = new->size - RWS_ANCHOR_SIZE;
*rwsptr = new;
return 0;
}
/************************************************* /*************************************************
* Match a Regular Expression - DFA engine * * Match a Regular Expression - DFA engine *
*************************************************/ *************************************************/
@ -431,7 +515,8 @@ internal_dfa_match(
uint32_t offsetcount, uint32_t offsetcount,
int *workspace, int *workspace,
int wscount, int wscount,
uint32_t rlevel) uint32_t rlevel,
int *RWS)
{ {
stateblock *active_states, *new_states, *temp_states; stateblock *active_states, *new_states, *temp_states;
stateblock *next_active_state, *next_new_state; stateblock *next_active_state, *next_new_state;
@ -2587,10 +2672,22 @@ for (;;)
case OP_ASSERTBACK: case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT: case OP_ASSERTBACK_NOT:
{ {
PCRE2_SPTR endasscode = code + GET(code, 1);
PCRE2_SIZE local_offsets[2];
int rc; int rc;
int local_workspace[1000]; int *local_workspace;
PCRE2_SIZE *local_offsets;
PCRE2_SPTR endasscode = code + GET(code, 1);
RWS_anchor *rws = (RWS_anchor *)RWS;
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
{
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
if (rc != 0) return rc;
RWS = (int *)rws;
}
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1); while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
@ -2600,10 +2697,13 @@ for (;;)
ptr, /* where we currently are */ ptr, /* where we currently are */
(PCRE2_SIZE)(ptr - start_subject), /* start offset */ (PCRE2_SIZE)(ptr - start_subject), /* start offset */
local_offsets, /* offset vector */ local_offsets, /* offset vector */
sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */ RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
local_workspace, /* workspace vector */ local_workspace, /* workspace vector */
sizeof(local_workspace)/sizeof(int), /* size of same */ RWS_RSIZE, /* size of same */
rlevel); /* function recursion level */ rlevel, /* function recursion level */
RWS); /* recursion workspace */
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
if (rc < 0 && rc != PCRE2_ERROR_NOMATCH) return rc; if (rc < 0 && rc != PCRE2_ERROR_NOMATCH) return rc;
if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK)) if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
@ -2670,11 +2770,23 @@ for (;;)
else else
{ {
PCRE2_SIZE local_offsets[2];
int local_workspace[1000];
int rc; int rc;
int *local_workspace;
PCRE2_SIZE *local_offsets;
PCRE2_SPTR asscode = code + LINK_SIZE + 1; PCRE2_SPTR asscode = code + LINK_SIZE + 1;
PCRE2_SPTR endasscode = asscode + GET(asscode, 1); PCRE2_SPTR endasscode = asscode + GET(asscode, 1);
RWS_anchor *rws = (RWS_anchor *)RWS;
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
{
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
if (rc != 0) return rc;
RWS = (int *)rws;
}
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1); while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
@ -2684,10 +2796,13 @@ for (;;)
ptr, /* where we currently are */ ptr, /* where we currently are */
(PCRE2_SIZE)(ptr - start_subject), /* start offset */ (PCRE2_SIZE)(ptr - start_subject), /* start offset */
local_offsets, /* offset vector */ local_offsets, /* offset vector */
sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */ RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
local_workspace, /* workspace vector */ local_workspace, /* workspace vector */
sizeof(local_workspace)/sizeof(int), /* size of same */ RWS_RSIZE, /* size of same */
rlevel); /* function recursion level */ rlevel, /* function recursion level */
RWS); /* recursion work space */
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
if (rc < 0 && rc != PCRE2_ERROR_NOMATCH) return rc; if (rc < 0 && rc != PCRE2_ERROR_NOMATCH) return rc;
if ((rc >= 0) == if ((rc >= 0) ==
@ -2702,13 +2817,25 @@ for (;;)
/*-----------------------------------------------------------------*/ /*-----------------------------------------------------------------*/
case OP_RECURSE: case OP_RECURSE:
{ {
int rc;
int *local_workspace;
PCRE2_SIZE *local_offsets;
RWS_anchor *rws = (RWS_anchor *)RWS;
dfa_recursion_info *ri; dfa_recursion_info *ri;
PCRE2_SIZE local_offsets[1000];
int local_workspace[1000];
PCRE2_SPTR callpat = start_code + GET(code, 1); PCRE2_SPTR callpat = start_code + GET(code, 1);
uint32_t recno = (callpat == mb->start_code)? 0 : uint32_t recno = (callpat == mb->start_code)? 0 :
GET2(callpat, 1 + LINK_SIZE); GET2(callpat, 1 + LINK_SIZE);
int rc;
if (rws->free < RWS_RSIZE + RWS_OVEC_RSIZE)
{
rc = more_workspace(&rws, RWS_OVEC_RSIZE, mb);
if (rc != 0) return rc;
RWS = (int *)rws;
}
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
local_workspace = ((int *)local_offsets) + RWS_OVEC_RSIZE;
rws->free -= RWS_RSIZE + RWS_OVEC_RSIZE;
/* Check for repeating a recursion without advancing the subject /* Check for repeating a recursion without advancing the subject
pointer. This should catch convoluted mutual recursions. (Some simple pointer. This should catch convoluted mutual recursions. (Some simple
@ -2732,11 +2859,13 @@ for (;;)
ptr, /* where we currently are */ ptr, /* where we currently are */
(PCRE2_SIZE)(ptr - start_subject), /* start offset */ (PCRE2_SIZE)(ptr - start_subject), /* start offset */
local_offsets, /* offset vector */ local_offsets, /* offset vector */
sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */ RWS_OVEC_RSIZE/OVEC_UNIT, /* size of same */
local_workspace, /* workspace vector */ local_workspace, /* workspace vector */
sizeof(local_workspace)/sizeof(int), /* size of same */ RWS_RSIZE, /* size of same */
rlevel); /* function recursion level */ rlevel, /* function recursion level */
RWS); /* recursion workspace */
rws->free += RWS_RSIZE + RWS_OVEC_RSIZE;
mb->recursive = new_recursive.prevrec; /* Done this recursion */ mb->recursive = new_recursive.prevrec; /* Done this recursion */
/* Ran out of internal offsets */ /* Ran out of internal offsets */
@ -2782,10 +2911,25 @@ for (;;)
case OP_SCBRAPOS: case OP_SCBRAPOS:
case OP_BRAPOSZERO: case OP_BRAPOSZERO:
{ {
int rc;
int *local_workspace;
PCRE2_SIZE *local_offsets;
PCRE2_SIZE charcount, matched_count; PCRE2_SIZE charcount, matched_count;
PCRE2_SPTR local_ptr = ptr; PCRE2_SPTR local_ptr = ptr;
RWS_anchor *rws = (RWS_anchor *)RWS;
BOOL allow_zero; BOOL allow_zero;
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
{
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
if (rc != 0) return rc;
RWS = (int *)rws;
}
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
if (codevalue == OP_BRAPOSZERO) if (codevalue == OP_BRAPOSZERO)
{ {
allow_zero = TRUE; allow_zero = TRUE;
@ -2798,19 +2942,17 @@ for (;;)
for (matched_count = 0;; matched_count++) for (matched_count = 0;; matched_count++)
{ {
PCRE2_SIZE local_offsets[2]; rc = internal_dfa_match(
int local_workspace[1000];
int rc = internal_dfa_match(
mb, /* fixed match data */ mb, /* fixed match data */
code, /* this subexpression's code */ code, /* this subexpression's code */
local_ptr, /* where we currently are */ local_ptr, /* where we currently are */
(PCRE2_SIZE)(ptr - start_subject), /* start offset */ (PCRE2_SIZE)(ptr - start_subject), /* start offset */
local_offsets, /* offset vector */ local_offsets, /* offset vector */
sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */ RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
local_workspace, /* workspace vector */ local_workspace, /* workspace vector */
sizeof(local_workspace)/sizeof(int), /* size of same */ RWS_RSIZE, /* size of same */
rlevel); /* function recursion level */ rlevel, /* function recursion level */
RWS); /* recursion workspace */
/* Failed to match */ /* Failed to match */
@ -2827,6 +2969,8 @@ for (;;)
local_ptr += charcount; /* Advance temporary position ptr */ local_ptr += charcount; /* Advance temporary position ptr */
} }
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
/* At this point we have matched the subpattern matched_count /* At this point we have matched the subpattern matched_count
times, and local_ptr is pointing to the character after the end of the times, and local_ptr is pointing to the character after the end of the
last match. */ last match. */
@ -2869,19 +3013,35 @@ for (;;)
/*-----------------------------------------------------------------*/ /*-----------------------------------------------------------------*/
case OP_ONCE: case OP_ONCE:
{ {
PCRE2_SIZE local_offsets[2]; int rc;
int local_workspace[1000]; int *local_workspace;
PCRE2_SIZE *local_offsets;
RWS_anchor *rws = (RWS_anchor *)RWS;
int rc = internal_dfa_match( if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
{
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
if (rc != 0) return rc;
RWS = (int *)rws;
}
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
rc = internal_dfa_match(
mb, /* fixed match data */ mb, /* fixed match data */
code, /* this subexpression's code */ code, /* this subexpression's code */
ptr, /* where we currently are */ ptr, /* where we currently are */
(PCRE2_SIZE)(ptr - start_subject), /* start offset */ (PCRE2_SIZE)(ptr - start_subject), /* start offset */
local_offsets, /* offset vector */ local_offsets, /* offset vector */
sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */ RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
local_workspace, /* workspace vector */ local_workspace, /* workspace vector */
sizeof(local_workspace)/sizeof(int), /* size of same */ RWS_RSIZE, /* size of same */
rlevel); /* function recursion level */ rlevel, /* function recursion level */
RWS); /* recursion workspace */
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
if (rc >= 0) if (rc >= 0)
{ {
@ -3063,6 +3223,7 @@ pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length,
PCRE2_SIZE start_offset, uint32_t options, pcre2_match_data *match_data, PCRE2_SIZE start_offset, uint32_t options, pcre2_match_data *match_data,
pcre2_match_context *mcontext, int *workspace, PCRE2_SIZE wscount) pcre2_match_context *mcontext, int *workspace, PCRE2_SIZE wscount)
{ {
int rc;
const pcre2_real_code *re = (const pcre2_real_code *)code; const pcre2_real_code *re = (const pcre2_real_code *)code;
PCRE2_SPTR start_match; PCRE2_SPTR start_match;
@ -3071,9 +3232,9 @@ PCRE2_SPTR bumpalong_limit;
PCRE2_SPTR req_cu_ptr; PCRE2_SPTR req_cu_ptr;
BOOL utf, anchored, startline, firstline; BOOL utf, anchored, startline, firstline;
BOOL has_first_cu = FALSE; BOOL has_first_cu = FALSE;
BOOL has_req_cu = FALSE; BOOL has_req_cu = FALSE;
PCRE2_UCHAR first_cu = 0; PCRE2_UCHAR first_cu = 0;
PCRE2_UCHAR first_cu2 = 0; PCRE2_UCHAR first_cu2 = 0;
PCRE2_UCHAR req_cu = 0; PCRE2_UCHAR req_cu = 0;
@ -3088,6 +3249,17 @@ pcre2_callout_block cb;
dfa_match_block actual_match_block; dfa_match_block actual_match_block;
dfa_match_block *mb = &actual_match_block; dfa_match_block *mb = &actual_match_block;
/* Set up a starting block of memory for use during recursive calls to
internal_dfa_match(). By putting this on the stack, it minimizes resource use
in the case when it is not needed. If this is too small, more memory is
obtained from the heap. At the start of each block is an anchor structure.*/
int base_recursion_workspace[RWS_BASE_SIZE];
RWS_anchor *rws = (RWS_anchor *)base_recursion_workspace;
rws->next = NULL;
rws->size = RWS_BASE_SIZE;
rws->free = RWS_BASE_SIZE - RWS_ANCHOR_SIZE;
/* A length equal to PCRE2_ZERO_TERMINATED implies a zero-terminated /* A length equal to PCRE2_ZERO_TERMINATED implies a zero-terminated
subject string. */ subject string. */
@ -3184,6 +3356,7 @@ if (mcontext == NULL)
mb->memctl = re->memctl; mb->memctl = re->memctl;
mb->match_limit = PRIV(default_match_context).match_limit; mb->match_limit = PRIV(default_match_context).match_limit;
mb->match_limit_depth = PRIV(default_match_context).depth_limit; mb->match_limit_depth = PRIV(default_match_context).depth_limit;
mb->heap_limit = PRIV(default_match_context).heap_limit;
} }
else else
{ {
@ -3198,6 +3371,7 @@ else
mb->memctl = mcontext->memctl; mb->memctl = mcontext->memctl;
mb->match_limit = mcontext->match_limit; mb->match_limit = mcontext->match_limit;
mb->match_limit_depth = mcontext->depth_limit; mb->match_limit_depth = mcontext->depth_limit;
mb->heap_limit = mcontext->heap_limit;
} }
if (mb->match_limit > re->limit_match) if (mb->match_limit > re->limit_match)
@ -3206,6 +3380,9 @@ if (mb->match_limit > re->limit_match)
if (mb->match_limit_depth > re->limit_depth) if (mb->match_limit_depth > re->limit_depth)
mb->match_limit_depth = re->limit_depth; mb->match_limit_depth = re->limit_depth;
if (mb->heap_limit > re->limit_heap)
mb->heap_limit = re->limit_heap;
mb->start_code = (PCRE2_UCHAR *)((uint8_t *)re + sizeof(pcre2_real_code)) + mb->start_code = (PCRE2_UCHAR *)((uint8_t *)re + sizeof(pcre2_real_code)) +
re->name_count * re->name_entry_size; re->name_count * re->name_entry_size;
mb->tables = re->tables; mb->tables = re->tables;
@ -3215,6 +3392,7 @@ mb->start_offset = start_offset;
mb->moptions = options; mb->moptions = options;
mb->poptions = re->overall_options; mb->poptions = re->overall_options;
mb->match_call_count = 0; mb->match_call_count = 0;
mb->heap_used = 0;
/* Process the \R and newline settings. */ /* Process the \R and newline settings. */
@ -3351,8 +3529,6 @@ a match. */
for (;;) for (;;)
{ {
int rc;
/* ----------------- Start of match optimizations ---------------- */ /* ----------------- Start of match optimizations ---------------- */
/* There are some optimizations that avoid running the match if a known /* There are some optimizations that avoid running the match if a known
@ -3544,7 +3720,7 @@ for (;;)
in characters, we treat it as code units to avoid spending too much time in characters, we treat it as code units to avoid spending too much time
in this optimization. */ in this optimization. */
if (end_subject - start_match < re->minlength) return PCRE2_ERROR_NOMATCH; if (end_subject - start_match < re->minlength) goto NOMATCH_EXIT;
/* If req_cu is set, we know that that code unit must appear in the /* If req_cu is set, we know that that code unit must appear in the
subject for the match to succeed. If the first code unit is set, req_cu subject for the match to succeed. If the first code unit is set, req_cu
@ -3621,7 +3797,8 @@ for (;;)
(uint32_t)match_data->oveccount * 2, /* actual size of same */ (uint32_t)match_data->oveccount * 2, /* actual size of same */
workspace, /* workspace vector */ workspace, /* workspace vector */
(int)wscount, /* size of same */ (int)wscount, /* size of same */
0); /* function recurse level */ 0, /* function recurse level */
base_recursion_workspace); /* initial workspace for recursion */
/* Anything other than "no match" means we are done, always; otherwise, carry /* Anything other than "no match" means we are done, always; otherwise, carry
on only if not anchored. */ on only if not anchored. */
@ -3637,7 +3814,7 @@ for (;;)
match_data->rightchar = (PCRE2_SIZE)( mb->last_used_ptr - subject); match_data->rightchar = (PCRE2_SIZE)( mb->last_used_ptr - subject);
match_data->startchar = (PCRE2_SIZE)(start_match - subject); match_data->startchar = (PCRE2_SIZE)(start_match - subject);
match_data->rc = rc; match_data->rc = rc;
return rc; goto EXIT;
} }
/* Advance to the next subject character unless we are at the end of a line /* Advance to the next subject character unless we are at the end of a line
@ -3668,8 +3845,18 @@ for (;;)
} /* "Bumpalong" loop */ } /* "Bumpalong" loop */
NOMATCH_EXIT:
rc = PCRE2_ERROR_NOMATCH;
return PCRE2_ERROR_NOMATCH; EXIT:
while (rws->next != NULL)
{
RWS_anchor *next = rws->next;
rws->next = next->next;
mb->memctl.free(next, mb->memctl.memory_data);
}
return rc;
} }
/* End of pcre2_dfa_match.c */ /* End of pcre2_dfa_match.c */

View File

@ -253,6 +253,11 @@ maximum size of this can be limited. */
#define START_FRAMES_SIZE 20480 #define START_FRAMES_SIZE 20480
/* Similarly, for DFA matching, an initial internal workspace vector is
allocated on the stack. */
#define DFA_START_RWS_SIZE 30720
/* Define the default BSR convention. */ /* Define the default BSR convention. */
#ifdef BSR_ANYCRLF #ifdef BSR_ANYCRLF

View File

@ -896,6 +896,8 @@ typedef struct dfa_match_block {
PCRE2_SPTR last_used_ptr; /* Latest consulted character */ PCRE2_SPTR last_used_ptr; /* Latest consulted character */
const uint8_t *tables; /* Character tables */ const uint8_t *tables; /* Character tables */
PCRE2_SIZE start_offset; /* The start offset value */ PCRE2_SIZE start_offset; /* The start offset value */
PCRE2_SIZE heap_limit; /* As it says */
PCRE2_SIZE heap_used; /* As it says */
uint32_t match_limit; /* As it says */ uint32_t match_limit; /* As it says */
uint32_t match_limit_depth; /* As it says */ uint32_t match_limit_depth; /* As it says */
uint32_t match_call_count; /* Number of calls of internal function */ uint32_t match_call_count; /* Number of calls of internal function */

View File

@ -5760,6 +5760,8 @@ PCRE2_SET_HEAP_LIMIT(dat_context, max);
for (;;) for (;;)
{ {
uint32_t stack_start = 0;
if (errnumber == PCRE2_ERROR_HEAPLIMIT) if (errnumber == PCRE2_ERROR_HEAPLIMIT)
{ {
PCRE2_SET_HEAP_LIMIT(dat_context, mid); PCRE2_SET_HEAP_LIMIT(dat_context, mid);
@ -5775,6 +5777,7 @@ for (;;)
if ((dat_datctl.control & CTL_DFA) != 0) if ((dat_datctl.control & CTL_DFA) != 0)
{ {
stack_start = DFA_START_RWS_SIZE/1024;
if (dfa_workspace == NULL) if (dfa_workspace == NULL)
dfa_workspace = (int *)malloc(DFA_WS_DIMENSION*sizeof(int)); dfa_workspace = (int *)malloc(DFA_WS_DIMENSION*sizeof(int));
if (dfa_matched++ == 0) if (dfa_matched++ == 0)
@ -5789,11 +5792,21 @@ for (;;)
dat_datctl.options, match_data, PTR(dat_context)); dat_datctl.options, match_data, PTR(dat_context));
else else
{
stack_start = START_FRAMES_SIZE/1024;
PCRE2_MATCH(capcount, compiled_code, pp, ulen, dat_datctl.offset, PCRE2_MATCH(capcount, compiled_code, pp, ulen, dat_datctl.offset,
dat_datctl.options, match_data, PTR(dat_context)); dat_datctl.options, match_data, PTR(dat_context));
}
if (capcount == errnumber) if (capcount == errnumber)
{ {
if ((mid & 0x80000000u) != 0)
{
fprintf(outfile, "Can't find minimum %s limit: check pattern for "
"restriction\n", msg);
break;
}
min = mid; min = mid;
mid = (mid == max - 1)? max : (max != UINT32_MAX)? (min + max)/2 : mid*2; mid = (mid == max - 1)? max : (max != UINT32_MAX)? (min + max)/2 : mid*2;
} }
@ -5802,11 +5815,12 @@ for (;;)
capcount == PCRE2_ERROR_PARTIAL) capcount == PCRE2_ERROR_PARTIAL)
{ {
/* If we've not hit the error with a heap limit less than the size of the /* If we've not hit the error with a heap limit less than the size of the
initial stack frame vector, the heap is not being used, so the minimum initial stack frame vector (for pcre2_match()) or the initial stack
limit is zero; there's no need to go on. The other limits are always workspace vector (for pcre2_dfa_match()), the heap is not being used, so
greater than zero. */ the minimum limit is zero; there's no need to go on. The other limits are
always greater than zero. */
if (errnumber == PCRE2_ERROR_HEAPLIMIT && mid < START_FRAMES_SIZE/1024) if (errnumber == PCRE2_ERROR_HEAPLIMIT && mid < stack_start)
{ {
fprintf(outfile, "Minimum %s limit = 0\n", msg); fprintf(outfile, "Minimum %s limit = 0\n", msg);
break; break;
@ -6771,7 +6785,7 @@ if ((pat_patctl.control & CTL_POSIX) != 0)
PCRE2_SIZE end = pmatch[i].rm_eo; PCRE2_SIZE end = pmatch[i].rm_eo;
for (j = last_printed + 1; j < i; j++) for (j = last_printed + 1; j < i; j++)
fprintf(outfile, "%2d: <unset>\n", (int)j); fprintf(outfile, "%2d: <unset>\n", (int)j);
last_printed = i; last_printed = i;
if (start > end) if (start > end)
{ {
start = pmatch[i].rm_eo; start = pmatch[i].rm_eo;
@ -7139,18 +7153,16 @@ else for (gmatched = 0;; gmatched++)
(double)CLOCKS_PER_SEC); (double)CLOCKS_PER_SEC);
} }
/* Find the heap, match and depth limits if requested. The match and heap /* Find the heap, match and depth limits if requested. The depth and heap
limits are not relevant for DFA matching and the depth and heap limits are limits are not relevant for JIT. The return from check_match_limit() is the
not relevant for JIT. The return from check_match_limit() is the return from return from the final call to pcre2_match() or pcre2_dfa_match(). */
the final call to pcre2_match() or pcre2_dfa_match(). */
if ((dat_datctl.control & CTL_FINDLIMITS) != 0) if ((dat_datctl.control & CTL_FINDLIMITS) != 0)
{ {
capcount = 0; /* This stops compiler warnings */ capcount = 0; /* This stops compiler warnings */
if ((dat_datctl.control & CTL_DFA) == 0 && if (FLD(compiled_code, executable_jit) == NULL ||
(FLD(compiled_code, executable_jit) == NULL || (dat_datctl.options & PCRE2_NO_JIT) != 0)
(dat_datctl.options & PCRE2_NO_JIT) != 0))
{ {
(void)check_match_limit(pp, arg_ulen, PCRE2_ERROR_HEAPLIMIT, "heap"); (void)check_match_limit(pp, arg_ulen, PCRE2_ERROR_HEAPLIMIT, "heap");
} }
@ -7165,6 +7177,12 @@ else for (gmatched = 0;; gmatched++)
capcount = check_match_limit(pp, arg_ulen, PCRE2_ERROR_DEPTHLIMIT, capcount = check_match_limit(pp, arg_ulen, PCRE2_ERROR_DEPTHLIMIT,
"depth"); "depth");
} }
if (capcount == 0)
{
fprintf(outfile, "Matched, but offsets vector is too small to show all matches\n");
capcount = dat_datctl.oveccount;
}
} }
/* Otherwise just run a single match, setting up a callout if required (the /* Otherwise just run a single match, setting up a callout if required (the
@ -7877,7 +7895,7 @@ else
(void)PCRE2_CONFIG(PCRE2_CONFIG_NEWLINE, &optval); (void)PCRE2_CONFIG(PCRE2_CONFIG_NEWLINE, &optval);
print_newline_config(optval, FALSE); print_newline_config(optval, FALSE);
(void)PCRE2_CONFIG(PCRE2_CONFIG_BSR, &optval); (void)PCRE2_CONFIG(PCRE2_CONFIG_BSR, &optval);
printf(" \\R matches %s\n", printf(" \\R matches %s\n",
(optval == PCRE2_BSR_ANYCRLF)? "CR, LF, or CRLF only" : (optval == PCRE2_BSR_ANYCRLF)? "CR, LF, or CRLF only" :
"all Unicode newlines"); "all Unicode newlines");
(void)PCRE2_CONFIG(PCRE2_CONFIG_NEVER_BACKSLASH_C, &optval); (void)PCRE2_CONFIG(PCRE2_CONFIG_NEVER_BACKSLASH_C, &optval);

8
testdata/testinput6 vendored
View File

@ -4874,6 +4874,14 @@
\= Expect depth limit exceeded \= Expect depth limit exceeded
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00] a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
/(*LIMIT_HEAP=0)^((.)(?1)|.)$/
\= Expect heap limit exceeded
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
/(*LIMIT_HEAP=50000)^((.)(?1)|.)$/
\= Expect success
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
/(02-)?[0-9]{3}-[0-9]{3}/ /(02-)?[0-9]{3}-[0-9]{3}/
02-123-123 02-123-123

11
testdata/testoutput6 vendored
View File

@ -7667,12 +7667,23 @@ No match
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00] a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
Failed: error -53: matching depth limit exceeded Failed: error -53: matching depth limit exceeded
/(*LIMIT_HEAP=0)^((.)(?1)|.)$/
\= Expect heap limit exceeded
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
Failed: error -63: heap limit exceeded
/(*LIMIT_HEAP=50000)^((.)(?1)|.)$/
\= Expect success
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
0: a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
/(02-)?[0-9]{3}-[0-9]{3}/ /(02-)?[0-9]{3}-[0-9]{3}/
02-123-123 02-123-123
0: 02-123-123 0: 02-123-123
/^(a(?2))(b)(?1)/ /^(a(?2))(b)(?1)/
abbab\=find_limits abbab\=find_limits
Minimum heap limit = 0
Minimum match limit = 4 Minimum match limit = 4
Minimum depth limit = 2 Minimum depth limit = 2
0: abbab 0: abbab