Re-factor pcre2_dfa_match() to use the heap instead of the stack for workspace
vectors when doing recursive function calls.
This commit is contained in:
parent
fb413521fc
commit
75747ebb11
10
ChangeLog
10
ChangeLog
|
@ -50,7 +50,15 @@ offset is set zero for early errors.
|
|||
|
||||
(c) Support for non-C99 snprintf() that returns -1 in the overflow case.
|
||||
|
||||
11. Minor tidy of pcre2_dfa_matgch() code.
|
||||
11. Minor tidy of pcre2_dfa_match() code.
|
||||
|
||||
12. Refactored pcre2_dfa_match() so that the internal recursive calls no longer
|
||||
use the stack for local workspace and local ovectors. Instead, an initial block
|
||||
of stack is reserved, but if this is insufficient, heap memory is used. The
|
||||
heap limit parameter now applies to pcre2_dfa_match().
|
||||
|
||||
13. If a "find limits" test of DFA matching in pcre2test resulted in too many
|
||||
matches for the ovector, no matches were displayed.
|
||||
|
||||
|
||||
Version 10.31 12-February-2018
|
||||
|
|
12
README
12
README
|
@ -241,9 +241,11 @@ library. They are also documented in the pcre2build man page.
|
|||
discussion in the pcre2api man page (search for pcre2_set_match_limit).
|
||||
|
||||
. There is a separate counter that limits the depth of nested backtracking
|
||||
during a matching process, which indirectly limits the amount of heap memory
|
||||
that is used. This also has a default of ten million, which is essentially
|
||||
"unlimited". You can change the default by setting, for example,
|
||||
(pcre2_match()) or nested function calls (pcre2_dfa_match()) during a
|
||||
matching process, which indirectly limits the amount of heap memory that is
|
||||
used, and in the case of pcre2_dfa_match() the amount of stack as well. This
|
||||
counter also has a default of ten million, which is essentially "unlimited".
|
||||
You can change the default by setting, for example,
|
||||
|
||||
--with-match-limit-depth=5000
|
||||
|
||||
|
@ -251,7 +253,7 @@ library. They are also documented in the pcre2build man page.
|
|||
pcre2_set_depth_limit).
|
||||
|
||||
. You can also set an explicit limit on the amount of heap memory used by
|
||||
the pcre2_match() interpreter:
|
||||
the pcre2_match() and pcre2_dfa_match() interpreters:
|
||||
|
||||
--with-heap-limit=500
|
||||
|
||||
|
@ -885,4 +887,4 @@ The distribution should contain the files listed below.
|
|||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 25 February 2018
|
||||
Last updated: 27 April 2018
|
||||
|
|
21
configure.ac
21
configure.ac
|
@ -142,7 +142,7 @@ AC_ARG_ENABLE(jit,
|
|||
AS_HELP_STRING([--enable-jit],
|
||||
[enable Just-In-Time compiling support]),
|
||||
, enable_jit=no)
|
||||
|
||||
|
||||
# This code enables JIT if the hardware supports it.
|
||||
if test "$enable_jit" = "auto"; then
|
||||
AC_LANG(C)
|
||||
|
@ -718,10 +718,11 @@ AC_DEFINE_UNQUOTED([PARENS_NEST_LIMIT], [$with_parens_nest_limit], [
|
|||
AC_DEFINE_UNQUOTED([MATCH_LIMIT], [$with_match_limit], [
|
||||
The value of MATCH_LIMIT determines the default number of times the
|
||||
pcre2_match() function can record a backtrack position during a single
|
||||
matching attempt. There is a runtime interface for setting a different limit.
|
||||
The limit exists in order to catch runaway regular expressions that take for
|
||||
ever to determine that they do not match. The default is set very large so
|
||||
that it does not accidentally catch legitimate cases.])
|
||||
matching attempt. The value is also used to limit a loop counter in
|
||||
pcre2_dfa_match(). There is a runtime interface for setting a different
|
||||
limit. The limit exists in order to catch runaway regular expressions that
|
||||
take for ever to determine that they do not match. The default is set very
|
||||
large so that it does not accidentally catch legitimate cases.])
|
||||
|
||||
# --with-match-limit-recursion is an obsolete synonym for --with-match-limit-depth
|
||||
|
||||
|
@ -745,11 +746,15 @@ AC_DEFINE_UNQUOTED([MATCH_LIMIT_DEPTH], [$with_match_limit_depth], [
|
|||
the maximum amount of heap memory that is used. The value of
|
||||
MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it must
|
||||
be less than the value of MATCH_LIMIT. The default is to use the same value
|
||||
as MATCH_LIMIT. There is a runtime method for setting a different limit.])
|
||||
as MATCH_LIMIT. There is a runtime method for setting a different limit. In
|
||||
the case of pcre2_dfa_match(), this limit controls the depth of the internal
|
||||
nested function calls that are used for pattern recursions, lookarounds, and
|
||||
atomic groups.])
|
||||
|
||||
AC_DEFINE_UNQUOTED([HEAP_LIMIT], [$with_heap_limit], [
|
||||
This limits the amount of memory that pcre2_match() may use while matching
|
||||
a pattern. The value is in kilobytes.])
|
||||
This limits the amount of memory that may be used while matching
|
||||
a pattern. It applies to both pcre2_match() and pcre2_dfa_match(). It does
|
||||
not apply to JIT matching. The value is in kilobytes.])
|
||||
|
||||
AC_DEFINE([MAX_NAME_SIZE], [32], [
|
||||
This limit is parameterized just in case anybody ever wants to
|
||||
|
|
|
@ -10,6 +10,7 @@ This document contains the following sections:
|
|||
Calling conventions in Windows environments
|
||||
Comments about Win32 builds
|
||||
Building PCRE2 on Windows with CMake
|
||||
Building PCRE2 on Windows with Visual Studio
|
||||
Testing with RunTest.bat
|
||||
Building PCRE2 on native z/OS and z/VM
|
||||
|
||||
|
@ -328,6 +329,18 @@ cache can be deleted by selecting "File > Delete Cache".
|
|||
most recent build configuration is targeted by the tests. A summary of
|
||||
test results is presented. Complete test output is subsequently
|
||||
available for review in Testing\Temporary under your build dir.
|
||||
|
||||
|
||||
BUILDING PCRE2 ON WINDOWS WITH VISUAL STUDIO
|
||||
|
||||
The code currently cannot be compiled without a stdint.h header, which is
|
||||
available only in relatively recent versions of Visual Studio. However, this
|
||||
portable and permissively-licensed implementation of the header worked without
|
||||
issue:
|
||||
|
||||
http://www.azillionmonkeys.com/qed/pstdint.h
|
||||
|
||||
Just rename it and drop it into the top level of the build tree.
|
||||
|
||||
|
||||
TESTING WITH RUNTEST.BAT
|
||||
|
@ -382,6 +395,6 @@ Everything in that location, source and executable, is in EBCDIC and native
|
|||
z/OS file formats. The port provides an API for LE languages such as COBOL and
|
||||
for the z/OS and z/VM versions of the Rexx languages.
|
||||
|
||||
===============================
|
||||
Last Updated: 13 September 2017
|
||||
===============================
|
||||
===========================
|
||||
Last Updated: 19 April 2018
|
||||
===========================
|
||||
|
|
|
@ -241,9 +241,11 @@ library. They are also documented in the pcre2build man page.
|
|||
discussion in the pcre2api man page (search for pcre2_set_match_limit).
|
||||
|
||||
. There is a separate counter that limits the depth of nested backtracking
|
||||
during a matching process, which indirectly limits the amount of heap memory
|
||||
that is used. This also has a default of ten million, which is essentially
|
||||
"unlimited". You can change the default by setting, for example,
|
||||
(pcre2_match()) or nested function calls (pcre2_dfa_match()) during a
|
||||
matching process, which indirectly limits the amount of heap memory that is
|
||||
used, and in the case of pcre2_dfa_match() the amount of stack as well. This
|
||||
counter also has a default of ten million, which is essentially "unlimited".
|
||||
You can change the default by setting, for example,
|
||||
|
||||
--with-match-limit-depth=5000
|
||||
|
||||
|
@ -251,7 +253,7 @@ library. They are also documented in the pcre2build man page.
|
|||
pcre2_set_depth_limit).
|
||||
|
||||
. You can also set an explicit limit on the amount of heap memory used by
|
||||
the pcre2_match() interpreter:
|
||||
the pcre2_match() and pcre2_dfa_match() interpreters:
|
||||
|
||||
--with-heap-limit=500
|
||||
|
||||
|
@ -885,4 +887,4 @@ The distribution should contain the files listed below.
|
|||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 25 February 2018
|
||||
Last updated: 27 April 2018
|
||||
|
|
|
@ -46,9 +46,9 @@ just once (except when processing lookaround assertions). This function is
|
|||
<i>wscount</i> Number of elements in the vector
|
||||
</pre>
|
||||
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
|
||||
up a callout function or specify the match and/or the recursion depth limits.
|
||||
The <i>length</i> and <i>startoffset</i> values are code units, not characters.
|
||||
The options are:
|
||||
up a callout function or specify the heap limit or the match or the recursion
|
||||
depth limits. The <i>length</i> and <i>startoffset</i> values are code units, not
|
||||
characters. The options are:
|
||||
<pre>
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||
|
|
|
@ -951,14 +951,15 @@ offset limit. In other words, whichever limit comes first is used.
|
|||
<br>
|
||||
The <i>heap_limit</i> parameter specifies, in units of kilobytes, the maximum
|
||||
amount of heap memory that <b>pcre2_match()</b> may use to hold backtracking
|
||||
information when running an interpretive match. This limit does not apply to
|
||||
matching with the JIT optimization, which has its own memory control
|
||||
arrangements (see the
|
||||
information when running an interpretive match. This limit also applies to
|
||||
<b>pcre2_dfa_match()</b>, which may use the heap when processing patterns with a
|
||||
lot of nested pattern recursion or lookarounds or atomic groups. This limit
|
||||
does not apply to matching with the JIT optimization, which has its own memory
|
||||
control arrangements (see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for more details), nor does it apply to <b>pcre2_dfa_match()</b>.
|
||||
If the limit is reached, the negative error code PCRE2_ERROR_HEAPLIMIT is
|
||||
returned. The default limit is set when PCRE2 is built; the default default is
|
||||
very large and is essentially "unlimited".
|
||||
documentation for more details). If the limit is reached, the negative error
|
||||
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 is
|
||||
built; the default default is very large and is essentially "unlimited".
|
||||
</P>
|
||||
<P>
|
||||
A value for the heap limit may also be supplied by an item at the start of a
|
||||
|
@ -978,6 +979,12 @@ Heap memory is used only if the initial vector is too small. If the heap limit
|
|||
is set to a value less than 21 (in particular, zero) no heap memory will be
|
||||
used. In this case, only patterns that do not have a lot of nested backtracking
|
||||
can be successfully processed.
|
||||
</P>
|
||||
<P>
|
||||
Similarly, for <b>pcre2_dfa_match()</b>, a vector on the system stack is used
|
||||
when processing pattern recursions, lookarounds, or atomic groups, and only if
|
||||
this is not big enough is heap memory used. In this case, too, setting a value
|
||||
of zero disables the use of the heap.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
|
@ -1035,11 +1042,22 @@ backtracking.
|
|||
<P>
|
||||
The depth limit is not relevant, and is ignored, when matching is done using
|
||||
JIT compiled code. However, it is supported by <b>pcre2_dfa_match()</b>, which
|
||||
uses it to limit the depth of internal recursive function calls that implement
|
||||
atomic groups, lookaround assertions, and pattern recursions. This is,
|
||||
therefore, an indirect limit on the amount of system stack that is used. A
|
||||
recursive pattern such as /(.)(?1)/, when matched to a very long string using
|
||||
<b>pcre2_dfa_match()</b>, can use a great deal of stack.
|
||||
uses it to limit the depth of nested internal recursive function calls that
|
||||
implement atomic groups, lookaround assertions, and pattern recursions. This
|
||||
limits, indirectly, the amount of system stack this is used. It was more useful
|
||||
in versions before 10.32, when stack memory was used for local workspace
|
||||
vectors for recursive function calls. From version 10.32, only local variables
|
||||
are allocated on the stack and as each call uses only a few hundred bytes, even
|
||||
a small stack can support quite a lot of recursion.
|
||||
</P>
|
||||
<P>
|
||||
If the depth of internal recursive function calls is great enough, local
|
||||
workspace vectors are allocated on the heap from version 10.32 onwards, so the
|
||||
depth limit also indirectly limits the amount of heap memory that is used. A
|
||||
recursive pattern such as /(.(?2))((?1)|)/, when matched to a very long string
|
||||
using <b>pcre2_dfa_match()</b>, can use a great deal of memory. However, it is
|
||||
probably better to limit heap usage directly by calling
|
||||
<b>pcre2_set_heap_limit()</b>.
|
||||
</P>
|
||||
<P>
|
||||
The default value for the depth limit can be set when PCRE2 is built; the
|
||||
|
@ -1096,15 +1114,16 @@ and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively.
|
|||
PCRE2_CONFIG_DEPTHLIMIT
|
||||
</pre>
|
||||
The output is a uint32_t integer that gives the default limit for the depth of
|
||||
nested backtracking in <b>pcre2_match()</b> or the depth of nested recursions
|
||||
and lookarounds in <b>pcre2_dfa_match()</b>. Further details are given with
|
||||
<b>pcre2_set_depth_limit()</b> above.
|
||||
nested backtracking in <b>pcre2_match()</b> or the depth of nested recursions,
|
||||
lookarounds, and atomic groups in <b>pcre2_dfa_match()</b>. Further details are
|
||||
given with <b>pcre2_set_depth_limit()</b> above.
|
||||
<pre>
|
||||
PCRE2_CONFIG_HEAPLIMIT
|
||||
</pre>
|
||||
The output is a uint32_t integer that gives, in kilobytes, the default limit
|
||||
for the amount of heap memory used by <b>pcre2_match()</b>. Further details are
|
||||
given with <b>pcre2_set_heap_limit()</b> above.
|
||||
for the amount of heap memory used by <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b>. Further details are given with
|
||||
<b>pcre2_set_heap_limit()</b> above.
|
||||
<pre>
|
||||
PCRE2_CONFIG_JIT
|
||||
</pre>
|
||||
|
@ -3510,17 +3529,7 @@ capture.
|
|||
Calls to the convenience functions that extract substrings by name
|
||||
return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a
|
||||
DFA match. The convenience functions that extract substrings by number never
|
||||
return PCRE2_ERROR_NOSUBSTRING, and the meanings of some other errors are
|
||||
slightly different:
|
||||
<pre>
|
||||
PCRE2_ERROR_UNAVAILABLE
|
||||
</pre>
|
||||
The ovector is not big enough to include a slot for the given substring number.
|
||||
<pre>
|
||||
PCRE2_ERROR_UNSET
|
||||
</pre>
|
||||
There is a slot in the ovector for this substring, but there were insufficient
|
||||
matches to fill it.
|
||||
return PCRE2_ERROR_NOSUBSTRING.
|
||||
</P>
|
||||
<P>
|
||||
The matched strings are stored in the ovector in reverse order of length; that
|
||||
|
@ -3594,9 +3603,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 31 December 2017
|
||||
Last updated: 27 April 2018
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -295,9 +295,10 @@ change this by a setting such as
|
|||
--with-heap-limit=500
|
||||
</pre>
|
||||
which limits the amount of heap to 500 kilobytes. This limit applies only to
|
||||
interpretive matching in pcre2_match(). It does not apply when JIT (which has
|
||||
its own memory arrangements) is used, nor does it apply to
|
||||
<b>pcre2_dfa_match()</b>.
|
||||
interpretive matching in <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, which
|
||||
may also use the heap for internal workspace when processing complicated
|
||||
patterns. This limit does not apply when JIT (which has its own memory
|
||||
arrangements) is used.
|
||||
</P>
|
||||
<P>
|
||||
You can also explicitly limit the depth of nested backtracking in the
|
||||
|
@ -573,7 +574,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 25 February 2018
|
||||
Last updated: 26 April 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -310,10 +310,12 @@ PCRE2_UNSET.
|
|||
</P>
|
||||
<P>
|
||||
For DFA matching, the <i>offset_vector</i> field points to the ovector that was
|
||||
passed to the matching function in the match data block, but it holds no useful
|
||||
information at callout time because <b>pcre2_dfa_match()</b> does not support
|
||||
substring capturing. The value of <i>capture_top</i> is always 1 and the value
|
||||
of <i>capture_last</i> is always 0 for DFA matching.
|
||||
passed to the matching function in the match data block for callouts at the top
|
||||
level, but to an internal ovector during the processing of pattern recursions,
|
||||
lookarounds, and atomic groups. However, these ovectors hold no useful
|
||||
information because <b>pcre2_dfa_match()</b> does not support substring
|
||||
capturing. The value of <i>capture_top</i> is always 1 and the value of
|
||||
<i>capture_last</i> is always 0 for DFA matching.
|
||||
</P>
|
||||
<P>
|
||||
The <i>subject</i> and <i>subject_length</i> fields contain copies of the values
|
||||
|
@ -461,9 +463,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 22 December 2017
|
||||
Last updated: 26 April 2018
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -173,12 +173,12 @@ the application to apply the JIT optimization by calling
|
|||
Setting match resource limits
|
||||
</b><br>
|
||||
<P>
|
||||
The pcre2_match() function contains a counter that is incremented every time it
|
||||
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
|
||||
this counter, which therefore limits the amount of computing resource used for
|
||||
a match. The maximum depth of nested backtracking can also be limited; this
|
||||
indirectly restricts the amount of heap memory that is used, but there is also
|
||||
an explicit memory limit that can be set.
|
||||
The <b>pcre2_match()</b> function contains a counter that is incremented every
|
||||
time it goes round its main loop. The caller of <b>pcre2_match()</b> can set a
|
||||
limit on this counter, which therefore limits the amount of computing resource
|
||||
used for a match. The maximum depth of nested backtracking can also be limited;
|
||||
this indirectly restricts the amount of heap memory that is used, but there is
|
||||
also an explicit memory limit that can be set.
|
||||
</P>
|
||||
<P>
|
||||
These facilities are provided to catch runaway matches that are provoked by
|
||||
|
@ -195,20 +195,22 @@ where d is any number of decimal digits. However, the value of the setting must
|
|||
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
setting of one of these limits, the lower value is used. The heap limit is
|
||||
specified in kilobytes.
|
||||
</P>
|
||||
<P>
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
still recognized for backwards compatibility.
|
||||
</P>
|
||||
<P>
|
||||
The heap limit applies only when the <b>pcre2_match()</b> interpreter is used
|
||||
for matching. It does not apply to JIT or DFA matching. The match limit is used
|
||||
(but in a different way) when JIT is being used, or when
|
||||
<b>pcre2_dfa_match()</b> is called, to limit computing resource usage by those
|
||||
matching functions. The depth limit is ignored by JIT but is relevant for DFA
|
||||
matching, which uses function recursion for recursions within the pattern. In
|
||||
this case, the depth limit controls the amount of system stack that is used.
|
||||
The heap limit applies only when the <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b> interpreters are used for matching. It does not apply
|
||||
to JIT. The match limit is used (but in a different way) when JIT is being
|
||||
used, or when <b>pcre2_dfa_match()</b> is called, to limit computing resource
|
||||
usage by those matching functions. The depth limit is ignored by JIT but is
|
||||
relevant for DFA matching, which uses function recursion for recursions within
|
||||
the pattern and for lookaround assertions and atomic groups. In this case, the
|
||||
depth limit controls the depth of such recursion.
|
||||
<a name="newlines"></a></P>
|
||||
<br><b>
|
||||
Newline conventions
|
||||
|
@ -2818,11 +2820,6 @@ matched at the top level, its final captured value is unset, even if it was
|
|||
(temporarily) set at a deeper level during the matching process.
|
||||
</P>
|
||||
<P>
|
||||
If there are more than 15 capturing parentheses in a pattern, PCRE2 has to
|
||||
obtain extra memory from the heap to store data during a recursion. If no
|
||||
memory can be obtained, the match fails with the PCRE2_ERROR_NOMEMORY error.
|
||||
</P>
|
||||
<P>
|
||||
Do not confuse the (?R) item with the condition (R), which tests for recursion.
|
||||
Consider this pattern, which matches text in angle brackets, allowing for
|
||||
arbitrary nesting. Only digits are allowed in nested brackets (that is, when
|
||||
|
@ -3479,9 +3476,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 12 September 2017
|
||||
Last updated: 25 April 2018
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -93,9 +93,17 @@ may also reduce the memory requirements.
|
|||
<P>
|
||||
In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive
|
||||
function calls, but only for processing atomic groups, lookaround assertions,
|
||||
and recursion within the pattern. Too much nested recursion may cause stack
|
||||
issues. The "match depth" parameter can be used to limit the depth of function
|
||||
recursion in <b>pcre2_dfa_match()</b>.
|
||||
and recursion within the pattern. The original version of the code used to
|
||||
allocate quite large internal workspace vectors on the stack, which caused some
|
||||
problems for some patterns in environments with small stacks. From release
|
||||
10.32 the code for <b>pcre2_dfa_match()</b> has been re-factored to use heap
|
||||
memory when necessary for internal workspace when recursing, though recursive
|
||||
function calls are still used.
|
||||
</P>
|
||||
<P>
|
||||
The "match depth" parameter can be used to limit the depth of function
|
||||
recursion, and the "match heap" parameter to limit heap memory in
|
||||
<b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
|
||||
<P>
|
||||
|
@ -244,9 +252,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 08 April 2017
|
||||
Last updated: 25 April 2018
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -1199,7 +1199,7 @@ pattern.
|
|||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
heap_limit=<n> set a limit on heap memory
|
||||
heap_limit=<n> set a limit on heap memory (Kbytes)
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=<n> set a match limit
|
||||
|
@ -1438,20 +1438,17 @@ Finding minimum limits
|
|||
<P>
|
||||
If the <b>find_limits</b> modifier is present on a subject line, <b>pcre2test</b>
|
||||
calls the relevant matching function several times, setting different values in
|
||||
the match context via <b>pcre2_set_heap_limit(), \fBpcre2_set_match_limit()</b>,
|
||||
or <b>pcre2_set_depth_limit()</b> until it finds the minimum values for each
|
||||
parameter that allows the match to complete without error.
|
||||
the match context via <b>pcre2_set_heap_limit()</b>,
|
||||
<b>pcre2_set_match_limit()</b>, or <b>pcre2_set_depth_limit()</b> until it finds
|
||||
the minimum values for each parameter that allows the match to complete without
|
||||
error. If JIT is being used, only the match limit is relevant.
|
||||
</P>
|
||||
<P>
|
||||
If JIT is being used, only the match limit is relevant. If DFA matching is
|
||||
being used, only the depth limit is relevant.
|
||||
</P>
|
||||
<P>
|
||||
The <i>match_limit</i> number is a measure of the amount of backtracking
|
||||
that takes place, and learning the minimum value can be instructive. For most
|
||||
simple matches, the number is quite small, but for patterns with very large
|
||||
numbers of matching possibilities, it can become large very quickly with
|
||||
increasing length of subject string.
|
||||
When using this modifier, the pattern should not contain any limit settings
|
||||
such as (*LIMIT_MATCH=...) within it. If such a setting is present and is
|
||||
lower than the minimum matching value, the minimum value cannot be found
|
||||
because <b>pcre2_set_match_limit()</b> etc. are only able to reduce the value of
|
||||
an in-pattern limit; they cannot increase it.
|
||||
</P>
|
||||
<P>
|
||||
For non-DFA matching, the minimum <i>depth_limit</i> number is a measure of how
|
||||
|
@ -1460,6 +1457,22 @@ searched). In the case of DFA matching, <i>depth_limit</i> controls the depth of
|
|||
recursive calls of the internal function that is used for handling pattern
|
||||
recursion, lookaround assertions, and atomic groups.
|
||||
</P>
|
||||
<P>
|
||||
For non-DFA matching, the <i>match_limit</i> number is a measure of the amount
|
||||
of backtracking that takes place, and learning the minimum value can be
|
||||
instructive. For most simple matches, the number is quite small, but for
|
||||
patterns with very large numbers of matching possibilities, it can become large
|
||||
very quickly with increasing length of subject string. In the case of DFA
|
||||
matching, <i>match_limit</i> controls the total number of calls, both recursive
|
||||
and non-recursive, to the internal matching function, thus controlling the
|
||||
overall amount of computing resource that is used.
|
||||
</P>
|
||||
<P>
|
||||
For both kinds of matching, the <i>heap_limit</i> number (which is in kilobytes)
|
||||
limits the amount of heap memory used for matching. A value of zero disables
|
||||
the use of any heap memory; many simple pattern matches can be done without
|
||||
using the heap, so this is not an unreasonable setting.
|
||||
</P>
|
||||
<br><b>
|
||||
Showing MARK names
|
||||
</b><br>
|
||||
|
@ -1476,13 +1489,14 @@ Showing memory usage
|
|||
<P>
|
||||
The <b>memory</b> modifier causes <b>pcre2test</b> to log the sizes of all heap
|
||||
memory allocation and freeing calls that occur during a call to
|
||||
<b>pcre2_match()</b>. These occur only when a match requires a bigger vector
|
||||
than the default for remembering backtracking points. In many cases there will
|
||||
be no heap memory used and therefore no additional output. No heap memory is
|
||||
allocated during matching with <b>pcre2_dfa_match</b> or with JIT, so in those
|
||||
cases the <b>memory</b> modifier never has any effect. For this modifier to
|
||||
work, the <b>null_context</b> modifier must not be set on both the pattern and
|
||||
the subject, though it can be set on one or the other.
|
||||
<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>. These occur only when a match
|
||||
requires a bigger vector than the default for remembering backtracking points
|
||||
(<b>pcre2_match()</b>) or for internal workspace (<b>pcre2_dfa_match()</b>). In
|
||||
many cases there will be no heap memory used and therefore no additional
|
||||
output. No heap memory is allocated during matching with JIT, so in that case
|
||||
the <b>memory</b> modifier never has any effect. For this modifier to work, the
|
||||
<b>null_context</b> modifier must not be set on both the pattern and the
|
||||
subject, though it can be set on one or the other.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting a starting offset
|
||||
|
@ -1982,9 +1996,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 21 December 2017
|
||||
Last updated: 25 April 2018
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
909
doc/pcre2.txt
909
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_DFA_MATCH 3 "30 May 2017" "PCRE2 10.30"
|
||||
.TH PCRE2_DFA_MATCH 3 "26 April 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -34,9 +34,9 @@ just once (except when processing lookaround assertions). This function is
|
|||
\fIwscount\fP Number of elements in the vector
|
||||
.sp
|
||||
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
|
||||
up a callout function or specify the match and/or the recursion depth limits.
|
||||
The \fIlength\fP and \fIstartoffset\fP values are code units, not characters.
|
||||
The options are:
|
||||
up a callout function or specify the heap limit or the match or the recursion
|
||||
depth limits. The \fIlength\fP and \fIstartoffset\fP values are code units, not
|
||||
characters. The options are:
|
||||
.sp
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "31 December 2017" "PCRE2 10.31"
|
||||
.TH PCRE2API 3 "27 April 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -887,16 +887,17 @@ offset limit. In other words, whichever limit comes first is used.
|
|||
.sp
|
||||
The \fIheap_limit\fP parameter specifies, in units of kilobytes, the maximum
|
||||
amount of heap memory that \fBpcre2_match()\fP may use to hold backtracking
|
||||
information when running an interpretive match. This limit does not apply to
|
||||
matching with the JIT optimization, which has its own memory control
|
||||
arrangements (see the
|
||||
information when running an interpretive match. This limit also applies to
|
||||
\fBpcre2_dfa_match()\fP, which may use the heap when processing patterns with a
|
||||
lot of nested pattern recursion or lookarounds or atomic groups. This limit
|
||||
does not apply to matching with the JIT optimization, which has its own memory
|
||||
control arrangements (see the
|
||||
.\" HREF
|
||||
\fBpcre2jit\fP
|
||||
.\"
|
||||
documentation for more details), nor does it apply to \fBpcre2_dfa_match()\fP.
|
||||
If the limit is reached, the negative error code PCRE2_ERROR_HEAPLIMIT is
|
||||
returned. The default limit is set when PCRE2 is built; the default default is
|
||||
very large and is essentially "unlimited".
|
||||
documentation for more details). If the limit is reached, the negative error
|
||||
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 is
|
||||
built; the default default is very large and is essentially "unlimited".
|
||||
.P
|
||||
A value for the heap limit may also be supplied by an item at the start of a
|
||||
pattern of the form
|
||||
|
@ -914,6 +915,11 @@ Heap memory is used only if the initial vector is too small. If the heap limit
|
|||
is set to a value less than 21 (in particular, zero) no heap memory will be
|
||||
used. In this case, only patterns that do not have a lot of nested backtracking
|
||||
can be successfully processed.
|
||||
.P
|
||||
Similarly, for \fBpcre2_dfa_match()\fP, a vector on the system stack is used
|
||||
when processing pattern recursions, lookarounds, or atomic groups, and only if
|
||||
this is not big enough is heap memory used. In this case, too, setting a value
|
||||
of zero disables the use of the heap.
|
||||
.sp
|
||||
.nf
|
||||
.B int pcre2_set_match_limit(pcre2_match_context *\fImcontext\fP,
|
||||
|
@ -967,11 +973,21 @@ backtracking.
|
|||
.P
|
||||
The depth limit is not relevant, and is ignored, when matching is done using
|
||||
JIT compiled code. However, it is supported by \fBpcre2_dfa_match()\fP, which
|
||||
uses it to limit the depth of internal recursive function calls that implement
|
||||
atomic groups, lookaround assertions, and pattern recursions. This is,
|
||||
therefore, an indirect limit on the amount of system stack that is used. A
|
||||
recursive pattern such as /(.)(?1)/, when matched to a very long string using
|
||||
\fBpcre2_dfa_match()\fP, can use a great deal of stack.
|
||||
uses it to limit the depth of nested internal recursive function calls that
|
||||
implement atomic groups, lookaround assertions, and pattern recursions. This
|
||||
limits, indirectly, the amount of system stack this is used. It was more useful
|
||||
in versions before 10.32, when stack memory was used for local workspace
|
||||
vectors for recursive function calls. From version 10.32, only local variables
|
||||
are allocated on the stack and as each call uses only a few hundred bytes, even
|
||||
a small stack can support quite a lot of recursion.
|
||||
.P
|
||||
If the depth of internal recursive function calls is great enough, local
|
||||
workspace vectors are allocated on the heap from version 10.32 onwards, so the
|
||||
depth limit also indirectly limits the amount of heap memory that is used. A
|
||||
recursive pattern such as /(.(?2))((?1)|)/, when matched to a very long string
|
||||
using \fBpcre2_dfa_match()\fP, can use a great deal of memory. However, it is
|
||||
probably better to limit heap usage directly by calling
|
||||
\fBpcre2_set_heap_limit()\fP.
|
||||
.P
|
||||
The default value for the depth limit can be set when PCRE2 is built; the
|
||||
default default is the same value as the default for the match limit. If the
|
||||
|
@ -1028,15 +1044,16 @@ and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively.
|
|||
PCRE2_CONFIG_DEPTHLIMIT
|
||||
.sp
|
||||
The output is a uint32_t integer that gives the default limit for the depth of
|
||||
nested backtracking in \fBpcre2_match()\fP or the depth of nested recursions
|
||||
and lookarounds in \fBpcre2_dfa_match()\fP. Further details are given with
|
||||
\fBpcre2_set_depth_limit()\fP above.
|
||||
nested backtracking in \fBpcre2_match()\fP or the depth of nested recursions,
|
||||
lookarounds, and atomic groups in \fBpcre2_dfa_match()\fP. Further details are
|
||||
given with \fBpcre2_set_depth_limit()\fP above.
|
||||
.sp
|
||||
PCRE2_CONFIG_HEAPLIMIT
|
||||
.sp
|
||||
The output is a uint32_t integer that gives, in kilobytes, the default limit
|
||||
for the amount of heap memory used by \fBpcre2_match()\fP. Further details are
|
||||
given with \fBpcre2_set_heap_limit()\fP above.
|
||||
for the amount of heap memory used by \fBpcre2_match()\fP or
|
||||
\fBpcre2_dfa_match()\fP. Further details are given with
|
||||
\fBpcre2_set_heap_limit()\fP above.
|
||||
.sp
|
||||
PCRE2_CONFIG_JIT
|
||||
.sp
|
||||
|
@ -3514,17 +3531,7 @@ capture.
|
|||
Calls to the convenience functions that extract substrings by name
|
||||
return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a
|
||||
DFA match. The convenience functions that extract substrings by number never
|
||||
return PCRE2_ERROR_NOSUBSTRING, and the meanings of some other errors are
|
||||
slightly different:
|
||||
.sp
|
||||
PCRE2_ERROR_UNAVAILABLE
|
||||
.sp
|
||||
The ovector is not big enough to include a slot for the given substring number.
|
||||
.sp
|
||||
PCRE2_ERROR_UNSET
|
||||
.sp
|
||||
There is a slot in the ovector for this substring, but there were insufficient
|
||||
matches to fill it.
|
||||
return PCRE2_ERROR_NOSUBSTRING.
|
||||
.P
|
||||
The matched strings are stored in the ovector in reverse order of length; that
|
||||
is, the longest matching string is first. If there were too many matches to fit
|
||||
|
@ -3605,6 +3612,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 31 December 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
Last updated: 27 April 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2BUILD 3 "25 February 2018" "PCRE2 10.32"
|
||||
.TH PCRE2BUILD 3 "26 April 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.
|
||||
|
@ -292,9 +292,10 @@ change this by a setting such as
|
|||
--with-heap-limit=500
|
||||
.sp
|
||||
which limits the amount of heap to 500 kilobytes. This limit applies only to
|
||||
interpretive matching in pcre2_match(). It does not apply when JIT (which has
|
||||
its own memory arrangements) is used, nor does it apply to
|
||||
\fBpcre2_dfa_match()\fP.
|
||||
interpretive matching in \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, which
|
||||
may also use the heap for internal workspace when processing complicated
|
||||
patterns. This limit does not apply when JIT (which has its own memory
|
||||
arrangements) is used.
|
||||
.P
|
||||
You can also explicitly limit the depth of nested backtracking in the
|
||||
\fBpcre2_match()\fP interpreter. This limit defaults to the value that is set
|
||||
|
@ -590,6 +591,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 25 February 2018
|
||||
Last updated: 26 April 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2CALLOUT 3 "22 December 2017" "PCRE2 10.31"
|
||||
.TH PCRE2CALLOUT 3 "26 April 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -291,10 +291,12 @@ than \fIcapture_top\fP also have both of their ovector slots set to
|
|||
PCRE2_UNSET.
|
||||
.P
|
||||
For DFA matching, the \fIoffset_vector\fP field points to the ovector that was
|
||||
passed to the matching function in the match data block, but it holds no useful
|
||||
information at callout time because \fBpcre2_dfa_match()\fP does not support
|
||||
substring capturing. The value of \fIcapture_top\fP is always 1 and the value
|
||||
of \fIcapture_last\fP is always 0 for DFA matching.
|
||||
passed to the matching function in the match data block for callouts at the top
|
||||
level, but to an internal ovector during the processing of pattern recursions,
|
||||
lookarounds, and atomic groups. However, these ovectors hold no useful
|
||||
information because \fBpcre2_dfa_match()\fP does not support substring
|
||||
capturing. The value of \fIcapture_top\fP is always 1 and the value of
|
||||
\fIcapture_last\fP is always 0 for DFA matching.
|
||||
.P
|
||||
The \fIsubject\fP and \fIsubject_length\fP fields contain copies of the values
|
||||
that were passed to the matching function.
|
||||
|
@ -441,6 +443,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 22 December 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
Last updated: 26 April 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "12 September 2017" "PCRE2 10.31"
|
||||
.TH PCRE2PATTERN 3 "25 April 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -141,12 +141,12 @@ the application to apply the JIT optimization by calling
|
|||
.SS "Setting match resource limits"
|
||||
.rs
|
||||
.sp
|
||||
The pcre2_match() function contains a counter that is incremented every time it
|
||||
goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
|
||||
this counter, which therefore limits the amount of computing resource used for
|
||||
a match. The maximum depth of nested backtracking can also be limited; this
|
||||
indirectly restricts the amount of heap memory that is used, but there is also
|
||||
an explicit memory limit that can be set.
|
||||
The \fBpcre2_match()\fP function contains a counter that is incremented every
|
||||
time it goes round its main loop. The caller of \fBpcre2_match()\fP can set a
|
||||
limit on this counter, which therefore limits the amount of computing resource
|
||||
used for a match. The maximum depth of nested backtracking can also be limited;
|
||||
this indirectly restricts the amount of heap memory that is used, but there is
|
||||
also an explicit memory limit that can be set.
|
||||
.P
|
||||
These facilities are provided to catch runaway matches that are provoked by
|
||||
patterns with huge matching trees (a typical example is a pattern with nested
|
||||
|
@ -162,18 +162,20 @@ where d is any number of decimal digits. However, the value of the setting must
|
|||
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
setting of one of these limits, the lower value is used. The heap limit is
|
||||
specified in kilobytes.
|
||||
.P
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
still recognized for backwards compatibility.
|
||||
.P
|
||||
The heap limit applies only when the \fBpcre2_match()\fP interpreter is used
|
||||
for matching. It does not apply to JIT or DFA matching. The match limit is used
|
||||
(but in a different way) when JIT is being used, or when
|
||||
\fBpcre2_dfa_match()\fP is called, to limit computing resource usage by those
|
||||
matching functions. The depth limit is ignored by JIT but is relevant for DFA
|
||||
matching, which uses function recursion for recursions within the pattern. In
|
||||
this case, the depth limit controls the amount of system stack that is used.
|
||||
The heap limit applies only when the \fBpcre2_match()\fP or
|
||||
\fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply
|
||||
to JIT. The match limit is used (but in a different way) when JIT is being
|
||||
used, or when \fBpcre2_dfa_match()\fP is called, to limit computing resource
|
||||
usage by those matching functions. The depth limit is ignored by JIT but is
|
||||
relevant for DFA matching, which uses function recursion for recursions within
|
||||
the pattern and for lookaround assertions and atomic groups. In this case, the
|
||||
depth limit controls the depth of such recursion.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="newlines"></a>
|
||||
|
@ -2838,10 +2840,6 @@ the last value taken on at the top level. If a capturing subpattern is not
|
|||
matched at the top level, its final captured value is unset, even if it was
|
||||
(temporarily) set at a deeper level during the matching process.
|
||||
.P
|
||||
If there are more than 15 capturing parentheses in a pattern, PCRE2 has to
|
||||
obtain extra memory from the heap to store data during a recursion. If no
|
||||
memory can be obtained, the match fails with the PCRE2_ERROR_NOMEMORY error.
|
||||
.P
|
||||
Do not confuse the (?R) item with the condition (R), which tests for recursion.
|
||||
Consider this pattern, which matches text in angle brackets, allowing for
|
||||
arbitrary nesting. Only digits are allowed in nested brackets (that is, when
|
||||
|
@ -3505,6 +3503,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 12 September 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
Last updated: 25 April 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PERFORM 3 "08 April 2017" "PCRE2 10.30"
|
||||
.TH PCRE2PERFORM 3 "25 April 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 PERFORMANCE"
|
||||
|
@ -78,9 +78,16 @@ may also reduce the memory requirements.
|
|||
.P
|
||||
In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive
|
||||
function calls, but only for processing atomic groups, lookaround assertions,
|
||||
and recursion within the pattern. Too much nested recursion may cause stack
|
||||
issues. The "match depth" parameter can be used to limit the depth of function
|
||||
recursion in \fBpcre2_dfa_match()\fP.
|
||||
and recursion within the pattern. The original version of the code used to
|
||||
allocate quite large internal workspace vectors on the stack, which caused some
|
||||
problems for some patterns in environments with small stacks. From release
|
||||
10.32 the code for \fBpcre2_dfa_match()\fP has been re-factored to use heap
|
||||
memory when necessary for internal workspace when recursing, though recursive
|
||||
function calls are still used.
|
||||
.P
|
||||
The "match depth" parameter can be used to limit the depth of function
|
||||
recursion, and the "match heap" parameter to limit heap memory in
|
||||
\fBpcre2_dfa_match()\fP.
|
||||
.
|
||||
.
|
||||
.SH "PROCESSING TIME"
|
||||
|
@ -232,6 +239,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 08 April 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
Last updated: 25 April 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2TEST 1 "21 Decbmber 2017" "PCRE 10.31"
|
||||
.TH PCRE2TEST 1 "25 April 2018" "PCRE 10.32"
|
||||
.SH NAME
|
||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
|
@ -1168,7 +1168,7 @@ pattern.
|
|||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
heap_limit=<n> set a limit on heap memory
|
||||
heap_limit=<n> set a limit on heap memory (Kbytes)
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=<n> set a match limit
|
||||
|
@ -1401,24 +1401,36 @@ the appropriate limits in the match context. These values are ignored when the
|
|||
.sp
|
||||
If the \fBfind_limits\fP modifier is present on a subject line, \fBpcre2test\fP
|
||||
calls the relevant matching function several times, setting different values in
|
||||
the match context via \fBpcre2_set_heap_limit(), \fBpcre2_set_match_limit()\fP,
|
||||
or \fBpcre2_set_depth_limit()\fP until it finds the minimum values for each
|
||||
parameter that allows the match to complete without error.
|
||||
the match context via \fBpcre2_set_heap_limit()\fP,
|
||||
\fBpcre2_set_match_limit()\fP, or \fBpcre2_set_depth_limit()\fP until it finds
|
||||
the minimum values for each parameter that allows the match to complete without
|
||||
error. If JIT is being used, only the match limit is relevant.
|
||||
.P
|
||||
If JIT is being used, only the match limit is relevant. If DFA matching is
|
||||
being used, only the depth limit is relevant.
|
||||
.P
|
||||
The \fImatch_limit\fP number is a measure of the amount of backtracking
|
||||
that takes place, and learning the minimum value can be instructive. For most
|
||||
simple matches, the number is quite small, but for patterns with very large
|
||||
numbers of matching possibilities, it can become large very quickly with
|
||||
increasing length of subject string.
|
||||
When using this modifier, the pattern should not contain any limit settings
|
||||
such as (*LIMIT_MATCH=...) within it. If such a setting is present and is
|
||||
lower than the minimum matching value, the minimum value cannot be found
|
||||
because \fBpcre2_set_match_limit()\fP etc. are only able to reduce the value of
|
||||
an in-pattern limit; they cannot increase it.
|
||||
.P
|
||||
For non-DFA matching, the minimum \fIdepth_limit\fP number is a measure of how
|
||||
much nested backtracking happens (that is, how deeply the pattern's tree is
|
||||
searched). In the case of DFA matching, \fIdepth_limit\fP controls the depth of
|
||||
recursive calls of the internal function that is used for handling pattern
|
||||
recursion, lookaround assertions, and atomic groups.
|
||||
.P
|
||||
For non-DFA matching, the \fImatch_limit\fP number is a measure of the amount
|
||||
of backtracking that takes place, and learning the minimum value can be
|
||||
instructive. For most simple matches, the number is quite small, but for
|
||||
patterns with very large numbers of matching possibilities, it can become large
|
||||
very quickly with increasing length of subject string. In the case of DFA
|
||||
matching, \fImatch_limit\fP controls the total number of calls, both recursive
|
||||
and non-recursive, to the internal matching function, thus controlling the
|
||||
overall amount of computing resource that is used.
|
||||
.P
|
||||
For both kinds of matching, the \fIheap_limit\fP number (which is in kilobytes)
|
||||
limits the amount of heap memory used for matching. A value of zero disables
|
||||
the use of any heap memory; many simple pattern matches can be done without
|
||||
using the heap, so this is not an unreasonable setting.
|
||||
.
|
||||
.
|
||||
.SS "Showing MARK names"
|
||||
|
@ -1437,13 +1449,14 @@ is added to the non-match message.
|
|||
.sp
|
||||
The \fBmemory\fP modifier causes \fBpcre2test\fP to log the sizes of all heap
|
||||
memory allocation and freeing calls that occur during a call to
|
||||
\fBpcre2_match()\fP. These occur only when a match requires a bigger vector
|
||||
than the default for remembering backtracking points. In many cases there will
|
||||
be no heap memory used and therefore no additional output. No heap memory is
|
||||
allocated during matching with \fBpcre2_dfa_match\fP or with JIT, so in those
|
||||
cases the \fBmemory\fP modifier never has any effect. For this modifier to
|
||||
work, the \fBnull_context\fP modifier must not be set on both the pattern and
|
||||
the subject, though it can be set on one or the other.
|
||||
\fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP. These occur only when a match
|
||||
requires a bigger vector than the default for remembering backtracking points
|
||||
(\fBpcre2_match()\fP) or for internal workspace (\fBpcre2_dfa_match()\fP). In
|
||||
many cases there will be no heap memory used and therefore no additional
|
||||
output. No heap memory is allocated during matching with JIT, so in that case
|
||||
the \fBmemory\fP modifier never has any effect. For this modifier to work, the
|
||||
\fBnull_context\fP modifier must not be set on both the pattern and the
|
||||
subject, though it can be set on one or the other.
|
||||
.
|
||||
.
|
||||
.SS "Setting a starting offset"
|
||||
|
@ -1962,6 +1975,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 21 December 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
Last updated: 25 April 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1071,7 +1071,7 @@ SUBJECT MODIFIERS
|
|||
get=<number or name> extract captured substring
|
||||
getall extract all captured substrings
|
||||
/g global global matching
|
||||
heap_limit=<n> set a limit on heap memory
|
||||
heap_limit=<n> set a limit on heap memory (Kbytes)
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
match_limit=<n> set a match limit
|
||||
|
@ -1291,126 +1291,139 @@ SUBJECT MODIFIERS
|
|||
values in the match context via pcre2_set_heap_limit(),
|
||||
pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
|
||||
minimum values for each parameter that allows the match to complete
|
||||
without error.
|
||||
without error. If JIT is being used, only the match limit is relevant.
|
||||
|
||||
If JIT is being used, only the match limit is relevant. If DFA matching
|
||||
is being used, only the depth limit is relevant.
|
||||
When using this modifier, the pattern should not contain any limit set-
|
||||
tings such as (*LIMIT_MATCH=...) within it. If such a setting is
|
||||
present and is lower than the minimum matching value, the minimum value
|
||||
cannot be found because pcre2_set_match_limit() etc. are only able to
|
||||
reduce the value of an in-pattern limit; they cannot increase it.
|
||||
|
||||
The match_limit number is a measure of the amount of backtracking that
|
||||
takes place, and learning the minimum value can be instructive. For
|
||||
most simple matches, the number is quite small, but for patterns with
|
||||
very large numbers of matching possibilities, it can become large very
|
||||
quickly with increasing length of subject string.
|
||||
|
||||
For non-DFA matching, the minimum depth_limit number is a measure of
|
||||
For non-DFA matching, the minimum depth_limit number is a measure of
|
||||
how much nested backtracking happens (that is, how deeply the pattern's
|
||||
tree is searched). In the case of DFA matching, depth_limit controls
|
||||
the depth of recursive calls of the internal function that is used for
|
||||
tree is searched). In the case of DFA matching, depth_limit controls
|
||||
the depth of recursive calls of the internal function that is used for
|
||||
handling pattern recursion, lookaround assertions, and atomic groups.
|
||||
|
||||
For non-DFA matching, the match_limit number is a measure of the amount
|
||||
of backtracking that takes place, and learning the minimum value can be
|
||||
instructive. For most simple matches, the number is quite small, but
|
||||
for patterns with very large numbers of matching possibilities, it can
|
||||
become large very quickly with increasing length of subject string. In
|
||||
the case of DFA matching, match_limit controls the total number of
|
||||
calls, both recursive and non-recursive, to the internal matching func-
|
||||
tion, thus controlling the overall amount of computing resource that is
|
||||
used.
|
||||
|
||||
For both kinds of matching, the heap_limit number (which is in kilo-
|
||||
bytes) limits the amount of heap memory used for matching. A value of
|
||||
zero disables the use of any heap memory; many simple pattern matches
|
||||
can be done without using the heap, so this is not an unreasonable set-
|
||||
ting.
|
||||
|
||||
Showing MARK names
|
||||
|
||||
|
||||
The mark modifier causes the names from backtracking control verbs that
|
||||
are returned from calls to pcre2_match() to be displayed. If a mark is
|
||||
returned for a match, non-match, or partial match, pcre2test shows it.
|
||||
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
||||
are returned from calls to pcre2_match() to be displayed. If a mark is
|
||||
returned for a match, non-match, or partial match, pcre2test shows it.
|
||||
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
||||
it is added to the non-match message.
|
||||
|
||||
Showing memory usage
|
||||
|
||||
The memory modifier causes pcre2test to log the sizes of all heap mem-
|
||||
ory allocation and freeing calls that occur during a call to
|
||||
pcre2_match(). These occur only when a match requires a bigger vector
|
||||
than the default for remembering backtracking points. In many cases
|
||||
there will be no heap memory used and therefore no additional output.
|
||||
No heap memory is allocated during matching with pcre2_dfa_match or
|
||||
with JIT, so in those cases the memory modifier never has any effect.
|
||||
For this modifier to work, the null_context modifier must not be set on
|
||||
both the pattern and the subject, though it can be set on one or the
|
||||
other.
|
||||
The memory modifier causes pcre2test to log the sizes of all heap mem-
|
||||
ory allocation and freeing calls that occur during a call to
|
||||
pcre2_match() or pcre2_dfa_match(). These occur only when a match
|
||||
requires a bigger vector than the default for remembering backtracking
|
||||
points (pcre2_match()) or for internal workspace (pcre2_dfa_match()).
|
||||
In many cases there will be no heap memory used and therefore no addi-
|
||||
tional output. No heap memory is allocated during matching with JIT, so
|
||||
in that case the memory modifier never has any effect. For this modi-
|
||||
fier to work, the null_context modifier must not be set on both the
|
||||
pattern and the subject, though it can be set on one or the other.
|
||||
|
||||
Setting a starting offset
|
||||
|
||||
The offset modifier sets an offset in the subject string at which
|
||||
The offset modifier sets an offset in the subject string at which
|
||||
matching starts. Its value is a number of code units, not characters.
|
||||
|
||||
Setting an offset limit
|
||||
|
||||
The offset_limit modifier sets a limit for unanchored matches. If a
|
||||
The offset_limit modifier sets a limit for unanchored matches. If a
|
||||
match cannot be found starting at or before this offset in the subject,
|
||||
a "no match" return is given. The data value is a number of code units,
|
||||
not characters. When this modifier is used, the use_offset_limit modi-
|
||||
not characters. When this modifier is used, the use_offset_limit modi-
|
||||
fier must have been set for the pattern; if not, an error is generated.
|
||||
|
||||
Setting the size of the output vector
|
||||
|
||||
The ovector modifier applies only to the subject line in which it
|
||||
appears, though of course it can also be used to set a default in a
|
||||
#subject command. It specifies the number of pairs of offsets that are
|
||||
The ovector modifier applies only to the subject line in which it
|
||||
appears, though of course it can also be used to set a default in a
|
||||
#subject command. It specifies the number of pairs of offsets that are
|
||||
available for storing matching information. The default is 15.
|
||||
|
||||
A value of zero is useful when testing the POSIX API because it causes
|
||||
A value of zero is useful when testing the POSIX API because it causes
|
||||
regexec() to be called with a NULL capture vector. When not testing the
|
||||
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
||||
ate_from_pattern() to be called, in order to create a match block of
|
||||
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
||||
ate_from_pattern() to be called, in order to create a match block of
|
||||
exactly the right size for the pattern. (It is not possible to create a
|
||||
match block with a zero-length ovector; there is always at least one
|
||||
match block with a zero-length ovector; there is always at least one
|
||||
pair of offsets.)
|
||||
|
||||
Passing the subject as zero-terminated
|
||||
|
||||
By default, the subject string is passed to a native API matching func-
|
||||
tion with its correct length. In order to test the facility for passing
|
||||
a zero-terminated string, the zero_terminate modifier is provided. It
|
||||
causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching
|
||||
a zero-terminated string, the zero_terminate modifier is provided. It
|
||||
causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching
|
||||
via the POSIX interface, this modifier is ignored, with a warning.
|
||||
|
||||
When testing pcre2_substitute(), this modifier also has the effect of
|
||||
When testing pcre2_substitute(), this modifier also has the effect of
|
||||
passing the replacement string as zero-terminated.
|
||||
|
||||
Passing a NULL context
|
||||
|
||||
Normally, pcre2test passes a context block to pcre2_match(),
|
||||
Normally, pcre2test passes a context block to pcre2_match(),
|
||||
pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is
|
||||
set, however, NULL is passed. This is for testing that the matching
|
||||
set, however, NULL is passed. This is for testing that the matching
|
||||
functions behave correctly in this case (they use default values). This
|
||||
modifier cannot be used with the find_limits modifier or when testing
|
||||
modifier cannot be used with the find_limits modifier or when testing
|
||||
the substitution function.
|
||||
|
||||
|
||||
THE ALTERNATIVE MATCHING FUNCTION
|
||||
|
||||
By default, pcre2test uses the standard PCRE2 matching function,
|
||||
By default, pcre2test uses the standard PCRE2 matching function,
|
||||
pcre2_match() to match each subject line. PCRE2 also supports an alter-
|
||||
native matching function, pcre2_dfa_match(), which operates in a dif-
|
||||
ferent way, and has some restrictions. The differences between the two
|
||||
native matching function, pcre2_dfa_match(), which operates in a dif-
|
||||
ferent way, and has some restrictions. The differences between the two
|
||||
functions are described in the pcre2matching documentation.
|
||||
|
||||
If the dfa modifier is set, the alternative matching function is used.
|
||||
This function finds all possible matches at a given point in the sub-
|
||||
ject. If, however, the dfa_shortest modifier is set, processing stops
|
||||
after the first match is found. This is always the shortest possible
|
||||
If the dfa modifier is set, the alternative matching function is used.
|
||||
This function finds all possible matches at a given point in the sub-
|
||||
ject. If, however, the dfa_shortest modifier is set, processing stops
|
||||
after the first match is found. This is always the shortest possible
|
||||
match.
|
||||
|
||||
|
||||
DEFAULT OUTPUT FROM pcre2test
|
||||
|
||||
This section describes the output when the normal matching function,
|
||||
This section describes the output when the normal matching function,
|
||||
pcre2_match(), is being used.
|
||||
|
||||
When a match succeeds, pcre2test outputs the list of captured sub-
|
||||
strings, starting with number 0 for the string that matched the whole
|
||||
pattern. Otherwise, it outputs "No match" when the return is
|
||||
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
|
||||
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
|
||||
this is the entire substring that was inspected during the partial
|
||||
match; it may include characters before the actual match start if a
|
||||
When a match succeeds, pcre2test outputs the list of captured sub-
|
||||
strings, starting with number 0 for the string that matched the whole
|
||||
pattern. Otherwise, it outputs "No match" when the return is
|
||||
PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
|
||||
matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
|
||||
this is the entire substring that was inspected during the partial
|
||||
match; it may include characters before the actual match start if a
|
||||
lookbehind assertion, \K, \b, or \B was involved.)
|
||||
|
||||
For any other return, pcre2test outputs the PCRE2 negative error number
|
||||
and a short descriptive phrase. If the error is a failed UTF string
|
||||
check, the code unit offset of the start of the failing character is
|
||||
and a short descriptive phrase. If the error is a failed UTF string
|
||||
check, the code unit offset of the start of the failing character is
|
||||
also output. Here is an example of an interactive pcre2test run.
|
||||
|
||||
$ pcre2test
|
||||
|
@ -1426,8 +1439,8 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
Unset capturing substrings that are not followed by one that is set are
|
||||
not shown by pcre2test unless the allcaptures modifier is specified. In
|
||||
the following example, there are two capturing substrings, but when the
|
||||
first data line is matched, the second, unset substring is not shown.
|
||||
An "internal" unset substring is shown as "<unset>", as for the second
|
||||
first data line is matched, the second, unset substring is not shown.
|
||||
An "internal" unset substring is shown as "<unset>", as for the second
|
||||
data line.
|
||||
|
||||
re> /(a)|(b)/
|
||||
|
@ -1439,11 +1452,11 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
1: <unset>
|
||||
2: b
|
||||
|
||||
If the strings contain any non-printing characters, they are output as
|
||||
\xhh escapes if the value is less than 256 and UTF mode is not set.
|
||||
If the strings contain any non-printing characters, they are output as
|
||||
\xhh escapes if the value is less than 256 and UTF mode is not set.
|
||||
Otherwise they are output as \x{hh...} escapes. See below for the defi-
|
||||
nition of non-printing characters. If the aftertext modifier is set,
|
||||
the output for substring 0 is followed by the the rest of the subject
|
||||
nition of non-printing characters. If the aftertext modifier is set,
|
||||
the output for substring 0 is followed by the the rest of the subject
|
||||
string, identified by "0+" like this:
|
||||
|
||||
re> /cat/aftertext
|
||||
|
@ -1451,7 +1464,7 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
0: cat
|
||||
0+ aract
|
||||
|
||||
If global matching is requested, the results of successive matching
|
||||
If global matching is requested, the results of successive matching
|
||||
attempts are output in sequence, like this:
|
||||
|
||||
re> /\Bi(\w\w)/g
|
||||
|
@ -1463,8 +1476,8 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
0: ipp
|
||||
1: pp
|
||||
|
||||
"No match" is output only if the first match attempt fails. Here is an
|
||||
example of a failure message (the offset 4 that is specified by the
|
||||
"No match" is output only if the first match attempt fails. Here is an
|
||||
example of a failure message (the offset 4 that is specified by the
|
||||
offset modifier is past the end of the subject string):
|
||||
|
||||
re> /xyz/
|
||||
|
@ -1472,7 +1485,7 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
Error -24 (bad offset value)
|
||||
|
||||
Note that whereas patterns can be continued over several lines (a plain
|
||||
">" prompt is used for continuations), subject lines may not. However
|
||||
">" prompt is used for continuations), subject lines may not. However
|
||||
newlines can be included in a subject by means of the \n escape (or \r,
|
||||
\r\n, etc., depending on the newline sequence setting).
|
||||
|
||||
|
@ -1480,7 +1493,7 @@ DEFAULT OUTPUT FROM pcre2test
|
|||
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
||||
|
||||
When the alternative matching function, pcre2_dfa_match(), is used, the
|
||||
output consists of a list of all the matches that start at the first
|
||||
output consists of a list of all the matches that start at the first
|
||||
point in the subject where there is at least one match. For example:
|
||||
|
||||
re> /(tang|tangerine|tan)/
|
||||
|
@ -1489,11 +1502,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
|||
1: tang
|
||||
2: tan
|
||||
|
||||
Using the normal matching function on this data finds only "tang". The
|
||||
longest matching string is always given first (and numbered zero).
|
||||
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
|
||||
followed by the partially matching substring. Note that this is the
|
||||
entire substring that was inspected during the partial match; it may
|
||||
Using the normal matching function on this data finds only "tang". The
|
||||
longest matching string is always given first (and numbered zero).
|
||||
After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
|
||||
followed by the partially matching substring. Note that this is the
|
||||
entire substring that was inspected during the partial match; it may
|
||||
include characters before the actual match start if a lookbehind asser-
|
||||
tion, \b, or \B was involved. (\K is not supported for DFA matching.)
|
||||
|
||||
|
@ -1509,16 +1522,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
|
|||
1: tan
|
||||
0: tan
|
||||
|
||||
The alternative matching function does not support substring capture,
|
||||
so the modifiers that are concerned with captured substrings are not
|
||||
The alternative matching function does not support substring capture,
|
||||
so the modifiers that are concerned with captured substrings are not
|
||||
relevant.
|
||||
|
||||
|
||||
RESTARTING AFTER A PARTIAL MATCH
|
||||
|
||||
When the alternative matching function has given the PCRE2_ERROR_PAR-
|
||||
When the alternative matching function has given the PCRE2_ERROR_PAR-
|
||||
TIAL return, indicating that the subject partially matched the pattern,
|
||||
you can restart the match with additional subject data by means of the
|
||||
you can restart the match with additional subject data by means of the
|
||||
dfa_restart modifier. For example:
|
||||
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
|
@ -1527,37 +1540,37 @@ RESTARTING AFTER A PARTIAL MATCH
|
|||
data> n05\=dfa,dfa_restart
|
||||
0: n05
|
||||
|
||||
For further information about partial matching, see the pcre2partial
|
||||
For further information about partial matching, see the pcre2partial
|
||||
documentation.
|
||||
|
||||
|
||||
CALLOUTS
|
||||
|
||||
If the pattern contains any callout requests, pcre2test's callout func-
|
||||
tion is called during matching unless callout_none is specified. This
|
||||
tion is called during matching unless callout_none is specified. This
|
||||
works with both matching functions, and with JIT, though there are some
|
||||
differences in behaviour. The output for callouts with numerical argu-
|
||||
differences in behaviour. The output for callouts with numerical argu-
|
||||
ments and those with string arguments is slightly different.
|
||||
|
||||
Callouts with numerical arguments
|
||||
|
||||
By default, the callout function displays the callout number, the start
|
||||
and current positions in the subject text at the callout time, and the
|
||||
and current positions in the subject text at the callout time, and the
|
||||
next pattern item to be tested. For example:
|
||||
|
||||
--->pqrabcdef
|
||||
0 ^ ^ \d
|
||||
|
||||
This output indicates that callout number 0 occurred for a match
|
||||
attempt starting at the fourth character of the subject string, when
|
||||
the pointer was at the seventh character, and when the next pattern
|
||||
item was \d. Just one circumflex is output if the start and current
|
||||
positions are the same, or if the current position precedes the start
|
||||
This output indicates that callout number 0 occurred for a match
|
||||
attempt starting at the fourth character of the subject string, when
|
||||
the pointer was at the seventh character, and when the next pattern
|
||||
item was \d. Just one circumflex is output if the start and current
|
||||
positions are the same, or if the current position precedes the start
|
||||
position, which can happen if the callout is in a lookbehind assertion.
|
||||
|
||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as
|
||||
a result of the auto_callout pattern modifier. In this case, instead of
|
||||
showing the callout number, the offset in the pattern, preceded by a
|
||||
showing the callout number, the offset in the pattern, preceded by a
|
||||
plus, is output. For example:
|
||||
|
||||
re> /\d?[A-E]\*/auto_callout
|
||||
|
@ -1570,7 +1583,7 @@ CALLOUTS
|
|||
0: E*
|
||||
|
||||
If a pattern contains (*MARK) items, an additional line is output when-
|
||||
ever a change of latest mark is passed to the callout function. For
|
||||
ever a change of latest mark is passed to the callout function. For
|
||||
example:
|
||||
|
||||
re> /a(*MARK:X)bc/auto_callout
|
||||
|
@ -1584,17 +1597,17 @@ CALLOUTS
|
|||
+12 ^ ^
|
||||
0: abc
|
||||
|
||||
The mark changes between matching "a" and "b", but stays the same for
|
||||
the rest of the match, so nothing more is output. If, as a result of
|
||||
backtracking, the mark reverts to being unset, the text "<unset>" is
|
||||
The mark changes between matching "a" and "b", but stays the same for
|
||||
the rest of the match, so nothing more is output. If, as a result of
|
||||
backtracking, the mark reverts to being unset, the text "<unset>" is
|
||||
output.
|
||||
|
||||
Callouts with string arguments
|
||||
|
||||
The output for a callout with a string argument is similar, except that
|
||||
instead of outputting a callout number before the position indicators,
|
||||
the callout string and its offset in the pattern string are output
|
||||
before the reflection of the subject string, and the subject string is
|
||||
instead of outputting a callout number before the position indicators,
|
||||
the callout string and its offset in the pattern string are output
|
||||
before the reflection of the subject string, and the subject string is
|
||||
reflected for each callout. For example:
|
||||
|
||||
re> /^ab(?C'first')cd(?C"second")ef/
|
||||
|
@ -1610,26 +1623,26 @@ CALLOUTS
|
|||
|
||||
Callout modifiers
|
||||
|
||||
The callout function in pcre2test returns zero (carry on matching) by
|
||||
default, but you can use a callout_fail modifier in a subject line to
|
||||
The callout function in pcre2test returns zero (carry on matching) by
|
||||
default, but you can use a callout_fail modifier in a subject line to
|
||||
change this and other parameters of the callout (see below).
|
||||
|
||||
If the callout_capture modifier is set, the current captured groups are
|
||||
output when a callout occurs. This is useful only for non-DFA matching,
|
||||
as pcre2_dfa_match() does not support capturing, so no captures are
|
||||
as pcre2_dfa_match() does not support capturing, so no captures are
|
||||
ever shown.
|
||||
|
||||
The normal callout output, showing the callout number or pattern offset
|
||||
(as described above) is suppressed if the callout_no_where modifier is
|
||||
(as described above) is suppressed if the callout_no_where modifier is
|
||||
set.
|
||||
|
||||
When using the interpretive matching function pcre2_match() without
|
||||
JIT, setting the callout_extra modifier causes additional output from
|
||||
pcre2test's callout function to be generated. For the first callout in
|
||||
a match attempt at a new starting position in the subject, "New match
|
||||
attempt" is output. If there has been a backtrack since the last call-
|
||||
When using the interpretive matching function pcre2_match() without
|
||||
JIT, setting the callout_extra modifier causes additional output from
|
||||
pcre2test's callout function to be generated. For the first callout in
|
||||
a match attempt at a new starting position in the subject, "New match
|
||||
attempt" is output. If there has been a backtrack since the last call-
|
||||
out (or start of matching if this is the first callout), "Backtrack" is
|
||||
output, followed by "No other matching paths" if the backtrack ended
|
||||
output, followed by "No other matching paths" if the backtrack ended
|
||||
the previous match attempt. For example:
|
||||
|
||||
re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess
|
||||
|
@ -1666,82 +1679,82 @@ CALLOUTS
|
|||
+1 ^ a+
|
||||
No match
|
||||
|
||||
Notice that various optimizations must be turned off if you want all
|
||||
possible matching paths to be scanned. If no_start_optimize is not
|
||||
used, there is an immediate "no match", without any callouts, because
|
||||
the starting optimization fails to find "b" in the subject, which it
|
||||
knows must be present for any match. If no_auto_possess is not used,
|
||||
the "a+" item is turned into "a++", which reduces the number of back-
|
||||
Notice that various optimizations must be turned off if you want all
|
||||
possible matching paths to be scanned. If no_start_optimize is not
|
||||
used, there is an immediate "no match", without any callouts, because
|
||||
the starting optimization fails to find "b" in the subject, which it
|
||||
knows must be present for any match. If no_auto_possess is not used,
|
||||
the "a+" item is turned into "a++", which reduces the number of back-
|
||||
tracks.
|
||||
|
||||
The callout_extra modifier has no effect if used with the DFA matching
|
||||
The callout_extra modifier has no effect if used with the DFA matching
|
||||
function, or with JIT.
|
||||
|
||||
Return values from callouts
|
||||
|
||||
The default return from the callout function is zero, which allows
|
||||
The default return from the callout function is zero, which allows
|
||||
matching to continue. The callout_fail modifier can be given one or two
|
||||
numbers. If there is only one number, 1 is returned instead of 0 (caus-
|
||||
ing matching to backtrack) when a callout of that number is reached. If
|
||||
two numbers (<n>:<m>) are given, 1 is returned when callout <n> is
|
||||
reached and there have been at least <m> callouts. The callout_error
|
||||
two numbers (<n>:<m>) are given, 1 is returned when callout <n> is
|
||||
reached and there have been at least <m> callouts. The callout_error
|
||||
modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus-
|
||||
ing the entire matching process to be aborted. If both these modifiers
|
||||
are set for the same callout number, callout_error takes precedence.
|
||||
Note that callouts with string arguments are always given the number
|
||||
ing the entire matching process to be aborted. If both these modifiers
|
||||
are set for the same callout number, callout_error takes precedence.
|
||||
Note that callouts with string arguments are always given the number
|
||||
zero.
|
||||
|
||||
The callout_data modifier can be given an unsigned or a negative num-
|
||||
ber. This is set as the "user data" that is passed to the matching
|
||||
function, and passed back when the callout function is invoked. Any
|
||||
value other than zero is used as a return from pcre2test's callout
|
||||
The callout_data modifier can be given an unsigned or a negative num-
|
||||
ber. This is set as the "user data" that is passed to the matching
|
||||
function, and passed back when the callout function is invoked. Any
|
||||
value other than zero is used as a return from pcre2test's callout
|
||||
function.
|
||||
|
||||
Inserting callouts can be helpful when using pcre2test to check compli-
|
||||
cated regular expressions. For further information about callouts, see
|
||||
cated regular expressions. For further information about callouts, see
|
||||
the pcre2callout documentation.
|
||||
|
||||
|
||||
NON-PRINTING CHARACTERS
|
||||
|
||||
When pcre2test is outputting text in the compiled version of a pattern,
|
||||
bytes other than 32-126 are always treated as non-printing characters
|
||||
bytes other than 32-126 are always treated as non-printing characters
|
||||
and are therefore shown as hex escapes.
|
||||
|
||||
When pcre2test is outputting text that is a matched part of a subject
|
||||
string, it behaves in the same way, unless a different locale has been
|
||||
set for the pattern (using the locale modifier). In this case, the
|
||||
isprint() function is used to distinguish printing and non-printing
|
||||
When pcre2test is outputting text that is a matched part of a subject
|
||||
string, it behaves in the same way, unless a different locale has been
|
||||
set for the pattern (using the locale modifier). In this case, the
|
||||
isprint() function is used to distinguish printing and non-printing
|
||||
characters.
|
||||
|
||||
|
||||
SAVING AND RESTORING COMPILED PATTERNS
|
||||
|
||||
It is possible to save compiled patterns on disc or elsewhere, and
|
||||
It is possible to save compiled patterns on disc or elsewhere, and
|
||||
reload them later, subject to a number of restrictions. JIT data cannot
|
||||
be saved. The host on which the patterns are reloaded must be running
|
||||
be saved. The host on which the patterns are reloaded must be running
|
||||
the same version of PCRE2, with the same code unit width, and must also
|
||||
have the same endianness, pointer width and PCRE2_SIZE type. Before
|
||||
compiled patterns can be saved they must be serialized, that is, con-
|
||||
verted to a stream of bytes. A single byte stream may contain any num-
|
||||
ber of compiled patterns, but they must all use the same character
|
||||
have the same endianness, pointer width and PCRE2_SIZE type. Before
|
||||
compiled patterns can be saved they must be serialized, that is, con-
|
||||
verted to a stream of bytes. A single byte stream may contain any num-
|
||||
ber of compiled patterns, but they must all use the same character
|
||||
tables. A single copy of the tables is included in the byte stream (its
|
||||
size is 1088 bytes).
|
||||
|
||||
The functions whose names begin with pcre2_serialize_ are used for
|
||||
serializing and de-serializing. They are described in the pcre2serial-
|
||||
The functions whose names begin with pcre2_serialize_ are used for
|
||||
serializing and de-serializing. They are described in the pcre2serial-
|
||||
ize documentation. In this section we describe the features of
|
||||
pcre2test that can be used to test these functions.
|
||||
|
||||
When a pattern with push modifier is successfully compiled, it is
|
||||
pushed onto a stack of compiled patterns, and pcre2test expects the
|
||||
next line to contain a new pattern (or command) instead of a subject
|
||||
line. By contrast, the pushcopy modifier causes a copy of the compiled
|
||||
pattern to be stacked, leaving the original available for immediate
|
||||
matching. By using push and/or pushcopy, a number of patterns can be
|
||||
When a pattern with push modifier is successfully compiled, it is
|
||||
pushed onto a stack of compiled patterns, and pcre2test expects the
|
||||
next line to contain a new pattern (or command) instead of a subject
|
||||
line. By contrast, the pushcopy modifier causes a copy of the compiled
|
||||
pattern to be stacked, leaving the original available for immediate
|
||||
matching. By using push and/or pushcopy, a number of patterns can be
|
||||
compiled and retained. These modifiers are incompatible with posix, and
|
||||
control modifiers that act at match time are ignored (with a message)
|
||||
for the stacked patterns. The jitverify modifier applies only at com-
|
||||
control modifiers that act at match time are ignored (with a message)
|
||||
for the stacked patterns. The jitverify modifier applies only at com-
|
||||
pile time.
|
||||
|
||||
The command
|
||||
|
@ -1749,21 +1762,21 @@ SAVING AND RESTORING COMPILED PATTERNS
|
|||
#save <filename>
|
||||
|
||||
causes all the stacked patterns to be serialized and the result written
|
||||
to the named file. Afterwards, all the stacked patterns are freed. The
|
||||
to the named file. Afterwards, all the stacked patterns are freed. The
|
||||
command
|
||||
|
||||
#load <filename>
|
||||
|
||||
reads the data in the file, and then arranges for it to be de-serial-
|
||||
ized, with the resulting compiled patterns added to the pattern stack.
|
||||
The pattern on the top of the stack can be retrieved by the #pop com-
|
||||
mand, which must be followed by lines of subjects that are to be
|
||||
matched with the pattern, terminated as usual by an empty line or end
|
||||
of file. This command may be followed by a modifier list containing
|
||||
only control modifiers that act after a pattern has been compiled. In
|
||||
reads the data in the file, and then arranges for it to be de-serial-
|
||||
ized, with the resulting compiled patterns added to the pattern stack.
|
||||
The pattern on the top of the stack can be retrieved by the #pop com-
|
||||
mand, which must be followed by lines of subjects that are to be
|
||||
matched with the pattern, terminated as usual by an empty line or end
|
||||
of file. This command may be followed by a modifier list containing
|
||||
only control modifiers that act after a pattern has been compiled. In
|
||||
particular, hex, posix, posix_nosub, push, and pushcopy are not
|
||||
allowed, nor are any option-setting modifiers. The JIT modifiers are,
|
||||
however permitted. Here is an example that saves and reloads two pat-
|
||||
allowed, nor are any option-setting modifiers. The JIT modifiers are,
|
||||
however permitted. Here is an example that saves and reloads two pat-
|
||||
terns.
|
||||
|
||||
/abc/push
|
||||
|
@ -1776,10 +1789,10 @@ SAVING AND RESTORING COMPILED PATTERNS
|
|||
#pop jit,bincode
|
||||
abc
|
||||
|
||||
If jitverify is used with #pop, it does not automatically imply jit,
|
||||
If jitverify is used with #pop, it does not automatically imply jit,
|
||||
which is different behaviour from when it is used on a pattern.
|
||||
|
||||
The #popcopy command is analagous to the pushcopy modifier in that it
|
||||
The #popcopy command is analagous to the pushcopy modifier in that it
|
||||
makes current a copy of the topmost stack pattern, leaving the original
|
||||
still on the stack.
|
||||
|
||||
|
@ -1799,5 +1812,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 21 December 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
Last updated: 25 April 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
|
|
|
@ -132,8 +132,9 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
/* Define to 1 if you have the <zlib.h> header file. */
|
||||
#undef HAVE_ZLIB_H
|
||||
|
||||
/* This limits the amount of memory that pcre2_match() may use while matching
|
||||
a pattern. The value is in kilobytes. */
|
||||
/* This limits the amount of memory that may be used while matching a pattern.
|
||||
It applies to both pcre2_match() and pcre2_dfa_match(). It does not apply
|
||||
to JIT matching. The value is in kilobytes. */
|
||||
#undef HEAP_LIMIT
|
||||
|
||||
/* The value of LINK_SIZE determines the number of bytes used to store links
|
||||
|
@ -148,7 +149,8 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
|
||||
/* The value of MATCH_LIMIT determines the default number of times the
|
||||
pcre2_match() function can record a backtrack position during a single
|
||||
matching attempt. There is a runtime interface for setting a different
|
||||
matching attempt. The value is also used to limit a loop counter in
|
||||
pcre2_dfa_match(). There is a runtime interface for setting a different
|
||||
limit. The limit exists in order to catch runaway regular expressions that
|
||||
take for ever to determine that they do not match. The default is set very
|
||||
large so that it does not accidentally catch legitimate cases. */
|
||||
|
@ -161,7 +163,9 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it
|
||||
must be less than the value of MATCH_LIMIT. The default is to use the same
|
||||
value as MATCH_LIMIT. There is a runtime method for setting a different
|
||||
limit. */
|
||||
limit. In the case of pcre2_dfa_match(), this limit controls the depth of
|
||||
the internal nested function calls that are used for pattern recursions,
|
||||
lookarounds, and atomic groups. */
|
||||
#undef MATCH_LIMIT_DEPTH
|
||||
|
||||
/* This limit is parameterized just in case anybody ever wants to change it.
|
||||
|
|
|
@ -292,6 +292,35 @@ typedef struct stateblock {
|
|||
#define INTS_PER_STATEBLOCK (int)(sizeof(stateblock)/sizeof(int))
|
||||
|
||||
|
||||
/* Before version 10.32 the recursive calls of internal_dfa_match() were passed
|
||||
local working space and output vectors that were created on the stack. This has
|
||||
caused issues for some patterns, especially in small-stack environments such as
|
||||
Windows. A new scheme is now in use which sets up a vector on the stack, but if
|
||||
this is too small, heap memory is used, up to the heap_limit. The main
|
||||
parameters are all numbers of ints because the workspace is a vector of ints.
|
||||
|
||||
The size of the starting stack vector, DFA_START_RWS_SIZE, is in bytes, and is
|
||||
defined in pcre2_internal.h so as to be available to pcre2test when it is
|
||||
finding the minimum heap requirement for a match. */
|
||||
|
||||
#define OVEC_UNIT (sizeof(PCRE2_SIZE)/sizeof(int))
|
||||
|
||||
#define RWS_BASE_SIZE (DFA_START_RWS_SIZE/sizeof(int)) /* Stack vector */
|
||||
#define RWS_RSIZE 1000 /* Work size for recursion */
|
||||
#define RWS_OVEC_RSIZE (1000*OVEC_UNIT) /* Ovector for recursion */
|
||||
#define RWS_OVEC_OSIZE (2*OVEC_UNIT) /* Ovector in other cases */
|
||||
|
||||
/* This structure is at the start of each workspace block. */
|
||||
|
||||
typedef struct RWS_anchor {
|
||||
struct RWS_anchor *next;
|
||||
unsigned int size; /* Number of ints */
|
||||
unsigned int free; /* Number of ints */
|
||||
} RWS_anchor;
|
||||
|
||||
#define RWS_ANCHOR_SIZE (sizeof(RWS_anchor)/sizeof(int))
|
||||
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Process a callout *
|
||||
|
@ -353,6 +382,61 @@ return (mb->callout)(cb, mb->callout_data);
|
|||
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Expand local workspace memory *
|
||||
*************************************************/
|
||||
|
||||
/* This function is called when internal_dfa_match() is about to be called
|
||||
recursively and there is insufficient workingspace left in the current work
|
||||
space block. If there's an existing next block, use it; otherwise get a new
|
||||
block unless the heap limit is reached.
|
||||
|
||||
Arguments:
|
||||
rwsptr pointer to block pointer (updated)
|
||||
ovecsize space needed for an ovector
|
||||
mb the match block
|
||||
|
||||
Returns: 0 rwsptr has been updated
|
||||
!0 an error code
|
||||
*/
|
||||
|
||||
static int
|
||||
more_workspace(RWS_anchor **rwsptr, unsigned int ovecsize, dfa_match_block *mb)
|
||||
{
|
||||
RWS_anchor *rws = *rwsptr;
|
||||
RWS_anchor *new;
|
||||
|
||||
if (rws->next != NULL)
|
||||
{
|
||||
new = rws->next;
|
||||
}
|
||||
|
||||
/* All sizes are in units of sizeof(int), except for mb->heaplimit, which is in
|
||||
kilobytes. */
|
||||
|
||||
else
|
||||
{
|
||||
unsigned int newsize = rws->size * 2;
|
||||
unsigned int heapleft = (unsigned int)
|
||||
(((1024/sizeof(int))*mb->heap_limit - mb->heap_used));
|
||||
if (newsize > heapleft) newsize = heapleft;
|
||||
if (newsize < RWS_RSIZE + ovecsize + RWS_ANCHOR_SIZE)
|
||||
return PCRE2_ERROR_HEAPLIMIT;
|
||||
new = mb->memctl.malloc(newsize*sizeof(int), mb->memctl.memory_data);
|
||||
if (new == NULL) return PCRE2_ERROR_NOMEMORY;
|
||||
mb->heap_used += newsize;
|
||||
new->next = NULL;
|
||||
new->size = newsize;
|
||||
rws->next = new;
|
||||
}
|
||||
|
||||
new->free = new->size - RWS_ANCHOR_SIZE;
|
||||
*rwsptr = new;
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Match a Regular Expression - DFA engine *
|
||||
*************************************************/
|
||||
|
@ -431,7 +515,8 @@ internal_dfa_match(
|
|||
uint32_t offsetcount,
|
||||
int *workspace,
|
||||
int wscount,
|
||||
uint32_t rlevel)
|
||||
uint32_t rlevel,
|
||||
int *RWS)
|
||||
{
|
||||
stateblock *active_states, *new_states, *temp_states;
|
||||
stateblock *next_active_state, *next_new_state;
|
||||
|
@ -2587,10 +2672,22 @@ for (;;)
|
|||
case OP_ASSERTBACK:
|
||||
case OP_ASSERTBACK_NOT:
|
||||
{
|
||||
PCRE2_SPTR endasscode = code + GET(code, 1);
|
||||
PCRE2_SIZE local_offsets[2];
|
||||
int rc;
|
||||
int local_workspace[1000];
|
||||
int *local_workspace;
|
||||
PCRE2_SIZE *local_offsets;
|
||||
PCRE2_SPTR endasscode = code + GET(code, 1);
|
||||
RWS_anchor *rws = (RWS_anchor *)RWS;
|
||||
|
||||
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
|
||||
{
|
||||
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
|
||||
if (rc != 0) return rc;
|
||||
RWS = (int *)rws;
|
||||
}
|
||||
|
||||
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
|
||||
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
|
||||
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
|
||||
|
||||
while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
|
||||
|
||||
|
@ -2600,10 +2697,13 @@ for (;;)
|
|||
ptr, /* where we currently are */
|
||||
(PCRE2_SIZE)(ptr - start_subject), /* start offset */
|
||||
local_offsets, /* offset vector */
|
||||
sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */
|
||||
RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
|
||||
local_workspace, /* workspace vector */
|
||||
sizeof(local_workspace)/sizeof(int), /* size of same */
|
||||
rlevel); /* function recursion level */
|
||||
RWS_RSIZE, /* size of same */
|
||||
rlevel, /* function recursion level */
|
||||
RWS); /* recursion workspace */
|
||||
|
||||
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
|
||||
|
||||
if (rc < 0 && rc != PCRE2_ERROR_NOMATCH) return rc;
|
||||
if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
|
||||
|
@ -2670,11 +2770,23 @@ for (;;)
|
|||
|
||||
else
|
||||
{
|
||||
PCRE2_SIZE local_offsets[2];
|
||||
int local_workspace[1000];
|
||||
int rc;
|
||||
int *local_workspace;
|
||||
PCRE2_SIZE *local_offsets;
|
||||
PCRE2_SPTR asscode = code + LINK_SIZE + 1;
|
||||
PCRE2_SPTR endasscode = asscode + GET(asscode, 1);
|
||||
RWS_anchor *rws = (RWS_anchor *)RWS;
|
||||
|
||||
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
|
||||
{
|
||||
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
|
||||
if (rc != 0) return rc;
|
||||
RWS = (int *)rws;
|
||||
}
|
||||
|
||||
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
|
||||
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
|
||||
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
|
||||
|
||||
while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
|
||||
|
||||
|
@ -2684,10 +2796,13 @@ for (;;)
|
|||
ptr, /* where we currently are */
|
||||
(PCRE2_SIZE)(ptr - start_subject), /* start offset */
|
||||
local_offsets, /* offset vector */
|
||||
sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */
|
||||
RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
|
||||
local_workspace, /* workspace vector */
|
||||
sizeof(local_workspace)/sizeof(int), /* size of same */
|
||||
rlevel); /* function recursion level */
|
||||
RWS_RSIZE, /* size of same */
|
||||
rlevel, /* function recursion level */
|
||||
RWS); /* recursion work space */
|
||||
|
||||
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
|
||||
|
||||
if (rc < 0 && rc != PCRE2_ERROR_NOMATCH) return rc;
|
||||
if ((rc >= 0) ==
|
||||
|
@ -2702,13 +2817,25 @@ for (;;)
|
|||
/*-----------------------------------------------------------------*/
|
||||
case OP_RECURSE:
|
||||
{
|
||||
int rc;
|
||||
int *local_workspace;
|
||||
PCRE2_SIZE *local_offsets;
|
||||
RWS_anchor *rws = (RWS_anchor *)RWS;
|
||||
dfa_recursion_info *ri;
|
||||
PCRE2_SIZE local_offsets[1000];
|
||||
int local_workspace[1000];
|
||||
PCRE2_SPTR callpat = start_code + GET(code, 1);
|
||||
uint32_t recno = (callpat == mb->start_code)? 0 :
|
||||
GET2(callpat, 1 + LINK_SIZE);
|
||||
int rc;
|
||||
|
||||
if (rws->free < RWS_RSIZE + RWS_OVEC_RSIZE)
|
||||
{
|
||||
rc = more_workspace(&rws, RWS_OVEC_RSIZE, mb);
|
||||
if (rc != 0) return rc;
|
||||
RWS = (int *)rws;
|
||||
}
|
||||
|
||||
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
|
||||
local_workspace = ((int *)local_offsets) + RWS_OVEC_RSIZE;
|
||||
rws->free -= RWS_RSIZE + RWS_OVEC_RSIZE;
|
||||
|
||||
/* Check for repeating a recursion without advancing the subject
|
||||
pointer. This should catch convoluted mutual recursions. (Some simple
|
||||
|
@ -2732,11 +2859,13 @@ for (;;)
|
|||
ptr, /* where we currently are */
|
||||
(PCRE2_SIZE)(ptr - start_subject), /* start offset */
|
||||
local_offsets, /* offset vector */
|
||||
sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */
|
||||
RWS_OVEC_RSIZE/OVEC_UNIT, /* size of same */
|
||||
local_workspace, /* workspace vector */
|
||||
sizeof(local_workspace)/sizeof(int), /* size of same */
|
||||
rlevel); /* function recursion level */
|
||||
RWS_RSIZE, /* size of same */
|
||||
rlevel, /* function recursion level */
|
||||
RWS); /* recursion workspace */
|
||||
|
||||
rws->free += RWS_RSIZE + RWS_OVEC_RSIZE;
|
||||
mb->recursive = new_recursive.prevrec; /* Done this recursion */
|
||||
|
||||
/* Ran out of internal offsets */
|
||||
|
@ -2782,10 +2911,25 @@ for (;;)
|
|||
case OP_SCBRAPOS:
|
||||
case OP_BRAPOSZERO:
|
||||
{
|
||||
int rc;
|
||||
int *local_workspace;
|
||||
PCRE2_SIZE *local_offsets;
|
||||
PCRE2_SIZE charcount, matched_count;
|
||||
PCRE2_SPTR local_ptr = ptr;
|
||||
RWS_anchor *rws = (RWS_anchor *)RWS;
|
||||
BOOL allow_zero;
|
||||
|
||||
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
|
||||
{
|
||||
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
|
||||
if (rc != 0) return rc;
|
||||
RWS = (int *)rws;
|
||||
}
|
||||
|
||||
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
|
||||
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
|
||||
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
|
||||
|
||||
if (codevalue == OP_BRAPOSZERO)
|
||||
{
|
||||
allow_zero = TRUE;
|
||||
|
@ -2798,19 +2942,17 @@ for (;;)
|
|||
|
||||
for (matched_count = 0;; matched_count++)
|
||||
{
|
||||
PCRE2_SIZE local_offsets[2];
|
||||
int local_workspace[1000];
|
||||
|
||||
int rc = internal_dfa_match(
|
||||
rc = internal_dfa_match(
|
||||
mb, /* fixed match data */
|
||||
code, /* this subexpression's code */
|
||||
local_ptr, /* where we currently are */
|
||||
(PCRE2_SIZE)(ptr - start_subject), /* start offset */
|
||||
local_offsets, /* offset vector */
|
||||
sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */
|
||||
RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
|
||||
local_workspace, /* workspace vector */
|
||||
sizeof(local_workspace)/sizeof(int), /* size of same */
|
||||
rlevel); /* function recursion level */
|
||||
RWS_RSIZE, /* size of same */
|
||||
rlevel, /* function recursion level */
|
||||
RWS); /* recursion workspace */
|
||||
|
||||
/* Failed to match */
|
||||
|
||||
|
@ -2827,6 +2969,8 @@ for (;;)
|
|||
local_ptr += charcount; /* Advance temporary position ptr */
|
||||
}
|
||||
|
||||
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
|
||||
|
||||
/* At this point we have matched the subpattern matched_count
|
||||
times, and local_ptr is pointing to the character after the end of the
|
||||
last match. */
|
||||
|
@ -2869,19 +3013,35 @@ for (;;)
|
|||
/*-----------------------------------------------------------------*/
|
||||
case OP_ONCE:
|
||||
{
|
||||
PCRE2_SIZE local_offsets[2];
|
||||
int local_workspace[1000];
|
||||
int rc;
|
||||
int *local_workspace;
|
||||
PCRE2_SIZE *local_offsets;
|
||||
RWS_anchor *rws = (RWS_anchor *)RWS;
|
||||
|
||||
int rc = internal_dfa_match(
|
||||
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
|
||||
{
|
||||
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
|
||||
if (rc != 0) return rc;
|
||||
RWS = (int *)rws;
|
||||
}
|
||||
|
||||
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
|
||||
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
|
||||
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
|
||||
|
||||
rc = internal_dfa_match(
|
||||
mb, /* fixed match data */
|
||||
code, /* this subexpression's code */
|
||||
ptr, /* where we currently are */
|
||||
(PCRE2_SIZE)(ptr - start_subject), /* start offset */
|
||||
local_offsets, /* offset vector */
|
||||
sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */
|
||||
RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
|
||||
local_workspace, /* workspace vector */
|
||||
sizeof(local_workspace)/sizeof(int), /* size of same */
|
||||
rlevel); /* function recursion level */
|
||||
RWS_RSIZE, /* size of same */
|
||||
rlevel, /* function recursion level */
|
||||
RWS); /* recursion workspace */
|
||||
|
||||
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
|
||||
|
||||
if (rc >= 0)
|
||||
{
|
||||
|
@ -3063,6 +3223,7 @@ pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length,
|
|||
PCRE2_SIZE start_offset, uint32_t options, pcre2_match_data *match_data,
|
||||
pcre2_match_context *mcontext, int *workspace, PCRE2_SIZE wscount)
|
||||
{
|
||||
int rc;
|
||||
const pcre2_real_code *re = (const pcre2_real_code *)code;
|
||||
|
||||
PCRE2_SPTR start_match;
|
||||
|
@ -3071,9 +3232,9 @@ PCRE2_SPTR bumpalong_limit;
|
|||
PCRE2_SPTR req_cu_ptr;
|
||||
|
||||
BOOL utf, anchored, startline, firstline;
|
||||
|
||||
BOOL has_first_cu = FALSE;
|
||||
BOOL has_req_cu = FALSE;
|
||||
|
||||
PCRE2_UCHAR first_cu = 0;
|
||||
PCRE2_UCHAR first_cu2 = 0;
|
||||
PCRE2_UCHAR req_cu = 0;
|
||||
|
@ -3088,6 +3249,17 @@ pcre2_callout_block cb;
|
|||
dfa_match_block actual_match_block;
|
||||
dfa_match_block *mb = &actual_match_block;
|
||||
|
||||
/* Set up a starting block of memory for use during recursive calls to
|
||||
internal_dfa_match(). By putting this on the stack, it minimizes resource use
|
||||
in the case when it is not needed. If this is too small, more memory is
|
||||
obtained from the heap. At the start of each block is an anchor structure.*/
|
||||
|
||||
int base_recursion_workspace[RWS_BASE_SIZE];
|
||||
RWS_anchor *rws = (RWS_anchor *)base_recursion_workspace;
|
||||
rws->next = NULL;
|
||||
rws->size = RWS_BASE_SIZE;
|
||||
rws->free = RWS_BASE_SIZE - RWS_ANCHOR_SIZE;
|
||||
|
||||
/* A length equal to PCRE2_ZERO_TERMINATED implies a zero-terminated
|
||||
subject string. */
|
||||
|
||||
|
@ -3184,6 +3356,7 @@ if (mcontext == NULL)
|
|||
mb->memctl = re->memctl;
|
||||
mb->match_limit = PRIV(default_match_context).match_limit;
|
||||
mb->match_limit_depth = PRIV(default_match_context).depth_limit;
|
||||
mb->heap_limit = PRIV(default_match_context).heap_limit;
|
||||
}
|
||||
else
|
||||
{
|
||||
|
@ -3198,6 +3371,7 @@ else
|
|||
mb->memctl = mcontext->memctl;
|
||||
mb->match_limit = mcontext->match_limit;
|
||||
mb->match_limit_depth = mcontext->depth_limit;
|
||||
mb->heap_limit = mcontext->heap_limit;
|
||||
}
|
||||
|
||||
if (mb->match_limit > re->limit_match)
|
||||
|
@ -3206,6 +3380,9 @@ if (mb->match_limit > re->limit_match)
|
|||
if (mb->match_limit_depth > re->limit_depth)
|
||||
mb->match_limit_depth = re->limit_depth;
|
||||
|
||||
if (mb->heap_limit > re->limit_heap)
|
||||
mb->heap_limit = re->limit_heap;
|
||||
|
||||
mb->start_code = (PCRE2_UCHAR *)((uint8_t *)re + sizeof(pcre2_real_code)) +
|
||||
re->name_count * re->name_entry_size;
|
||||
mb->tables = re->tables;
|
||||
|
@ -3215,6 +3392,7 @@ mb->start_offset = start_offset;
|
|||
mb->moptions = options;
|
||||
mb->poptions = re->overall_options;
|
||||
mb->match_call_count = 0;
|
||||
mb->heap_used = 0;
|
||||
|
||||
/* Process the \R and newline settings. */
|
||||
|
||||
|
@ -3351,8 +3529,6 @@ a match. */
|
|||
|
||||
for (;;)
|
||||
{
|
||||
int rc;
|
||||
|
||||
/* ----------------- Start of match optimizations ---------------- */
|
||||
|
||||
/* There are some optimizations that avoid running the match if a known
|
||||
|
@ -3544,7 +3720,7 @@ for (;;)
|
|||
in characters, we treat it as code units to avoid spending too much time
|
||||
in this optimization. */
|
||||
|
||||
if (end_subject - start_match < re->minlength) return PCRE2_ERROR_NOMATCH;
|
||||
if (end_subject - start_match < re->minlength) goto NOMATCH_EXIT;
|
||||
|
||||
/* If req_cu is set, we know that that code unit must appear in the
|
||||
subject for the match to succeed. If the first code unit is set, req_cu
|
||||
|
@ -3621,7 +3797,8 @@ for (;;)
|
|||
(uint32_t)match_data->oveccount * 2, /* actual size of same */
|
||||
workspace, /* workspace vector */
|
||||
(int)wscount, /* size of same */
|
||||
0); /* function recurse level */
|
||||
0, /* function recurse level */
|
||||
base_recursion_workspace); /* initial workspace for recursion */
|
||||
|
||||
/* Anything other than "no match" means we are done, always; otherwise, carry
|
||||
on only if not anchored. */
|
||||
|
@ -3637,7 +3814,7 @@ for (;;)
|
|||
match_data->rightchar = (PCRE2_SIZE)( mb->last_used_ptr - subject);
|
||||
match_data->startchar = (PCRE2_SIZE)(start_match - subject);
|
||||
match_data->rc = rc;
|
||||
return rc;
|
||||
goto EXIT;
|
||||
}
|
||||
|
||||
/* Advance to the next subject character unless we are at the end of a line
|
||||
|
@ -3668,8 +3845,18 @@ for (;;)
|
|||
|
||||
} /* "Bumpalong" loop */
|
||||
|
||||
NOMATCH_EXIT:
|
||||
rc = PCRE2_ERROR_NOMATCH;
|
||||
|
||||
return PCRE2_ERROR_NOMATCH;
|
||||
EXIT:
|
||||
while (rws->next != NULL)
|
||||
{
|
||||
RWS_anchor *next = rws->next;
|
||||
rws->next = next->next;
|
||||
mb->memctl.free(next, mb->memctl.memory_data);
|
||||
}
|
||||
|
||||
return rc;
|
||||
}
|
||||
|
||||
/* End of pcre2_dfa_match.c */
|
||||
|
|
|
@ -253,6 +253,11 @@ maximum size of this can be limited. */
|
|||
|
||||
#define START_FRAMES_SIZE 20480
|
||||
|
||||
/* Similarly, for DFA matching, an initial internal workspace vector is
|
||||
allocated on the stack. */
|
||||
|
||||
#define DFA_START_RWS_SIZE 30720
|
||||
|
||||
/* Define the default BSR convention. */
|
||||
|
||||
#ifdef BSR_ANYCRLF
|
||||
|
|
|
@ -896,6 +896,8 @@ typedef struct dfa_match_block {
|
|||
PCRE2_SPTR last_used_ptr; /* Latest consulted character */
|
||||
const uint8_t *tables; /* Character tables */
|
||||
PCRE2_SIZE start_offset; /* The start offset value */
|
||||
PCRE2_SIZE heap_limit; /* As it says */
|
||||
PCRE2_SIZE heap_used; /* As it says */
|
||||
uint32_t match_limit; /* As it says */
|
||||
uint32_t match_limit_depth; /* As it says */
|
||||
uint32_t match_call_count; /* Number of calls of internal function */
|
||||
|
|
|
@ -5760,6 +5760,8 @@ PCRE2_SET_HEAP_LIMIT(dat_context, max);
|
|||
|
||||
for (;;)
|
||||
{
|
||||
uint32_t stack_start = 0;
|
||||
|
||||
if (errnumber == PCRE2_ERROR_HEAPLIMIT)
|
||||
{
|
||||
PCRE2_SET_HEAP_LIMIT(dat_context, mid);
|
||||
|
@ -5775,6 +5777,7 @@ for (;;)
|
|||
|
||||
if ((dat_datctl.control & CTL_DFA) != 0)
|
||||
{
|
||||
stack_start = DFA_START_RWS_SIZE/1024;
|
||||
if (dfa_workspace == NULL)
|
||||
dfa_workspace = (int *)malloc(DFA_WS_DIMENSION*sizeof(int));
|
||||
if (dfa_matched++ == 0)
|
||||
|
@ -5789,11 +5792,21 @@ for (;;)
|
|||
dat_datctl.options, match_data, PTR(dat_context));
|
||||
|
||||
else
|
||||
{
|
||||
stack_start = START_FRAMES_SIZE/1024;
|
||||
PCRE2_MATCH(capcount, compiled_code, pp, ulen, dat_datctl.offset,
|
||||
dat_datctl.options, match_data, PTR(dat_context));
|
||||
}
|
||||
|
||||
if (capcount == errnumber)
|
||||
{
|
||||
if ((mid & 0x80000000u) != 0)
|
||||
{
|
||||
fprintf(outfile, "Can't find minimum %s limit: check pattern for "
|
||||
"restriction\n", msg);
|
||||
break;
|
||||
}
|
||||
|
||||
min = mid;
|
||||
mid = (mid == max - 1)? max : (max != UINT32_MAX)? (min + max)/2 : mid*2;
|
||||
}
|
||||
|
@ -5802,11 +5815,12 @@ for (;;)
|
|||
capcount == PCRE2_ERROR_PARTIAL)
|
||||
{
|
||||
/* If we've not hit the error with a heap limit less than the size of the
|
||||
initial stack frame vector, the heap is not being used, so the minimum
|
||||
limit is zero; there's no need to go on. The other limits are always
|
||||
greater than zero. */
|
||||
initial stack frame vector (for pcre2_match()) or the initial stack
|
||||
workspace vector (for pcre2_dfa_match()), the heap is not being used, so
|
||||
the minimum limit is zero; there's no need to go on. The other limits are
|
||||
always greater than zero. */
|
||||
|
||||
if (errnumber == PCRE2_ERROR_HEAPLIMIT && mid < START_FRAMES_SIZE/1024)
|
||||
if (errnumber == PCRE2_ERROR_HEAPLIMIT && mid < stack_start)
|
||||
{
|
||||
fprintf(outfile, "Minimum %s limit = 0\n", msg);
|
||||
break;
|
||||
|
@ -6771,7 +6785,7 @@ if ((pat_patctl.control & CTL_POSIX) != 0)
|
|||
PCRE2_SIZE end = pmatch[i].rm_eo;
|
||||
for (j = last_printed + 1; j < i; j++)
|
||||
fprintf(outfile, "%2d: <unset>\n", (int)j);
|
||||
last_printed = i;
|
||||
last_printed = i;
|
||||
if (start > end)
|
||||
{
|
||||
start = pmatch[i].rm_eo;
|
||||
|
@ -7139,18 +7153,16 @@ else for (gmatched = 0;; gmatched++)
|
|||
(double)CLOCKS_PER_SEC);
|
||||
}
|
||||
|
||||
/* Find the heap, match and depth limits if requested. The match and heap
|
||||
limits are not relevant for DFA matching and the depth and heap limits are
|
||||
not relevant for JIT. The return from check_match_limit() is the return from
|
||||
the final call to pcre2_match() or pcre2_dfa_match(). */
|
||||
/* Find the heap, match and depth limits if requested. The depth and heap
|
||||
limits are not relevant for JIT. The return from check_match_limit() is the
|
||||
return from the final call to pcre2_match() or pcre2_dfa_match(). */
|
||||
|
||||
if ((dat_datctl.control & CTL_FINDLIMITS) != 0)
|
||||
{
|
||||
capcount = 0; /* This stops compiler warnings */
|
||||
|
||||
if ((dat_datctl.control & CTL_DFA) == 0 &&
|
||||
(FLD(compiled_code, executable_jit) == NULL ||
|
||||
(dat_datctl.options & PCRE2_NO_JIT) != 0))
|
||||
if (FLD(compiled_code, executable_jit) == NULL ||
|
||||
(dat_datctl.options & PCRE2_NO_JIT) != 0)
|
||||
{
|
||||
(void)check_match_limit(pp, arg_ulen, PCRE2_ERROR_HEAPLIMIT, "heap");
|
||||
}
|
||||
|
@ -7165,6 +7177,12 @@ else for (gmatched = 0;; gmatched++)
|
|||
capcount = check_match_limit(pp, arg_ulen, PCRE2_ERROR_DEPTHLIMIT,
|
||||
"depth");
|
||||
}
|
||||
|
||||
if (capcount == 0)
|
||||
{
|
||||
fprintf(outfile, "Matched, but offsets vector is too small to show all matches\n");
|
||||
capcount = dat_datctl.oveccount;
|
||||
}
|
||||
}
|
||||
|
||||
/* Otherwise just run a single match, setting up a callout if required (the
|
||||
|
@ -7877,7 +7895,7 @@ else
|
|||
(void)PCRE2_CONFIG(PCRE2_CONFIG_NEWLINE, &optval);
|
||||
print_newline_config(optval, FALSE);
|
||||
(void)PCRE2_CONFIG(PCRE2_CONFIG_BSR, &optval);
|
||||
printf(" \\R matches %s\n",
|
||||
printf(" \\R matches %s\n",
|
||||
(optval == PCRE2_BSR_ANYCRLF)? "CR, LF, or CRLF only" :
|
||||
"all Unicode newlines");
|
||||
(void)PCRE2_CONFIG(PCRE2_CONFIG_NEVER_BACKSLASH_C, &optval);
|
||||
|
|
|
@ -4874,6 +4874,14 @@
|
|||
\= Expect depth limit exceeded
|
||||
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
|
||||
|
||||
/(*LIMIT_HEAP=0)^((.)(?1)|.)$/
|
||||
\= Expect heap limit exceeded
|
||||
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
|
||||
|
||||
/(*LIMIT_HEAP=50000)^((.)(?1)|.)$/
|
||||
\= Expect success
|
||||
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
|
||||
|
||||
/(02-)?[0-9]{3}-[0-9]{3}/
|
||||
02-123-123
|
||||
|
||||
|
|
|
@ -7667,12 +7667,23 @@ No match
|
|||
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
|
||||
Failed: error -53: matching depth limit exceeded
|
||||
|
||||
/(*LIMIT_HEAP=0)^((.)(?1)|.)$/
|
||||
\= Expect heap limit exceeded
|
||||
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
|
||||
Failed: error -63: heap limit exceeded
|
||||
|
||||
/(*LIMIT_HEAP=50000)^((.)(?1)|.)$/
|
||||
\= Expect success
|
||||
a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
|
||||
0: a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]
|
||||
|
||||
/(02-)?[0-9]{3}-[0-9]{3}/
|
||||
02-123-123
|
||||
0: 02-123-123
|
||||
|
||||
/^(a(?2))(b)(?1)/
|
||||
abbab\=find_limits
|
||||
Minimum heap limit = 0
|
||||
Minimum match limit = 4
|
||||
Minimum depth limit = 2
|
||||
0: abbab
|
||||
|
|
Loading…
Reference in New Issue