diff --git a/ChangeLog b/ChangeLog index cfa9547..2967e44 100644 --- a/ChangeLog +++ b/ChangeLog @@ -50,7 +50,15 @@ offset is set zero for early errors. (c) Support for non-C99 snprintf() that returns -1 in the overflow case. -11. Minor tidy of pcre2_dfa_matgch() code. +11. Minor tidy of pcre2_dfa_match() code. + +12. Refactored pcre2_dfa_match() so that the internal recursive calls no longer +use the stack for local workspace and local ovectors. Instead, an initial block +of stack is reserved, but if this is insufficient, heap memory is used. The +heap limit parameter now applies to pcre2_dfa_match(). + +13. If a "find limits" test of DFA matching in pcre2test resulted in too many +matches for the ovector, no matches were displayed. Version 10.31 12-February-2018 diff --git a/README b/README index 66b756b..e4729ac 100644 --- a/README +++ b/README @@ -241,9 +241,11 @@ library. They are also documented in the pcre2build man page. discussion in the pcre2api man page (search for pcre2_set_match_limit). . There is a separate counter that limits the depth of nested backtracking - during a matching process, which indirectly limits the amount of heap memory - that is used. This also has a default of ten million, which is essentially - "unlimited". You can change the default by setting, for example, + (pcre2_match()) or nested function calls (pcre2_dfa_match()) during a + matching process, which indirectly limits the amount of heap memory that is + used, and in the case of pcre2_dfa_match() the amount of stack as well. This + counter also has a default of ten million, which is essentially "unlimited". + You can change the default by setting, for example, --with-match-limit-depth=5000 @@ -251,7 +253,7 @@ library. They are also documented in the pcre2build man page. pcre2_set_depth_limit). . You can also set an explicit limit on the amount of heap memory used by - the pcre2_match() interpreter: + the pcre2_match() and pcre2_dfa_match() interpreters: --with-heap-limit=500 @@ -885,4 +887,4 @@ The distribution should contain the files listed below. Philip Hazel Email local part: ph10 Email domain: cam.ac.uk -Last updated: 25 February 2018 +Last updated: 27 April 2018 diff --git a/configure.ac b/configure.ac index 5349257..e3ca650 100644 --- a/configure.ac +++ b/configure.ac @@ -142,7 +142,7 @@ AC_ARG_ENABLE(jit, AS_HELP_STRING([--enable-jit], [enable Just-In-Time compiling support]), , enable_jit=no) - + # This code enables JIT if the hardware supports it. if test "$enable_jit" = "auto"; then AC_LANG(C) @@ -718,10 +718,11 @@ AC_DEFINE_UNQUOTED([PARENS_NEST_LIMIT], [$with_parens_nest_limit], [ AC_DEFINE_UNQUOTED([MATCH_LIMIT], [$with_match_limit], [ The value of MATCH_LIMIT determines the default number of times the pcre2_match() function can record a backtrack position during a single - matching attempt. There is a runtime interface for setting a different limit. - The limit exists in order to catch runaway regular expressions that take for - ever to determine that they do not match. The default is set very large so - that it does not accidentally catch legitimate cases.]) + matching attempt. The value is also used to limit a loop counter in + pcre2_dfa_match(). There is a runtime interface for setting a different + limit. The limit exists in order to catch runaway regular expressions that + take for ever to determine that they do not match. The default is set very + large so that it does not accidentally catch legitimate cases.]) # --with-match-limit-recursion is an obsolete synonym for --with-match-limit-depth @@ -745,11 +746,15 @@ AC_DEFINE_UNQUOTED([MATCH_LIMIT_DEPTH], [$with_match_limit_depth], [ the maximum amount of heap memory that is used. The value of MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it must be less than the value of MATCH_LIMIT. The default is to use the same value - as MATCH_LIMIT. There is a runtime method for setting a different limit.]) + as MATCH_LIMIT. There is a runtime method for setting a different limit. In + the case of pcre2_dfa_match(), this limit controls the depth of the internal + nested function calls that are used for pattern recursions, lookarounds, and + atomic groups.]) AC_DEFINE_UNQUOTED([HEAP_LIMIT], [$with_heap_limit], [ - This limits the amount of memory that pcre2_match() may use while matching - a pattern. The value is in kilobytes.]) + This limits the amount of memory that may be used while matching + a pattern. It applies to both pcre2_match() and pcre2_dfa_match(). It does + not apply to JIT matching. The value is in kilobytes.]) AC_DEFINE([MAX_NAME_SIZE], [32], [ This limit is parameterized just in case anybody ever wants to diff --git a/doc/html/NON-AUTOTOOLS-BUILD.txt b/doc/html/NON-AUTOTOOLS-BUILD.txt index 0775794..0bf4507 100644 --- a/doc/html/NON-AUTOTOOLS-BUILD.txt +++ b/doc/html/NON-AUTOTOOLS-BUILD.txt @@ -10,6 +10,7 @@ This document contains the following sections: Calling conventions in Windows environments Comments about Win32 builds Building PCRE2 on Windows with CMake + Building PCRE2 on Windows with Visual Studio Testing with RunTest.bat Building PCRE2 on native z/OS and z/VM @@ -328,6 +329,18 @@ cache can be deleted by selecting "File > Delete Cache". most recent build configuration is targeted by the tests. A summary of test results is presented. Complete test output is subsequently available for review in Testing\Temporary under your build dir. + + +BUILDING PCRE2 ON WINDOWS WITH VISUAL STUDIO + +The code currently cannot be compiled without a stdint.h header, which is +available only in relatively recent versions of Visual Studio. However, this +portable and permissively-licensed implementation of the header worked without +issue: + + http://www.azillionmonkeys.com/qed/pstdint.h + +Just rename it and drop it into the top level of the build tree. TESTING WITH RUNTEST.BAT @@ -382,6 +395,6 @@ Everything in that location, source and executable, is in EBCDIC and native z/OS file formats. The port provides an API for LE languages such as COBOL and for the z/OS and z/VM versions of the Rexx languages. -=============================== -Last Updated: 13 September 2017 -=============================== +=========================== +Last Updated: 19 April 2018 +=========================== diff --git a/doc/html/README.txt b/doc/html/README.txt index 66b756b..e4729ac 100644 --- a/doc/html/README.txt +++ b/doc/html/README.txt @@ -241,9 +241,11 @@ library. They are also documented in the pcre2build man page. discussion in the pcre2api man page (search for pcre2_set_match_limit). . There is a separate counter that limits the depth of nested backtracking - during a matching process, which indirectly limits the amount of heap memory - that is used. This also has a default of ten million, which is essentially - "unlimited". You can change the default by setting, for example, + (pcre2_match()) or nested function calls (pcre2_dfa_match()) during a + matching process, which indirectly limits the amount of heap memory that is + used, and in the case of pcre2_dfa_match() the amount of stack as well. This + counter also has a default of ten million, which is essentially "unlimited". + You can change the default by setting, for example, --with-match-limit-depth=5000 @@ -251,7 +253,7 @@ library. They are also documented in the pcre2build man page. pcre2_set_depth_limit). . You can also set an explicit limit on the amount of heap memory used by - the pcre2_match() interpreter: + the pcre2_match() and pcre2_dfa_match() interpreters: --with-heap-limit=500 @@ -885,4 +887,4 @@ The distribution should contain the files listed below. Philip Hazel Email local part: ph10 Email domain: cam.ac.uk -Last updated: 25 February 2018 +Last updated: 27 April 2018 diff --git a/doc/html/pcre2_dfa_match.html b/doc/html/pcre2_dfa_match.html index 36d7976..8702cca 100644 --- a/doc/html/pcre2_dfa_match.html +++ b/doc/html/pcre2_dfa_match.html @@ -46,9 +46,9 @@ just once (except when processing lookaround assertions). This function is wscount Number of elements in the vector For pcre2_dfa_match(), a match context is needed only if you want to set -up a callout function or specify the match and/or the recursion depth limits. -The length and startoffset values are code units, not characters. -The options are: +up a callout function or specify the heap limit or the match or the recursion +depth limits. The length and startoffset values are code units, not +characters. The options are:
   PCRE2_ANCHORED          Match only at the first position
   PCRE2_ENDANCHORED       Pattern can match only at end of subject
diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html
index ba3b2ca..7498afb 100644
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@@ -951,14 +951,15 @@ offset limit. In other words, whichever limit comes first is used.
 
The heap_limit parameter specifies, in units of kilobytes, the maximum amount of heap memory that pcre2_match() may use to hold backtracking -information when running an interpretive match. This limit does not apply to -matching with the JIT optimization, which has its own memory control -arrangements (see the +information when running an interpretive match. This limit also applies to +pcre2_dfa_match(), which may use the heap when processing patterns with a +lot of nested pattern recursion or lookarounds or atomic groups. This limit +does not apply to matching with the JIT optimization, which has its own memory +control arrangements (see the pcre2jit -documentation for more details), nor does it apply to pcre2_dfa_match(). -If the limit is reached, the negative error code PCRE2_ERROR_HEAPLIMIT is -returned. The default limit is set when PCRE2 is built; the default default is -very large and is essentially "unlimited". +documentation for more details). If the limit is reached, the negative error +code PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 is +built; the default default is very large and is essentially "unlimited".

A value for the heap limit may also be supplied by an item at the start of a @@ -978,6 +979,12 @@ Heap memory is used only if the initial vector is too small. If the heap limit is set to a value less than 21 (in particular, zero) no heap memory will be used. In this case, only patterns that do not have a lot of nested backtracking can be successfully processed. +

+

+Similarly, for pcre2_dfa_match(), a vector on the system stack is used +when processing pattern recursions, lookarounds, or atomic groups, and only if +this is not big enough is heap memory used. In this case, too, setting a value +of zero disables the use of the heap.

int pcre2_set_match_limit(pcre2_match_context *mcontext, @@ -1035,11 +1042,22 @@ backtracking.

The depth limit is not relevant, and is ignored, when matching is done using JIT compiled code. However, it is supported by pcre2_dfa_match(), which -uses it to limit the depth of internal recursive function calls that implement -atomic groups, lookaround assertions, and pattern recursions. This is, -therefore, an indirect limit on the amount of system stack that is used. A -recursive pattern such as /(.)(?1)/, when matched to a very long string using -pcre2_dfa_match(), can use a great deal of stack. +uses it to limit the depth of nested internal recursive function calls that +implement atomic groups, lookaround assertions, and pattern recursions. This +limits, indirectly, the amount of system stack this is used. It was more useful +in versions before 10.32, when stack memory was used for local workspace +vectors for recursive function calls. From version 10.32, only local variables +are allocated on the stack and as each call uses only a few hundred bytes, even +a small stack can support quite a lot of recursion. +

+

+If the depth of internal recursive function calls is great enough, local +workspace vectors are allocated on the heap from version 10.32 onwards, so the +depth limit also indirectly limits the amount of heap memory that is used. A +recursive pattern such as /(.(?2))((?1)|)/, when matched to a very long string +using pcre2_dfa_match(), can use a great deal of memory. However, it is +probably better to limit heap usage directly by calling +pcre2_set_heap_limit().

The default value for the depth limit can be set when PCRE2 is built; the @@ -1096,15 +1114,16 @@ and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively. PCRE2_CONFIG_DEPTHLIMIT

The output is a uint32_t integer that gives the default limit for the depth of -nested backtracking in pcre2_match() or the depth of nested recursions -and lookarounds in pcre2_dfa_match(). Further details are given with -pcre2_set_depth_limit() above. +nested backtracking in pcre2_match() or the depth of nested recursions, +lookarounds, and atomic groups in pcre2_dfa_match(). Further details are +given with pcre2_set_depth_limit() above.
   PCRE2_CONFIG_HEAPLIMIT
 
The output is a uint32_t integer that gives, in kilobytes, the default limit -for the amount of heap memory used by pcre2_match(). Further details are -given with pcre2_set_heap_limit() above. +for the amount of heap memory used by pcre2_match() or +pcre2_dfa_match(). Further details are given with +pcre2_set_heap_limit() above.
   PCRE2_CONFIG_JIT
 
@@ -3510,17 +3529,7 @@ capture. Calls to the convenience functions that extract substrings by name return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a DFA match. The convenience functions that extract substrings by number never -return PCRE2_ERROR_NOSUBSTRING, and the meanings of some other errors are -slightly different: -
-  PCRE2_ERROR_UNAVAILABLE
-
-The ovector is not big enough to include a slot for the given substring number. -
-  PCRE2_ERROR_UNSET
-
-There is a slot in the ovector for this substring, but there were insufficient -matches to fill it. +return PCRE2_ERROR_NOSUBSTRING.

The matched strings are stored in the ovector in reverse order of length; that @@ -3594,9 +3603,9 @@ Cambridge, England.


REVISION

-Last updated: 31 December 2017 +Last updated: 27 April 2018
-Copyright © 1997-2017 University of Cambridge. +Copyright © 1997-2018 University of Cambridge.

Return to the PCRE2 index page. diff --git a/doc/html/pcre2build.html b/doc/html/pcre2build.html index edf24e8..c9d9324 100644 --- a/doc/html/pcre2build.html +++ b/doc/html/pcre2build.html @@ -295,9 +295,10 @@ change this by a setting such as --with-heap-limit=500 which limits the amount of heap to 500 kilobytes. This limit applies only to -interpretive matching in pcre2_match(). It does not apply when JIT (which has -its own memory arrangements) is used, nor does it apply to -pcre2_dfa_match(). +interpretive matching in pcre2_match() and pcre2_dfa_match(), which +may also use the heap for internal workspace when processing complicated +patterns. This limit does not apply when JIT (which has its own memory +arrangements) is used.

You can also explicitly limit the depth of nested backtracking in the @@ -573,7 +574,7 @@ Cambridge, England.


REVISION

-Last updated: 25 February 2018 +Last updated: 26 April 2018
Copyright © 1997-2018 University of Cambridge.
diff --git a/doc/html/pcre2callout.html b/doc/html/pcre2callout.html index 2adf21a..4ff1673 100644 --- a/doc/html/pcre2callout.html +++ b/doc/html/pcre2callout.html @@ -310,10 +310,12 @@ PCRE2_UNSET.

For DFA matching, the offset_vector field points to the ovector that was -passed to the matching function in the match data block, but it holds no useful -information at callout time because pcre2_dfa_match() does not support -substring capturing. The value of capture_top is always 1 and the value -of capture_last is always 0 for DFA matching. +passed to the matching function in the match data block for callouts at the top +level, but to an internal ovector during the processing of pattern recursions, +lookarounds, and atomic groups. However, these ovectors hold no useful +information because pcre2_dfa_match() does not support substring +capturing. The value of capture_top is always 1 and the value of +capture_last is always 0 for DFA matching.

The subject and subject_length fields contain copies of the values @@ -461,9 +463,9 @@ Cambridge, England.


REVISION

-Last updated: 22 December 2017 +Last updated: 26 April 2018
-Copyright © 1997-2017 University of Cambridge. +Copyright © 1997-2018 University of Cambridge.

Return to the PCRE2 index page. diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index c495cba..1131c2a 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -173,12 +173,12 @@ the application to apply the JIT optimization by calling Setting match resource limits

-The pcre2_match() function contains a counter that is incremented every time it -goes round its main loop. The caller of pcre2_match() can set a limit on -this counter, which therefore limits the amount of computing resource used for -a match. The maximum depth of nested backtracking can also be limited; this -indirectly restricts the amount of heap memory that is used, but there is also -an explicit memory limit that can be set. +The pcre2_match() function contains a counter that is incremented every +time it goes round its main loop. The caller of pcre2_match() can set a +limit on this counter, which therefore limits the amount of computing resource +used for a match. The maximum depth of nested backtracking can also be limited; +this indirectly restricts the amount of heap memory that is used, but there is +also an explicit memory limit that can be set.

These facilities are provided to catch runaway matches that are provoked by @@ -195,20 +195,22 @@ where d is any number of decimal digits. However, the value of the setting must be less than the value set (or defaulted) by the caller of pcre2_match() for it to have any effect. In other words, the pattern writer can lower the limits set by the programmer, but not raise them. If there is more than one -setting of one of these limits, the lower value is used. +setting of one of these limits, the lower value is used. The heap limit is +specified in kilobytes.

Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is still recognized for backwards compatibility.

-The heap limit applies only when the pcre2_match() interpreter is used -for matching. It does not apply to JIT or DFA matching. The match limit is used -(but in a different way) when JIT is being used, or when -pcre2_dfa_match() is called, to limit computing resource usage by those -matching functions. The depth limit is ignored by JIT but is relevant for DFA -matching, which uses function recursion for recursions within the pattern. In -this case, the depth limit controls the amount of system stack that is used. +The heap limit applies only when the pcre2_match() or +pcre2_dfa_match() interpreters are used for matching. It does not apply +to JIT. The match limit is used (but in a different way) when JIT is being +used, or when pcre2_dfa_match() is called, to limit computing resource +usage by those matching functions. The depth limit is ignored by JIT but is +relevant for DFA matching, which uses function recursion for recursions within +the pattern and for lookaround assertions and atomic groups. In this case, the +depth limit controls the depth of such recursion.


Newline conventions @@ -2818,11 +2820,6 @@ matched at the top level, its final captured value is unset, even if it was (temporarily) set at a deeper level during the matching process.

-If there are more than 15 capturing parentheses in a pattern, PCRE2 has to -obtain extra memory from the heap to store data during a recursion. If no -memory can be obtained, the match fails with the PCRE2_ERROR_NOMEMORY error. -

-

Do not confuse the (?R) item with the condition (R), which tests for recursion. Consider this pattern, which matches text in angle brackets, allowing for arbitrary nesting. Only digits are allowed in nested brackets (that is, when @@ -3479,9 +3476,9 @@ Cambridge, England.


REVISION

-Last updated: 12 September 2017 +Last updated: 25 April 2018
-Copyright © 1997-2017 University of Cambridge. +Copyright © 1997-2018 University of Cambridge.

Return to the PCRE2 index page. diff --git a/doc/html/pcre2perform.html b/doc/html/pcre2perform.html index 28f4f73..7ff3b87 100644 --- a/doc/html/pcre2perform.html +++ b/doc/html/pcre2perform.html @@ -93,9 +93,17 @@ may also reduce the memory requirements.

In contrast to pcre2_match(), pcre2_dfa_match() does use recursive function calls, but only for processing atomic groups, lookaround assertions, -and recursion within the pattern. Too much nested recursion may cause stack -issues. The "match depth" parameter can be used to limit the depth of function -recursion in pcre2_dfa_match(). +and recursion within the pattern. The original version of the code used to +allocate quite large internal workspace vectors on the stack, which caused some +problems for some patterns in environments with small stacks. From release +10.32 the code for pcre2_dfa_match() has been re-factored to use heap +memory when necessary for internal workspace when recursing, though recursive +function calls are still used. +

+

+The "match depth" parameter can be used to limit the depth of function +recursion, and the "match heap" parameter to limit heap memory in +pcre2_dfa_match().


PROCESSING TIME

@@ -244,9 +252,9 @@ Cambridge, England.


REVISION

-Last updated: 08 April 2017 +Last updated: 25 April 2018
-Copyright © 1997-2017 University of Cambridge. +Copyright © 1997-2018 University of Cambridge.

Return to the PCRE2 index page. diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html index 7d98d90..d6e5345 100644 --- a/doc/html/pcre2test.html +++ b/doc/html/pcre2test.html @@ -1199,7 +1199,7 @@ pattern. get=<number or name> extract captured substring getall extract all captured substrings /g global global matching - heap_limit=<n> set a limit on heap memory + heap_limit=<n> set a limit on heap memory (Kbytes) jitstack=<n> set size of JIT stack mark show mark values match_limit=<n> set a match limit @@ -1438,20 +1438,17 @@ Finding minimum limits

If the find_limits modifier is present on a subject line, pcre2test calls the relevant matching function several times, setting different values in -the match context via pcre2_set_heap_limit(), \fBpcre2_set_match_limit(), -or pcre2_set_depth_limit() until it finds the minimum values for each -parameter that allows the match to complete without error. +the match context via pcre2_set_heap_limit(), +pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds +the minimum values for each parameter that allows the match to complete without +error. If JIT is being used, only the match limit is relevant.

-If JIT is being used, only the match limit is relevant. If DFA matching is -being used, only the depth limit is relevant. -

-

-The match_limit number is a measure of the amount of backtracking -that takes place, and learning the minimum value can be instructive. For most -simple matches, the number is quite small, but for patterns with very large -numbers of matching possibilities, it can become large very quickly with -increasing length of subject string. +When using this modifier, the pattern should not contain any limit settings +such as (*LIMIT_MATCH=...) within it. If such a setting is present and is +lower than the minimum matching value, the minimum value cannot be found +because pcre2_set_match_limit() etc. are only able to reduce the value of +an in-pattern limit; they cannot increase it.

For non-DFA matching, the minimum depth_limit number is a measure of how @@ -1460,6 +1457,22 @@ searched). In the case of DFA matching, depth_limit controls the depth of recursive calls of the internal function that is used for handling pattern recursion, lookaround assertions, and atomic groups.

+

+For non-DFA matching, the match_limit number is a measure of the amount +of backtracking that takes place, and learning the minimum value can be +instructive. For most simple matches, the number is quite small, but for +patterns with very large numbers of matching possibilities, it can become large +very quickly with increasing length of subject string. In the case of DFA +matching, match_limit controls the total number of calls, both recursive +and non-recursive, to the internal matching function, thus controlling the +overall amount of computing resource that is used. +

+

+For both kinds of matching, the heap_limit number (which is in kilobytes) +limits the amount of heap memory used for matching. A value of zero disables +the use of any heap memory; many simple pattern matches can be done without +using the heap, so this is not an unreasonable setting. +


Showing MARK names
@@ -1476,13 +1489,14 @@ Showing memory usage

The memory modifier causes pcre2test to log the sizes of all heap memory allocation and freeing calls that occur during a call to -pcre2_match(). These occur only when a match requires a bigger vector -than the default for remembering backtracking points. In many cases there will -be no heap memory used and therefore no additional output. No heap memory is -allocated during matching with pcre2_dfa_match or with JIT, so in those -cases the memory modifier never has any effect. For this modifier to -work, the null_context modifier must not be set on both the pattern and -the subject, though it can be set on one or the other. +pcre2_match() or pcre2_dfa_match(). These occur only when a match +requires a bigger vector than the default for remembering backtracking points +(pcre2_match()) or for internal workspace (pcre2_dfa_match()). In +many cases there will be no heap memory used and therefore no additional +output. No heap memory is allocated during matching with JIT, so in that case +the memory modifier never has any effect. For this modifier to work, the +null_context modifier must not be set on both the pattern and the +subject, though it can be set on one or the other.


Setting a starting offset @@ -1982,9 +1996,9 @@ Cambridge, England.


REVISION

-Last updated: 21 December 2017 +Last updated: 25 April 2018
-Copyright © 1997-2017 University of Cambridge. +Copyright © 1997-2018 University of Cambridge.

Return to the PCRE2 index page. diff --git a/doc/pcre2.txt b/doc/pcre2.txt index 218ff5a..9c70f06 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -959,13 +959,15 @@ PCRE2 CONTEXTS The heap_limit parameter specifies, in units of kilobytes, the maximum amount of heap memory that pcre2_match() may use to hold backtracking - information when running an interpretive match. This limit does not - apply to matching with the JIT optimization, which has its own memory - control arrangements (see the pcre2jit documentation for more details), - nor does it apply to pcre2_dfa_match(). If the limit is reached, the - negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default - limit is set when PCRE2 is built; the default default is very large and - is essentially "unlimited". + information when running an interpretive match. This limit also applies + to pcre2_dfa_match(), which may use the heap when processing patterns + with a lot of nested pattern recursion or lookarounds or atomic groups. + This limit does not apply to matching with the JIT optimization, which + has its own memory control arrangements (see the pcre2jit documentation + for more details). If the limit is reached, the negative error code + PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 + is built; the default default is very large and is essentially "unlim- + ited". A value for the heap limit may also be supplied by an item at the start of a pattern of the form @@ -984,71 +986,86 @@ PCRE2 CONTEXTS zero) no heap memory will be used. In this case, only patterns that do not have a lot of nested backtracking can be successfully processed. + Similarly, for pcre2_dfa_match(), a vector on the system stack is used + when processing pattern recursions, lookarounds, or atomic groups, and + only if this is not big enough is heap memory used. In this case, too, + setting a value of zero disables the use of the heap. + int pcre2_set_match_limit(pcre2_match_context *mcontext, uint32_t value); - The match_limit parameter provides a means of preventing PCRE2 from + The match_limit parameter provides a means of preventing PCRE2 from using up too many computing resources when processing patterns that are not going to match, but which have a very large number of possibilities - in their search trees. The classic example is a pattern that uses + in their search trees. The classic example is a pattern that uses nested unlimited repeats. - There is an internal counter in pcre2_match() that is incremented each - time round its main matching loop. If this value reaches the match + There is an internal counter in pcre2_match() that is incremented each + time round its main matching loop. If this value reaches the match limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT. - This has the effect of limiting the amount of backtracking that can + This has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from - zero for each position in the subject string. This limit also applies + zero for each position in the subject string. This limit also applies to pcre2_dfa_match(), though the counting is done in a different way. - When pcre2_match() is called with a pattern that was successfully pro- + When pcre2_match() is called with a pattern that was successfully pro- cessed by pcre2_jit_compile(), the way in which matching is executed is - entirely different. However, there is still the possibility of runaway - matching that goes on for a very long time, and so the match_limit - value is also used in this case (but in a different way) to limit how + entirely different. However, there is still the possibility of runaway + matching that goes on for a very long time, and so the match_limit + value is also used in this case (but in a different way) to limit how long the matching can continue. - The default value for the limit can be set when PCRE2 is built; the - default default is 10 million, which handles all but the most extreme - cases. A value for the match limit may also be supplied by an item at + The default value for the limit can be set when PCRE2 is built; the + default default is 10 million, which handles all but the most extreme + cases. A value for the match limit may also be supplied by an item at the start of a pattern of the form (*LIMIT_MATCH=ddd) - where ddd is a decimal number. However, such a setting is ignored + where ddd is a decimal number. However, such a setting is ignored unless ddd is less than the limit set by the caller of pcre2_match() or pcre2_dfa_match() or, if no such limit is set, less than the default. int pcre2_set_depth_limit(pcre2_match_context *mcontext, uint32_t value); - This parameter limits the depth of nested backtracking in - pcre2_match(). Each time a nested backtracking point is passed, a new + This parameter limits the depth of nested backtracking in + pcre2_match(). Each time a nested backtracking point is passed, a new memory "frame" is used to remember the state of matching at that point. - Thus, this parameter indirectly limits the amount of memory that is - used in a match. However, because the size of each memory "frame" + Thus, this parameter indirectly limits the amount of memory that is + used in a match. However, because the size of each memory "frame" depends on the number of capturing parentheses, the actual memory limit - varies from pattern to pattern. This limit was more useful in versions + varies from pattern to pattern. This limit was more useful in versions before 10.30, where function recursion was used for backtracking. - The depth limit is not relevant, and is ignored, when matching is done + The depth limit is not relevant, and is ignored, when matching is done using JIT compiled code. However, it is supported by pcre2_dfa_match(), - which uses it to limit the depth of internal recursive function calls - that implement atomic groups, lookaround assertions, and pattern recur- - sions. This is, therefore, an indirect limit on the amount of system - stack that is used. A recursive pattern such as /(.)(?1)/, when matched - to a very long string using pcre2_dfa_match(), can use a great deal of - stack. + which uses it to limit the depth of nested internal recursive function + calls that implement atomic groups, lookaround assertions, and pattern + recursions. This limits, indirectly, the amount of system stack this is + used. It was more useful in versions before 10.32, when stack memory + was used for local workspace vectors for recursive function calls. From + version 10.32, only local variables are allocated on the stack and as + each call uses only a few hundred bytes, even a small stack can support + quite a lot of recursion. - The default value for the depth limit can be set when PCRE2 is built; - the default default is the same value as the default for the match - limit. If the limit is exceeded, pcre2_match() or pcre2_dfa_match() + If the depth of internal recursive function calls is great enough, + local workspace vectors are allocated on the heap from version 10.32 + onwards, so the depth limit also indirectly limits the amount of heap + memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when + matched to a very long string using pcre2_dfa_match(), can use a great + deal of memory. However, it is probably better to limit heap usage + directly by calling pcre2_set_heap_limit(). + + The default value for the depth limit can be set when PCRE2 is built; + the default default is the same value as the default for the match + limit. If the limit is exceeded, pcre2_match() or pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be supplied by an item at the start of a pattern of the form (*LIMIT_DEPTH=ddd) - where ddd is a decimal number. However, such a setting is ignored + where ddd is a decimal number. However, such a setting is ignored unless ddd is less than the limit set by the caller of pcre2_match() or pcre2_dfa_match() or, if no such limit is set, less than the default. @@ -1057,52 +1074,53 @@ CHECKING BUILD-TIME OPTIONS int pcre2_config(uint32_t what, void *where); - The function pcre2_config() makes it possible for a PCRE2 client to - discover which optional features have been compiled into the PCRE2 - library. The pcre2build documentation has more details about these + The function pcre2_config() makes it possible for a PCRE2 client to + discover which optional features have been compiled into the PCRE2 + library. The pcre2build documentation has more details about these optional features. - The first argument for pcre2_config() specifies which information is - required. The second argument is a pointer to memory into which the - information is placed. If NULL is passed, the function returns the - amount of memory that is needed for the requested information. For - calls that return numerical values, the value is in bytes; when - requesting these values, where should point to appropriately aligned - memory. For calls that return strings, the required length is given in + The first argument for pcre2_config() specifies which information is + required. The second argument is a pointer to memory into which the + information is placed. If NULL is passed, the function returns the + amount of memory that is needed for the requested information. For + calls that return numerical values, the value is in bytes; when + requesting these values, where should point to appropriately aligned + memory. For calls that return strings, the required length is given in code units, not counting the terminating zero. - When requesting information, the returned value from pcre2_config() is - non-negative on success, or the negative error code PCRE2_ERROR_BADOP- - TION if the value in the first argument is not recognized. The follow- + When requesting information, the returned value from pcre2_config() is + non-negative on success, or the negative error code PCRE2_ERROR_BADOP- + TION if the value in the first argument is not recognized. The follow- ing information is available: PCRE2_CONFIG_BSR - The output is a uint32_t integer whose value indicates what character - sequences the \R escape sequence matches by default. A value of + The output is a uint32_t integer whose value indicates what character + sequences the \R escape sequence matches by default. A value of PCRE2_BSR_UNICODE means that \R matches any Unicode line ending - sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, + sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. The default can be overridden when a pattern is compiled. PCRE2_CONFIG_COMPILED_WIDTHS - The output is a uint32_t integer whose lower bits indicate which code - unit widths were selected when PCRE2 was built. The 1-bit indicates - 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup- + The output is a uint32_t integer whose lower bits indicate which code + unit widths were selected when PCRE2 was built. The 1-bit indicates + 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup- port, respectively. PCRE2_CONFIG_DEPTHLIMIT - The output is a uint32_t integer that gives the default limit for the - depth of nested backtracking in pcre2_match() or the depth of nested - recursions and lookarounds in pcre2_dfa_match(). Further details are - given with pcre2_set_depth_limit() above. + The output is a uint32_t integer that gives the default limit for the + depth of nested backtracking in pcre2_match() or the depth of nested + recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur- + ther details are given with pcre2_set_depth_limit() above. PCRE2_CONFIG_HEAPLIMIT - The output is a uint32_t integer that gives, in kilobytes, the default - limit for the amount of heap memory used by pcre2_match(). Further - details are given with pcre2_set_heap_limit() above. + The output is a uint32_t integer that gives, in kilobytes, the default + limit for the amount of heap memory used by pcre2_match() or + pcre2_dfa_match(). Further details are given with + pcre2_set_heap_limit() above. PCRE2_CONFIG_JIT @@ -3396,74 +3414,63 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION Calls to the convenience functions that extract substrings by name return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a DFA match. The convenience functions that extract substrings by - number never return PCRE2_ERROR_NOSUBSTRING, and the meanings of some - other errors are slightly different: + number never return PCRE2_ERROR_NOSUBSTRING. - PCRE2_ERROR_UNAVAILABLE - - The ovector is not big enough to include a slot for the given substring - number. - - PCRE2_ERROR_UNSET - - There is a slot in the ovector for this substring, but there were - insufficient matches to fill it. - - The matched strings are stored in the ovector in reverse order of - length; that is, the longest matching string is first. If there were - too many matches to fit into the ovector, the yield of the function is + The matched strings are stored in the ovector in reverse order of + length; that is, the longest matching string is first. If there were + too many matches to fit into the ovector, the yield of the function is zero, and the vector is filled with the longest matches. - NOTE: PCRE2's "auto-possessification" optimization usually applies to - character repeats at the end of a pattern (as well as internally). For - example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA - matching, this means that only one possible match is found. If you - really do want multiple matches in such cases, either use an ungreedy - repeat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when + NOTE: PCRE2's "auto-possessification" optimization usually applies to + character repeats at the end of a pattern (as well as internally). For + example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA + matching, this means that only one possible match is found. If you + really do want multiple matches in such cases, either use an ungreedy + repeat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when compiling. Error returns from pcre2_dfa_match() The pcre2_dfa_match() function returns a negative number when it fails. - Many of the errors are the same as for pcre2_match(), as described + Many of the errors are the same as for pcre2_match(), as described above. There are in addition the following errors that are specific to pcre2_dfa_match(): PCRE2_ERROR_DFA_UITEM - This return is given if pcre2_dfa_match() encounters an item in the - pattern that it does not support, for instance, the use of \C in a UTF + This return is given if pcre2_dfa_match() encounters an item in the + pattern that it does not support, for instance, the use of \C in a UTF mode or a back reference. PCRE2_ERROR_DFA_UCOND - This return is given if pcre2_dfa_match() encounters a condition item - that uses a back reference for the condition, or a test for recursion + This return is given if pcre2_dfa_match() encounters a condition item + that uses a back reference for the condition, or a test for recursion in a specific group. These are not supported. PCRE2_ERROR_DFA_WSSIZE - This return is given if pcre2_dfa_match() runs out of space in the + This return is given if pcre2_dfa_match() runs out of space in the workspace vector. PCRE2_ERROR_DFA_RECURSE - When a recursive subpattern is processed, the matching function calls + When a recursive subpattern is processed, the matching function calls itself recursively, using private memory for the ovector and workspace. - This error is given if the internal ovector is not large enough. This + This error is given if the internal ovector is not large enough. This should be extremely rare, as a vector of size 1000 is used. PCRE2_ERROR_DFA_BADRESTART - When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, - some plausibility checks are made on the contents of the workspace, - which should contain data about the previous partial match. If any of + When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, + some plausibility checks are made on the contents of the workspace, + which should contain data about the previous partial match. If any of these checks fail, this error is given. SEE ALSO - pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), + pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3). @@ -3476,8 +3483,8 @@ AUTHOR REVISION - Last updated: 31 December 2017 - Copyright (c) 1997-2017 University of Cambridge. + Last updated: 27 April 2018 + Copyright (c) 1997-2018 University of Cambridge. ------------------------------------------------------------------------------ @@ -3746,28 +3753,29 @@ LIMITING PCRE2 RESOURCE USAGE --with-heap-limit=500 which limits the amount of heap to 500 kilobytes. This limit applies - only to interpretive matching in pcre2_match(). It does not apply when - JIT (which has its own memory arrangements) is used, nor does it apply - to pcre2_dfa_match(). + only to interpretive matching in pcre2_match() and pcre2_dfa_match(), + which may also use the heap for internal workspace when processing com- + plicated patterns. This limit does not apply when JIT (which has its + own memory arrangements) is used. - You can also explicitly limit the depth of nested backtracking in the + You can also explicitly limit the depth of nested backtracking in the pcre2_match() interpreter. This limit defaults to the value that is set - for --with-match-limit. You can set a lower default limit by adding, + for --with-match-limit. You can set a lower default limit by adding, for example, --with-match-limit_depth=10000 - to the configure command. This value can be overridden at run time. - This depth limit indirectly limits the amount of heap memory that is - used, but because the size of each backtracking "frame" depends on the - number of capturing parentheses in a pattern, the amount of heap that - is used before the limit is reached varies from pattern to pattern. - This limit was more useful in versions before 10.30, where function + to the configure command. This value can be overridden at run time. + This depth limit indirectly limits the amount of heap memory that is + used, but because the size of each backtracking "frame" depends on the + number of capturing parentheses in a pattern, the amount of heap that + is used before the limit is reached varies from pattern to pattern. + This limit was more useful in versions before 10.30, where function recursion was used for backtracking. As well as applying to pcre2_match(), the depth limit also controls the - depth of recursive function calls in pcre2_dfa_match(). These are used - for lookaround assertions, atomic groups, and recursion within pat- + depth of recursive function calls in pcre2_dfa_match(). These are used + for lookaround assertions, atomic groups, and recursion within pat- terns. The limit does not apply to JIT matching. @@ -3775,45 +3783,45 @@ CREATING CHARACTER TABLES AT BUILD TIME PCRE2 uses fixed tables for processing characters whose code points are less than 256. By default, PCRE2 is built with a set of tables that are - distributed in the file src/pcre2_chartables.c.dist. These tables are + distributed in the file src/pcre2_chartables.c.dist. These tables are for ASCII codes only. If you add --enable-rebuild-chartables - to the configure command, the distributed tables are no longer used. - Instead, a program called dftables is compiled and run. This outputs + to the configure command, the distributed tables are no longer used. + Instead, a program called dftables is compiled and run. This outputs the source for new set of tables, created in the default locale of your C run-time system. This method of replacing the tables does not work if - you are cross compiling, because dftables is run on the local host. If - you need to create alternative tables when cross compiling, you will + you are cross compiling, because dftables is run on the local host. If + you need to create alternative tables when cross compiling, you will have to do so "by hand". USING EBCDIC CODE - PCRE2 assumes by default that it will run in an environment where the - character code is ASCII or Unicode, which is a superset of ASCII. This + PCRE2 assumes by default that it will run in an environment where the + character code is ASCII or Unicode, which is a superset of ASCII. This is the case for most computer operating systems. PCRE2 can, however, be compiled to run in an 8-bit EBCDIC environment by adding --enable-ebcdic --disable-unicode to the configure command. This setting implies --enable-rebuild-charta- - bles. You should only use it if you know that you are in an EBCDIC + bles. You should only use it if you know that you are in an EBCDIC environment (for example, an IBM mainframe operating system). - It is not possible to support both EBCDIC and UTF-8 codes in the same - version of the library. Consequently, --enable-unicode and --enable- + It is not possible to support both EBCDIC and UTF-8 codes in the same + version of the library. Consequently, --enable-unicode and --enable- ebcdic are mutually exclusive. The EBCDIC character that corresponds to an ASCII LF is assumed to have - the value 0x15 by default. However, in some EBCDIC environments, 0x25 + the value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In such an environment you should use --enable-ebcdic-nl25 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR - has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and + has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 0x25 is not chosen as LF is made to correspond to the Unicode NEL char- acter (which, in Unicode, is 0x85). @@ -3826,34 +3834,34 @@ PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS By default, on non-Windows systems, pcre2grep supports the use of call- outs with string arguments within the patterns it is matching, in order - to run external scripts. For details, see the pcre2grep documentation. - This support can be disabled by adding --disable-pcre2grep-callout to + to run external scripts. For details, see the pcre2grep documentation. + This support can be disabled by adding --disable-pcre2grep-callout to the configure command. PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT - By default, pcre2grep reads all files as plain text. You can build it - so that it recognizes files whose names end in .gz or .bz2, and reads + By default, pcre2grep reads all files as plain text. You can build it + so that it recognizes files whose names end in .gz or .bz2, and reads them with libz or libbz2, respectively, by adding one or both of --enable-pcre2grep-libz --enable-pcre2grep-libbz2 to the configure command. These options naturally require that the rel- - evant libraries are installed on your system. Configuration will fail + evant libraries are installed on your system. Configuration will fail if they are not. PCRE2GREP BUFFER SIZE - pcre2grep uses an internal buffer to hold a "window" on the file it is + pcre2grep uses an internal buffer to hold a "window" on the file it is scanning, in order to be able to output "before" and "after" lines when - it finds a match. The starting size of the buffer is controlled by a - parameter whose default value is 20K. The buffer itself is three times - this size, but because of the way it is used for holding "before" - lines, the longest line that is guaranteed to be processable is the - parameter size. If a longer line is encountered, pcre2grep automati- + it finds a match. The starting size of the buffer is controlled by a + parameter whose default value is 20K. The buffer itself is three times + this size, but because of the way it is used for holding "before" + lines, the longest line that is guaranteed to be processable is the + parameter size. If a longer line is encountered, pcre2grep automati- cally expands the buffer, up to a specified maximum size, whose default is 1M or the starting size, whichever is the larger. You can change the default parameter values by adding, for example, @@ -3861,8 +3869,8 @@ PCRE2GREP BUFFER SIZE --with-pcre2grep-bufsize=51200 --with-pcre2grep-max-bufsize=2097152 - to the configure command. The caller of pcre2grep can override these - values by using --buffer-size and --max-buffer-size on the command + to the configure command. The caller of pcre2grep can override these + values by using --buffer-size and --max-buffer-size on the command line. @@ -3873,26 +3881,26 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT --enable-pcre2test-libreadline --enable-pcre2test-libedit - to the configure command, pcre2test is linked with the libreadline + to the configure command, pcre2test is linked with the libreadline orlibedit library, respectively, and when its input is from a terminal, - it reads it using the readline() function. This provides line-editing - and history facilities. Note that libreadline is GPL-licensed, so if - you distribute a binary of pcre2test linked in this way, there may be + it reads it using the readline() function. This provides line-editing + and history facilities. Note that libreadline is GPL-licensed, so if + you distribute a binary of pcre2test linked in this way, there may be licensing issues. These can be avoided by linking instead with libedit, which has a BSD licence. - Setting --enable-pcre2test-libreadline causes the -lreadline option to - be added to the pcre2test build. In many operating environments with a - sytem-installed readline library this is sufficient. However, in some + Setting --enable-pcre2test-libreadline causes the -lreadline option to + be added to the pcre2test build. In many operating environments with a + sytem-installed readline library this is sufficient. However, in some environments (e.g. if an unmodified distribution version of readline is - in use), some extra configuration may be necessary. The INSTALL file + in use), some extra configuration may be necessary. The INSTALL file for libreadline says this: "Readline uses the termcap functions, but does not link with the termcap or curses library itself, allowing applications which link with readline the to choose an appropriate library." - If your environment has not been set up so that an appropriate library + If your environment has not been set up so that an appropriate library is automatically included, you may need to add something like LIBS="-ncurses" @@ -3906,7 +3914,7 @@ INCLUDING DEBUGGING CODE --enable-debug - to the configure command, additional debugging code is included in the + to the configure command, additional debugging code is included in the build. This feature is intended for use by the PCRE2 maintainers. @@ -3916,15 +3924,15 @@ DEBUGGING WITH VALGRIND SUPPORT --enable-valgrind - to the configure command, PCRE2 will use valgrind annotations to mark - certain memory regions as unaddressable. This allows it to detect - invalid memory accesses, and is mostly useful for debugging PCRE2 + to the configure command, PCRE2 will use valgrind annotations to mark + certain memory regions as unaddressable. This allows it to detect + invalid memory accesses, and is mostly useful for debugging PCRE2 itself. CODE COVERAGE REPORTING - If your C compiler is gcc, you can build a version of PCRE2 that can + If your C compiler is gcc, you can build a version of PCRE2 that can generate a code coverage report for its test suite. To enable this, you must install lcov version 1.6 or above. Then specify @@ -3933,20 +3941,20 @@ CODE COVERAGE REPORTING to the configure command and build PCRE2 in the usual way. Note that using ccache (a caching C compiler) is incompatible with code - coverage reporting. If you have configured ccache to run automatically + coverage reporting. If you have configured ccache to run automatically on your system, you must set the environment variable CCACHE_DISABLE=1 before running make to build PCRE2, so that ccache is not used. - When --enable-coverage is used, the following addition targets are + When --enable-coverage is used, the following addition targets are added to the Makefile: make coverage - This creates a fresh coverage report for the PCRE2 test suite. It is - equivalent to running "make coverage-reset", "make coverage-baseline", + This creates a fresh coverage report for the PCRE2 test suite. It is + equivalent to running "make coverage-reset", "make coverage-baseline", "make check", and then "make coverage-report". make coverage-reset @@ -3963,56 +3971,56 @@ CODE COVERAGE REPORTING make coverage-clean-report - This removes the generated coverage report without cleaning the cover- + This removes the generated coverage report without cleaning the cover- age data itself. make coverage-clean-data - This removes the captured coverage data without removing the coverage + This removes the captured coverage data without removing the coverage files created at compile time (*.gcno). make coverage-clean - This cleans all coverage data including the generated coverage report. - For more information about code coverage, see the gcov and lcov docu- + This cleans all coverage data including the generated coverage report. + For more information about code coverage, see the gcov and lcov docu- mentation. SUPPORT FOR FUZZERS - There is a special option for use by people who want to run fuzzing + There is a special option for use by people who want to run fuzzing tests on PCRE2: --enable-fuzz-support At present this applies only to the 8-bit library. If set, it causes an - extra library called libpcre2-fuzzsupport.a to be built, but not - installed. This contains a single function called LLVMFuzzerTestOneIn- - put() whose arguments are a pointer to a string and the length of the - string. When called, this function tries to compile the string as a - pattern, and if that succeeds, to match it. This is done both with no - options and with some random options bits that are generated from the + extra library called libpcre2-fuzzsupport.a to be built, but not + installed. This contains a single function called LLVMFuzzerTestOneIn- + put() whose arguments are a pointer to a string and the length of the + string. When called, this function tries to compile the string as a + pattern, and if that succeeds, to match it. This is done both with no + options and with some random options bits that are generated from the string. - Setting --enable-fuzz-support also causes a binary called pcre2fuz- - zcheck to be created. This is normally run under valgrind or used when + Setting --enable-fuzz-support also causes a binary called pcre2fuz- + zcheck to be created. This is normally run under valgrind or used when PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing - function and outputs information about it is doing. The input strings - are specified by arguments: if an argument starts with "=" the rest of - it is a literal input string. Otherwise, it is assumed to be a file + function and outputs information about it is doing. The input strings + are specified by arguments: if an argument starts with "=" the rest of + it is a literal input string. Otherwise, it is assumed to be a file name, and the contents of the file are the test string. OBSOLETE OPTION - In versions of PCRE2 prior to 10.30, there were two ways of handling - backtracking in the pcre2_match() function. The default was to use the + In versions of PCRE2 prior to 10.30, there were two ways of handling + backtracking in the pcre2_match() function. The default was to use the system stack, but if --disable-stack-for-recursion - was set, memory on the heap was used. From release 10.30 onwards this - has changed (the stack is no longer used) and this option now does + was set, memory on the heap was used. From release 10.30 onwards this + has changed (the stack is no longer used) and this option now does nothing except give a warning. @@ -4030,7 +4038,7 @@ AUTHOR REVISION - Last updated: 25 February 2018 + Last updated: 26 April 2018 Copyright (c) 1997-2018 University of Cambridge. ------------------------------------------------------------------------------ @@ -4311,10 +4319,12 @@ THE CALLOUT INTERFACE their ovector slots set to PCRE2_UNSET. For DFA matching, the offset_vector field points to the ovector that - was passed to the matching function in the match data block, but it - holds no useful information at callout time because pcre2_dfa_match() - does not support substring capturing. The value of capture_top is - always 1 and the value of capture_last is always 0 for DFA matching. + was passed to the matching function in the match data block for call- + outs at the top level, but to an internal ovector during the processing + of pattern recursions, lookarounds, and atomic groups. However, these + ovectors hold no useful information because pcre2_dfa_match() does not + support substring capturing. The value of capture_top is always 1 and + the value of capture_last is always 0 for DFA matching. The subject and subject_length fields contain copies of the values that were passed to the matching function. @@ -4454,8 +4464,8 @@ AUTHOR REVISION - Last updated: 22 December 2017 - Copyright (c) 1997-2017 University of Cambridge. + Last updated: 26 April 2018 + Copyright (c) 1997-2018 University of Cambridge. ------------------------------------------------------------------------------ @@ -5919,19 +5929,19 @@ SPECIAL START-OF-PATTERN ITEMS pcre2_match() for it to have any effect. In other words, the pattern writer can lower the limits set by the programmer, but not raise them. If there is more than one setting of one of these limits, the lower - value is used. + value is used. The heap limit is specified in kilobytes. Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is still recognized for backwards compatibility. - The heap limit applies only when the pcre2_match() interpreter is used - for matching. It does not apply to JIT or DFA matching. The match limit - is used (but in a different way) when JIT is being used, or when + The heap limit applies only when the pcre2_match() or pcre2_dfa_match() + interpreters are used for matching. It does not apply to JIT. The match + limit is used (but in a different way) when JIT is being used, or when pcre2_dfa_match() is called, to limit computing resource usage by those matching functions. The depth limit is ignored by JIT but is relevant for DFA matching, which uses function recursion for recursions within - the pattern. In this case, the depth limit controls the amount of sys- - tem stack that is used. + the pattern and for lookaround assertions and atomic groups. In this + case, the depth limit controls the depth of such recursion. Newline conventions @@ -8260,86 +8270,81 @@ RECURSIVE PATTERNS unset, even if it was (temporarily) set at a deeper level during the matching process. - If there are more than 15 capturing parentheses in a pattern, PCRE2 has - to obtain extra memory from the heap to store data during a recursion. - If no memory can be obtained, the match fails with the - PCRE2_ERROR_NOMEMORY error. - - Do not confuse the (?R) item with the condition (R), which tests for - recursion. Consider this pattern, which matches text in angle brack- - ets, allowing for arbitrary nesting. Only digits are allowed in nested - brackets (that is, when recursing), whereas any characters are permit- + Do not confuse the (?R) item with the condition (R), which tests for + recursion. Consider this pattern, which matches text in angle brack- + ets, allowing for arbitrary nesting. Only digits are allowed in nested + brackets (that is, when recursing), whereas any characters are permit- ted at the outer level. < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > - In this pattern, (?(R) is the start of a conditional subpattern, with - two different alternatives for the recursive and non-recursive cases. + In this pattern, (?(R) is the start of a conditional subpattern, with + two different alternatives for the recursive and non-recursive cases. The (?R) item is the actual recursive call. Differences in recursion processing between PCRE2 and Perl Some former differences between PCRE2 and Perl no longer exist. - Before release 10.30, recursion processing in PCRE2 differed from Perl - in that a recursive subpattern call was always treated as an atomic - group. That is, once it had matched some of the subject string, it was - never re-entered, even if it contained untried alternatives and there - was a subsequent matching failure. (Historical note: PCRE implemented + Before release 10.30, recursion processing in PCRE2 differed from Perl + in that a recursive subpattern call was always treated as an atomic + group. That is, once it had matched some of the subject string, it was + never re-entered, even if it contained untried alternatives and there + was a subsequent matching failure. (Historical note: PCRE implemented recursion before Perl did.) - Starting with release 10.30, recursive subroutine calls are no longer + Starting with release 10.30, recursive subroutine calls are no longer treated as atomic. That is, they can be re-entered to try unused alter- - natives if there is a matching failure later in the pattern. This is - now compatible with the way Perl works. If you want a subroutine call + natives if there is a matching failure later in the pattern. This is + now compatible with the way Perl works. If you want a subroutine call to be atomic, you must explicitly enclose it in an atomic group. - Supporting backtracking into recursions simplifies certain types of + Supporting backtracking into recursions simplifies certain types of recursive pattern. For example, this pattern matches palindromic strings: ^((.)(?1)\2|.?)$ - The second branch in the group matches a single central character in - the palindrome when there are an odd number of characters, or nothing - when there are an even number of characters, but in order to work it - has to be able to try the second case when the rest of the pattern + The second branch in the group matches a single central character in + the palindrome when there are an odd number of characters, or nothing + when there are an even number of characters, but in order to work it + has to be able to try the second case when the rest of the pattern match fails. If you want to match typical palindromic phrases, the pat- - tern has to ignore all non-word characters, which can be done like + tern has to ignore all non-word characters, which can be done like this: ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$ - If run with the PCRE2_CASELESS option, this pattern matches phrases - such as "A man, a plan, a canal: Panama!". Note the use of the posses- - sive quantifier *+ to avoid backtracking into sequences of non-word + If run with the PCRE2_CASELESS option, this pattern matches phrases + such as "A man, a plan, a canal: Panama!". Note the use of the posses- + sive quantifier *+ to avoid backtracking into sequences of non-word characters. Without this, PCRE2 takes a great deal longer (ten times or - more) to match typical phrases, and Perl takes so long that you think + more) to match typical phrases, and Perl takes so long that you think it has gone into a loop. - Another way in which PCRE2 and Perl used to differ in their recursion - processing is in the handling of captured values. Formerly in Perl, - when a subpattern was called recursively or as a subpattern (see the - next section), it had no access to any values that were captured out- - side the recursion, whereas in PCRE2 these values can be referenced. + Another way in which PCRE2 and Perl used to differ in their recursion + processing is in the handling of captured values. Formerly in Perl, + when a subpattern was called recursively or as a subpattern (see the + next section), it had no access to any values that were captured out- + side the recursion, whereas in PCRE2 these values can be referenced. Consider this pattern: ^(.)(\1|a(?2)) - This pattern matches "bab". The first capturing parentheses match "b", - then in the second group, when the back reference \1 fails to match - "b", the second alternative matches "a" and then recurses. In the - recursion, \1 does now match "b" and so the whole match succeeds. This - match used to fail in Perl, but in later versions (I tried 5.024) it + This pattern matches "bab". The first capturing parentheses match "b", + then in the second group, when the back reference \1 fails to match + "b", the second alternative matches "a" and then recurses. In the + recursion, \1 does now match "b" and so the whole match succeeds. This + match used to fail in Perl, but in later versions (I tried 5.024) it now works. SUBPATTERNS AS SUBROUTINES - If the syntax for a recursive subpattern call (either by number or by - name) is used outside the parentheses to which it refers, it operates - like a subroutine in a programming language. The called subpattern may - be defined before or after the reference. A numbered reference can be + If the syntax for a recursive subpattern call (either by number or by + name) is used outside the parentheses to which it refers, it operates + like a subroutine in a programming language. The called subpattern may + be defined before or after the reference. A numbered reference can be absolute or relative, as in these examples: (...(absolute)...)...(?2)... @@ -8350,102 +8355,102 @@ SUBPATTERNS AS SUBROUTINES (sens|respons)e and \1ibility - matches "sense and sensibility" and "response and responsibility", but + matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If instead the pattern (sens|respons)e and (?1)ibility - is used, it does match "sense and responsibility" as well as the other - two strings. Another example is given in the discussion of DEFINE + is used, it does match "sense and responsibility" as well as the other + two strings. Another example is given in the discussion of DEFINE above. - Like recursions, subroutine calls used to be treated as atomic, but - this changed at PCRE2 release 10.30, so backtracking into subroutine - calls can now occur. However, any capturing parentheses that are set + Like recursions, subroutine calls used to be treated as atomic, but + this changed at PCRE2 release 10.30, so backtracking into subroutine + calls can now occur. However, any capturing parentheses that are set during the subroutine call revert to their previous values afterwards. - Processing options such as case-independence are fixed when a subpat- - tern is defined, so if it is used as a subroutine, such options cannot + Processing options such as case-independence are fixed when a subpat- + tern is defined, so if it is used as a subroutine, such options cannot be changed for different calls. For example, consider this pattern: (abc)(?i:(?-1)) - It matches "abcabc". It does not match "abcABC" because the change of + It matches "abcabc". It does not match "abcABC" because the change of processing option does not affect the called subpattern. ONIGURUMA SUBROUTINE SYNTAX - For compatibility with Oniguruma, the non-Perl syntax \g followed by a + For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either in angle brackets or single quotes, is - an alternative syntax for referencing a subpattern as a subroutine, - possibly recursively. Here are two of the examples used above, rewrit- + an alternative syntax for referencing a subpattern as a subroutine, + possibly recursively. Here are two of the examples used above, rewrit- ten using this syntax: (? \( ( (?>[^()]+) | \g )* \) ) (sens|respons)e and \g'1'ibility - PCRE2 supports an extension to Oniguruma: if a number is preceded by a + PCRE2 supports an extension to Oniguruma: if a number is preceded by a plus or a minus sign it is taken as a relative reference. For example: (abc)(?i:\g<-1>) - Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not - synonymous. The former is a back reference; the latter is a subroutine + Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not + synonymous. The former is a back reference; the latter is a subroutine call. CALLOUTS Perl has a feature whereby using the sequence (?{...}) causes arbitrary - Perl code to be obeyed in the middle of matching a regular expression. + Perl code to be obeyed in the middle of matching a regular expression. This makes it possible, amongst other things, to extract different sub- strings that match the same pair of parentheses when there is a repeti- tion. - PCRE2 provides a similar feature, but of course it cannot obey arbi- - trary Perl code. The feature is called "callout". The caller of PCRE2 - provides an external function by putting its entry point in a match - context using the function pcre2_set_callout(), and then passing that - context to pcre2_match() or pcre2_dfa_match(). If no match context is + PCRE2 provides a similar feature, but of course it cannot obey arbi- + trary Perl code. The feature is called "callout". The caller of PCRE2 + provides an external function by putting its entry point in a match + context using the function pcre2_set_callout(), and then passing that + context to pcre2_match() or pcre2_dfa_match(). If no match context is passed, or if the callout entry point is set to NULL, callouts are dis- abled. - Within a regular expression, (?C) indicates a point at which the - external function is to be called. There are two kinds of callout: - those with a numerical argument and those with a string argument. (?C) - on its own with no argument is treated as (?C0). A numerical argument - allows the application to distinguish between different callouts. - String arguments were added for release 10.20 to make it possible for - script languages that use PCRE2 to embed short scripts within patterns + Within a regular expression, (?C) indicates a point at which the + external function is to be called. There are two kinds of callout: + those with a numerical argument and those with a string argument. (?C) + on its own with no argument is treated as (?C0). A numerical argument + allows the application to distinguish between different callouts. + String arguments were added for release 10.20 to make it possible for + script languages that use PCRE2 to embed short scripts within patterns in a similar way to Perl. During matching, when PCRE2 reaches a callout point, the external func- - tion is called. It is provided with the number or string argument of - the callout, the position in the pattern, and one item of data that is + tion is called. It is provided with the number or string argument of + the callout, the position in the pattern, and one item of data that is also set in the match block. The callout function may cause matching to proceed, to backtrack, or to fail. - By default, PCRE2 implements a number of optimizations at matching - time, and one side-effect is that sometimes callouts are skipped. If - you need all possible callouts to happen, you need to set options that - disable the relevant optimizations. More details, including a complete - description of the programming interface to the callout function, are + By default, PCRE2 implements a number of optimizations at matching + time, and one side-effect is that sometimes callouts are skipped. If + you need all possible callouts to happen, you need to set options that + disable the relevant optimizations. More details, including a complete + description of the programming interface to the callout function, are given in the pcre2callout documentation. Callouts with numerical arguments - If you just want to have a means of identifying different callout - points, put a number less than 256 after the letter C. For example, + If you just want to have a means of identifying different callout + points, put a number less than 256 after the letter C. For example, this pattern has two callout points: (?C1)abc(?C2)def - If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical - callouts are automatically installed before each item in the pattern. - They are all numbered 255. If there is a conditional group in the pat- + If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical + callouts are automatically installed before each item in the pattern. + They are all numbered 255. If there is a conditional group in the pat- tern whose condition is an assertion, an additional callout is inserted - just before the condition. An explicit callout may also be set at this + just before the condition. An explicit callout may also be set at this position, as in this example: (?(?C9)(?=a)abc|def) @@ -8455,60 +8460,60 @@ CALLOUTS Callouts with string arguments - A delimited string may be used instead of a number as a callout argu- - ment. The starting delimiter must be one of ` ' " ^ % # $ { and the + A delimited string may be used instead of a number as a callout argu- + ment. The starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is the same as the start, except for {, where the end- - ing delimiter is }. If the ending delimiter is needed within the + ing delimiter is }. If the ending delimiter is needed within the string, it must be doubled. For example: (?C'ab ''c'' d')xyz(?C{any text})pqr - The doubling is removed before the string is passed to the callout + The doubling is removed before the string is passed to the callout function. BACKTRACKING CONTROL - There are a number of special "Backtracking Control Verbs" (to use - Perl's terminology) that modify the behaviour of backtracking during - matching. They are generally of the form (*VERB) or (*VERB:NAME). Some - verbs take either form, possibly behaving differently depending on + There are a number of special "Backtracking Control Verbs" (to use + Perl's terminology) that modify the behaviour of backtracking during + matching. They are generally of the form (*VERB) or (*VERB:NAME). Some + verbs take either form, possibly behaving differently depending on whether or not a name is present. - By default, for compatibility with Perl, a name is any sequence of + By default, for compatibility with Perl, a name is any sequence of characters that does not include a closing parenthesis. The name is not - processed in any way, and it is not possible to include a closing - parenthesis in the name. This can be changed by setting the - PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati- + processed in any way, and it is not possible to include a closing + parenthesis in the name. This can be changed by setting the + PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati- ble. - When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to - verb names and only an unescaped closing parenthesis terminates the - name. However, the only backslash items that are permitted are \Q, \E, - and sequences such as \x{100} that define character code points. Char- + When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to + verb names and only an unescaped closing parenthesis terminates the + name. However, the only backslash items that are permitted are \Q, \E, + and sequences such as \x{100} that define character code points. Char- acter type escapes such as \d are faulted. A closing parenthesis can be included in a name either as \) or between - \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED - option is also set, unescaped whitespace in verb names is skipped, and - #-comments are recognized, exactly as in the rest of the pattern. + \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED + option is also set, unescaped whitespace in verb names is skipped, and + #-comments are recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not affect verb names unless PCRE2_ALT_VERBNAMES is also set. - The maximum length of a name is 255 in the 8-bit library and 65535 in - the 16-bit and 32-bit libraries. If the name is empty, that is, if the - closing parenthesis immediately follows the colon, the effect is as if + The maximum length of a name is 255 in the 8-bit library and 65535 in + the 16-bit and 32-bit libraries. If the name is empty, that is, if the + closing parenthesis immediately follows the colon, the effect is as if the colon were not there. Any number of these verbs may occur in a pat- tern. - Since these verbs are specifically related to backtracking, most of - them can be used only when the pattern is to be matched using the tra- + Since these verbs are specifically related to backtracking, most of + them can be used only when the pattern is to be matched using the tra- ditional matching function, because that uses a backtracking algorithm. - With the exception of (*FAIL), which behaves like a failing negative + With the exception of (*FAIL), which behaves like a failing negative assertion, the backtracking control verbs cause an error if encountered by the DFA matching function. - The behaviour of these verbs in repeated groups, assertions, and in + The behaviour of these verbs in repeated groups, assertions, and in subpatterns called as subroutines (whether or not recursively) is docu- mented below. @@ -8516,71 +8521,71 @@ BACKTRACKING CONTROL PCRE2 contains some optimizations that are used to speed up matching by running some checks at the start of each match attempt. For example, it - may know the minimum length of matching subject, or that a particular + may know the minimum length of matching subject, or that a particular character must be present. When one of these optimizations bypasses the - running of a match, any included backtracking verbs will not, of + running of a match, any included backtracking verbs will not, of course, be processed. You can suppress the start-of-match optimizations - by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- - pile(), or by starting the pattern with (*NO_START_OPT). There is more + by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- + pile(), or by starting the pattern with (*NO_START_OPT). There is more discussion of this option in the section entitled "Compiling a pattern" in the pcre2api documentation. - Experiments with Perl suggest that it too has similar optimizations, + Experiments with Perl suggest that it too has similar optimizations, sometimes leading to anomalous results. Verbs that act immediately - The following verbs act as soon as they are encountered. They may not + The following verbs act as soon as they are encountered. They may not be followed by a name. (*ACCEPT) - This verb causes the match to end successfully, skipping the remainder - of the pattern. However, when it is inside a subpattern that is called - as a subroutine, only that subpattern is ended successfully. Matching + This verb causes the match to end successfully, skipping the remainder + of the pattern. However, when it is inside a subpattern that is called + as a subroutine, only that subpattern is ended successfully. Matching then continues at the outer level. If (*ACCEPT) in triggered in a posi- - tive assertion, the assertion succeeds; in a negative assertion, the + tive assertion, the assertion succeeds; in a negative assertion, the assertion fails. - If (*ACCEPT) is inside capturing parentheses, the data so far is cap- + If (*ACCEPT) is inside capturing parentheses, the data so far is cap- tured. For example: A((?:A|B(*ACCEPT)|C)D) - This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- + This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- tured by the outer parentheses. (*FAIL) or (*F) - This verb causes a matching failure, forcing backtracking to occur. It - is equivalent to (?!) but easier to read. The Perl documentation notes - that it is probably useful only when combined with (?{}) or (??{}). - Those are, of course, Perl features that are not present in PCRE2. The - nearest equivalent is the callout feature, as for example in this pat- + This verb causes a matching failure, forcing backtracking to occur. It + is equivalent to (?!) but easier to read. The Perl documentation notes + that it is probably useful only when combined with (?{}) or (??{}). + Those are, of course, Perl features that are not present in PCRE2. The + nearest equivalent is the callout feature, as for example in this pat- tern: a+(?C)(*FAIL) - A match with the string "aaaa" always fails, but the callout is taken + A match with the string "aaaa" always fails, but the callout is taken before each backtrack happens (in this example, 10 times). Recording which path was taken - There is one verb whose main purpose is to track how a match was - arrived at, though it also has a secondary use in conjunction with + There is one verb whose main purpose is to track how a match was + arrived at, though it also has a secondary use in conjunction with advancing the match starting point (see (*SKIP) below). (*MARK:NAME) or (*:NAME) - A name is always required with this verb. There may be as many - instances of (*MARK) as you like in a pattern, and their names do not + A name is always required with this verb. There may be as many + instances of (*MARK) as you like in a pattern, and their names do not have to be unique. - When a match succeeds, the name of the last-encountered (*MARK:NAME), - (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to - the caller as described in the section entitled "Other information - about the match" in the pcre2api documentation. Here is an example of - pcre2test output, where the "mark" modifier requests the retrieval and + When a match succeeds, the name of the last-encountered (*MARK:NAME), + (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to + the caller as described in the section entitled "Other information + about the match" in the pcre2api documentation. Here is an example of + pcre2test output, where the "mark" modifier requests the retrieval and outputting of (*MARK) data: re> /X(*MARK:A)Y|X(*MARK:B)Z/mark @@ -8592,72 +8597,72 @@ BACKTRACKING CONTROL MK: B The (*MARK) name is tagged with "MK:" in this output, and in this exam- - ple it indicates which of the two alternatives matched. This is a more - efficient way of obtaining this information than putting each alterna- + ple it indicates which of the two alternatives matched. This is a more + efficient way of obtaining this information than putting each alterna- tive in its own capturing parentheses. - If a verb with a name is encountered in a positive assertion that is - true, the name is recorded and passed back if it is the last-encoun- + If a verb with a name is encountered in a positive assertion that is + true, the name is recorded and passed back if it is the last-encoun- tered. This does not happen for negative assertions or failing positive assertions. - After a partial match or a failed match, the last encountered name in + After a partial match or a failed match, the last encountered name in the entire match process is returned. For example: re> /X(*MARK:A)Y|X(*MARK:B)Z/mark data> XP No match, mark = B - Note that in this unanchored example the mark is retained from the + Note that in this unanchored example the mark is retained from the match attempt that started at the letter "X" in the subject. Subsequent match attempts starting at "P" and then with an empty string do not get as far as the (*MARK) item, but nevertheless do not reset it. - If you are interested in (*MARK) values after failed matches, you - should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to + If you are interested in (*MARK) values after failed matches, you + should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to ensure that the match is always attempted. Verbs that act after backtracking The following verbs do nothing when they are encountered. Matching con- - tinues with what follows, but if there is no subsequent match, causing - a backtrack to the verb, a failure is forced. That is, backtracking - cannot pass to the left of the verb. However, when one of these verbs - appears inside an atomic group or in an assertion that is true, its - effect is confined to that group, because once the group has been - matched, there is never any backtracking into it. In this situation, - backtracking has to jump to the left of the entire atomic group or + tinues with what follows, but if there is no subsequent match, causing + a backtrack to the verb, a failure is forced. That is, backtracking + cannot pass to the left of the verb. However, when one of these verbs + appears inside an atomic group or in an assertion that is true, its + effect is confined to that group, because once the group has been + matched, there is never any backtracking into it. In this situation, + backtracking has to jump to the left of the entire atomic group or assertion. - These verbs differ in exactly what kind of failure occurs when back- - tracking reaches them. The behaviour described below is what happens - when the verb is not in a subroutine or an assertion. Subsequent sec- + These verbs differ in exactly what kind of failure occurs when back- + tracking reaches them. The behaviour described below is what happens + when the verb is not in a subroutine or an assertion. Subsequent sec- tions cover these special cases. (*COMMIT) - This verb, which may not be followed by a name, causes the whole match + This verb, which may not be followed by a name, causes the whole match to fail outright if there is a later matching failure that causes back- - tracking to reach it. Even if the pattern is unanchored, no further + tracking to reach it. Even if the pattern is unanchored, no further attempts to find a match by advancing the starting point take place. If - (*COMMIT) is the only backtracking verb that is encountered, once it - has been passed pcre2_match() is committed to finding a match at the + (*COMMIT) is the only backtracking verb that is encountered, once it + has been passed pcre2_match() is committed to finding a match at the current starting point, or not at all. For example: a+(*COMMIT)b - This matches "xxaab" but not "aacaab". It can be thought of as a kind + This matches "xxaab" but not "aacaab". It can be thought of as a kind of dynamic anchor, or "I've started, so I must finish." The name of the - most recently passed (*MARK) in the path is passed back when (*COMMIT) + most recently passed (*MARK) in the path is passed back when (*COMMIT) forces a match failure. - If there is more than one backtracking verb in a pattern, a different - one that follows (*COMMIT) may be triggered first, so merely passing + If there is more than one backtracking verb in a pattern, a different + one that follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a match does not always guarantee that a match must be at this starting point. - Note that (*COMMIT) at the start of a pattern is not the same as an - anchor, unless PCRE2's start-of-match optimizations are turned off, as + Note that (*COMMIT) at the start of a pattern is not the same as an + anchor, unless PCRE2's start-of-match optimizations are turned off, as shown in this output from pcre2test: re> /(*COMMIT)abc/ @@ -8668,213 +8673,213 @@ BACKTRACKING CONTROL data> xyzabc No match - For the first pattern, PCRE2 knows that any match must start with "a", - so the optimization skips along the subject to "a" before applying the - pattern to the first set of data. The match attempt then succeeds. The - second pattern disables the optimization that skips along to the first - character. The pattern is now applied starting at "x", and so the - (*COMMIT) causes the match to fail without trying any other starting + For the first pattern, PCRE2 knows that any match must start with "a", + so the optimization skips along the subject to "a" before applying the + pattern to the first set of data. The match attempt then succeeds. The + second pattern disables the optimization that skips along to the first + character. The pattern is now applied starting at "x", and so the + (*COMMIT) causes the match to fail without trying any other starting points. (*PRUNE) or (*PRUNE:NAME) - This verb causes the match to fail at the current starting position in + This verb causes the match to fail at the current starting position in the subject if there is a later matching failure that causes backtrack- - ing to reach it. If the pattern is unanchored, the normal "bumpalong" - advance to the next starting character then happens. Backtracking can - occur as usual to the left of (*PRUNE), before it is reached, or when - matching to the right of (*PRUNE), but if there is no match to the - right, backtracking cannot cross (*PRUNE). In simple cases, the use of - (*PRUNE) is just an alternative to an atomic group or possessive quan- + ing to reach it. If the pattern is unanchored, the normal "bumpalong" + advance to the next starting character then happens. Backtracking can + occur as usual to the left of (*PRUNE), before it is reached, or when + matching to the right of (*PRUNE), but if there is no match to the + right, backtracking cannot cross (*PRUNE). In simple cases, the use of + (*PRUNE) is just an alternative to an atomic group or possessive quan- tifier, but there are some uses of (*PRUNE) that cannot be expressed in - any other way. In an anchored pattern (*PRUNE) has the same effect as + any other way. In an anchored pattern (*PRUNE) has the same effect as (*COMMIT). The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is remembered for passing back - to the caller. However, (*SKIP:NAME) searches only for names set with + to the caller. However, (*SKIP:NAME) searches only for names set with (*MARK), ignoring those set by (*PRUNE) or (*THEN). (*SKIP) - This verb, when given without a name, is like (*PRUNE), except that if - the pattern is unanchored, the "bumpalong" advance is not to the next + This verb, when given without a name, is like (*PRUNE), except that if + the pattern is unanchored, the "bumpalong" advance is not to the next character, but to the position in the subject where (*SKIP) was encoun- - tered. (*SKIP) signifies that whatever text was matched leading up to + tered. (*SKIP) signifies that whatever text was matched leading up to it cannot be part of a successful match. Consider: a+(*SKIP)b - If the subject is "aaaac...", after the first match attempt fails - (starting at the first character in the string), the starting point + If the subject is "aaaac...", after the first match attempt fails + (starting at the first character in the string), the starting point skips on to start the next attempt at "c". Note that a possessive quan- - tifer does not have the same effect as this example; although it would - suppress backtracking during the first match attempt, the second - attempt would start at the second character instead of skipping on to + tifer does not have the same effect as this example; although it would + suppress backtracking during the first match attempt, the second + attempt would start at the second character instead of skipping on to "c". (*SKIP:NAME) When (*SKIP) has an associated name, its behaviour is modified. When it is triggered, the previous path through the pattern is searched for the - most recent (*MARK) that has the same name. If one is found, the + most recent (*MARK) that has the same name. If one is found, the "bumpalong" advance is to the subject position that corresponds to that (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a matching name is found, the (*SKIP) is ignored. - Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It + Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME). (*THEN) or (*THEN:NAME) - This verb causes a skip to the next innermost alternative when back- - tracking reaches it. That is, it cancels any further backtracking - within the current alternative. Its name comes from the observation + This verb causes a skip to the next innermost alternative when back- + tracking reaches it. That is, it cancels any further backtracking + within the current alternative. Its name comes from the observation that it can be used for a pattern-based if-then-else block: ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... - If the COND1 pattern matches, FOO is tried (and possibly further items - after the end of the group if FOO succeeds); on failure, the matcher - skips to the second alternative and tries COND2, without backtracking - into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- - quently BAZ fails, there are no more alternatives, so there is a back- - track to whatever came before the entire group. If (*THEN) is not + If the COND1 pattern matches, FOO is tried (and possibly further items + after the end of the group if FOO succeeds); on failure, the matcher + skips to the second alternative and tries COND2, without backtracking + into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- + quently BAZ fails, there are no more alternatives, so there is a back- + track to whatever came before the entire group. If (*THEN) is not inside an alternation, it acts like (*PRUNE). - The behaviour of (*THEN:NAME) is the not the same as - (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is - remembered for passing back to the caller. However, (*SKIP:NAME) - searches only for names set with (*MARK), ignoring those set by + The behaviour of (*THEN:NAME) is the not the same as + (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is + remembered for passing back to the caller. However, (*SKIP:NAME) + searches only for names set with (*MARK), ignoring those set by (*PRUNE) and (*THEN). - A subpattern that does not contain a | character is just a part of the - enclosing alternative; it is not a nested alternation with only one - alternative. The effect of (*THEN) extends beyond such a subpattern to - the enclosing alternative. Consider this pattern, where A, B, etc. are - complex pattern fragments that do not contain any | characters at this + A subpattern that does not contain a | character is just a part of the + enclosing alternative; it is not a nested alternation with only one + alternative. The effect of (*THEN) extends beyond such a subpattern to + the enclosing alternative. Consider this pattern, where A, B, etc. are + complex pattern fragments that do not contain any | characters at this level: A (B(*THEN)C) | D - If A and B are matched, but there is a failure in C, matching does not + If A and B are matched, but there is a failure in C, matching does not backtrack into A; instead it moves to the next alternative, that is, D. - However, if the subpattern containing (*THEN) is given an alternative, + However, if the subpattern containing (*THEN) is given an alternative, it behaves differently: A (B(*THEN)C | (*FAIL)) | D - The effect of (*THEN) is now confined to the inner subpattern. After a + The effect of (*THEN) is now confined to the inner subpattern. After a failure in C, matching moves to (*FAIL), which causes the whole subpat- - tern to fail because there are no more alternatives to try. In this + tern to fail because there are no more alternatives to try. In this case, matching does now backtrack into A. - Note that a conditional subpattern is not considered as having two - alternatives, because only one is ever used. In other words, the | + Note that a conditional subpattern is not considered as having two + alternatives, because only one is ever used. In other words, the | character in a conditional subpattern has a different meaning. Ignoring white space, consider: ^.*? (?(?=a) a | b(*THEN)c ) - If the subject is "ba", this pattern does not match. Because .*? is - ungreedy, it initially matches zero characters. The condition (?=a) - then fails, the character "b" is matched, but "c" is not. At this - point, matching does not backtrack to .*? as might perhaps be expected - from the presence of the | character. The conditional subpattern is + If the subject is "ba", this pattern does not match. Because .*? is + ungreedy, it initially matches zero characters. The condition (?=a) + then fails, the character "b" is matched, but "c" is not. At this + point, matching does not backtrack to .*? as might perhaps be expected + from the presence of the | character. The conditional subpattern is part of the single alternative that comprises the whole pattern, and so - the match fails. (If there was a backtrack into .*?, allowing it to + the match fails. (If there was a backtrack into .*?, allowing it to match "b", the match would succeed.) - The verbs just described provide four different "strengths" of control + The verbs just described provide four different "strengths" of control when subsequent matching fails. (*THEN) is the weakest, carrying on the - match at the next alternative. (*PRUNE) comes next, failing the match - at the current starting position, but allowing an advance to the next - character (for an unanchored pattern). (*SKIP) is similar, except that + match at the next alternative. (*PRUNE) comes next, failing the match + at the current starting position, but allowing an advance to the next + character (for an unanchored pattern). (*SKIP) is similar, except that the advance may be more than one character. (*COMMIT) is the strongest, causing the entire match to fail. More than one backtracking verb - If more than one backtracking verb is present in a pattern, the one - that is backtracked onto first acts. For example, consider this pat- + If more than one backtracking verb is present in a pattern, the one + that is backtracked onto first acts. For example, consider this pat- tern, where A, B, etc. are complex pattern fragments: (A(*COMMIT)B(*THEN)C|ABD) - If A matches but B fails, the backtrack to (*COMMIT) causes the entire + If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to fail. However, if A and B match, but C fails, the backtrack to - (*THEN) causes the next alternative (ABD) to be tried. This behaviour - is consistent, but is not always the same as Perl's. It means that if - two or more backtracking verbs appear in succession, all the the last + (*THEN) causes the next alternative (ABD) to be tried. This behaviour + is consistent, but is not always the same as Perl's. It means that if + two or more backtracking verbs appear in succession, all the the last of them has no effect. Consider this example: ...(*COMMIT)(*PRUNE)... If there is a matching failure to the right, backtracking onto (*PRUNE) - causes it to be triggered, and its action is taken. There can never be + causes it to be triggered, and its action is taken. There can never be a backtrack onto (*COMMIT). Backtracking verbs in repeated groups - PCRE2 differs from Perl in its handling of backtracking verbs in + PCRE2 differs from Perl in its handling of backtracking verbs in repeated groups. For example, consider: /(a(*COMMIT)b)+ac/ - If the subject is "abac", Perl matches, but PCRE2 fails because the + If the subject is "abac", Perl matches, but PCRE2 fails because the (*COMMIT) in the second repeat of the group acts. Backtracking verbs in assertions - (*FAIL) in any assertion has its normal effect: it forces an immediate - backtrack. The behaviour of the other backtracking verbs depends on - whether or not the assertion is standalone or acting as the condition + (*FAIL) in any assertion has its normal effect: it forces an immediate + backtrack. The behaviour of the other backtracking verbs depends on + whether or not the assertion is standalone or acting as the condition in a conditional subpattern. - (*ACCEPT) in a standalone positive assertion causes the assertion to - succeed without any further processing; captured strings are retained. - In a standalone negative assertion, (*ACCEPT) causes the assertion to + (*ACCEPT) in a standalone positive assertion causes the assertion to + succeed without any further processing; captured strings are retained. + In a standalone negative assertion, (*ACCEPT) causes the assertion to fail without any further processing; captured substrings are discarded. - If the assertion is a condition, (*ACCEPT) causes the condition to be - true for a positive assertion and false for a negative one; captured + If the assertion is a condition, (*ACCEPT) causes the condition to be + true for a positive assertion and false for a negative one; captured substrings are retained in both cases. - The effect of (*THEN) is not allowed to escape beyond an assertion. If - there are no more branches to try, (*THEN) causes a positive assertion + The effect of (*THEN) is not allowed to escape beyond an assertion. If + there are no more branches to try, (*THEN) causes a positive assertion to be false, and a negative assertion to be true. - The other backtracking verbs are not treated specially if they appear - in a standalone positive assertion. In a conditional positive asser- + The other backtracking verbs are not treated specially if they appear + in a standalone positive assertion. In a conditional positive asser- tion, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the con- - dition to be false. However, for both standalone and conditional nega- - tive assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) + dition to be false. However, for both standalone and conditional nega- + tive assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be true, without considering any further alter- native branches. Backtracking verbs in subroutines - These behaviours occur whether or not the subpattern is called recur- + These behaviours occur whether or not the subpattern is called recur- sively. Perl's treatment of subroutines is different in some cases. - (*FAIL) in a subpattern called as a subroutine has its normal effect: + (*FAIL) in a subpattern called as a subroutine has its normal effect: it forces an immediate backtrack. - (*ACCEPT) in a subpattern called as a subroutine causes the subroutine - match to succeed without any further processing. Matching then contin- + (*ACCEPT) in a subpattern called as a subroutine causes the subroutine + match to succeed without any further processing. Matching then contin- ues after the subroutine call. (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine cause the subroutine match to fail. - (*THEN) skips to the next alternative in the innermost enclosing group - within the subpattern that has alternatives. If there is no such group + (*THEN) skips to the next alternative in the innermost enclosing group + within the subpattern that has alternatives. If there is no such group within the subpattern, (*THEN) causes the subroutine match to fail. SEE ALSO - pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), + pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), pcre2(3). @@ -8887,8 +8892,8 @@ AUTHOR REVISION - Last updated: 12 September 2017 - Copyright (c) 1997-2017 University of Cambridge. + Last updated: 25 April 2018 + Copyright (c) 1997-2018 University of Cambridge. ------------------------------------------------------------------------------ @@ -8973,9 +8978,17 @@ STACK AND HEAP USAGE AT RUN TIME In contrast to pcre2_match(), pcre2_dfa_match() does use recursive function calls, but only for processing atomic groups, lookaround - assertions, and recursion within the pattern. Too much nested recursion - may cause stack issues. The "match depth" parameter can be used to - limit the depth of function recursion in pcre2_dfa_match(). + assertions, and recursion within the pattern. The original version of + the code used to allocate quite large internal workspace vectors on the + stack, which caused some problems for some patterns in environments + with small stacks. From release 10.32 the code for pcre2_dfa_match() + has been re-factored to use heap memory when necessary for internal + workspace when recursing, though recursive function calls are still + used. + + The "match depth" parameter can be used to limit the depth of function + recursion, and the "match heap" parameter to limit heap memory in + pcre2_dfa_match(). PROCESSING TIME @@ -9115,8 +9128,8 @@ AUTHOR REVISION - Last updated: 08 April 2017 - Copyright (c) 1997-2017 University of Cambridge. + Last updated: 25 April 2018 + Copyright (c) 1997-2018 University of Cambridge. ------------------------------------------------------------------------------ diff --git a/doc/pcre2_dfa_match.3 b/doc/pcre2_dfa_match.3 index 7839145..dfc3ae6 100644 --- a/doc/pcre2_dfa_match.3 +++ b/doc/pcre2_dfa_match.3 @@ -1,4 +1,4 @@ -.TH PCRE2_DFA_MATCH 3 "30 May 2017" "PCRE2 10.30" +.TH PCRE2_DFA_MATCH 3 "26 April 2018" "PCRE2 10.32" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH SYNOPSIS @@ -34,9 +34,9 @@ just once (except when processing lookaround assertions). This function is \fIwscount\fP Number of elements in the vector .sp For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set -up a callout function or specify the match and/or the recursion depth limits. -The \fIlength\fP and \fIstartoffset\fP values are code units, not characters. -The options are: +up a callout function or specify the heap limit or the match or the recursion +depth limits. The \fIlength\fP and \fIstartoffset\fP values are code units, not +characters. The options are: .sp PCRE2_ANCHORED Match only at the first position PCRE2_ENDANCHORED Pattern can match only at end of subject diff --git a/doc/pcre2api.3 b/doc/pcre2api.3 index 786b314..ed4b3a0 100644 --- a/doc/pcre2api.3 +++ b/doc/pcre2api.3 @@ -1,4 +1,4 @@ -.TH PCRE2API 3 "31 December 2017" "PCRE2 10.31" +.TH PCRE2API 3 "27 April 2018" "PCRE2 10.32" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .sp @@ -887,16 +887,17 @@ offset limit. In other words, whichever limit comes first is used. .sp The \fIheap_limit\fP parameter specifies, in units of kilobytes, the maximum amount of heap memory that \fBpcre2_match()\fP may use to hold backtracking -information when running an interpretive match. This limit does not apply to -matching with the JIT optimization, which has its own memory control -arrangements (see the +information when running an interpretive match. This limit also applies to +\fBpcre2_dfa_match()\fP, which may use the heap when processing patterns with a +lot of nested pattern recursion or lookarounds or atomic groups. This limit +does not apply to matching with the JIT optimization, which has its own memory +control arrangements (see the .\" HREF \fBpcre2jit\fP .\" -documentation for more details), nor does it apply to \fBpcre2_dfa_match()\fP. -If the limit is reached, the negative error code PCRE2_ERROR_HEAPLIMIT is -returned. The default limit is set when PCRE2 is built; the default default is -very large and is essentially "unlimited". +documentation for more details). If the limit is reached, the negative error +code PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 is +built; the default default is very large and is essentially "unlimited". .P A value for the heap limit may also be supplied by an item at the start of a pattern of the form @@ -914,6 +915,11 @@ Heap memory is used only if the initial vector is too small. If the heap limit is set to a value less than 21 (in particular, zero) no heap memory will be used. In this case, only patterns that do not have a lot of nested backtracking can be successfully processed. +.P +Similarly, for \fBpcre2_dfa_match()\fP, a vector on the system stack is used +when processing pattern recursions, lookarounds, or atomic groups, and only if +this is not big enough is heap memory used. In this case, too, setting a value +of zero disables the use of the heap. .sp .nf .B int pcre2_set_match_limit(pcre2_match_context *\fImcontext\fP, @@ -967,11 +973,21 @@ backtracking. .P The depth limit is not relevant, and is ignored, when matching is done using JIT compiled code. However, it is supported by \fBpcre2_dfa_match()\fP, which -uses it to limit the depth of internal recursive function calls that implement -atomic groups, lookaround assertions, and pattern recursions. This is, -therefore, an indirect limit on the amount of system stack that is used. A -recursive pattern such as /(.)(?1)/, when matched to a very long string using -\fBpcre2_dfa_match()\fP, can use a great deal of stack. +uses it to limit the depth of nested internal recursive function calls that +implement atomic groups, lookaround assertions, and pattern recursions. This +limits, indirectly, the amount of system stack this is used. It was more useful +in versions before 10.32, when stack memory was used for local workspace +vectors for recursive function calls. From version 10.32, only local variables +are allocated on the stack and as each call uses only a few hundred bytes, even +a small stack can support quite a lot of recursion. +.P +If the depth of internal recursive function calls is great enough, local +workspace vectors are allocated on the heap from version 10.32 onwards, so the +depth limit also indirectly limits the amount of heap memory that is used. A +recursive pattern such as /(.(?2))((?1)|)/, when matched to a very long string +using \fBpcre2_dfa_match()\fP, can use a great deal of memory. However, it is +probably better to limit heap usage directly by calling +\fBpcre2_set_heap_limit()\fP. .P The default value for the depth limit can be set when PCRE2 is built; the default default is the same value as the default for the match limit. If the @@ -1028,15 +1044,16 @@ and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively. PCRE2_CONFIG_DEPTHLIMIT .sp The output is a uint32_t integer that gives the default limit for the depth of -nested backtracking in \fBpcre2_match()\fP or the depth of nested recursions -and lookarounds in \fBpcre2_dfa_match()\fP. Further details are given with -\fBpcre2_set_depth_limit()\fP above. +nested backtracking in \fBpcre2_match()\fP or the depth of nested recursions, +lookarounds, and atomic groups in \fBpcre2_dfa_match()\fP. Further details are +given with \fBpcre2_set_depth_limit()\fP above. .sp PCRE2_CONFIG_HEAPLIMIT .sp The output is a uint32_t integer that gives, in kilobytes, the default limit -for the amount of heap memory used by \fBpcre2_match()\fP. Further details are -given with \fBpcre2_set_heap_limit()\fP above. +for the amount of heap memory used by \fBpcre2_match()\fP or +\fBpcre2_dfa_match()\fP. Further details are given with +\fBpcre2_set_heap_limit()\fP above. .sp PCRE2_CONFIG_JIT .sp @@ -3514,17 +3531,7 @@ capture. Calls to the convenience functions that extract substrings by name return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a DFA match. The convenience functions that extract substrings by number never -return PCRE2_ERROR_NOSUBSTRING, and the meanings of some other errors are -slightly different: -.sp - PCRE2_ERROR_UNAVAILABLE -.sp -The ovector is not big enough to include a slot for the given substring number. -.sp - PCRE2_ERROR_UNSET -.sp -There is a slot in the ovector for this substring, but there were insufficient -matches to fill it. +return PCRE2_ERROR_NOSUBSTRING. .P The matched strings are stored in the ovector in reverse order of length; that is, the longest matching string is first. If there were too many matches to fit @@ -3605,6 +3612,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 31 December 2017 -Copyright (c) 1997-2017 University of Cambridge. +Last updated: 27 April 2018 +Copyright (c) 1997-2018 University of Cambridge. .fi diff --git a/doc/pcre2build.3 b/doc/pcre2build.3 index 0d34d23..3b8a956 100644 --- a/doc/pcre2build.3 +++ b/doc/pcre2build.3 @@ -1,4 +1,4 @@ -.TH PCRE2BUILD 3 "25 February 2018" "PCRE2 10.32" +.TH PCRE2BUILD 3 "26 April 2018" "PCRE2 10.32" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) . @@ -292,9 +292,10 @@ change this by a setting such as --with-heap-limit=500 .sp which limits the amount of heap to 500 kilobytes. This limit applies only to -interpretive matching in pcre2_match(). It does not apply when JIT (which has -its own memory arrangements) is used, nor does it apply to -\fBpcre2_dfa_match()\fP. +interpretive matching in \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, which +may also use the heap for internal workspace when processing complicated +patterns. This limit does not apply when JIT (which has its own memory +arrangements) is used. .P You can also explicitly limit the depth of nested backtracking in the \fBpcre2_match()\fP interpreter. This limit defaults to the value that is set @@ -590,6 +591,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 25 February 2018 +Last updated: 26 April 2018 Copyright (c) 1997-2018 University of Cambridge. .fi diff --git a/doc/pcre2callout.3 b/doc/pcre2callout.3 index e3fd600..4be2e49 100644 --- a/doc/pcre2callout.3 +++ b/doc/pcre2callout.3 @@ -1,4 +1,4 @@ -.TH PCRE2CALLOUT 3 "22 December 2017" "PCRE2 10.31" +.TH PCRE2CALLOUT 3 "26 April 2018" "PCRE2 10.32" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH SYNOPSIS @@ -291,10 +291,12 @@ than \fIcapture_top\fP also have both of their ovector slots set to PCRE2_UNSET. .P For DFA matching, the \fIoffset_vector\fP field points to the ovector that was -passed to the matching function in the match data block, but it holds no useful -information at callout time because \fBpcre2_dfa_match()\fP does not support -substring capturing. The value of \fIcapture_top\fP is always 1 and the value -of \fIcapture_last\fP is always 0 for DFA matching. +passed to the matching function in the match data block for callouts at the top +level, but to an internal ovector during the processing of pattern recursions, +lookarounds, and atomic groups. However, these ovectors hold no useful +information because \fBpcre2_dfa_match()\fP does not support substring +capturing. The value of \fIcapture_top\fP is always 1 and the value of +\fIcapture_last\fP is always 0 for DFA matching. .P The \fIsubject\fP and \fIsubject_length\fP fields contain copies of the values that were passed to the matching function. @@ -441,6 +443,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 22 December 2017 -Copyright (c) 1997-2017 University of Cambridge. +Last updated: 26 April 2018 +Copyright (c) 1997-2018 University of Cambridge. .fi diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index 5c0daa8..c33f27d 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "12 September 2017" "PCRE2 10.31" +.TH PCRE2PATTERN 3 "25 April 2018" "PCRE2 10.32" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -141,12 +141,12 @@ the application to apply the JIT optimization by calling .SS "Setting match resource limits" .rs .sp -The pcre2_match() function contains a counter that is incremented every time it -goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on -this counter, which therefore limits the amount of computing resource used for -a match. The maximum depth of nested backtracking can also be limited; this -indirectly restricts the amount of heap memory that is used, but there is also -an explicit memory limit that can be set. +The \fBpcre2_match()\fP function contains a counter that is incremented every +time it goes round its main loop. The caller of \fBpcre2_match()\fP can set a +limit on this counter, which therefore limits the amount of computing resource +used for a match. The maximum depth of nested backtracking can also be limited; +this indirectly restricts the amount of heap memory that is used, but there is +also an explicit memory limit that can be set. .P These facilities are provided to catch runaway matches that are provoked by patterns with huge matching trees (a typical example is a pattern with nested @@ -162,18 +162,20 @@ where d is any number of decimal digits. However, the value of the setting must be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP for it to have any effect. In other words, the pattern writer can lower the limits set by the programmer, but not raise them. If there is more than one -setting of one of these limits, the lower value is used. +setting of one of these limits, the lower value is used. The heap limit is +specified in kilobytes. .P Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is still recognized for backwards compatibility. .P -The heap limit applies only when the \fBpcre2_match()\fP interpreter is used -for matching. It does not apply to JIT or DFA matching. The match limit is used -(but in a different way) when JIT is being used, or when -\fBpcre2_dfa_match()\fP is called, to limit computing resource usage by those -matching functions. The depth limit is ignored by JIT but is relevant for DFA -matching, which uses function recursion for recursions within the pattern. In -this case, the depth limit controls the amount of system stack that is used. +The heap limit applies only when the \fBpcre2_match()\fP or +\fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply +to JIT. The match limit is used (but in a different way) when JIT is being +used, or when \fBpcre2_dfa_match()\fP is called, to limit computing resource +usage by those matching functions. The depth limit is ignored by JIT but is +relevant for DFA matching, which uses function recursion for recursions within +the pattern and for lookaround assertions and atomic groups. In this case, the +depth limit controls the depth of such recursion. . . .\" HTML @@ -2838,10 +2840,6 @@ the last value taken on at the top level. If a capturing subpattern is not matched at the top level, its final captured value is unset, even if it was (temporarily) set at a deeper level during the matching process. .P -If there are more than 15 capturing parentheses in a pattern, PCRE2 has to -obtain extra memory from the heap to store data during a recursion. If no -memory can be obtained, the match fails with the PCRE2_ERROR_NOMEMORY error. -.P Do not confuse the (?R) item with the condition (R), which tests for recursion. Consider this pattern, which matches text in angle brackets, allowing for arbitrary nesting. Only digits are allowed in nested brackets (that is, when @@ -3505,6 +3503,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 12 September 2017 -Copyright (c) 1997-2017 University of Cambridge. +Last updated: 25 April 2018 +Copyright (c) 1997-2018 University of Cambridge. .fi diff --git a/doc/pcre2perform.3 b/doc/pcre2perform.3 index 8b49a2a..4ec441a 100644 --- a/doc/pcre2perform.3 +++ b/doc/pcre2perform.3 @@ -1,4 +1,4 @@ -.TH PCRE2PERFORM 3 "08 April 2017" "PCRE2 10.30" +.TH PCRE2PERFORM 3 "25 April 2018" "PCRE2 10.32" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 PERFORMANCE" @@ -78,9 +78,16 @@ may also reduce the memory requirements. .P In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive function calls, but only for processing atomic groups, lookaround assertions, -and recursion within the pattern. Too much nested recursion may cause stack -issues. The "match depth" parameter can be used to limit the depth of function -recursion in \fBpcre2_dfa_match()\fP. +and recursion within the pattern. The original version of the code used to +allocate quite large internal workspace vectors on the stack, which caused some +problems for some patterns in environments with small stacks. From release +10.32 the code for \fBpcre2_dfa_match()\fP has been re-factored to use heap +memory when necessary for internal workspace when recursing, though recursive +function calls are still used. +.P +The "match depth" parameter can be used to limit the depth of function +recursion, and the "match heap" parameter to limit heap memory in +\fBpcre2_dfa_match()\fP. . . .SH "PROCESSING TIME" @@ -232,6 +239,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 08 April 2017 -Copyright (c) 1997-2017 University of Cambridge. +Last updated: 25 April 2018 +Copyright (c) 1997-2018 University of Cambridge. .fi diff --git a/doc/pcre2test.1 b/doc/pcre2test.1 index ee78792..a14eab2 100644 --- a/doc/pcre2test.1 +++ b/doc/pcre2test.1 @@ -1,4 +1,4 @@ -.TH PCRE2TEST 1 "21 Decbmber 2017" "PCRE 10.31" +.TH PCRE2TEST 1 "25 April 2018" "PCRE 10.32" .SH NAME pcre2test - a program for testing Perl-compatible regular expressions. .SH SYNOPSIS @@ -1168,7 +1168,7 @@ pattern. get= extract captured substring getall extract all captured substrings /g global global matching - heap_limit= set a limit on heap memory + heap_limit= set a limit on heap memory (Kbytes) jitstack= set size of JIT stack mark show mark values match_limit= set a match limit @@ -1401,24 +1401,36 @@ the appropriate limits in the match context. These values are ignored when the .sp If the \fBfind_limits\fP modifier is present on a subject line, \fBpcre2test\fP calls the relevant matching function several times, setting different values in -the match context via \fBpcre2_set_heap_limit(), \fBpcre2_set_match_limit()\fP, -or \fBpcre2_set_depth_limit()\fP until it finds the minimum values for each -parameter that allows the match to complete without error. +the match context via \fBpcre2_set_heap_limit()\fP, +\fBpcre2_set_match_limit()\fP, or \fBpcre2_set_depth_limit()\fP until it finds +the minimum values for each parameter that allows the match to complete without +error. If JIT is being used, only the match limit is relevant. .P -If JIT is being used, only the match limit is relevant. If DFA matching is -being used, only the depth limit is relevant. -.P -The \fImatch_limit\fP number is a measure of the amount of backtracking -that takes place, and learning the minimum value can be instructive. For most -simple matches, the number is quite small, but for patterns with very large -numbers of matching possibilities, it can become large very quickly with -increasing length of subject string. +When using this modifier, the pattern should not contain any limit settings +such as (*LIMIT_MATCH=...) within it. If such a setting is present and is +lower than the minimum matching value, the minimum value cannot be found +because \fBpcre2_set_match_limit()\fP etc. are only able to reduce the value of +an in-pattern limit; they cannot increase it. .P For non-DFA matching, the minimum \fIdepth_limit\fP number is a measure of how much nested backtracking happens (that is, how deeply the pattern's tree is searched). In the case of DFA matching, \fIdepth_limit\fP controls the depth of recursive calls of the internal function that is used for handling pattern recursion, lookaround assertions, and atomic groups. +.P +For non-DFA matching, the \fImatch_limit\fP number is a measure of the amount +of backtracking that takes place, and learning the minimum value can be +instructive. For most simple matches, the number is quite small, but for +patterns with very large numbers of matching possibilities, it can become large +very quickly with increasing length of subject string. In the case of DFA +matching, \fImatch_limit\fP controls the total number of calls, both recursive +and non-recursive, to the internal matching function, thus controlling the +overall amount of computing resource that is used. +.P +For both kinds of matching, the \fIheap_limit\fP number (which is in kilobytes) +limits the amount of heap memory used for matching. A value of zero disables +the use of any heap memory; many simple pattern matches can be done without +using the heap, so this is not an unreasonable setting. . . .SS "Showing MARK names" @@ -1437,13 +1449,14 @@ is added to the non-match message. .sp The \fBmemory\fP modifier causes \fBpcre2test\fP to log the sizes of all heap memory allocation and freeing calls that occur during a call to -\fBpcre2_match()\fP. These occur only when a match requires a bigger vector -than the default for remembering backtracking points. In many cases there will -be no heap memory used and therefore no additional output. No heap memory is -allocated during matching with \fBpcre2_dfa_match\fP or with JIT, so in those -cases the \fBmemory\fP modifier never has any effect. For this modifier to -work, the \fBnull_context\fP modifier must not be set on both the pattern and -the subject, though it can be set on one or the other. +\fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP. These occur only when a match +requires a bigger vector than the default for remembering backtracking points +(\fBpcre2_match()\fP) or for internal workspace (\fBpcre2_dfa_match()\fP). In +many cases there will be no heap memory used and therefore no additional +output. No heap memory is allocated during matching with JIT, so in that case +the \fBmemory\fP modifier never has any effect. For this modifier to work, the +\fBnull_context\fP modifier must not be set on both the pattern and the +subject, though it can be set on one or the other. . . .SS "Setting a starting offset" @@ -1962,6 +1975,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 21 December 2017 -Copyright (c) 1997-2017 University of Cambridge. +Last updated: 25 April 2018 +Copyright (c) 1997-2018 University of Cambridge. .fi diff --git a/doc/pcre2test.txt b/doc/pcre2test.txt index 93efd24..ef00ef7 100644 --- a/doc/pcre2test.txt +++ b/doc/pcre2test.txt @@ -1071,7 +1071,7 @@ SUBJECT MODIFIERS get= extract captured substring getall extract all captured substrings /g global global matching - heap_limit= set a limit on heap memory + heap_limit= set a limit on heap memory (Kbytes) jitstack= set size of JIT stack mark show mark values match_limit= set a match limit @@ -1291,126 +1291,139 @@ SUBJECT MODIFIERS values in the match context via pcre2_set_heap_limit(), pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the minimum values for each parameter that allows the match to complete - without error. + without error. If JIT is being used, only the match limit is relevant. - If JIT is being used, only the match limit is relevant. If DFA matching - is being used, only the depth limit is relevant. + When using this modifier, the pattern should not contain any limit set- + tings such as (*LIMIT_MATCH=...) within it. If such a setting is + present and is lower than the minimum matching value, the minimum value + cannot be found because pcre2_set_match_limit() etc. are only able to + reduce the value of an in-pattern limit; they cannot increase it. - The match_limit number is a measure of the amount of backtracking that - takes place, and learning the minimum value can be instructive. For - most simple matches, the number is quite small, but for patterns with - very large numbers of matching possibilities, it can become large very - quickly with increasing length of subject string. - - For non-DFA matching, the minimum depth_limit number is a measure of + For non-DFA matching, the minimum depth_limit number is a measure of how much nested backtracking happens (that is, how deeply the pattern's - tree is searched). In the case of DFA matching, depth_limit controls - the depth of recursive calls of the internal function that is used for + tree is searched). In the case of DFA matching, depth_limit controls + the depth of recursive calls of the internal function that is used for handling pattern recursion, lookaround assertions, and atomic groups. + For non-DFA matching, the match_limit number is a measure of the amount + of backtracking that takes place, and learning the minimum value can be + instructive. For most simple matches, the number is quite small, but + for patterns with very large numbers of matching possibilities, it can + become large very quickly with increasing length of subject string. In + the case of DFA matching, match_limit controls the total number of + calls, both recursive and non-recursive, to the internal matching func- + tion, thus controlling the overall amount of computing resource that is + used. + + For both kinds of matching, the heap_limit number (which is in kilo- + bytes) limits the amount of heap memory used for matching. A value of + zero disables the use of any heap memory; many simple pattern matches + can be done without using the heap, so this is not an unreasonable set- + ting. + Showing MARK names The mark modifier causes the names from backtracking control verbs that - are returned from calls to pcre2_match() to be displayed. If a mark is - returned for a match, non-match, or partial match, pcre2test shows it. - For a match, it is on a line by itself, tagged with "MK:". Otherwise, + are returned from calls to pcre2_match() to be displayed. If a mark is + returned for a match, non-match, or partial match, pcre2test shows it. + For a match, it is on a line by itself, tagged with "MK:". Otherwise, it is added to the non-match message. Showing memory usage - The memory modifier causes pcre2test to log the sizes of all heap mem- - ory allocation and freeing calls that occur during a call to - pcre2_match(). These occur only when a match requires a bigger vector - than the default for remembering backtracking points. In many cases - there will be no heap memory used and therefore no additional output. - No heap memory is allocated during matching with pcre2_dfa_match or - with JIT, so in those cases the memory modifier never has any effect. - For this modifier to work, the null_context modifier must not be set on - both the pattern and the subject, though it can be set on one or the - other. + The memory modifier causes pcre2test to log the sizes of all heap mem- + ory allocation and freeing calls that occur during a call to + pcre2_match() or pcre2_dfa_match(). These occur only when a match + requires a bigger vector than the default for remembering backtracking + points (pcre2_match()) or for internal workspace (pcre2_dfa_match()). + In many cases there will be no heap memory used and therefore no addi- + tional output. No heap memory is allocated during matching with JIT, so + in that case the memory modifier never has any effect. For this modi- + fier to work, the null_context modifier must not be set on both the + pattern and the subject, though it can be set on one or the other. Setting a starting offset - The offset modifier sets an offset in the subject string at which + The offset modifier sets an offset in the subject string at which matching starts. Its value is a number of code units, not characters. Setting an offset limit - The offset_limit modifier sets a limit for unanchored matches. If a + The offset_limit modifier sets a limit for unanchored matches. If a match cannot be found starting at or before this offset in the subject, a "no match" return is given. The data value is a number of code units, - not characters. When this modifier is used, the use_offset_limit modi- + not characters. When this modifier is used, the use_offset_limit modi- fier must have been set for the pattern; if not, an error is generated. Setting the size of the output vector - The ovector modifier applies only to the subject line in which it - appears, though of course it can also be used to set a default in a - #subject command. It specifies the number of pairs of offsets that are + The ovector modifier applies only to the subject line in which it + appears, though of course it can also be used to set a default in a + #subject command. It specifies the number of pairs of offsets that are available for storing matching information. The default is 15. - A value of zero is useful when testing the POSIX API because it causes + A value of zero is useful when testing the POSIX API because it causes regexec() to be called with a NULL capture vector. When not testing the - POSIX API, a value of zero is used to cause pcre2_match_data_cre- - ate_from_pattern() to be called, in order to create a match block of + POSIX API, a value of zero is used to cause pcre2_match_data_cre- + ate_from_pattern() to be called, in order to create a match block of exactly the right size for the pattern. (It is not possible to create a - match block with a zero-length ovector; there is always at least one + match block with a zero-length ovector; there is always at least one pair of offsets.) Passing the subject as zero-terminated By default, the subject string is passed to a native API matching func- tion with its correct length. In order to test the facility for passing - a zero-terminated string, the zero_terminate modifier is provided. It - causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching + a zero-terminated string, the zero_terminate modifier is provided. It + causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching via the POSIX interface, this modifier is ignored, with a warning. - When testing pcre2_substitute(), this modifier also has the effect of + When testing pcre2_substitute(), this modifier also has the effect of passing the replacement string as zero-terminated. Passing a NULL context - Normally, pcre2test passes a context block to pcre2_match(), + Normally, pcre2test passes a context block to pcre2_match(), pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is - set, however, NULL is passed. This is for testing that the matching + set, however, NULL is passed. This is for testing that the matching functions behave correctly in this case (they use default values). This - modifier cannot be used with the find_limits modifier or when testing + modifier cannot be used with the find_limits modifier or when testing the substitution function. THE ALTERNATIVE MATCHING FUNCTION - By default, pcre2test uses the standard PCRE2 matching function, + By default, pcre2test uses the standard PCRE2 matching function, pcre2_match() to match each subject line. PCRE2 also supports an alter- - native matching function, pcre2_dfa_match(), which operates in a dif- - ferent way, and has some restrictions. The differences between the two + native matching function, pcre2_dfa_match(), which operates in a dif- + ferent way, and has some restrictions. The differences between the two functions are described in the pcre2matching documentation. - If the dfa modifier is set, the alternative matching function is used. - This function finds all possible matches at a given point in the sub- - ject. If, however, the dfa_shortest modifier is set, processing stops - after the first match is found. This is always the shortest possible + If the dfa modifier is set, the alternative matching function is used. + This function finds all possible matches at a given point in the sub- + ject. If, however, the dfa_shortest modifier is set, processing stops + after the first match is found. This is always the shortest possible match. DEFAULT OUTPUT FROM pcre2test - This section describes the output when the normal matching function, + This section describes the output when the normal matching function, pcre2_match(), is being used. - When a match succeeds, pcre2test outputs the list of captured sub- - strings, starting with number 0 for the string that matched the whole - pattern. Otherwise, it outputs "No match" when the return is - PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially - matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that - this is the entire substring that was inspected during the partial - match; it may include characters before the actual match start if a + When a match succeeds, pcre2test outputs the list of captured sub- + strings, starting with number 0 for the string that matched the whole + pattern. Otherwise, it outputs "No match" when the return is + PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially + matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that + this is the entire substring that was inspected during the partial + match; it may include characters before the actual match start if a lookbehind assertion, \K, \b, or \B was involved.) For any other return, pcre2test outputs the PCRE2 negative error number - and a short descriptive phrase. If the error is a failed UTF string - check, the code unit offset of the start of the failing character is + and a short descriptive phrase. If the error is a failed UTF string + check, the code unit offset of the start of the failing character is also output. Here is an example of an interactive pcre2test run. $ pcre2test @@ -1426,8 +1439,8 @@ DEFAULT OUTPUT FROM pcre2test Unset capturing substrings that are not followed by one that is set are not shown by pcre2test unless the allcaptures modifier is specified. In the following example, there are two capturing substrings, but when the - first data line is matched, the second, unset substring is not shown. - An "internal" unset substring is shown as "", as for the second + first data line is matched, the second, unset substring is not shown. + An "internal" unset substring is shown as "", as for the second data line. re> /(a)|(b)/ @@ -1439,11 +1452,11 @@ DEFAULT OUTPUT FROM pcre2test 1: 2: b - If the strings contain any non-printing characters, they are output as - \xhh escapes if the value is less than 256 and UTF mode is not set. + If the strings contain any non-printing characters, they are output as + \xhh escapes if the value is less than 256 and UTF mode is not set. Otherwise they are output as \x{hh...} escapes. See below for the defi- - nition of non-printing characters. If the aftertext modifier is set, - the output for substring 0 is followed by the the rest of the subject + nition of non-printing characters. If the aftertext modifier is set, + the output for substring 0 is followed by the the rest of the subject string, identified by "0+" like this: re> /cat/aftertext @@ -1451,7 +1464,7 @@ DEFAULT OUTPUT FROM pcre2test 0: cat 0+ aract - If global matching is requested, the results of successive matching + If global matching is requested, the results of successive matching attempts are output in sequence, like this: re> /\Bi(\w\w)/g @@ -1463,8 +1476,8 @@ DEFAULT OUTPUT FROM pcre2test 0: ipp 1: pp - "No match" is output only if the first match attempt fails. Here is an - example of a failure message (the offset 4 that is specified by the + "No match" is output only if the first match attempt fails. Here is an + example of a failure message (the offset 4 that is specified by the offset modifier is past the end of the subject string): re> /xyz/ @@ -1472,7 +1485,7 @@ DEFAULT OUTPUT FROM pcre2test Error -24 (bad offset value) Note that whereas patterns can be continued over several lines (a plain - ">" prompt is used for continuations), subject lines may not. However + ">" prompt is used for continuations), subject lines may not. However newlines can be included in a subject by means of the \n escape (or \r, \r\n, etc., depending on the newline sequence setting). @@ -1480,7 +1493,7 @@ DEFAULT OUTPUT FROM pcre2test OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION When the alternative matching function, pcre2_dfa_match(), is used, the - output consists of a list of all the matches that start at the first + output consists of a list of all the matches that start at the first point in the subject where there is at least one match. For example: re> /(tang|tangerine|tan)/ @@ -1489,11 +1502,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1: tang 2: tan - Using the normal matching function on this data finds only "tang". The - longest matching string is always given first (and numbered zero). - After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", - followed by the partially matching substring. Note that this is the - entire substring that was inspected during the partial match; it may + Using the normal matching function on this data finds only "tang". The + longest matching string is always given first (and numbered zero). + After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", + followed by the partially matching substring. Note that this is the + entire substring that was inspected during the partial match; it may include characters before the actual match start if a lookbehind asser- tion, \b, or \B was involved. (\K is not supported for DFA matching.) @@ -1509,16 +1522,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1: tan 0: tan - The alternative matching function does not support substring capture, - so the modifiers that are concerned with captured substrings are not + The alternative matching function does not support substring capture, + so the modifiers that are concerned with captured substrings are not relevant. RESTARTING AFTER A PARTIAL MATCH - When the alternative matching function has given the PCRE2_ERROR_PAR- + When the alternative matching function has given the PCRE2_ERROR_PAR- TIAL return, indicating that the subject partially matched the pattern, - you can restart the match with additional subject data by means of the + you can restart the match with additional subject data by means of the dfa_restart modifier. For example: re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ @@ -1527,37 +1540,37 @@ RESTARTING AFTER A PARTIAL MATCH data> n05\=dfa,dfa_restart 0: n05 - For further information about partial matching, see the pcre2partial + For further information about partial matching, see the pcre2partial documentation. CALLOUTS If the pattern contains any callout requests, pcre2test's callout func- - tion is called during matching unless callout_none is specified. This + tion is called during matching unless callout_none is specified. This works with both matching functions, and with JIT, though there are some - differences in behaviour. The output for callouts with numerical argu- + differences in behaviour. The output for callouts with numerical argu- ments and those with string arguments is slightly different. Callouts with numerical arguments By default, the callout function displays the callout number, the start - and current positions in the subject text at the callout time, and the + and current positions in the subject text at the callout time, and the next pattern item to be tested. For example: --->pqrabcdef 0 ^ ^ \d - This output indicates that callout number 0 occurred for a match - attempt starting at the fourth character of the subject string, when - the pointer was at the seventh character, and when the next pattern - item was \d. Just one circumflex is output if the start and current - positions are the same, or if the current position precedes the start + This output indicates that callout number 0 occurred for a match + attempt starting at the fourth character of the subject string, when + the pointer was at the seventh character, and when the next pattern + item was \d. Just one circumflex is output if the start and current + positions are the same, or if the current position precedes the start position, which can happen if the callout is in a lookbehind assertion. Callouts numbered 255 are assumed to be automatic callouts, inserted as a result of the auto_callout pattern modifier. In this case, instead of - showing the callout number, the offset in the pattern, preceded by a + showing the callout number, the offset in the pattern, preceded by a plus, is output. For example: re> /\d?[A-E]\*/auto_callout @@ -1570,7 +1583,7 @@ CALLOUTS 0: E* If a pattern contains (*MARK) items, an additional line is output when- - ever a change of latest mark is passed to the callout function. For + ever a change of latest mark is passed to the callout function. For example: re> /a(*MARK:X)bc/auto_callout @@ -1584,17 +1597,17 @@ CALLOUTS +12 ^ ^ 0: abc - The mark changes between matching "a" and "b", but stays the same for - the rest of the match, so nothing more is output. If, as a result of - backtracking, the mark reverts to being unset, the text "" is + The mark changes between matching "a" and "b", but stays the same for + the rest of the match, so nothing more is output. If, as a result of + backtracking, the mark reverts to being unset, the text "" is output. Callouts with string arguments The output for a callout with a string argument is similar, except that - instead of outputting a callout number before the position indicators, - the callout string and its offset in the pattern string are output - before the reflection of the subject string, and the subject string is + instead of outputting a callout number before the position indicators, + the callout string and its offset in the pattern string are output + before the reflection of the subject string, and the subject string is reflected for each callout. For example: re> /^ab(?C'first')cd(?C"second")ef/ @@ -1610,26 +1623,26 @@ CALLOUTS Callout modifiers - The callout function in pcre2test returns zero (carry on matching) by - default, but you can use a callout_fail modifier in a subject line to + The callout function in pcre2test returns zero (carry on matching) by + default, but you can use a callout_fail modifier in a subject line to change this and other parameters of the callout (see below). If the callout_capture modifier is set, the current captured groups are output when a callout occurs. This is useful only for non-DFA matching, - as pcre2_dfa_match() does not support capturing, so no captures are + as pcre2_dfa_match() does not support capturing, so no captures are ever shown. The normal callout output, showing the callout number or pattern offset - (as described above) is suppressed if the callout_no_where modifier is + (as described above) is suppressed if the callout_no_where modifier is set. - When using the interpretive matching function pcre2_match() without - JIT, setting the callout_extra modifier causes additional output from - pcre2test's callout function to be generated. For the first callout in - a match attempt at a new starting position in the subject, "New match - attempt" is output. If there has been a backtrack since the last call- + When using the interpretive matching function pcre2_match() without + JIT, setting the callout_extra modifier causes additional output from + pcre2test's callout function to be generated. For the first callout in + a match attempt at a new starting position in the subject, "New match + attempt" is output. If there has been a backtrack since the last call- out (or start of matching if this is the first callout), "Backtrack" is - output, followed by "No other matching paths" if the backtrack ended + output, followed by "No other matching paths" if the backtrack ended the previous match attempt. For example: re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess @@ -1666,82 +1679,82 @@ CALLOUTS +1 ^ a+ No match - Notice that various optimizations must be turned off if you want all - possible matching paths to be scanned. If no_start_optimize is not - used, there is an immediate "no match", without any callouts, because - the starting optimization fails to find "b" in the subject, which it - knows must be present for any match. If no_auto_possess is not used, - the "a+" item is turned into "a++", which reduces the number of back- + Notice that various optimizations must be turned off if you want all + possible matching paths to be scanned. If no_start_optimize is not + used, there is an immediate "no match", without any callouts, because + the starting optimization fails to find "b" in the subject, which it + knows must be present for any match. If no_auto_possess is not used, + the "a+" item is turned into "a++", which reduces the number of back- tracks. - The callout_extra modifier has no effect if used with the DFA matching + The callout_extra modifier has no effect if used with the DFA matching function, or with JIT. Return values from callouts - The default return from the callout function is zero, which allows + The default return from the callout function is zero, which allows matching to continue. The callout_fail modifier can be given one or two numbers. If there is only one number, 1 is returned instead of 0 (caus- ing matching to backtrack) when a callout of that number is reached. If - two numbers (:) are given, 1 is returned when callout is - reached and there have been at least callouts. The callout_error + two numbers (:) are given, 1 is returned when callout is + reached and there have been at least callouts. The callout_error modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus- - ing the entire matching process to be aborted. If both these modifiers - are set for the same callout number, callout_error takes precedence. - Note that callouts with string arguments are always given the number + ing the entire matching process to be aborted. If both these modifiers + are set for the same callout number, callout_error takes precedence. + Note that callouts with string arguments are always given the number zero. - The callout_data modifier can be given an unsigned or a negative num- - ber. This is set as the "user data" that is passed to the matching - function, and passed back when the callout function is invoked. Any - value other than zero is used as a return from pcre2test's callout + The callout_data modifier can be given an unsigned or a negative num- + ber. This is set as the "user data" that is passed to the matching + function, and passed back when the callout function is invoked. Any + value other than zero is used as a return from pcre2test's callout function. Inserting callouts can be helpful when using pcre2test to check compli- - cated regular expressions. For further information about callouts, see + cated regular expressions. For further information about callouts, see the pcre2callout documentation. NON-PRINTING CHARACTERS When pcre2test is outputting text in the compiled version of a pattern, - bytes other than 32-126 are always treated as non-printing characters + bytes other than 32-126 are always treated as non-printing characters and are therefore shown as hex escapes. - When pcre2test is outputting text that is a matched part of a subject - string, it behaves in the same way, unless a different locale has been - set for the pattern (using the locale modifier). In this case, the - isprint() function is used to distinguish printing and non-printing + When pcre2test is outputting text that is a matched part of a subject + string, it behaves in the same way, unless a different locale has been + set for the pattern (using the locale modifier). In this case, the + isprint() function is used to distinguish printing and non-printing characters. SAVING AND RESTORING COMPILED PATTERNS - It is possible to save compiled patterns on disc or elsewhere, and + It is possible to save compiled patterns on disc or elsewhere, and reload them later, subject to a number of restrictions. JIT data cannot - be saved. The host on which the patterns are reloaded must be running + be saved. The host on which the patterns are reloaded must be running the same version of PCRE2, with the same code unit width, and must also - have the same endianness, pointer width and PCRE2_SIZE type. Before - compiled patterns can be saved they must be serialized, that is, con- - verted to a stream of bytes. A single byte stream may contain any num- - ber of compiled patterns, but they must all use the same character + have the same endianness, pointer width and PCRE2_SIZE type. Before + compiled patterns can be saved they must be serialized, that is, con- + verted to a stream of bytes. A single byte stream may contain any num- + ber of compiled patterns, but they must all use the same character tables. A single copy of the tables is included in the byte stream (its size is 1088 bytes). - The functions whose names begin with pcre2_serialize_ are used for - serializing and de-serializing. They are described in the pcre2serial- + The functions whose names begin with pcre2_serialize_ are used for + serializing and de-serializing. They are described in the pcre2serial- ize documentation. In this section we describe the features of pcre2test that can be used to test these functions. - When a pattern with push modifier is successfully compiled, it is - pushed onto a stack of compiled patterns, and pcre2test expects the - next line to contain a new pattern (or command) instead of a subject - line. By contrast, the pushcopy modifier causes a copy of the compiled - pattern to be stacked, leaving the original available for immediate - matching. By using push and/or pushcopy, a number of patterns can be + When a pattern with push modifier is successfully compiled, it is + pushed onto a stack of compiled patterns, and pcre2test expects the + next line to contain a new pattern (or command) instead of a subject + line. By contrast, the pushcopy modifier causes a copy of the compiled + pattern to be stacked, leaving the original available for immediate + matching. By using push and/or pushcopy, a number of patterns can be compiled and retained. These modifiers are incompatible with posix, and - control modifiers that act at match time are ignored (with a message) - for the stacked patterns. The jitverify modifier applies only at com- + control modifiers that act at match time are ignored (with a message) + for the stacked patterns. The jitverify modifier applies only at com- pile time. The command @@ -1749,21 +1762,21 @@ SAVING AND RESTORING COMPILED PATTERNS #save causes all the stacked patterns to be serialized and the result written - to the named file. Afterwards, all the stacked patterns are freed. The + to the named file. Afterwards, all the stacked patterns are freed. The command #load - reads the data in the file, and then arranges for it to be de-serial- - ized, with the resulting compiled patterns added to the pattern stack. - The pattern on the top of the stack can be retrieved by the #pop com- - mand, which must be followed by lines of subjects that are to be - matched with the pattern, terminated as usual by an empty line or end - of file. This command may be followed by a modifier list containing - only control modifiers that act after a pattern has been compiled. In + reads the data in the file, and then arranges for it to be de-serial- + ized, with the resulting compiled patterns added to the pattern stack. + The pattern on the top of the stack can be retrieved by the #pop com- + mand, which must be followed by lines of subjects that are to be + matched with the pattern, terminated as usual by an empty line or end + of file. This command may be followed by a modifier list containing + only control modifiers that act after a pattern has been compiled. In particular, hex, posix, posix_nosub, push, and pushcopy are not - allowed, nor are any option-setting modifiers. The JIT modifiers are, - however permitted. Here is an example that saves and reloads two pat- + allowed, nor are any option-setting modifiers. The JIT modifiers are, + however permitted. Here is an example that saves and reloads two pat- terns. /abc/push @@ -1776,10 +1789,10 @@ SAVING AND RESTORING COMPILED PATTERNS #pop jit,bincode abc - If jitverify is used with #pop, it does not automatically imply jit, + If jitverify is used with #pop, it does not automatically imply jit, which is different behaviour from when it is used on a pattern. - The #popcopy command is analagous to the pushcopy modifier in that it + The #popcopy command is analagous to the pushcopy modifier in that it makes current a copy of the topmost stack pattern, leaving the original still on the stack. @@ -1799,5 +1812,5 @@ AUTHOR REVISION - Last updated: 21 December 2017 - Copyright (c) 1997-2017 University of Cambridge. + Last updated: 25 April 2018 + Copyright (c) 1997-2018 University of Cambridge. diff --git a/src/config.h.in b/src/config.h.in index 7a3a861..14b48e9 100644 --- a/src/config.h.in +++ b/src/config.h.in @@ -132,8 +132,9 @@ sure both macros are undefined; an emulation function will then be used. */ /* Define to 1 if you have the header file. */ #undef HAVE_ZLIB_H -/* This limits the amount of memory that pcre2_match() may use while matching - a pattern. The value is in kilobytes. */ +/* This limits the amount of memory that may be used while matching a pattern. + It applies to both pcre2_match() and pcre2_dfa_match(). It does not apply + to JIT matching. The value is in kilobytes. */ #undef HEAP_LIMIT /* The value of LINK_SIZE determines the number of bytes used to store links @@ -148,7 +149,8 @@ sure both macros are undefined; an emulation function will then be used. */ /* The value of MATCH_LIMIT determines the default number of times the pcre2_match() function can record a backtrack position during a single - matching attempt. There is a runtime interface for setting a different + matching attempt. The value is also used to limit a loop counter in + pcre2_dfa_match(). There is a runtime interface for setting a different limit. The limit exists in order to catch runaway regular expressions that take for ever to determine that they do not match. The default is set very large so that it does not accidentally catch legitimate cases. */ @@ -161,7 +163,9 @@ sure both macros are undefined; an emulation function will then be used. */ MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it must be less than the value of MATCH_LIMIT. The default is to use the same value as MATCH_LIMIT. There is a runtime method for setting a different - limit. */ + limit. In the case of pcre2_dfa_match(), this limit controls the depth of + the internal nested function calls that are used for pattern recursions, + lookarounds, and atomic groups. */ #undef MATCH_LIMIT_DEPTH /* This limit is parameterized just in case anybody ever wants to change it. diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c index fc04bfc..bc62e6b 100644 --- a/src/pcre2_dfa_match.c +++ b/src/pcre2_dfa_match.c @@ -292,6 +292,35 @@ typedef struct stateblock { #define INTS_PER_STATEBLOCK (int)(sizeof(stateblock)/sizeof(int)) +/* Before version 10.32 the recursive calls of internal_dfa_match() were passed +local working space and output vectors that were created on the stack. This has +caused issues for some patterns, especially in small-stack environments such as +Windows. A new scheme is now in use which sets up a vector on the stack, but if +this is too small, heap memory is used, up to the heap_limit. The main +parameters are all numbers of ints because the workspace is a vector of ints. + +The size of the starting stack vector, DFA_START_RWS_SIZE, is in bytes, and is +defined in pcre2_internal.h so as to be available to pcre2test when it is +finding the minimum heap requirement for a match. */ + +#define OVEC_UNIT (sizeof(PCRE2_SIZE)/sizeof(int)) + +#define RWS_BASE_SIZE (DFA_START_RWS_SIZE/sizeof(int)) /* Stack vector */ +#define RWS_RSIZE 1000 /* Work size for recursion */ +#define RWS_OVEC_RSIZE (1000*OVEC_UNIT) /* Ovector for recursion */ +#define RWS_OVEC_OSIZE (2*OVEC_UNIT) /* Ovector in other cases */ + +/* This structure is at the start of each workspace block. */ + +typedef struct RWS_anchor { + struct RWS_anchor *next; + unsigned int size; /* Number of ints */ + unsigned int free; /* Number of ints */ +} RWS_anchor; + +#define RWS_ANCHOR_SIZE (sizeof(RWS_anchor)/sizeof(int)) + + /************************************************* * Process a callout * @@ -353,6 +382,61 @@ return (mb->callout)(cb, mb->callout_data); +/************************************************* +* Expand local workspace memory * +*************************************************/ + +/* This function is called when internal_dfa_match() is about to be called +recursively and there is insufficient workingspace left in the current work +space block. If there's an existing next block, use it; otherwise get a new +block unless the heap limit is reached. + +Arguments: + rwsptr pointer to block pointer (updated) + ovecsize space needed for an ovector + mb the match block + +Returns: 0 rwsptr has been updated + !0 an error code +*/ + +static int +more_workspace(RWS_anchor **rwsptr, unsigned int ovecsize, dfa_match_block *mb) +{ +RWS_anchor *rws = *rwsptr; +RWS_anchor *new; + +if (rws->next != NULL) + { + new = rws->next; + } + +/* All sizes are in units of sizeof(int), except for mb->heaplimit, which is in +kilobytes. */ + +else + { + unsigned int newsize = rws->size * 2; + unsigned int heapleft = (unsigned int) + (((1024/sizeof(int))*mb->heap_limit - mb->heap_used)); + if (newsize > heapleft) newsize = heapleft; + if (newsize < RWS_RSIZE + ovecsize + RWS_ANCHOR_SIZE) + return PCRE2_ERROR_HEAPLIMIT; + new = mb->memctl.malloc(newsize*sizeof(int), mb->memctl.memory_data); + if (new == NULL) return PCRE2_ERROR_NOMEMORY; + mb->heap_used += newsize; + new->next = NULL; + new->size = newsize; + rws->next = new; + } + +new->free = new->size - RWS_ANCHOR_SIZE; +*rwsptr = new; +return 0; +} + + + /************************************************* * Match a Regular Expression - DFA engine * *************************************************/ @@ -431,7 +515,8 @@ internal_dfa_match( uint32_t offsetcount, int *workspace, int wscount, - uint32_t rlevel) + uint32_t rlevel, + int *RWS) { stateblock *active_states, *new_states, *temp_states; stateblock *next_active_state, *next_new_state; @@ -2587,10 +2672,22 @@ for (;;) case OP_ASSERTBACK: case OP_ASSERTBACK_NOT: { - PCRE2_SPTR endasscode = code + GET(code, 1); - PCRE2_SIZE local_offsets[2]; int rc; - int local_workspace[1000]; + int *local_workspace; + PCRE2_SIZE *local_offsets; + PCRE2_SPTR endasscode = code + GET(code, 1); + RWS_anchor *rws = (RWS_anchor *)RWS; + + if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE) + { + rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb); + if (rc != 0) return rc; + RWS = (int *)rws; + } + + local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free); + local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE; + rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE; while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1); @@ -2600,10 +2697,13 @@ for (;;) ptr, /* where we currently are */ (PCRE2_SIZE)(ptr - start_subject), /* start offset */ local_offsets, /* offset vector */ - sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */ + RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */ local_workspace, /* workspace vector */ - sizeof(local_workspace)/sizeof(int), /* size of same */ - rlevel); /* function recursion level */ + RWS_RSIZE, /* size of same */ + rlevel, /* function recursion level */ + RWS); /* recursion workspace */ + + rws->free += RWS_RSIZE + RWS_OVEC_OSIZE; if (rc < 0 && rc != PCRE2_ERROR_NOMATCH) return rc; if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK)) @@ -2670,11 +2770,23 @@ for (;;) else { - PCRE2_SIZE local_offsets[2]; - int local_workspace[1000]; int rc; + int *local_workspace; + PCRE2_SIZE *local_offsets; PCRE2_SPTR asscode = code + LINK_SIZE + 1; PCRE2_SPTR endasscode = asscode + GET(asscode, 1); + RWS_anchor *rws = (RWS_anchor *)RWS; + + if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE) + { + rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb); + if (rc != 0) return rc; + RWS = (int *)rws; + } + + local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free); + local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE; + rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE; while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1); @@ -2684,10 +2796,13 @@ for (;;) ptr, /* where we currently are */ (PCRE2_SIZE)(ptr - start_subject), /* start offset */ local_offsets, /* offset vector */ - sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */ + RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */ local_workspace, /* workspace vector */ - sizeof(local_workspace)/sizeof(int), /* size of same */ - rlevel); /* function recursion level */ + RWS_RSIZE, /* size of same */ + rlevel, /* function recursion level */ + RWS); /* recursion work space */ + + rws->free += RWS_RSIZE + RWS_OVEC_OSIZE; if (rc < 0 && rc != PCRE2_ERROR_NOMATCH) return rc; if ((rc >= 0) == @@ -2702,13 +2817,25 @@ for (;;) /*-----------------------------------------------------------------*/ case OP_RECURSE: { + int rc; + int *local_workspace; + PCRE2_SIZE *local_offsets; + RWS_anchor *rws = (RWS_anchor *)RWS; dfa_recursion_info *ri; - PCRE2_SIZE local_offsets[1000]; - int local_workspace[1000]; PCRE2_SPTR callpat = start_code + GET(code, 1); uint32_t recno = (callpat == mb->start_code)? 0 : GET2(callpat, 1 + LINK_SIZE); - int rc; + + if (rws->free < RWS_RSIZE + RWS_OVEC_RSIZE) + { + rc = more_workspace(&rws, RWS_OVEC_RSIZE, mb); + if (rc != 0) return rc; + RWS = (int *)rws; + } + + local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free); + local_workspace = ((int *)local_offsets) + RWS_OVEC_RSIZE; + rws->free -= RWS_RSIZE + RWS_OVEC_RSIZE; /* Check for repeating a recursion without advancing the subject pointer. This should catch convoluted mutual recursions. (Some simple @@ -2732,11 +2859,13 @@ for (;;) ptr, /* where we currently are */ (PCRE2_SIZE)(ptr - start_subject), /* start offset */ local_offsets, /* offset vector */ - sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */ + RWS_OVEC_RSIZE/OVEC_UNIT, /* size of same */ local_workspace, /* workspace vector */ - sizeof(local_workspace)/sizeof(int), /* size of same */ - rlevel); /* function recursion level */ + RWS_RSIZE, /* size of same */ + rlevel, /* function recursion level */ + RWS); /* recursion workspace */ + rws->free += RWS_RSIZE + RWS_OVEC_RSIZE; mb->recursive = new_recursive.prevrec; /* Done this recursion */ /* Ran out of internal offsets */ @@ -2782,10 +2911,25 @@ for (;;) case OP_SCBRAPOS: case OP_BRAPOSZERO: { + int rc; + int *local_workspace; + PCRE2_SIZE *local_offsets; PCRE2_SIZE charcount, matched_count; PCRE2_SPTR local_ptr = ptr; + RWS_anchor *rws = (RWS_anchor *)RWS; BOOL allow_zero; + if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE) + { + rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb); + if (rc != 0) return rc; + RWS = (int *)rws; + } + + local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free); + local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE; + rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE; + if (codevalue == OP_BRAPOSZERO) { allow_zero = TRUE; @@ -2798,19 +2942,17 @@ for (;;) for (matched_count = 0;; matched_count++) { - PCRE2_SIZE local_offsets[2]; - int local_workspace[1000]; - - int rc = internal_dfa_match( + rc = internal_dfa_match( mb, /* fixed match data */ code, /* this subexpression's code */ local_ptr, /* where we currently are */ (PCRE2_SIZE)(ptr - start_subject), /* start offset */ local_offsets, /* offset vector */ - sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */ + RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */ local_workspace, /* workspace vector */ - sizeof(local_workspace)/sizeof(int), /* size of same */ - rlevel); /* function recursion level */ + RWS_RSIZE, /* size of same */ + rlevel, /* function recursion level */ + RWS); /* recursion workspace */ /* Failed to match */ @@ -2827,6 +2969,8 @@ for (;;) local_ptr += charcount; /* Advance temporary position ptr */ } + rws->free += RWS_RSIZE + RWS_OVEC_OSIZE; + /* At this point we have matched the subpattern matched_count times, and local_ptr is pointing to the character after the end of the last match. */ @@ -2869,19 +3013,35 @@ for (;;) /*-----------------------------------------------------------------*/ case OP_ONCE: { - PCRE2_SIZE local_offsets[2]; - int local_workspace[1000]; + int rc; + int *local_workspace; + PCRE2_SIZE *local_offsets; + RWS_anchor *rws = (RWS_anchor *)RWS; - int rc = internal_dfa_match( + if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE) + { + rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb); + if (rc != 0) return rc; + RWS = (int *)rws; + } + + local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free); + local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE; + rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE; + + rc = internal_dfa_match( mb, /* fixed match data */ code, /* this subexpression's code */ ptr, /* where we currently are */ (PCRE2_SIZE)(ptr - start_subject), /* start offset */ local_offsets, /* offset vector */ - sizeof(local_offsets)/sizeof(PCRE2_SIZE), /* size of same */ + RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */ local_workspace, /* workspace vector */ - sizeof(local_workspace)/sizeof(int), /* size of same */ - rlevel); /* function recursion level */ + RWS_RSIZE, /* size of same */ + rlevel, /* function recursion level */ + RWS); /* recursion workspace */ + + rws->free += RWS_RSIZE + RWS_OVEC_OSIZE; if (rc >= 0) { @@ -3063,6 +3223,7 @@ pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length, PCRE2_SIZE start_offset, uint32_t options, pcre2_match_data *match_data, pcre2_match_context *mcontext, int *workspace, PCRE2_SIZE wscount) { +int rc; const pcre2_real_code *re = (const pcre2_real_code *)code; PCRE2_SPTR start_match; @@ -3071,9 +3232,9 @@ PCRE2_SPTR bumpalong_limit; PCRE2_SPTR req_cu_ptr; BOOL utf, anchored, startline, firstline; - BOOL has_first_cu = FALSE; BOOL has_req_cu = FALSE; + PCRE2_UCHAR first_cu = 0; PCRE2_UCHAR first_cu2 = 0; PCRE2_UCHAR req_cu = 0; @@ -3088,6 +3249,17 @@ pcre2_callout_block cb; dfa_match_block actual_match_block; dfa_match_block *mb = &actual_match_block; +/* Set up a starting block of memory for use during recursive calls to +internal_dfa_match(). By putting this on the stack, it minimizes resource use +in the case when it is not needed. If this is too small, more memory is +obtained from the heap. At the start of each block is an anchor structure.*/ + +int base_recursion_workspace[RWS_BASE_SIZE]; +RWS_anchor *rws = (RWS_anchor *)base_recursion_workspace; +rws->next = NULL; +rws->size = RWS_BASE_SIZE; +rws->free = RWS_BASE_SIZE - RWS_ANCHOR_SIZE; + /* A length equal to PCRE2_ZERO_TERMINATED implies a zero-terminated subject string. */ @@ -3184,6 +3356,7 @@ if (mcontext == NULL) mb->memctl = re->memctl; mb->match_limit = PRIV(default_match_context).match_limit; mb->match_limit_depth = PRIV(default_match_context).depth_limit; + mb->heap_limit = PRIV(default_match_context).heap_limit; } else { @@ -3198,6 +3371,7 @@ else mb->memctl = mcontext->memctl; mb->match_limit = mcontext->match_limit; mb->match_limit_depth = mcontext->depth_limit; + mb->heap_limit = mcontext->heap_limit; } if (mb->match_limit > re->limit_match) @@ -3206,6 +3380,9 @@ if (mb->match_limit > re->limit_match) if (mb->match_limit_depth > re->limit_depth) mb->match_limit_depth = re->limit_depth; +if (mb->heap_limit > re->limit_heap) + mb->heap_limit = re->limit_heap; + mb->start_code = (PCRE2_UCHAR *)((uint8_t *)re + sizeof(pcre2_real_code)) + re->name_count * re->name_entry_size; mb->tables = re->tables; @@ -3215,6 +3392,7 @@ mb->start_offset = start_offset; mb->moptions = options; mb->poptions = re->overall_options; mb->match_call_count = 0; +mb->heap_used = 0; /* Process the \R and newline settings. */ @@ -3351,8 +3529,6 @@ a match. */ for (;;) { - int rc; - /* ----------------- Start of match optimizations ---------------- */ /* There are some optimizations that avoid running the match if a known @@ -3544,7 +3720,7 @@ for (;;) in characters, we treat it as code units to avoid spending too much time in this optimization. */ - if (end_subject - start_match < re->minlength) return PCRE2_ERROR_NOMATCH; + if (end_subject - start_match < re->minlength) goto NOMATCH_EXIT; /* If req_cu is set, we know that that code unit must appear in the subject for the match to succeed. If the first code unit is set, req_cu @@ -3621,7 +3797,8 @@ for (;;) (uint32_t)match_data->oveccount * 2, /* actual size of same */ workspace, /* workspace vector */ (int)wscount, /* size of same */ - 0); /* function recurse level */ + 0, /* function recurse level */ + base_recursion_workspace); /* initial workspace for recursion */ /* Anything other than "no match" means we are done, always; otherwise, carry on only if not anchored. */ @@ -3637,7 +3814,7 @@ for (;;) match_data->rightchar = (PCRE2_SIZE)( mb->last_used_ptr - subject); match_data->startchar = (PCRE2_SIZE)(start_match - subject); match_data->rc = rc; - return rc; + goto EXIT; } /* Advance to the next subject character unless we are at the end of a line @@ -3668,8 +3845,18 @@ for (;;) } /* "Bumpalong" loop */ +NOMATCH_EXIT: +rc = PCRE2_ERROR_NOMATCH; -return PCRE2_ERROR_NOMATCH; +EXIT: +while (rws->next != NULL) + { + RWS_anchor *next = rws->next; + rws->next = next->next; + mb->memctl.free(next, mb->memctl.memory_data); + } + +return rc; } /* End of pcre2_dfa_match.c */ diff --git a/src/pcre2_internal.h b/src/pcre2_internal.h index 3db9d60..f9e18f3 100644 --- a/src/pcre2_internal.h +++ b/src/pcre2_internal.h @@ -253,6 +253,11 @@ maximum size of this can be limited. */ #define START_FRAMES_SIZE 20480 +/* Similarly, for DFA matching, an initial internal workspace vector is +allocated on the stack. */ + +#define DFA_START_RWS_SIZE 30720 + /* Define the default BSR convention. */ #ifdef BSR_ANYCRLF diff --git a/src/pcre2_intmodedep.h b/src/pcre2_intmodedep.h index f5805aa..ce95e68 100644 --- a/src/pcre2_intmodedep.h +++ b/src/pcre2_intmodedep.h @@ -896,6 +896,8 @@ typedef struct dfa_match_block { PCRE2_SPTR last_used_ptr; /* Latest consulted character */ const uint8_t *tables; /* Character tables */ PCRE2_SIZE start_offset; /* The start offset value */ + PCRE2_SIZE heap_limit; /* As it says */ + PCRE2_SIZE heap_used; /* As it says */ uint32_t match_limit; /* As it says */ uint32_t match_limit_depth; /* As it says */ uint32_t match_call_count; /* Number of calls of internal function */ diff --git a/src/pcre2test.c b/src/pcre2test.c index ad3db2c..fe6ef79 100644 --- a/src/pcre2test.c +++ b/src/pcre2test.c @@ -5760,6 +5760,8 @@ PCRE2_SET_HEAP_LIMIT(dat_context, max); for (;;) { + uint32_t stack_start = 0; + if (errnumber == PCRE2_ERROR_HEAPLIMIT) { PCRE2_SET_HEAP_LIMIT(dat_context, mid); @@ -5775,6 +5777,7 @@ for (;;) if ((dat_datctl.control & CTL_DFA) != 0) { + stack_start = DFA_START_RWS_SIZE/1024; if (dfa_workspace == NULL) dfa_workspace = (int *)malloc(DFA_WS_DIMENSION*sizeof(int)); if (dfa_matched++ == 0) @@ -5789,11 +5792,21 @@ for (;;) dat_datctl.options, match_data, PTR(dat_context)); else + { + stack_start = START_FRAMES_SIZE/1024; PCRE2_MATCH(capcount, compiled_code, pp, ulen, dat_datctl.offset, dat_datctl.options, match_data, PTR(dat_context)); + } if (capcount == errnumber) { + if ((mid & 0x80000000u) != 0) + { + fprintf(outfile, "Can't find minimum %s limit: check pattern for " + "restriction\n", msg); + break; + } + min = mid; mid = (mid == max - 1)? max : (max != UINT32_MAX)? (min + max)/2 : mid*2; } @@ -5802,11 +5815,12 @@ for (;;) capcount == PCRE2_ERROR_PARTIAL) { /* If we've not hit the error with a heap limit less than the size of the - initial stack frame vector, the heap is not being used, so the minimum - limit is zero; there's no need to go on. The other limits are always - greater than zero. */ + initial stack frame vector (for pcre2_match()) or the initial stack + workspace vector (for pcre2_dfa_match()), the heap is not being used, so + the minimum limit is zero; there's no need to go on. The other limits are + always greater than zero. */ - if (errnumber == PCRE2_ERROR_HEAPLIMIT && mid < START_FRAMES_SIZE/1024) + if (errnumber == PCRE2_ERROR_HEAPLIMIT && mid < stack_start) { fprintf(outfile, "Minimum %s limit = 0\n", msg); break; @@ -6771,7 +6785,7 @@ if ((pat_patctl.control & CTL_POSIX) != 0) PCRE2_SIZE end = pmatch[i].rm_eo; for (j = last_printed + 1; j < i; j++) fprintf(outfile, "%2d: \n", (int)j); - last_printed = i; + last_printed = i; if (start > end) { start = pmatch[i].rm_eo; @@ -7139,18 +7153,16 @@ else for (gmatched = 0;; gmatched++) (double)CLOCKS_PER_SEC); } - /* Find the heap, match and depth limits if requested. The match and heap - limits are not relevant for DFA matching and the depth and heap limits are - not relevant for JIT. The return from check_match_limit() is the return from - the final call to pcre2_match() or pcre2_dfa_match(). */ + /* Find the heap, match and depth limits if requested. The depth and heap + limits are not relevant for JIT. The return from check_match_limit() is the + return from the final call to pcre2_match() or pcre2_dfa_match(). */ if ((dat_datctl.control & CTL_FINDLIMITS) != 0) { capcount = 0; /* This stops compiler warnings */ - if ((dat_datctl.control & CTL_DFA) == 0 && - (FLD(compiled_code, executable_jit) == NULL || - (dat_datctl.options & PCRE2_NO_JIT) != 0)) + if (FLD(compiled_code, executable_jit) == NULL || + (dat_datctl.options & PCRE2_NO_JIT) != 0) { (void)check_match_limit(pp, arg_ulen, PCRE2_ERROR_HEAPLIMIT, "heap"); } @@ -7165,6 +7177,12 @@ else for (gmatched = 0;; gmatched++) capcount = check_match_limit(pp, arg_ulen, PCRE2_ERROR_DEPTHLIMIT, "depth"); } + + if (capcount == 0) + { + fprintf(outfile, "Matched, but offsets vector is too small to show all matches\n"); + capcount = dat_datctl.oveccount; + } } /* Otherwise just run a single match, setting up a callout if required (the @@ -7877,7 +7895,7 @@ else (void)PCRE2_CONFIG(PCRE2_CONFIG_NEWLINE, &optval); print_newline_config(optval, FALSE); (void)PCRE2_CONFIG(PCRE2_CONFIG_BSR, &optval); -printf(" \\R matches %s\n", +printf(" \\R matches %s\n", (optval == PCRE2_BSR_ANYCRLF)? "CR, LF, or CRLF only" : "all Unicode newlines"); (void)PCRE2_CONFIG(PCRE2_CONFIG_NEVER_BACKSLASH_C, &optval); diff --git a/testdata/testinput6 b/testdata/testinput6 index e2f00c0..af1dc03 100644 --- a/testdata/testinput6 +++ b/testdata/testinput6 @@ -4874,6 +4874,14 @@ \= Expect depth limit exceeded a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00] +/(*LIMIT_HEAP=0)^((.)(?1)|.)$/ +\= Expect heap limit exceeded + a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00] + +/(*LIMIT_HEAP=50000)^((.)(?1)|.)$/ +\= Expect success + a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00] + /(02-)?[0-9]{3}-[0-9]{3}/ 02-123-123 diff --git a/testdata/testoutput6 b/testdata/testoutput6 index b409fe0..32287b1 100644 --- a/testdata/testoutput6 +++ b/testdata/testoutput6 @@ -7667,12 +7667,23 @@ No match a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00] Failed: error -53: matching depth limit exceeded +/(*LIMIT_HEAP=0)^((.)(?1)|.)$/ +\= Expect heap limit exceeded + a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00] +Failed: error -63: heap limit exceeded + +/(*LIMIT_HEAP=50000)^((.)(?1)|.)$/ +\= Expect success + a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00] + 0: a[00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00] + /(02-)?[0-9]{3}-[0-9]{3}/ 02-123-123 0: 02-123-123 /^(a(?2))(b)(?1)/ abbab\=find_limits +Minimum heap limit = 0 Minimum match limit = 4 Minimum depth limit = 2 0: abbab