From ed9f34b06b260323c07b5f34c40ec75d82b39412 Mon Sep 17 00:00:00 2001
From: "Philip.Hazel"
Date: Fri, 31 Mar 2017 16:49:33 +0000
Subject: [PATCH] Documentation update
---
Makefile.am | 2 -
doc/html/index.html | 3 -
doc/html/pcre2compat.html | 64 ++++----
doc/html/pcre2jit.html | 6 +-
doc/html/pcre2limits.html | 12 +-
doc/html/pcre2perform.html | 121 +++++++++++----
doc/index.html.src | 3 -
doc/pcre2.txt | 311 +++++++++++++++++++++----------------
doc/pcre2perform.3 | 130 +++++++++++-----
doc/pcre2stack.3 | 212 -------------------------
10 files changed, 391 insertions(+), 473 deletions(-)
delete mode 100644 doc/pcre2stack.3
diff --git a/Makefile.am b/Makefile.am
index a370db2..fa57eeb 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -103,7 +103,6 @@ dist_html_DATA = \
doc/html/pcre2posix.html \
doc/html/pcre2sample.html \
doc/html/pcre2serialize.html \
- doc/html/pcre2stack.html \
doc/html/pcre2syntax.html \
doc/html/pcre2test.html \
doc/html/pcre2unicode.html
@@ -187,7 +186,6 @@ dist_man_MANS = \
doc/pcre2posix.3 \
doc/pcre2sample.3 \
doc/pcre2serialize.3 \
- doc/pcre2stack.3 \
doc/pcre2syntax.3 \
doc/pcre2test.1 \
doc/pcre2unicode.3
diff --git a/doc/html/index.html b/doc/html/index.html
index 3920426..3517671 100644
--- a/doc/html/index.html
+++ b/doc/html/index.html
@@ -68,9 +68,6 @@ first.
pcre2serialize |
Serializing functions for saving precompiled patterns |
-pcre2stack |
- Discussion of PCRE2's stack usage |
-
pcre2syntax |
Syntax quick-reference summary |
diff --git a/doc/html/pcre2compat.html b/doc/html/pcre2compat.html
index 993dfd1..b55ab82 100644
--- a/doc/html/pcre2compat.html
+++ b/doc/html/pcre2compat.html
@@ -18,7 +18,8 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
This document describes the differences in the ways that PCRE2 and Perl handle
regular expressions. The differences described here are with respect to Perl
-versions 5.10 and above.
+versions 5.24, but as both Perl and PCRE2 are continually changing, the
+information may sometimes be out of date.
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
@@ -27,17 +28,18 @@ have are given in the
page.
-2. PCRE2 allows repeat quantifiers only on parenthesized assertions, but they
-do not mean what you might think. For example, (?!a){3} does not assert that
-the next three characters are not "a". It just asserts that the next character
-is not "a" three times (in principle: PCRE2 optimizes this to run the assertion
-just once). Perl allows repeat quantifiers on other assertions such as \b, but
-these do not seem to have any use.
+2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
+they do not mean what you might think. For example, (?!a){3} does not assert
+that the next three characters are not "a". It just asserts that the next
+character is not "a" three times (in principle: PCRE2 optimizes this to run the
+assertion just once). Perl allows some repeat quantifiers on other assertions,
+for example, \b* (but not \b{3}), but these do not seem to have any use.
-3. Capturing subpatterns that occur inside negative lookahead assertions are
-counted, but their entries in the offsets vector are never set. Perl sometimes
-(but not always) sets its numerical variables from inside negative assertions.
+3. Capturing subpatterns that occur inside negative lookaround assertions are
+counted, but their entries in the offsets vector are set only if the assertion
+is a condition. Perl has changed its behaviour in this regard from time to
+time.
4. The following Perl escape sequences are not supported: \l, \u, \L,
@@ -50,13 +52,13 @@ generated by default. However, if the PCRE2_ALT_BSUX option is set,
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
-built with Unicode support. The properties that can be tested with \p and \P
-are limited to the general category properties such as Lu and Nd, script names
-such as Greek or Han, and the derived properties Any and L&. PCRE2 does support
-the Cs (surrogate) property, which Perl does not; the Perl documentation says
-"Because Perl hides the need for the user to understand the internal
-representation of Unicode characters, there is no need to implement the
-somewhat messy concept of surrogates."
+built with Unicode support (the default). The properties that can be tested
+with \p and \P are limited to the general category properties such as Lu and
+Nd, script names such as Greek or Han, and the derived properties Any and L&.
+PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
+documentation says "Because Perl hides the need for the user to understand the
+internal representation of Unicode characters, there is no need to implement
+the somewhat messy concept of surrogates."
6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters
@@ -75,23 +77,15 @@ The \Q...\E sequence is recognized both inside and outside character classes.
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
-constructions. However, there is support for recursive patterns. This is not
-available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE2 "callout"
-feature allows an external function to be called during pattern matching. See
-the
+constructions. However, there is support PCRE2's "callout" feature, which
+allows an external function to be called during pattern matching. See the
pcre2callout
documentation for details.
-8. Subroutine calls (whether recursive or not) are treated as atomic groups.
-Atomic recursion is like Python, but unlike Perl. Captured values that are set
-outside a subroutine call can be referenced from inside in PCRE2, but not in
-Perl. There is a discussion that explains these differences in more detail in
-the
-section on recursion differences from Perl
-in the
-pcre2pattern
-page.
+8. Subroutine calls (whether recursive or not) were treated as atomic groups up
+to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
+into subroutine calls is now supported, as in Perl.
9. If any of the backtracking control verbs are used in a subpattern that is
@@ -147,14 +141,14 @@ certainly user mistakes.
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
affected when case-independent matching is specified. For example, \p{Lu}
always matches an upper case letter. I think Perl has changed in this respect;
-in the release at the time of writing (5.16), \p{Lu} and \p{Ll} match all
+in the release at the time of writing (5.24), \p{Lu} and \p{Ll} match all
letters, regardless of case, when case independence is specified.
17. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
-of which (such as named parentheses) have been in PCRE2 for some time. This
-list is with respect to Perl 5.10:
+of which (such as named parentheses) were in PCRE2 for some time before. This
+list is with respect to Perl 5.24:
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
@@ -220,9 +214,9 @@ Cambridge, England.
REVISION
-Last updated: 18 October 2016
+Last updated: 29 March 2017
-Copyright © 1997-2016 University of Cambridge.
+Copyright © 1997-2017 University of Cambridge.
Return to the PCRE2 index page.
diff --git a/doc/html/pcre2jit.html b/doc/html/pcre2jit.html
index 4a6d4ff..5eae042 100644
--- a/doc/html/pcre2jit.html
+++ b/doc/html/pcre2jit.html
@@ -173,7 +173,7 @@ below for a discussion of JIT stack usage.
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
a very large pattern tree goes on for too long, as it is in the same
circumstance when JIT is not used, but the details of exactly what is counted
-are not the same. The PCRE2_ERROR_RECURSIONLIMIT error code is never returned
+are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
when JIT matching is used.
CONTROLLING THE JIT STACK
@@ -436,9 +436,9 @@ Cambridge, England.
REVISION
-Last updated: 05 June 2016
+Last updated: 30 March 2017
-Copyright © 1997-2016 University of Cambridge.
+Copyright © 1997-2017 University of Cambridge.
Return to the PCRE2 index page.
diff --git a/doc/html/pcre2limits.html b/doc/html/pcre2limits.html
index d7e382b..640fe3d 100644
--- a/doc/html/pcre2limits.html
+++ b/doc/html/pcre2limits.html
@@ -44,14 +44,6 @@ integer type, usually defined as size_t. Its maximum value (that is
and unset offsets.
-Note that when using the traditional matching function, PCRE2 uses recursion to
-handle subpatterns and indefinite repetition. This means that the available
-stack space may limit the size of a subject string that can be processed by
-certain patterns. For a discussion of stack issues, see the
-pcre2stack
-documentation.
-
-
All values in repeating quantifiers must be less than 65536.
@@ -94,9 +86,9 @@ Cambridge, England.
REVISION
-Last updated: 26 October 2016
+Last updated: 30 March 2017
-Copyright © 1997-2016 University of Cambridge.
+Copyright © 1997-2017 University of Cambridge.
Return to the PCRE2 index page.
diff --git a/doc/html/pcre2perform.html b/doc/html/pcre2perform.html
index ac9d23c..ad5d065 100644
--- a/doc/html/pcre2perform.html
+++ b/doc/html/pcre2perform.html
@@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
- PCRE2 PERFORMANCE
- COMPILED PATTERN MEMORY USAGE
-
- STACK USAGE AT RUN TIME
+
- STACK AND HEAP USAGE AT RUN TIME
- PROCESSING TIME
- AUTHOR
- REVISION
@@ -29,11 +29,11 @@ of them.
COMPILED PATTERN MEMORY USAGE
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
-so that most simple patterns do not use much memory. However, there is one case
-where the memory usage of a compiled pattern can be unexpectedly large. If a
-parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
-a limited maximum, the whole subpattern is repeated in the compiled code. For
-example, the pattern
+so that most simple patterns do not use much memory for storing the compiled
+version. However, there is one case where the memory usage of a compiled
+pattern can be unexpectedly large. If a parenthesized subpattern has a
+quantifier with a minimum greater than 1 and/or a limited maximum, the whole
+subpattern is repeated in the compiled code. For example, the pattern
(abc|def){2,4}
@@ -52,13 +52,13 @@ example, the very simple pattern
((ab){1,1000}c){1,3}
-uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
-with its default internal pointer size of two bytes, the size limit on a
-compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
-is reached with the above pattern if the outer repetition is increased from 3
-to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
-larger compiled patterns, but it is better to try to rewrite your pattern to
-use less memory if you can.
+uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
+compiled with its default internal pointer size of two bytes, the size limit on
+a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
+this is reached with the above pattern if the outer repetition is increased
+from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
+handle larger compiled patterns, but it is better to try to rewrite your
+pattern to use less memory if you can.
One way of reducing the memory usage for such patterns is to make use of
@@ -68,25 +68,33 @@ facility. Re-writing the above pattern as
((ab)(?2){0,999}c)(?1){0,2}
-reduces the memory requirements to 18K, and indeed it remains under 20K even
-with the outer repetition increased to 100. However, this pattern is not
-exactly equivalent, because the "subroutine" calls are treated as
-atomic groups
-into which there can be no backtracking if there is a subsequent matching
-failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
-Furthermore, there is a noticeable loss of speed when executing the modified
-pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
-speed is acceptable, this kind of rewriting will allow you to process patterns
-that PCRE2 cannot otherwise handle.
+reduces the memory requirements to around 16K, and indeed it remains under 20K
+even with the outer repetition increased to 100. However, this kind of pattern
+is not always exactly equivalent, because any captures within subroutine calls
+are lost when the subroutine completes. If this is not a problem, this kind of
+rewriting will allow you to process patterns that PCRE2 cannot otherwise
+handle. The matching performance of the two different versions of the pattern
+are roughly the same. (This applies from release 10.30 - things were different
+in earlier releases.)
-
STACK USAGE AT RUN TIME
+
STACK AND HEAP USAGE AT RUN TIME
-When pcre2_match() is used for matching, certain kinds of pattern can
-cause it to use large amounts of the process stack. In some environments the
-default process stack is quite small, and if it runs out the result is often
-SIGSEGV. Rewriting your pattern can often help. The
-pcre2stack
-documentation discusses this issue in detail.
+From release 10.30, the interpretive (non-JIT) version of pcre2_match()
+uses very little system stack at run time. In earlier releases recursive
+function calls could use a great deal of stack, and this could cause problems,
+but this usage has been eliminated. Backtracking positions are now explicitly
+remembered in memory frames controlled by the code. An initial 10K vector of
+frames is allocated on the system stack (enough for about 50 frames for small
+patterns), but if this is insufficient, heap memory is used. Rewriting patterns
+to be time-efficient, as described below, may also reduce the memory
+requirements.
+
+
+In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
+function calls, but only for processing atomic groups, lookaround assertions,
+and recursion within the pattern. Too much nested recursion may cause stack
+issues. The "match depth" parameter can be used to limit the depth of function
+recursion in pcre2_dfa_match().
PROCESSING TIME
@@ -175,7 +183,54 @@ appreciable time with strings longer than about 20 characters.
In many cases, the solution to this kind of performance issue is to use an
-atomic group or a possessive quantifier.
+atomic group or a possessive quantifier. This can often reduce memory
+requirements as well. As another example, consider this pattern:
+
+ ([^<]|<(?!inet))+
+
+It matches from wherever it starts until it encounters "<inet" or the end of
+the data, and is the kind of pattern that might be used when processing an XML
+file. Each iteration of the outer parentheses matches either one character that
+is not "<" or a "<" that is not followed by "inet". However, each time a
+parenthesis is processed, a backtracking position is passed, so this
+formulation uses a memory frame for each matched character. For a long string,
+a lot of memory is required. Consider now this rewritten pattern, which matches
+exactly the same strings:
+
+ ([^<]++|<(?!inet))+
+
+This runs much faster, because sequences of characters that do not contain "<"
+are "swallowed" in one item inside the parentheses, and a possessive quantifier
+is used to stop any backtracking into the runs of non-"<" characters. This
+version also uses a lot less memory because entry to a new set of parentheses
+happens only when a "<" character that is not followed by "inet" is encountered
+(and we assume this is relatively rare).
+
+
+This example shows that one way of optimizing performance when matching long
+subject strings is to write repeated parenthesized subpatterns to match more
+than one character whenever possible.
+
+
+SETTING RESOURCE LIMITS
+
+
+You can set limits on the amount of processing that takes place when matching,
+and on the amount of heap memory that is used. The default values of the limits
+are very large, and unlikely ever to operate. They can be changed when PCRE2 is
+built, and they can also be set when pcre2_match() or
+pcre2_dfa_match() is called. For details of these interfaces, see the
+pcre2build
+documentation and the section entitled
+"The match context"
+in the
+pcre2api
+documentation.
+
+
+The pcre2test test program has a modifier called "find_limits" which, if
+applied to a subject line, causes it to find the smallest limits that allow a
+pattern to match. This is done by repeatedly matching with different limits.
AUTHOR
@@ -188,9 +243,9 @@ Cambridge, England.
REVISION
-Last updated: 02 January 2015
+Last updated: 31 March 2017
-Copyright © 1997-2015 University of Cambridge.
+Copyright © 1997-2017 University of Cambridge.
Return to the PCRE2 index page.
diff --git a/doc/index.html.src b/doc/index.html.src
index 3920426..3517671 100644
--- a/doc/index.html.src
+++ b/doc/index.html.src
@@ -68,9 +68,6 @@ first.
pcre2serialize |
Serializing functions for saving precompiled patterns |
-pcre2stack |
- Discussion of PCRE2's stack usage |
-
pcre2syntax |
Syntax quick-reference summary |
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index 17070e2..6237f74 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -4097,45 +4097,46 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
This document describes the differences in the ways that PCRE2 and Perl
handle regular expressions. The differences described here are with
- respect to Perl versions 5.10 and above.
+ respect to Perl versions 5.24, but as both Perl and PCRE2 are continu-
+ ally changing, the information may sometimes be out of date.
- 1. PCRE2 has only a subset of Perl's Unicode support. Details of what
+ 1. PCRE2 has only a subset of Perl's Unicode support. Details of what
it does have are given in the pcre2unicode page.
- 2. PCRE2 allows repeat quantifiers only on parenthesized assertions,
- but they do not mean what you might think. For example, (?!a){3} does
- not assert that the next three characters are not "a". It just asserts
- that the next character is not "a" three times (in principle: PCRE2
- optimizes this to run the assertion just once). Perl allows repeat
- quantifiers on other assertions such as \b, but these do not seem to
- have any use.
+ 2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
+ tions, but they do not mean what you might think. For example, (?!a){3}
+ does not assert that the next three characters are not "a". It just
+ asserts that the next character is not "a" three times (in principle:
+ PCRE2 optimizes this to run the assertion just once). Perl allows some
+ repeat quantifiers on other assertions, for example, \b* (but not
+ \b{3}), but these do not seem to have any use.
- 3. Capturing subpatterns that occur inside negative lookahead asser-
- tions are counted, but their entries in the offsets vector are never
- set. Perl sometimes (but not always) sets its numerical variables from
- inside negative assertions.
+ 3. Capturing subpatterns that occur inside negative lookaround asser-
+ tions are counted, but their entries in the offsets vector are set only
+ if the assertion is a condition. Perl has changed its behaviour in this
+ regard from time to time.
- 4. The following Perl escape sequences are not supported: \l, \u, \L,
- \U, and \N when followed by a character name or Unicode value. (\N on
+ 4. The following Perl escape sequences are not supported: \l, \u, \L,
+ \U, and \N when followed by a character name or Unicode value. (\N on
its own, matching a non-newline character, is supported.) In fact these
- are implemented by Perl's general string-handling and are not part of
- its pattern matching engine. If any of these are encountered by PCRE2,
+ are implemented by Perl's general string-handling and are not part of
+ its pattern matching engine. If any of these are encountered by PCRE2,
an error is generated by default. However, if the PCRE2_ALT_BSUX option
is set, \U and \u are interpreted as ECMAScript interprets them.
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
- is built with Unicode support. The properties that can be tested with
- \p and \P are limited to the general category properties such as Lu and
- Nd, script names such as Greek or Han, and the derived properties Any
- and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
- not; the Perl documentation says "Because Perl hides the need for the
- user to understand the internal representation of Unicode characters,
- there is no need to implement the somewhat messy concept of surro-
- gates."
+ is built with Unicode support (the default). The properties that can be
+ tested with \p and \P are limited to the general category properties
+ such as Lu and Nd, script names such as Greek or Han, and the derived
+ properties Any and L&. PCRE2 does support the Cs (surrogate) property,
+ which Perl does not; the Perl documentation says "Because Perl hides
+ the need for the user to understand the internal representation of Uni-
+ code characters, there is no need to implement the somewhat messy con-
+ cept of surrogates."
- 6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
- acters in between are treated as literals. This is slightly different
- from Perl in that $ and @ are also handled as literals inside the
+ 6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
+ acters in between are treated as literals. This is slightly different
+ from Perl in that $ and @ are also handled as literals inside the
quotes. In Perl, they cause variable interpolation (but of course PCRE2
does not have variables). Note the following examples:
@@ -4146,22 +4147,17 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
- The \Q...\E sequence is recognized both inside and outside character
+ The \Q...\E sequence is recognized both inside and outside character
classes.
- 7. Fairly obviously, PCRE2 does not support the (?{code}) and
- (??{code}) constructions. However, there is support for recursive pat-
- terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
- the PCRE2 "callout" feature allows an external function to be called
- during pattern matching. See the pcre2callout documentation for
- details.
+ 7. Fairly obviously, PCRE2 does not support the (?{code}) and
+ (??{code}) constructions. However, there is support PCRE2's "callout"
+ feature, which allows an external function to be called during pattern
+ matching. See the pcre2callout documentation for details.
- 8. Subroutine calls (whether recursive or not) are treated as atomic
- groups. Atomic recursion is like Python, but unlike Perl. Captured
- values that are set outside a subroutine call can be referenced from
- inside in PCRE2, but not in Perl. There is a discussion that explains
- these differences in more detail in the section on recursion differ-
- ences from Perl in the pcre2pattern page.
+ 8. Subroutine calls (whether recursive or not) were treated as atomic
+ groups up to PCRE2 release 10.23, but from release 10.30 this changed,
+ and backtracking into subroutine calls is now supported, as in Perl.
9. If any of the backtracking control verbs are used in a subpattern
that is called as a subroutine (whether or not recursively), their
@@ -4211,14 +4207,14 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
16. In PCRE2, the upper/lower case character properties Lu and Ll are
not affected when case-independent matching is specified. For example,
\p{Lu} always matches an upper case letter. I think Perl has changed in
- this respect; in the release at the time of writing (5.16), \p{Lu} and
+ this respect; in the release at the time of writing (5.24), \p{Lu} and
\p{Ll} match all letters, regardless of case, when case independence is
specified.
17. PCRE2 provides some extensions to the Perl regular expression
facilities. Perl 5.10 includes new features that are not in earlier
- versions of Perl, some of which (such as named parentheses) have been
- in PCRE2 for some time. This list is with respect to Perl 5.10:
+ versions of Perl, some of which (such as named parentheses) were in
+ PCRE2 for some time before. This list is with respect to Perl 5.24:
(a) Although lookbehind assertions in PCRE2 must match fixed length
strings, each alternative branch of a lookbehind assertion can match a
@@ -4271,8 +4267,8 @@ AUTHOR
REVISION
- Last updated: 18 October 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 29 March 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -4420,8 +4416,8 @@ RETURN VALUES FROM JIT MATCHING
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
searching a very large pattern tree goes on for too long, as it is in
the same circumstance when JIT is not used, but the details of exactly
- what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error
- code is never returned when JIT matching is used.
+ what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
+ is never returned when JIT matching is used.
CONTROLLING THE JIT STACK
@@ -4668,8 +4664,8 @@ AUTHOR
REVISION
- Last updated: 05 June 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 30 March 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -4706,12 +4702,6 @@ SIZE AND OTHER LIMITATIONS
(that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
terminated strings and unset offsets.
- Note that when using the traditional matching function, PCRE2 uses
- recursion to handle subpatterns and indefinite repetition. This means
- that the available stack space may limit the size of a subject string
- that can be processed by certain patterns. For a discussion of stack
- issues, see the pcre2stack documentation.
-
All values in repeating quantifiers must be less than 65536.
The maximum length of a lookbehind assertion is 65535 characters.
@@ -4745,8 +4735,8 @@ AUTHOR
REVISION
- Last updated: 26 October 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 30 March 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -8485,11 +8475,12 @@ PCRE2 PERFORMANCE
COMPILED PATTERN MEMORY USAGE
Patterns are compiled by PCRE2 into a reasonably efficient interpretive
- code, so that most simple patterns do not use much memory. However,
- there is one case where the memory usage of a compiled pattern can be
- unexpectedly large. If a parenthesized subpattern has a quantifier with
- a minimum greater than 1 and/or a limited maximum, the whole subpattern
- is repeated in the compiled code. For example, the pattern
+ code, so that most simple patterns do not use much memory for storing
+ the compiled version. However, there is one case where the memory usage
+ of a compiled pattern can be unexpectedly large. If a parenthesized
+ subpattern has a quantifier with a minimum greater than 1 and/or a lim-
+ ited maximum, the whole subpattern is repeated in the compiled code.
+ For example, the pattern
(abc|def){2,4}
@@ -8497,134 +8488,186 @@ COMPILED PATTERN MEMORY USAGE
(abc|def)(abc|def)((abc|def)(abc|def)?)?
- (Technical aside: It is done this way so that backtrack points within
+ (Technical aside: It is done this way so that backtrack points within
each of the repetitions can be independently maintained.)
- For regular expressions whose quantifiers use only small numbers, this
- is not usually a problem. However, if the numbers are large, and par-
- ticularly if such repetitions are nested, the memory usage can become
+ For regular expressions whose quantifiers use only small numbers, this
+ is not usually a problem. However, if the numbers are large, and par-
+ ticularly if such repetitions are nested, the memory usage can become
an embarrassment. For example, the very simple pattern
((ab){1,1000}c){1,3}
- uses 51K bytes when compiled using the 8-bit library. When PCRE2 is
- compiled with its default internal pointer size of two bytes, the size
- limit on a compiled pattern is 64K code units in the 8-bit and 16-bit
- libraries, and this is reached with the above pattern if the outer rep-
- etition is increased from 3 to 4. PCRE2 can be compiled to use larger
- internal pointers and thus handle larger compiled patterns, but it is
- better to try to rewrite your pattern to use less memory if you can.
+ uses over 50K bytes when compiled using the 8-bit library. When PCRE2
+ is compiled with its default internal pointer size of two bytes, the
+ size limit on a compiled pattern is 64K code units in the 8-bit and
+ 16-bit libraries, and this is reached with the above pattern if the
+ outer repetition is increased from 3 to 4. PCRE2 can be compiled to use
+ larger internal pointers and thus handle larger compiled patterns, but
+ it is better to try to rewrite your pattern to use less memory if you
+ can.
One way of reducing the memory usage for such patterns is to make use
of PCRE2's "subroutine" facility. Re-writing the above pattern as
((ab)(?2){0,999}c)(?1){0,2}
- reduces the memory requirements to 18K, and indeed it remains under 20K
- even with the outer repetition increased to 100. However, this pattern
- is not exactly equivalent, because the "subroutine" calls are treated
- as atomic groups into which there can be no backtracking if there is a
- subsequent matching failure. Therefore, PCRE2 cannot do this kind of
- rewriting automatically. Furthermore, there is a noticeable loss of
- speed when executing the modified pattern. Nevertheless, if the atomic
- grouping is not a problem and the loss of speed is acceptable, this
- kind of rewriting will allow you to process patterns that PCRE2 cannot
- otherwise handle.
+ reduces the memory requirements to around 16K, and indeed it remains
+ under 20K even with the outer repetition increased to 100. However,
+ this kind of pattern is not always exactly equivalent, because any cap-
+ tures within subroutine calls are lost when the subroutine completes.
+ If this is not a problem, this kind of rewriting will allow you to
+ process patterns that PCRE2 cannot otherwise handle. The matching per-
+ formance of the two different versions of the pattern are roughly the
+ same. (This applies from release 10.30 - things were different in ear-
+ lier releases.)
-STACK USAGE AT RUN TIME
+STACK AND HEAP USAGE AT RUN TIME
- When pcre2_match() is used for matching, certain kinds of pattern can
- cause it to use large amounts of the process stack. In some environ-
- ments the default process stack is quite small, and if it runs out the
- result is often SIGSEGV. Rewriting your pattern can often help. The
- pcre2stack documentation discusses this issue in detail.
+ From release 10.30, the interpretive (non-JIT) version of pcre2_match()
+ uses very little system stack at run time. In earlier releases recur-
+ sive function calls could use a great deal of stack, and this could
+ cause problems, but this usage has been eliminated. Backtracking posi-
+ tions are now explicitly remembered in memory frames controlled by the
+ code. An initial 10K vector of frames is allocated on the system stack
+ (enough for about 50 frames for small patterns), but if this is insuf-
+ ficient, heap memory is used. Rewriting patterns to be time-efficient,
+ as described below, may also reduce the memory requirements.
+
+ In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
+ function calls, but only for processing atomic groups, lookaround
+ assertions, and recursion within the pattern. Too much nested recursion
+ may cause stack issues. The "match depth" parameter can be used to
+ limit the depth of function recursion in pcre2_dfa_match().
PROCESSING TIME
- Certain items in regular expression patterns are processed more effi-
+ Certain items in regular expression patterns are processed more effi-
ciently than others. It is more efficient to use a character class like
- [aeiou] than a set of single-character alternatives such as
- (a|e|i|o|u). In general, the simplest construction that provides the
+ [aeiou] than a set of single-character alternatives such as
+ (a|e|i|o|u). In general, the simplest construction that provides the
required behaviour is usually the most efficient. Jeffrey Friedl's book
- contains a lot of useful general discussion about optimizing regular
- expressions for efficient performance. This document contains a few
+ contains a lot of useful general discussion about optimizing regular
+ expressions for efficient performance. This document contains a few
observations about PCRE2.
- Using Unicode character properties (the \p, \P, and \X escapes) is
- slow, because PCRE2 has to use a multi-stage table lookup whenever it
- needs a character's property. If you can find an alternative pattern
+ Using Unicode character properties (the \p, \P, and \X escapes) is
+ slow, because PCRE2 has to use a multi-stage table lookup whenever it
+ needs a character's property. If you can find an alternative pattern
that does not use character properties, it will probably be faster.
- By default, the escape sequences \b, \d, \s, and \w, and the POSIX
- character classes such as [:alpha:] do not use Unicode properties,
+ By default, the escape sequences \b, \d, \s, and \w, and the POSIX
+ character classes such as [:alpha:] do not use Unicode properties,
partly for backwards compatibility, and partly for performance reasons.
- However, you can set the PCRE2_UCP option or start the pattern with
- (*UCP) if you want Unicode character properties to be used. This can
- double the matching time for items such as \d, when matched with
- pcre2_match(); the performance loss is less with a DFA matching func-
+ However, you can set the PCRE2_UCP option or start the pattern with
+ (*UCP) if you want Unicode character properties to be used. This can
+ double the matching time for items such as \d, when matched with
+ pcre2_match(); the performance loss is less with a DFA matching func-
tion, and in both cases there is not much difference for \b.
- When a pattern begins with .* not in atomic parentheses, nor in paren-
- theses that are the subject of a backreference, and the PCRE2_DOTALL
- option is set, the pattern is implicitly anchored by PCRE2, since it
- can match only at the start of a subject string. If the pattern has
+ When a pattern begins with .* not in atomic parentheses, nor in paren-
+ theses that are the subject of a backreference, and the PCRE2_DOTALL
+ option is set, the pattern is implicitly anchored by PCRE2, since it
+ can match only at the start of a subject string. If the pattern has
multiple top-level branches, they must all be anchorable. The optimiza-
- tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
+ tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
- If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
+ If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
because the dot metacharacter does not then match a newline, and if the
- subject string contains newlines, the pattern may match from the char-
+ subject string contains newlines, the pattern may match from the char-
acter immediately following one of them instead of from the very start.
For example, the pattern
.*second
- matches the subject "first\nand second" (where \n stands for a newline
- character), with the match starting at the seventh character. In order
- to do this, PCRE2 has to retry the match starting after every newline
+ matches the subject "first\nand second" (where \n stands for a newline
+ character), with the match starting at the seventh character. In order
+ to do this, PCRE2 has to retry the match starting after every newline
in the subject.
- If you are using such a pattern with subject strings that do not con-
- tain newlines, the best performance is obtained by setting
- PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
+ If you are using such a pattern with subject strings that do not con-
+ tain newlines, the best performance is obtained by setting
+ PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
explicit anchoring. That saves PCRE2 from having to scan along the sub-
ject looking for a newline to restart at.
- Beware of patterns that contain nested indefinite repeats. These can
- take a long time to run when applied to a string that does not match.
+ Beware of patterns that contain nested indefinite repeats. These can
+ take a long time to run when applied to a string that does not match.
Consider the pattern fragment
^(a+)*
- This can match "aaaa" in 16 different ways, and this number increases
- very rapidly as the string gets longer. (The * repeat can match 0, 1,
- 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
- repeats can match different numbers of times.) When the remainder of
- the pattern is such that the entire match is going to fail, PCRE2 has
- in principle to try every possible variation, and this can take an
+ This can match "aaaa" in 16 different ways, and this number increases
+ very rapidly as the string gets longer. (The * repeat can match 0, 1,
+ 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
+ repeats can match different numbers of times.) When the remainder of
+ the pattern is such that the entire match is going to fail, PCRE2 has
+ in principle to try every possible variation, and this can take an
extremely long time, even for relatively short strings.
An optimization catches some of the more simple cases such as
(a+)*b
- where a literal character follows. Before embarking on the standard
- matching procedure, PCRE2 checks that there is a "b" later in the sub-
- ject string, and if there is not, it fails the match immediately. How-
- ever, when there is no following literal this optimization cannot be
+ where a literal character follows. Before embarking on the standard
+ matching procedure, PCRE2 checks that there is a "b" later in the sub-
+ ject string, and if there is not, it fails the match immediately. How-
+ ever, when there is no following literal this optimization cannot be
used. You can see the difference by comparing the behaviour of
(a+)*\d
- with the pattern above. The former gives a failure almost instantly
- when applied to a whole line of "a" characters, whereas the latter
+ with the pattern above. The former gives a failure almost instantly
+ when applied to a whole line of "a" characters, whereas the latter
takes an appreciable time with strings longer than about 20 characters.
In many cases, the solution to this kind of performance issue is to use
- an atomic group or a possessive quantifier.
+ an atomic group or a possessive quantifier. This can often reduce mem-
+ ory requirements as well. As another example, consider this pattern:
+
+ ([^<]|<(?!inet))+
+
+ It matches from wherever it starts until it encounters "
-.\"
-atomic groups
-.\"
-into which there can be no backtracking if there is a subsequent matching
-failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
-Furthermore, there is a noticeable loss of speed when executing the modified
-pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
-speed is acceptable, this kind of rewriting will allow you to process patterns
-that PCRE2 cannot otherwise handle.
+reduces the memory requirements to around 16K, and indeed it remains under 20K
+even with the outer repetition increased to 100. However, this kind of pattern
+is not always exactly equivalent, because any captures within subroutine calls
+are lost when the subroutine completes. If this is not a problem, this kind of
+rewriting will allow you to process patterns that PCRE2 cannot otherwise
+handle. The matching performance of the two different versions of the pattern
+are roughly the same. (This applies from release 10.30 - things were different
+in earlier releases.)
.
.
-.SH "STACK USAGE AT RUN TIME"
+.SH "STACK AND HEAP USAGE AT RUN TIME"
.rs
.sp
-When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can
-cause it to use large amounts of the process stack. In some environments the
-default process stack is quite small, and if it runs out the result is often
-SIGSEGV. Rewriting your pattern can often help. The
-.\" HREF
-\fBpcre2stack\fP
-.\"
-documentation discusses this issue in detail.
+From release 10.30, the interpretive (non-JIT) version of \fBpcre2_match()\fP
+uses very little system stack at run time. In earlier releases recursive
+function calls could use a great deal of stack, and this could cause problems,
+but this usage has been eliminated. Backtracking positions are now explicitly
+remembered in memory frames controlled by the code. An initial 10K vector of
+frames is allocated on the system stack (enough for about 50 frames for small
+patterns), but if this is insufficient, heap memory is used. Rewriting patterns
+to be time-efficient, as described below, may also reduce the memory
+requirements.
+.P
+In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive
+function calls, but only for processing atomic groups, lookaround assertions,
+and recursion within the pattern. Too much nested recursion may cause stack
+issues. The "match depth" parameter can be used to limit the depth of function
+recursion in \fBpcre2_dfa_match()\fP.
.
.
.SH "PROCESSING TIME"
@@ -160,7 +162,59 @@ applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters.
.P
In many cases, the solution to this kind of performance issue is to use an
-atomic group or a possessive quantifier.
+atomic group or a possessive quantifier. This can often reduce memory
+requirements as well. As another example, consider this pattern:
+.sp
+ ([^<]|<(?!inet))+
+.sp
+It matches from wherever it starts until it encounters "
+.\"
+"The match context"
+.\"
+in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation.
+.P
+The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
+applied to a subject line, causes it to find the smallest limits that allow a
+pattern to match. This is done by repeatedly matching with different limits.
.
.
.SH AUTHOR
@@ -177,6 +231,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 02 January 2015
-Copyright (c) 1997-2015 University of Cambridge.
+Last updated: 31 March 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2stack.3 b/doc/pcre2stack.3
deleted file mode 100644
index 89d101b..0000000
--- a/doc/pcre2stack.3
+++ /dev/null
@@ -1,212 +0,0 @@
-.TH PCRE2STACK 3 "23 December 2016" "PCRE2 10.23"
-.SH NAME
-PCRE2 - Perl-compatible regular expressions (revised API)
-.SH "PCRE2 DISCUSSION OF STACK USAGE"
-.rs
-.sp
-When you call \fBpcre2_match()\fP, it makes use of an internal function called
-\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
-in order to remember the state of the match so that it can back up and try a
-different alternative after a failure. As matching proceeds deeper and deeper
-into the tree of possibilities, the recursion depth increases. The
-\fBmatch()\fP function is also called in other circumstances, for example,
-whenever a parenthesized sub-pattern is entered, and in certain cases of
-repetition.
-.P
-Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
-as a* it may be called several times at the same level, after matching
-different numbers of a's. Furthermore, in a number of cases where the result of
-the recursive call would immediately be passed back as the result of the
-current call (a "tail recursion"), the function is just restarted instead.
-.P
-Each time the internal \fBmatch()\fP function is called recursively, it uses
-memory from the process stack. For certain kinds of pattern and data, very
-large amounts of stack may be needed, despite the recognition of "tail
-recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
-of the GCC compiler, the stack requirements are greatly increased.
-.P
-The above comments apply when \fBpcre2_match()\fP is run in its normal
-interpretive manner. If the compiled pattern was processed by
-\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the
-options passed to \fBpcre2_match()\fP were not incompatible, the matching
-process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
-this case, the memory requirements are handled entirely differently. See the
-.\" HREF
-\fBpcre2jit\fP
-.\"
-documentation for details.
-.P
-The \fBpcre2_dfa_match()\fP function operates in a different way to
-\fBpcre2_match()\fP, and uses recursion only when there is a regular expression
-recursion or subroutine call in the pattern. This includes the processing of
-assertion and "once-only" subpatterns, which are handled like subroutine calls.
-Normally, these are never very deep, and the limit on the complexity of
-\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
-However, it is possible to write patterns with runaway infinite recursions;
-such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack unless a
-limit is applied (see below).
-.P
-The comments in the next three sections do not apply to
-\fBpcre2_dfa_match()\fP; they are relevant only for \fBpcre2_match()\fP without
-the JIT optimization.
-.
-.
-.SS "Reducing \fBpcre2_match()\fP's stack usage"
-.rs
-.sp
-You can often reduce the amount of recursion, and therefore the
-amount of stack used, by modifying the pattern that is being matched. Consider,
-for example, this pattern:
-.sp
- ([^<]|<(?!inet))+
-.sp
-It matches from wherever it starts until it encounters "
-.\"
-"The match context"
-.\"
-in the
-.\" HREF
-\fBpcre2api\fP
-.\"
-documentation. Since the block sizes are always the same, it may be possible to
-implement a customized memory handler that is more efficient than the standard
-function. The memory blocks obtained for this purpose are retained and re-used
-if possible while \fBpcre2_match()\fP is running. They are all freed just
-before it exits.
-.
-.
-.SS "Limiting \fBpcre2_match()\fP's stack usage"
-.rs
-.sp
-You can set limits on the number of times the internal \fBmatch()\fP function
-is called, both in total and recursively. If a limit is exceeded,
-\fBpcre2_match()\fP returns an error code. Setting suitable limits should
-prevent it from running out of stack. The default values of the limits are very
-large, and unlikely ever to operate. They can be changed when PCRE2 is built,
-and they can also be set when \fBpcre2_match()\fP is called. For details of
-these interfaces, see the
-.\" HREF
-\fBpcre2build\fP
-.\"
-documentation and the section entitled
-.\" HTML
-.\"
-"The match context"
-.\"
-in the
-.\" HREF
-\fBpcre2api\fP
-.\"
-documentation.
-.P
-As a very rough rule of thumb, you should reckon on about 500 bytes per
-recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
-the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
-around 128000 recursions.
-.P
-The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
-applied to a subject line, causes it to find the smallest limits that allow a a
-pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
-different limits.
-.
-.
-.SS "Limiting \fBpcre2_dfa_match()\fP's stack usage"
-.rs
-.sp
-The recursion limit, as described above for \fBpcre2_match()\fP, also applies
-to \fBpcre2_dfa_match()\fP, whose use of recursive function calls for
-recursions in the pattern can lead to runaway stack usage. The non-recursive
-match limit is not relevant for DFA matching, and is ignored.
-.
-.
-.SS "Changing stack size in Unix-like systems"
-.rs
-.sp
-In Unix-like environments, there is not often a problem with the stack unless
-very long strings are involved, though the default limit on stack size varies
-from system to system. Values from 8Mb to 64Mb are common. You can find your
-default limit by running the command:
-.sp
- ulimit -s
-.sp
-Unfortunately, the effect of running out of stack is often SIGSEGV, though
-sometimes a more explicit error message is given. You can normally increase the
-limit on stack size by code such as this:
-.sp
- struct rlimit rlim;
- getrlimit(RLIMIT_STACK, &rlim);
- rlim.rlim_cur = 100*1024*1024;
- setrlimit(RLIMIT_STACK, &rlim);
-.sp
-This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
-attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
-do this before calling \fBpcre2_match()\fP.
-.
-.
-.SS "Changing stack size in Mac OS X"
-.rs
-.sp
-Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
-is also possible to set a stack size when linking a program. There is a
-discussion about stack sizes in Mac OS X at this web site:
-.\" HTML
-.\"
-http://developer.apple.com/qa/qa2005/qa1419.html.
-.\"
-.
-.
-.SH AUTHOR
-.rs
-.sp
-.nf
-Philip Hazel
-University Computing Service
-Cambridge, England.
-.fi
-.
-.
-.SH REVISION
-.rs
-.sp
-.nf
-Last updated: 23 December 2016
-Copyright (c) 1997-2016 University of Cambridge.
-.fi