Documentation update.

2017-03-17 16:55:58 +00:00 · 2017-03-17 16:55:58 +00:00 · 73735b81a3
parent b6b716b540
commit 73735b81a3
3 changed files with 79 additions and 89 deletions
--- a/53
+++ b/53
@ -88,10 +88,10 @@ I had a flash of inspiration as to how I could run the real compile function in
 a "fake" mode that enables it to compute how much memory it would need, while
 in most cases only ever using a small amount of working memory, and without too
 many tests of the mode that might slow it down. So I refactored the compiling
-functions to work this way. This got rid of about 600 lines of source. It
-should make future maintenance and development easier. As this was such a major
-change, I never released 6.8, instead upping the number to 7.0 (other quite
-major changes were also present in the 7.0 release).
+functions to work this way. This got rid of about 600 lines of source and made
+further maintenance and development easier. As this was such a major change, I
+never released 6.8, instead upping the number to 7.0 (other quite major changes
+were also present in the 7.0 release).

 A side effect of this work was that the previous limit of 200 on the nesting
 depth of parentheses was removed. However, there was a downside: compiling ran
@ -122,7 +122,7 @@ all the named subpatterns and their corresponding group numbers. This means
 that the actual compile (both the memory-computing dummy run and the real
 compile) has full knowledge of group names and numbers throughout. Several
 dozen lines of messy code were eliminated, though the new pre-pass was not
-short. In particular, parsing and skipping over [] classes is complicated.
+short. In particular, parsing and skipping over [] classes was complicated.

 While working on 10.22 I realized that I could simplify yet again by moving
 more of the parsing into the pre-pass, thus avoiding doing it in two places, so
@ -162,7 +162,7 @@ simpler than before.
 Most errors can be diagnosed during the parsing scan. For those that cannot
 (for example, "lookbehind assertion is not fixed length"), the parsed code
 contains offsets into the pattern so that the actual compiling code can
-identify where errors occur.
+report where errors are.


 The elements of the parsed pattern vector
@ -217,10 +217,10 @@ The following have data in the lower 16 bits, and may be followed by other data
 elements:

 META_ALT              | alternation
-META_BACKREF
-META_CAPTURE
-META_ESCAPE
-META_RECURSE
+META_BACKREF          back reference
+META_CAPTURE          start of capturing group
+META_ESCAPE           non-literal escape sequence
+META_RECURSE          recursion call

 If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
 is the length of its branch, for which OP_REVERSE must be generated.
@ -232,8 +232,8 @@ META_BACKREF is followed by an offset if the back reference group number is 10
 or more. The offsets of the first ocurrences of references to groups whose
 numbers are less than 10 are put in cb->small_ref_offset[] (only the first
 occurrence is useful). On 64-bit systems this avoids using more than two parsed
-pattern elements for items such as \3. The offset is used when an error is
-given for a reference to a non-existent group.
+pattern elements for items such as \3. The offset is used when an error occurs
+because the reference is to a non-existent group.

 META_RECURSE is always followed by an offset, for use in error messages.

@ -286,7 +286,7 @@ group; this is used when generating OP_REVERSE for that branch.
 META_LOOKBEHIND       (?<=
 META_LOOKBEHINDNOT    (?<!

-The following are followed by two values, the minimum and maximum. Repeat
+The following are followed by two elements, the minimum and maximum. Repeat
 values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
 represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:

@ -369,7 +369,7 @@ unit, the most significant unit is first.

 In this description, we assume the "normal" compilation options. Data values
 that are counts (e.g. quantifiers) are always two bytes long in 8-bit mode
-(most significant byte first), or one code unit in 16-bit and 32-bit modes.
+(most significant byte first), and one code unit in 16-bit and 32-bit modes.


 Opcodes with no following data
@ -409,7 +409,7 @@ These items are all just one unit long
  OP_ACCEPT              ) These are Perl 5.10's "backtracking control
  OP_COMMIT              ) verbs". If OP_ACCEPT is inside capturing
  OP_FAIL                ) parentheses, it may be preceded by one or more
-  OP_PRUNE               ) OP_CLOSE, each followed by a count that
+  OP_PRUNE               ) OP_CLOSE, each followed by a number that
  OP_SKIP                ) indicates which parentheses must be closed.
  OP_THEN                )

@ -679,7 +679,7 @@ Once-only (atomic) groups

 These are just like other subpatterns, but they start with the opcode OP_ONCE.
 The check for matching an empty string in an unbounded repeat is handled
-entirely at runtime, so there are just this one opcode for atomic groups.
+entirely at runtime, so there is just this one opcode for atomic groups.


 Assertions
@ -742,14 +742,21 @@ Recursion
 Recursion either matches the current pattern, or some subexpression. The opcode
 OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting
 bracket from the start of the whole pattern. OP_RECURSE is also used for
-"subroutine" calls, even though they are not strictly a recursion. Repeated
-recursions are automatically wrapped inside OP_ONCE brackets, because otherwise
-some patterns broke them. A non-repeated recursion is not wrapped in OP_ONCE
-brackets, but it is nevertheless still treated as an atomic group.
+"subroutine" calls, even though they are not strictly a recursion. Up till
+release 10.30 recursions were treated as atomic groups, making them
+incompatible with Perl (but PCRE had then well before Perl did). From 10.30,
+backtracking into recursions is supported.
+
+Repeated recursions used to be wrapped inside OP_ONCE brackets, which not only
+forced no backtracking, but also allowed repetition to be handled as for other
+bracketed groups. From 10.30 onwards, repeated recursions are duplicated for
+their minimum repetitions, and then wrapped in non-capturing brackets for the
+remainder. For example, (?1){3} is treated as (?1)(?1)(?1), and (?1){2,4} is
+treated as (?1)(?1)(?:(?1)){0,2}.


-Callout
-------
+Callouts
+--------

 A callout can nowadays have either a numerical argument or a string argument.
 These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
@ -787,4 +794,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the
 correct length, in order to catch updating errors.

 Philip Hazel
-March 2017
+17 March 2017
--- a/24
+++ b/24
@ -1,10 +1,6 @@
 Building PCRE2 without using autotools
 --------------------------------------

-This document has been converted from the PCRE1 document. I have removed a
-number of sections about building in various environments, as they applied only
-to PCRE1 and are probably out of date.
-
 This document contains the following sections:

  General
@ -183,21 +179,9 @@ can skip ahead to the CMake section.

 STACK SIZE IN WINDOWS ENVIRONMENTS

-The default processor stack size of 1Mb in some Windows environments is too
-small for matching patterns that need much recursion. In particular, test 2 may
-fail because of this. Normally, running out of stack causes a crash, but there
-have been cases where the test program has just died silently. See your linker
-documentation for how to increase stack size if you experience problems. If you
-are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
-compiler, you can increase the stack size for pcre2test and pcre2grep by
-setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
-example). The Linux default of 8Mb is a reasonable choice for the stack, though
-even that can be too small for some pattern/subject combinations.
-
-PCRE2 has a compile configuration option to disable the use of stack for
-recursion so that heap is used instead. However, pattern matching is
-significantly slower when this is done. There is more about stack usage in the
-"pcre2stack" documentation.
+Prior to release 10.30 the default system stack size of 1Mb in some Windows 
+environments caused issues with some tests. This should no longer be the case 
+for 10.30 and later releases.


 LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
@ -393,4 +377,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
 recommended download site.

 =============================
-Last Updated: 13 October 2016
+Last Updated: 17 March 2017
--- a/91
+++ b/91
@ -15,8 +15,8 @@ subscribe or manage your subscription here:

   https://lists.exim.org/mailman/listinfo/pcre-dev

-Please read the NEWS file if you are upgrading from a previous release.
-The contents of this README file are:
+Please read the NEWS file if you are upgrading from a previous release. The
+contents of this README file are:

  The PCRE2 APIs
  Documentation for PCRE2
@ -44,8 +44,8 @@ wrappers.

 The distribution does contain a set of C wrapper functions for the 8-bit
 library that are based on the POSIX regular expression API (see the pcre2posix
-man page). These can be found in a library called libpcre2-posix. Note that this
-just provides a POSIX calling interface to PCRE2; the regular expressions
+man page). These can be found in a library called libpcre2-posix. Note that
+this just provides a POSIX calling interface to PCRE2; the regular expressions
 themselves still follow Perl syntax and semantics. The POSIX API is restricted,
 and does not give full access to all of PCRE2's facilities.

@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms:
 Building PCRE2 on non-Unix-like systems
 ---------------------------------------

-For a non-Unix-like system, please read the comments in the file
-NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and
-"make" you may be able to build PCRE2 using autotools in the same way as for
-many Unix-like systems.
+For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
+your system supports the use of "configure" and "make" you may be able to build
+PCRE2 using autotools in the same way as for many Unix-like systems.

 PCRE2 can also be configured using CMake, which can be run in various ways
 (command line, GUI, etc). This creates Makefiles, solution files, etc. The file
@ -174,19 +173,19 @@ library. They are also documented in the pcre2build man page.
  architectures. If you try to enable it on an unsupported architecture, there
  will be a compile time error.

-. If you do not want to make use of the support for UTF-8 Unicode character
-  strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
-  library, or UTF-32 Unicode character strings in the 32-bit library, you can
-  add --disable-unicode to the "configure" command. This reduces the size of
-  the libraries. It is not possible to configure one library with Unicode
-  support, and another without, in the same configuration.
+. If you do not want to make use of the default support for UTF-8 Unicode
+  character strings in the 8-bit library, UTF-16 Unicode character strings in
+  the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
+  library, you can add --disable-unicode to the "configure" command. This
+  reduces the size of the libraries. It is not possible to configure one
+  library with Unicode support, and another without, in the same configuration.
+  It is also not possible to use --enable-ebcdic (see below) with Unicode
+  support, so if this option is set, you must also use --disable-unicode.

  When Unicode support is available, the use of a UTF encoding still has to be
  enabled by setting the PCRE2_UTF option at run time or starting a pattern
  with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
-  either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
-  not possible to use both --enable-unicode and --enable-ebcdic at the same
-  time.
+  either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.

  As well as supporting UTF strings, Unicode support includes support for the
  \P, \p, and \X sequences that recognize Unicode character properties.
@ -232,18 +231,18 @@ library. They are also documented in the pcre2build man page.
  --with-match-limit=500000

  on the "configure" command. This is just the default; individual calls to
-  pcre2_match() can supply their own value. There is more discussion on the
-  pcre2api man page.
+  pcre2_match() can supply their own value. There is more discussion in the
+  pcre2api man page (search for pcre2_set_match_limit).

-. There is a separate counter that limits the depth of recursive function calls
-  during a matching process. This also has a default of ten million, which is
-  essentially "unlimited". You can change the default by setting, for example,
+. There is a separate counter that limits the depth of nested backtracking
+  during a matching process, which in turn limits the amount of memory that is
+  used. This also has a default of ten million, which is essentially
+  "unlimited". You can change the default by setting, for example,

-  --with-match-limit-recursion=500000
+  --with-match-limit-depth=5000

-  Recursive function calls use up the runtime stack; running out of stack can
-  cause programs to crash in strange ways. There is a discussion about stack
-  sizes in the pcre2stack man page.
+  There is more discussion in the pcre2api man page (search for
+  pcre2_set_depth_limit).

 . In the 8-bit library, the default maximum compiled pattern size is around
  64K bytes. You can increase this by adding --with-link-size=3 to the
@ -254,20 +253,6 @@ library. They are also documented in the pcre2build man page.
  performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
  link size setting is ignored, as 4-byte offsets are always used.

-. You can build PCRE2 so that its internal match() function that is called from
-  pcre2_match() does not call itself recursively. Instead, it uses memory
-  blocks obtained from the heap to save data that would otherwise be saved on
-  the stack. To build PCRE2 like this, use
-
-  --disable-stack-for-recursion
-
-  on the "configure" command. PCRE2 runs more slowly in this mode, but it may
-  be necessary in environments with limited stack sizes. This applies only to
-  the normal execution of the pcre2_match() function; if JIT support is being
-  successfully used, it is not relevant. Equally, it does not apply to
-  pcre2_dfa_match(), which does not use deeply nested recursion. There is a
-  discussion about stack sizes in the pcre2stack man page.
-
 . For speed, PCRE2 uses four tables for manipulating and identifying characters
  whose code point values are less than 256. By default, it uses a set of
  tables for ASCII encoding that is part of the distribution. If you specify
@ -389,6 +374,13 @@ library. They are also documented in the pcre2build man page.
  string. Otherwise, it is assumed to be a file name, and the contents of the
  file are the test string.

+. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
+  which caused pcre2_match() to use individual blocks on the heap for
+  backtracking instead of recursive function calls (which use the stack). This
+  is now obsolete since pcre2_match() was refactored always to use the heap (in
+  a much more efficient way than before). This option is retained for backwards
+  compatibility, but has no effect other than to output a warning.
+
 The "configure" script builds the following files for the basic C library:

 . Makefile             the makefile that builds the library
@ -662,25 +654,32 @@ Unicode support is enabled.
 Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
 16-bit and 32-bit modes. These are tests that generate different output in
 8-bit mode. Each pair are for general cases and Unicode support, respectively.
+
 Test 13 checks the handling of non-UTF characters greater than 255 by
 pcre2_dfa_match() in 16-bit and 32-bit modes.

-Test 14 contains a number of tests that must not be run with JIT. They check,
+Test 14 contains some special UTF and UCP tests that give different output for
+the different widths.
+
+Test 15 contains a number of tests that must not be run with JIT. They check,
 among other non-JIT things, the match-limiting features of the intepretive
 matcher.

-Test 15 is run only when JIT support is not available. It checks that an
+Test 16 is run only when JIT support is not available. It checks that an
 attempt to use JIT has the expected behaviour.

-Test 16 is run only when JIT support is available. It checks JIT complete and
+Test 17 is run only when JIT support is available. It checks JIT complete and
 partial modes, match-limiting under JIT, and other JIT-specific features.

-Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to
+Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
 the 8-bit library, without and with Unicode support, respectively.

-Test 19 checks the serialization functions by writing a set of compiled
+Test 20 checks the serialization functions by writing a set of compiled
 patterns to a file, and then reloading and checking them.

+Tests 21 and 22 test \C support when the use of \C is not locked out, without
+and with UTF support, respectively. Test 23 tests \C when it is locked out.
+

 Character tables
 ----------------
@ -866,4 +865,4 @@ The distribution should contain the files listed below.
 Philip Hazel
 Email local part: ph10
 Email domain: cam.ac.uk
-Last updated: 01 November 2016
+Last updated: 17 March 2017