Documentation update.

This commit is contained in:
Philip.Hazel 2017-03-17 16:55:58 +00:00
parent b6b716b540
commit 73735b81a3
3 changed files with 79 additions and 89 deletions

53
HACKING
View File

@ -88,10 +88,10 @@ I had a flash of inspiration as to how I could run the real compile function in
a "fake" mode that enables it to compute how much memory it would need, while a "fake" mode that enables it to compute how much memory it would need, while
in most cases only ever using a small amount of working memory, and without too in most cases only ever using a small amount of working memory, and without too
many tests of the mode that might slow it down. So I refactored the compiling many tests of the mode that might slow it down. So I refactored the compiling
functions to work this way. This got rid of about 600 lines of source. It functions to work this way. This got rid of about 600 lines of source and made
should make future maintenance and development easier. As this was such a major further maintenance and development easier. As this was such a major change, I
change, I never released 6.8, instead upping the number to 7.0 (other quite never released 6.8, instead upping the number to 7.0 (other quite major changes
major changes were also present in the 7.0 release). were also present in the 7.0 release).
A side effect of this work was that the previous limit of 200 on the nesting A side effect of this work was that the previous limit of 200 on the nesting
depth of parentheses was removed. However, there was a downside: compiling ran depth of parentheses was removed. However, there was a downside: compiling ran
@ -122,7 +122,7 @@ all the named subpatterns and their corresponding group numbers. This means
that the actual compile (both the memory-computing dummy run and the real that the actual compile (both the memory-computing dummy run and the real
compile) has full knowledge of group names and numbers throughout. Several compile) has full knowledge of group names and numbers throughout. Several
dozen lines of messy code were eliminated, though the new pre-pass was not dozen lines of messy code were eliminated, though the new pre-pass was not
short. In particular, parsing and skipping over [] classes is complicated. short. In particular, parsing and skipping over [] classes was complicated.
While working on 10.22 I realized that I could simplify yet again by moving While working on 10.22 I realized that I could simplify yet again by moving
more of the parsing into the pre-pass, thus avoiding doing it in two places, so more of the parsing into the pre-pass, thus avoiding doing it in two places, so
@ -162,7 +162,7 @@ simpler than before.
Most errors can be diagnosed during the parsing scan. For those that cannot Most errors can be diagnosed during the parsing scan. For those that cannot
(for example, "lookbehind assertion is not fixed length"), the parsed code (for example, "lookbehind assertion is not fixed length"), the parsed code
contains offsets into the pattern so that the actual compiling code can contains offsets into the pattern so that the actual compiling code can
identify where errors occur. report where errors are.
The elements of the parsed pattern vector The elements of the parsed pattern vector
@ -217,10 +217,10 @@ The following have data in the lower 16 bits, and may be followed by other data
elements: elements:
META_ALT | alternation META_ALT | alternation
META_BACKREF META_BACKREF back reference
META_CAPTURE META_CAPTURE start of capturing group
META_ESCAPE META_ESCAPE non-literal escape sequence
META_RECURSE META_RECURSE recursion call
If the data for META_ALT is non-zero, it is inside a lookbehind, and the data If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
is the length of its branch, for which OP_REVERSE must be generated. is the length of its branch, for which OP_REVERSE must be generated.
@ -232,8 +232,8 @@ META_BACKREF is followed by an offset if the back reference group number is 10
or more. The offsets of the first ocurrences of references to groups whose or more. The offsets of the first ocurrences of references to groups whose
numbers are less than 10 are put in cb->small_ref_offset[] (only the first numbers are less than 10 are put in cb->small_ref_offset[] (only the first
occurrence is useful). On 64-bit systems this avoids using more than two parsed occurrence is useful). On 64-bit systems this avoids using more than two parsed
pattern elements for items such as \3. The offset is used when an error is pattern elements for items such as \3. The offset is used when an error occurs
given for a reference to a non-existent group. because the reference is to a non-existent group.
META_RECURSE is always followed by an offset, for use in error messages. META_RECURSE is always followed by an offset, for use in error messages.
@ -286,7 +286,7 @@ group; this is used when generating OP_REVERSE for that branch.
META_LOOKBEHIND (?<= META_LOOKBEHIND (?<=
META_LOOKBEHINDNOT (?<! META_LOOKBEHINDNOT (?<!
The following are followed by two values, the minimum and maximum. Repeat The following are followed by two elements, the minimum and maximum. Repeat
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT: represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
@ -369,7 +369,7 @@ unit, the most significant unit is first.
In this description, we assume the "normal" compilation options. Data values In this description, we assume the "normal" compilation options. Data values
that are counts (e.g. quantifiers) are always two bytes long in 8-bit mode that are counts (e.g. quantifiers) are always two bytes long in 8-bit mode
(most significant byte first), or one code unit in 16-bit and 32-bit modes. (most significant byte first), and one code unit in 16-bit and 32-bit modes.
Opcodes with no following data Opcodes with no following data
@ -409,7 +409,7 @@ These items are all just one unit long
OP_ACCEPT ) These are Perl 5.10's "backtracking control OP_ACCEPT ) These are Perl 5.10's "backtracking control
OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing
OP_FAIL ) parentheses, it may be preceded by one or more OP_FAIL ) parentheses, it may be preceded by one or more
OP_PRUNE ) OP_CLOSE, each followed by a count that OP_PRUNE ) OP_CLOSE, each followed by a number that
OP_SKIP ) indicates which parentheses must be closed. OP_SKIP ) indicates which parentheses must be closed.
OP_THEN ) OP_THEN )
@ -679,7 +679,7 @@ Once-only (atomic) groups
These are just like other subpatterns, but they start with the opcode OP_ONCE. These are just like other subpatterns, but they start with the opcode OP_ONCE.
The check for matching an empty string in an unbounded repeat is handled The check for matching an empty string in an unbounded repeat is handled
entirely at runtime, so there are just this one opcode for atomic groups. entirely at runtime, so there is just this one opcode for atomic groups.
Assertions Assertions
@ -742,14 +742,21 @@ Recursion
Recursion either matches the current pattern, or some subexpression. The opcode Recursion either matches the current pattern, or some subexpression. The opcode
OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting
bracket from the start of the whole pattern. OP_RECURSE is also used for bracket from the start of the whole pattern. OP_RECURSE is also used for
"subroutine" calls, even though they are not strictly a recursion. Repeated "subroutine" calls, even though they are not strictly a recursion. Up till
recursions are automatically wrapped inside OP_ONCE brackets, because otherwise release 10.30 recursions were treated as atomic groups, making them
some patterns broke them. A non-repeated recursion is not wrapped in OP_ONCE incompatible with Perl (but PCRE had then well before Perl did). From 10.30,
brackets, but it is nevertheless still treated as an atomic group. backtracking into recursions is supported.
Repeated recursions used to be wrapped inside OP_ONCE brackets, which not only
forced no backtracking, but also allowed repetition to be handled as for other
bracketed groups. From 10.30 onwards, repeated recursions are duplicated for
their minimum repetitions, and then wrapped in non-capturing brackets for the
remainder. For example, (?1){3} is treated as (?1)(?1)(?1), and (?1){2,4} is
treated as (?1)(?1)(?:(?1)){0,2}.
Callout Callouts
------- --------
A callout can nowadays have either a numerical argument or a string argument. A callout can nowadays have either a numerical argument or a string argument.
These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
@ -787,4 +794,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the
correct length, in order to catch updating errors. correct length, in order to catch updating errors.
Philip Hazel Philip Hazel
March 2017 17 March 2017

View File

@ -1,10 +1,6 @@
Building PCRE2 without using autotools Building PCRE2 without using autotools
-------------------------------------- --------------------------------------
This document has been converted from the PCRE1 document. I have removed a
number of sections about building in various environments, as they applied only
to PCRE1 and are probably out of date.
This document contains the following sections: This document contains the following sections:
General General
@ -183,21 +179,9 @@ can skip ahead to the CMake section.
STACK SIZE IN WINDOWS ENVIRONMENTS STACK SIZE IN WINDOWS ENVIRONMENTS
The default processor stack size of 1Mb in some Windows environments is too Prior to release 10.30 the default system stack size of 1Mb in some Windows
small for matching patterns that need much recursion. In particular, test 2 may environments caused issues with some tests. This should no longer be the case
fail because of this. Normally, running out of stack causes a crash, but there for 10.30 and later releases.
have been cases where the test program has just died silently. See your linker
documentation for how to increase stack size if you experience problems. If you
are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
compiler, you can increase the stack size for pcre2test and pcre2grep by
setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
example). The Linux default of 8Mb is a reasonable choice for the stack, though
even that can be too small for some pattern/subject combinations.
PCRE2 has a compile configuration option to disable the use of stack for
recursion so that heap is used instead. However, pattern matching is
significantly slower when this is done. There is more about stack usage in the
"pcre2stack" documentation.
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
@ -393,4 +377,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the
recommended download site. recommended download site.
============================= =============================
Last Updated: 13 October 2016 Last Updated: 17 March 2017

91
README
View File

@ -15,8 +15,8 @@ subscribe or manage your subscription here:
https://lists.exim.org/mailman/listinfo/pcre-dev https://lists.exim.org/mailman/listinfo/pcre-dev
Please read the NEWS file if you are upgrading from a previous release. Please read the NEWS file if you are upgrading from a previous release. The
The contents of this README file are: contents of this README file are:
The PCRE2 APIs The PCRE2 APIs
Documentation for PCRE2 Documentation for PCRE2
@ -44,8 +44,8 @@ wrappers.
The distribution does contain a set of C wrapper functions for the 8-bit The distribution does contain a set of C wrapper functions for the 8-bit
library that are based on the POSIX regular expression API (see the pcre2posix library that are based on the POSIX regular expression API (see the pcre2posix
man page). These can be found in a library called libpcre2-posix. Note that this man page). These can be found in a library called libpcre2-posix. Note that
just provides a POSIX calling interface to PCRE2; the regular expressions this just provides a POSIX calling interface to PCRE2; the regular expressions
themselves still follow Perl syntax and semantics. The POSIX API is restricted, themselves still follow Perl syntax and semantics. The POSIX API is restricted,
and does not give full access to all of PCRE2's facilities. and does not give full access to all of PCRE2's facilities.
@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms:
Building PCRE2 on non-Unix-like systems Building PCRE2 on non-Unix-like systems
--------------------------------------- ---------------------------------------
For a non-Unix-like system, please read the comments in the file For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and your system supports the use of "configure" and "make" you may be able to build
"make" you may be able to build PCRE2 using autotools in the same way as for PCRE2 using autotools in the same way as for many Unix-like systems.
many Unix-like systems.
PCRE2 can also be configured using CMake, which can be run in various ways PCRE2 can also be configured using CMake, which can be run in various ways
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file (command line, GUI, etc). This creates Makefiles, solution files, etc. The file
@ -174,19 +173,19 @@ library. They are also documented in the pcre2build man page.
architectures. If you try to enable it on an unsupported architecture, there architectures. If you try to enable it on an unsupported architecture, there
will be a compile time error. will be a compile time error.
. If you do not want to make use of the support for UTF-8 Unicode character . If you do not want to make use of the default support for UTF-8 Unicode
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit character strings in the 8-bit library, UTF-16 Unicode character strings in
library, or UTF-32 Unicode character strings in the 32-bit library, you can the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
add --disable-unicode to the "configure" command. This reduces the size of library, you can add --disable-unicode to the "configure" command. This
the libraries. It is not possible to configure one library with Unicode reduces the size of the libraries. It is not possible to configure one
support, and another without, in the same configuration. library with Unicode support, and another without, in the same configuration.
It is also not possible to use --enable-ebcdic (see below) with Unicode
support, so if this option is set, you must also use --disable-unicode.
When Unicode support is available, the use of a UTF encoding still has to be When Unicode support is available, the use of a UTF encoding still has to be
enabled by setting the PCRE2_UTF option at run time or starting a pattern enabled by setting the PCRE2_UTF option at run time or starting a pattern
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
not possible to use both --enable-unicode and --enable-ebcdic at the same
time.
As well as supporting UTF strings, Unicode support includes support for the As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties. \P, \p, and \X sequences that recognize Unicode character properties.
@ -232,18 +231,18 @@ library. They are also documented in the pcre2build man page.
--with-match-limit=500000 --with-match-limit=500000
on the "configure" command. This is just the default; individual calls to on the "configure" command. This is just the default; individual calls to
pcre2_match() can supply their own value. There is more discussion on the pcre2_match() can supply their own value. There is more discussion in the
pcre2api man page. pcre2api man page (search for pcre2_set_match_limit).
. There is a separate counter that limits the depth of recursive function calls . There is a separate counter that limits the depth of nested backtracking
during a matching process. This also has a default of ten million, which is during a matching process, which in turn limits the amount of memory that is
essentially "unlimited". You can change the default by setting, for example, used. This also has a default of ten million, which is essentially
"unlimited". You can change the default by setting, for example,
--with-match-limit-recursion=500000 --with-match-limit-depth=5000
Recursive function calls use up the runtime stack; running out of stack can There is more discussion in the pcre2api man page (search for
cause programs to crash in strange ways. There is a discussion about stack pcre2_set_depth_limit).
sizes in the pcre2stack man page.
. In the 8-bit library, the default maximum compiled pattern size is around . In the 8-bit library, the default maximum compiled pattern size is around
64K bytes. You can increase this by adding --with-link-size=3 to the 64K bytes. You can increase this by adding --with-link-size=3 to the
@ -254,20 +253,6 @@ library. They are also documented in the pcre2build man page.
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
link size setting is ignored, as 4-byte offsets are always used. link size setting is ignored, as 4-byte offsets are always used.
. You can build PCRE2 so that its internal match() function that is called from
pcre2_match() does not call itself recursively. Instead, it uses memory
blocks obtained from the heap to save data that would otherwise be saved on
the stack. To build PCRE2 like this, use
--disable-stack-for-recursion
on the "configure" command. PCRE2 runs more slowly in this mode, but it may
be necessary in environments with limited stack sizes. This applies only to
the normal execution of the pcre2_match() function; if JIT support is being
successfully used, it is not relevant. Equally, it does not apply to
pcre2_dfa_match(), which does not use deeply nested recursion. There is a
discussion about stack sizes in the pcre2stack man page.
. For speed, PCRE2 uses four tables for manipulating and identifying characters . For speed, PCRE2 uses four tables for manipulating and identifying characters
whose code point values are less than 256. By default, it uses a set of whose code point values are less than 256. By default, it uses a set of
tables for ASCII encoding that is part of the distribution. If you specify tables for ASCII encoding that is part of the distribution. If you specify
@ -389,6 +374,13 @@ library. They are also documented in the pcre2build man page.
string. Otherwise, it is assumed to be a file name, and the contents of the string. Otherwise, it is assumed to be a file name, and the contents of the
file are the test string. file are the test string.
. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
which caused pcre2_match() to use individual blocks on the heap for
backtracking instead of recursive function calls (which use the stack). This
is now obsolete since pcre2_match() was refactored always to use the heap (in
a much more efficient way than before). This option is retained for backwards
compatibility, but has no effect other than to output a warning.
The "configure" script builds the following files for the basic C library: The "configure" script builds the following files for the basic C library:
. Makefile the makefile that builds the library . Makefile the makefile that builds the library
@ -662,25 +654,32 @@ Unicode support is enabled.
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
16-bit and 32-bit modes. These are tests that generate different output in 16-bit and 32-bit modes. These are tests that generate different output in
8-bit mode. Each pair are for general cases and Unicode support, respectively. 8-bit mode. Each pair are for general cases and Unicode support, respectively.
Test 13 checks the handling of non-UTF characters greater than 255 by Test 13 checks the handling of non-UTF characters greater than 255 by
pcre2_dfa_match() in 16-bit and 32-bit modes. pcre2_dfa_match() in 16-bit and 32-bit modes.
Test 14 contains a number of tests that must not be run with JIT. They check, Test 14 contains some special UTF and UCP tests that give different output for
the different widths.
Test 15 contains a number of tests that must not be run with JIT. They check,
among other non-JIT things, the match-limiting features of the intepretive among other non-JIT things, the match-limiting features of the intepretive
matcher. matcher.
Test 15 is run only when JIT support is not available. It checks that an Test 16 is run only when JIT support is not available. It checks that an
attempt to use JIT has the expected behaviour. attempt to use JIT has the expected behaviour.
Test 16 is run only when JIT support is available. It checks JIT complete and Test 17 is run only when JIT support is available. It checks JIT complete and
partial modes, match-limiting under JIT, and other JIT-specific features. partial modes, match-limiting under JIT, and other JIT-specific features.
Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
the 8-bit library, without and with Unicode support, respectively. the 8-bit library, without and with Unicode support, respectively.
Test 19 checks the serialization functions by writing a set of compiled Test 20 checks the serialization functions by writing a set of compiled
patterns to a file, and then reloading and checking them. patterns to a file, and then reloading and checking them.
Tests 21 and 22 test \C support when the use of \C is not locked out, without
and with UTF support, respectively. Test 23 tests \C when it is locked out.
Character tables Character tables
---------------- ----------------
@ -866,4 +865,4 @@ The distribution should contain the files listed below.
Philip Hazel Philip Hazel
Email local part: ph10 Email local part: ph10
Email domain: cam.ac.uk Email domain: cam.ac.uk
Last updated: 01 November 2016 Last updated: 17 March 2017