Update maintenance documentation

This commit is contained in:
Philip Hazel 2022-04-25 15:07:14 +01:00
parent f65df06305
commit 104fe2fead
2 changed files with 56 additions and 61 deletions

61
HACKING
View File

@ -8,8 +8,8 @@ library is referred to as PCRE1 below. For information about testing PCRE2, see
the pcre2test documentation and the comment at the head of the RunTest file.
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
confusion with PCRE1.
releases carried on the 8.xx series, up to the final 8.45 release. PCRE2
releases started at 10.00 to avoid confusion with PCRE1.
Historical note 1
@ -38,8 +38,8 @@ Historical note 2
By contrast, the code originally written by Henry Spencer (which was
subsequently heavily modified for Perl) compiles the expression twice: once in
a dummy mode in order to find out how much store will be needed, and then for
real. (The Perl version probably doesn't do this any more; I'm talking about
the original library.) The execution function operates by backtracking and
real. (The Perl version may or may not still do this; I'm talking about the
original library.) The execution function operates by backtracking and
maximizing (or, optionally, minimizing, in Perl) the amount of the subject that
matches individual wild portions of the pattern. This is an "NFA algorithm" in
Friedl's terminology.
@ -151,8 +151,8 @@ of code units in the item itself. The exception is the aforementioned large
advance to check for such values. When auto-callouts are enabled, the generous
assumption is made that there will be a callout for each pattern code unit
(which of course is only actually true if all code units are literals) plus one
at the end. There is a default parsed pattern vector on the system stack, but
if this is not big enough, heap memory is used.
at the end. A default parsed pattern vector is defined on the system stack, to
minimize memory handling, but if this is not big enough, heap memory is used.
As before, the actual compiling function is run twice, the first time to
determine the amount of memory needed for the final compiled pattern. It
@ -187,7 +187,7 @@ META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
META_CLASS_EMPTY_NOT [^] negative empty class - ditto
META_CLASS_END ] end of non-empty class
META_CLASS_NOT [^ start non-empty negative class
META_COMMIT (*COMMIT)
META_COMMIT (*COMMIT) - no argument (see below for with argument)
META_COND_ASSERT (?(?assertion)
META_DOLLAR $ metacharacter
META_DOT . metacharacter
@ -201,14 +201,14 @@ META_NOCAPTURE (?: no capture parens
META_PLUS +
META_PLUS_PLUS ++
META_PLUS_QUERY +?
META_PRUNE (*PRUNE) - no argument
META_PRUNE (*PRUNE) - no argument (see below for with argument)
META_QUERY ?
META_QUERY_PLUS ?+
META_QUERY_QUERY ??
META_RANGE_ESCAPED hyphen in class range with at least one escape
META_RANGE_LITERAL hyphen in class range defined literally
META_SKIP (*SKIP) - no argument
META_THEN (*THEN) - no argument
META_SKIP (*SKIP) - no argument (see below for with argument)
META_THEN (*THEN) - no argument (see below for with argument)
The two RANGE values occur only in character classes. They are positioned
between two literals that define the start and end of the range. In an EBCDIC
@ -229,7 +229,8 @@ If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
is the length of its branch, for which OP_REVERSE must be generated.
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
their data in the lower 16 bits of the element.
their data in the lower 16 bits of the element. META_RECURSE is followed by an
offset, for use in error messages.
META_BACKREF is followed by an offset if the back reference group number is 10
or more. The offsets of the first ocurrences of references to groups whose
@ -238,8 +239,6 @@ occurrence is useful). On 64-bit systems this avoids using more than two parsed
pattern elements for items such as \3. The offset is used when an error occurs
because the reference is to a non-existent group.
META_RECURSE is always followed by an offset, for use in error messages.
META_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next
element contains the 16-bit type and data property values, packed together.
ESC_g and ESC_k are used only for named references - numerical ones are turned
@ -291,9 +290,9 @@ META_LOOKBEHIND (?<= start of lookbehind
META_LOOKBEHIND_NA (*naplb: start of non-atomic lookbehind
META_LOOKBEHINDNOT (?<! start of negative lookbehind
The following are followed by two elements, the minimum and maximum. Repeat
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
The following are followed by two elements, the minimum and maximum. The
maximum value is limited to 65535 (MAX_REPEAT). A maximum value of "unlimited"
is represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
META_MINMAX {n,m} repeat
META_MINMAX_PLUS {n,m}+ repeat
@ -347,11 +346,11 @@ support is not available for this kind of matching.
Changeable options
------------------
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
others) may be changed in the middle of patterns by items such as (?i). Their
processing is handled entirely at compile time by generating different opcodes
for the different settings. The runtime functions do not need to keep track of
an option's state.
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL) and
some others may be changed in the middle of patterns by items such as (?i).
Their processing is handled entirely at compile time by generating different
opcodes for the different settings. The runtime functions do not need to keep
track of an option's state.
PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
are tracked and processed during the parsing pre-pass. The others are handled
@ -437,7 +436,7 @@ Backtracking control verbs
--------------------------
Verbs with no arguments generate opcodes with no following data (as listed
in the section above).
in the section above).
(*MARK:NAME) generates OP_MARK followed by the mark name, preceded by a
length in one code unit, and followed by a binary zero. The name length is
@ -468,8 +467,8 @@ Caseless matching (positive or negative) of characters that have more than two
case-equivalent code points (which is possible only in UTF mode) is handled by
compiling a Unicode property item (see below), with the pseudo-property
PT_CLIST. The value of this property is an offset in a vector called
"ucd_caseless_sets" which identifies the start of a short list of equivalent
characters, terminated by the value NOTACHAR (0xffffffff).
"ucd_caseless_sets" which identifies the start of a short list of case
equivalent characters, terminated by the value NOTACHAR (0xffffffff).
Repeating single characters
@ -546,9 +545,9 @@ Each is followed by two code units that encode the desired property as a type
and a value. The types are a set of #defines of the form PT_xxx, and the values
are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
The value is relevant only for PT_GC (General Category), PT_PC (Particular
Category), PT_SC (Script), PT_BIDICL (Bidi Class), and the pseudo-property
PT_CLIST, which is used to identify a list of case-equivalent characters when
there are three or more.
Category), PT_SC (Script), PT_BIDICL (Bidi Class), PT_BOOL (Boolean property),
and the pseudo-property PT_CLIST, which is used to identify a list of
case-equivalent characters when there are three or more (see above).
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
@ -666,9 +665,9 @@ a count that immediately follows the offset.
There are several opcodes that mark the end of a subpattern group. OP_KET is
used for subpatterns that do not repeat indefinitely, OP_KETRMIN and
OP_KETRMAX are used for indefinite repetitions, minimally or maximally
respectively, and OP_KETRPOS for possessive repetitions (see below for more
respectively, and OP_KETRPOS for possessive repetitions (see below for more
details). All four are followed by a LINK_SIZE value giving (as a positive
number) the offset back to the matching bracket opcode.
number) the offset back to the matching opening bracket opcode.
If a subpattern is quantified such that it is permitted to match zero times, it
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
@ -719,7 +718,7 @@ Assertions
Forward assertions are also just like other subpatterns, but starting with one
of the opcodes OP_ASSERT, OP_ASSERT_NA (non-atomic assertion), or
OP_ASSERT_NOT. Backward assertions use the opcodes OP_ASSERTBACK,
OP_ASSERT_NOT. Backward assertions use the opcodes OP_ASSERTBACK,
OP_ASSERTBACK_NA, and OP_ASSERTBACK_NOT, and the first opcode inside the
assertion is OP_REVERSE, followed by a count of the number of characters to
move back the pointer in the subject string. In ASCII or UTF-32 mode, the count
@ -828,4 +827,4 @@ not a real opcode, but is used to check at compile time that tables indexed by
opcode are the correct length, in order to catch updating errors.
Philip Hazel
December 2021
April 2022

View File

@ -78,9 +78,9 @@ utf8.c
A short, freestanding C program for converting a Unicode code point into a
sequence of bytes in the UTF-8 encoding, and vice versa. If its argument is a
hex number such as 0x1234, it outputs a list of the equivalent UTF-8 bytes.
If its argument is a sequence of concatenated UTF-8 bytes (e.g. e188b4) it
treats them as a UTF-8 character and outputs the equivalent code point in
hex. See comments at its head for details.
If its argument is a sequence of concatenated UTF-8 bytes (e.g. 12e188b4) it
treats them as a UTF-8 string and outputs the equivalent code points in hex.
See comments at its head for details.
Updating to a new Unicode release
@ -94,8 +94,9 @@ directory.
Note: Previously, it was necessary to update lists of scripts and their
abbreviations by hand before running the Python scripts. This is no longer
necessary because the scripts have been upgraded to extract this information
themselves. Also, there used to be explicit lists of script in two of the man
pages. This is no longer the case.
themselves. Also, there used to be explicit lists of scripts in two of the man
pages. This is no longer the case; the pcre2test program can now output a list
of supported scripts.
You can give an output file name as an argument to the following scripts, but
by default:
@ -129,8 +130,8 @@ files should eventually be installed in the main testdata directory.
Preparing for a PCRE2 release
=============================
This section contains a checklist of things that I consult before building a
distribution for a new release.
This section contains a checklist of things that I do before building a new
release.
. Ensure that the version number and version date are correct in configure.ac.
@ -139,17 +140,16 @@ distribution for a new release.
. If new build options or new source files have been added, ensure that they
are added to the CMake files as well as to the autoconf files. The relevant
files are CMakeLists.txt and config-cmake.h.in. After making a release
tarball, test it out with CMake if there have been changes here.
files are CMakeLists.txt and config-cmake.h.in. After making a release, test
it out with CMake if there have been changes here.
. Run ./autogen.sh to ensure everything is up-to-date.
. Compile and test with many different config options, and combinations of
options. Also, test with valgrind by running "RunTest valgrind" and
"RunGrepTest valgrind" (which takes quite a long time). The script
maint/ManyConfigTests now encapsulates this testing. It runs tests with
different configurations, and it also runs some of them with valgrind, all of
which can take quite some time.
"RunGrepTest valgrind". The script maint/ManyConfigTests now encapsulates
this testing. It runs tests with different configurations, and it also runs
some of them with valgrind, all of which can take quite some time.
. Run tests in both 32-bit and 64-bit environments if possible. I can no longer
run 32-bit tests.
@ -164,7 +164,8 @@ distribution for a new release.
-fsanitize=signed-integer-overflow
. Do a test build using CMake. Remove src/config.h first, lest it override the
version that CMake creates. Do NOT use parallel make.
version that CMake creates. Also do a CMake unity build to check that it
still works: [c]cmake -DCMAKE_UNITY_BUILD=ON sets up a unity build.
. Run perltest.sh on the test data for tests 1 and 4. The output should match
the PCRE2 test output, apart from the version identification at the start of
@ -183,11 +184,12 @@ distribution for a new release.
systems. For example, on Solaris it is helpful to test using Sun's cc
compiler as a change from gcc. Adding -xarch=v9 to the cc options does a
64-bit test, but it also needs -S 64 for pcre2test to increase the stack size
for test 2. Since I retired I can no longer do much of this, but instead I
rely on putting out release candidates for testing by the community.
for test 2. Since I retired I can no longer do much of this. There are
automated tests under Ubuntu, Alpine, and Windows that are now set up as
GitHub actions. Check that they are running clean.
. The buildbots at http://buildfarm.opencsw.org/ do some automated testing
of PCRE2 and should be checked before putting out a release.
of PCRE2 and should also be checked before putting out a release.
Updating version info for libtool
@ -243,10 +245,11 @@ it reports them and then aborts. Otherwise it removes trailing spaces from
sources and refreshes the HTML documentation. Update the GitHub repository with
"git push".
Once PrepareRelease has run clean, run "make distcheck" to create the tarball
Once PrepareRelease has run clean, run "make distcheck" to create the tarballs
and the zipball. I then sign these files. Double-check with "git status" that
the repository is fully up-to-date, then create a new tag on GitHub. Upload the
tarball, zipball, and the signatures as "assets" of the GitHub release.
the repository is fully up-to-date, then create a new tag and a release on
GitHub. Upload the tarballs, zipball, and the signatures as "assets" of the
GitHub release.
When the new release is out, don't forget to tell webmaster@pcre.org and the
mailing list.
@ -365,8 +368,6 @@ years.
See Unicode TR 29. The last two are very much aimed at natural language.
. (?[...]) extended classes: big project.
. Allow a callout to specify a number of characters to skip. This can be done
compatibly via an extra callout field.
@ -436,13 +437,8 @@ years.
with lookarounds for \b and \B. Ideally the setting should last till the end
of the group, which means remembering all previous settings; maybe a fixed
amount of stack would do - how deep would anyone want to nest these things?
See GitHub issue #13 for a compendium of character class issues.
. Recognize the short script names. They are already listed in maint/
Multistage2.py because they are needed for scanning the script extensions
file.
. Use script extensions for \p?
See GitHub issue #13 for a compendium of character class issues, including
(?[...]) extended classes.
. A user suggested something like --with-build-info to set a build information
string that could be retrieved by pcre2_config(). However, there's no
@ -461,4 +457,4 @@ years.
Philip Hazel
Email local part: Philip.Hazel
Email domain: gmail.com
Last updated: 10 January 2022
Last updated: 25 April 2022